Date: May 2026
Hardware: RTX 4070 Laptop GPU (8GB VRAM), Intel Ultra 7 165H, 64GB DDR5
Model: turboderp/Qwen3.5-9B-exl3 @ 4.00bpw
Backend: tabbyAPI (ExLlamaV3 0.0.37)
The Goal
I wanted a fast, long-context 9B model running locally as an OpenAI-compatible API. Not Ollama — I needed raw performance and a proper context window, not convenience defaults. The candidate: Qwen3.5-9B at 4.0 bpw EXL3 quantisation via tabbyAPI in an Incus LXC container with GPU passthrough.
The question was never whether it would run. It was: what's the real context ceiling, and does the software stack matter as much as the marketing claims?
Why tabbyAPI over vLLM or Ollama?
Ollama's great when you want things to just work. vLLM is the right tool for batched multi-user workloads. But for a single-user API that needs maximum single-request throughput and fine-grained KV control:
- tabbyAPI (ExLlamaV3): paged KV cache, per-element KV quantisation (
k_bits,v_bitsindependently), aggressive speculative decoding, GDN-aware caching - vLLM: excellent batching, but the KV and context ceiling maths are more complex on 8GB with hybrid models
- Ollama: abstracts away all the knobs — which is a problem when the knobs are exactly what you want to tune
The Architecture Surprise
Before tuning anything I read the config.json:
"layer_types": ["linear", "linear", "linear", "full", ...]
"full_attention_interval": 4
Qwen3.5-9B is not a pure transformer. It's a GDN (Grouped-with-Dense-Notes) hybrid: 24 linear attention (GatedDeltaNet) layers and only 8 full attention layers out of 32 total. That changes everything about the VRAM maths.
Standard transformer KV cache sizing assumes every layer has a full attention KV block. With only 8 full-attention layers:
bytes/token = 8 layers × 4 KV_heads × 256 head_dim × 2 (K+V) × bits/8
| cache_mode | bytes/token | tokens in 2.4 GB KV budget |
|---|---|---|
8,8 (Q8 K+V) |
16,384 B | ~155,000 |
8,4 (Q8 K, Q4 V) |
12,288 B | ~207,000 |
4,4 (Q4 K+V) |
8,192 B | ~310,000 |
The KV pressure is 4× lower than a pure transformer at the same parameter count. That meant 180K tokens at Q8,4 was plausible before a single probe had run.
The GDN recurrent state (the 24 linear layers) doesn't live in VRAM at all. tabbyAPI serialises it to system RAM between requests via sysmem_recurrent_cache — an OrderedDict in Python process memory. At the default 4 GB, dozens of concurrent long sessions co-exist without touching the GPU.
Finding the Actual Context Ceiling
Theoretical maths tells you where to probe. It doesn't tell you the real ceiling. ExLlamaV3 has internal allocation overhead, batch workspace, and the model weights themselves — all competing for the same 8 GB.
I wrote a binary search probe script (tabbyapi_probe.py) that:
1. Writes a config with a candidate cache_size
2. Pushes it into the container and restarts the service
3. Polls journalctl for Application startup complete (success) or Insufficient VRAM in split for model and cache (OOM)
4. Handles the systemd stale-restart race (OOM → systemd auto-restart with old config → second OOM) before probing the next candidate
python3 tabbyapi_probe.py \
--container tabbyapi \
--model-name Qwen3.5-9B-exl3-4.0bpw \
--cache-mode 8,4 \
--lo 131072 --hi 262144 \
--config-template tabbyapi_config.yml \
--output probe_results.json
Q8,4 binary search results (both stacks — identical ceiling)
| cache_size | pages | outcome | VRAM used / free |
|---|---|---|---|
| 131,072 | 512 | ✓ success | 6,083 / 1,724 MiB |
| 262,144 | 1,024 | ✗ OOM | — |
| 196,608 | 768 | ✗ OOM | — |
| 163,840 | 640 | ✓ success | 6,499 / 1,308 MiB |
| 180,224 | 704 | ✓ success | 6,691 / 1,116 MiB |
| 188,416 | 736 | ✓ success | 6,787 / 1,020 MiB |
| 192,512 | 752 | ✓ success | 6,851 / 956 MiB |
| 194,560 | 760 | ✓ success | 6,883 / 924 MiB |
| 195,584 | 764 | ✗ OOM | — |
| 195,072 | 762 | ✓ confirmed max | 6,883 / 924 MiB |
Both Stack A (torch 2.9, CUDA 12.8) and Stack C revised (torch 2.11, CUDA 13.0) converged on the same ceiling. The KV page size is 256 tokens, so the allocation granularity is 256 × 12,288 bytes ≈ 3 MiB per page. CUDA runtime overhead differences between 12.8 and 13.0 are smaller than a single page.
Q4,4 results (reference, from earlier work)
Maximum confirmed: 180,992 tokens at Q4,4. The Q8,4 ceiling falls lower due to 1.5× higher bytes/token.
The Package Compatibility Maze
This is where things got interesting.
I'd been running the original stack fine: torch 2.9.0+cu128, ExLlamaV3 0.0.34, flash-attn 2.8.3, causal-conv1d 1.6.2. Clean. Everything working.
Then I went looking for whether a newer ExLlamaV3 with CUDA 13.2 kernels would improve things. The answer required understanding something fundamental about Python C extensions.
Why upgrading PyTorch is not like upgrading a Python package
Every .so extension compiled against PyTorch links against internal PyTorch symbols by name — things like _ZN3c104cuda29c10_cuda_check_implementationEiPKcS2_ib. Those symbols are not part of any stable ABI. They change between PyTorch minor versions. The result: upgrading torch from 2.9 to 2.11 silently breaks flash-attn and causal-conv1d even though they're "installed" and even though importlib.util.find_spec() says they're present.
That last point matters: tabbyAPI uses find_spec() to check optional dependencies. A package with a broken .so still passes find_spec. The crash only happens when the module is actually imported at runtime.
The compatibility matrix
| Stack | torch | ExLlamaV3 | flash-attn | causal-conv1d | FLA |
|---|---|---|---|---|---|
| A | 2.9.0+cu128 | 0.0.37+cu128 | 2.8.3+cu128 ✓ | 1.6.2.post1 ✓ | ✓ |
| B | 2.10.0+cu128 | 0.0.37+cu128 | 2.8.3+cu128 or +cu130 ✓ | 1.6.2.post1 ✓ | ✓ |
| C | 2.11.0+cu130 | 0.0.37+cu132 | 2.8.3+cu130 ✓ | ✗ no wheel | ✓ |
| D | 2.12.0+cu130 | 0.0.37+cu132 | 2.8.3+cu130 or +cu132 ✓ | ✗ | ✓ |
The story I originally told about Stack C was wrong. I removed flash-attn when upgrading to torch 2.11, assuming no compatible wheel existed. It does: flash_attn-2.8.3+cu130torch2.11 is available from mjun0812's prebuild repo. The find_spec() check in tabbyAPI would have passed (broken ABI .so still shows as installed), but the correct fix was to install a compatible wheel, not remove the package.
- causal-conv1d still has no torch2.10+ wheel as of May 2026 — this one is genuinely unavailable without building from source
- flash-linear-attention is pure Python + Triton — no ABI coupling, works with any torch version
What each missing package actually costs
flash-attn handles the 8 full-attention layers in Qwen3.5-9B. Without it, ExLlamaV3 dispatches to Triton paged attention (its next preference in the fallback list). The measured cost: 3% at 512-token context, growing to 83% slower at 131K context. The loss is entirely in the KV-cache decode path — flash-attn's fused kernel has much lower memory bandwidth overhead than Triton paged attention at long sequences.
causal-conv1d accelerates the conv1d operations inside the 24 GDN layers. Without it, ExLlamaV3's Triton kernel for GatedDeltaNet recurrence handles them instead. The measured cost on RTX 4070: zero. No decode speed difference at any context length.
flash-linear-attention (FLA) accelerates the GatedDeltaNet forward pass via Triton kernels. This one stayed installed across all stacks.
The CUDA 13.2 upgrade — what actually changed
Upgrading to ExLlamaV3 0.0.36+cu132 gets you custom CUDA kernels compiled against CUDA 13.2. The ExLlamaV3 custom ops (quantised matmul, RoPE, etc.) are recompiled with newer compiler optimisations. Whether that recovers the flash-attn and causal-conv1d losses empirically is exactly what the benchmark below tests.
Other Tuning Applied
ngram_match_min: 3 (free speculative decoding)
tabbyAPI hardcodes ngram_match_min=0 in its AsyncGenerator constructor — the parameter exists in ExLlamaV3 but isn't exposed in config.yml. One line patch to /opt/tabbyapi/backends/exllamav3/model.py:
self.generator = AsyncGenerator(
...
num_draft_tokens=self.draft_num_tokens,
ngram_match_min=3, # ← added
)
With value 3: when the last 3+ output tokens have appeared somewhere in the input context, the next token is drafted from that prior occurrence and the main model validates in parallel. Zero VRAM cost, zero draft model, purely context-driven. Best gains on structured or repetitive text — code, documents that quote themselves, long reasoning chains.
sysmem_recurrent_cache
Confirmed system RAM, not VRAM. The RecurrentCache is a Python OrderedDict holding serialised GDN recurrent states between requests. Default 4 GB; at ~MB-scale per state this handles dozens of concurrent long sessions. Left at default.
max_batch_size
Left at default (4). TODO: measure VRAM savings at batch size 1 — with a single-user setup there's no batch parallelism to lose.
Benchmark: Stack A vs Stack C
Methodology
- 3 trials per context length, cached-run median reported (trial 1 is always a cold prefill — excluded from medians)
- Context lengths tested: 512, 4,096, 16,384, 65,536, 131,072 tokens
- Decode length: 512 tokens per request
- Prefill speed = prompt tokens / TTFT; decode speed = 512 / (total − TTFT)
- VRAM peak sampled during generation
Three stacks measured: Stack A (torch 2.9, full dependencies), Stack C original (torch 2.11, flash-attn accidentally removed), Stack C revised (torch 2.11, flash-attn restored).
Results: decode throughput (tokens/sec)
| Context (tokens) | Stack A | Stack C orig (no flash-attn) | Stack C revised (flash-attn) |
|---|---|---|---|
| 512 | 34 | 33 | 33 |
| 4,096 | 33 | 30 | 32 |
| 16,384 | 28 | 23 | 28 |
| 65,536 | 16 | 11 | 16 |
| 131,072 | 11 | 6 | 11 |
Results: prefill throughput (tokens/sec, cached run)
| Context (tokens) | Stack A | Stack C revised | Delta |
|---|---|---|---|
| 512 | 237 | 236 | 0% |
| 4,096 | 10,224 | 10,401 | +2% |
| 16,384 | 37,432 | 36,808 | −2% |
| 65,536 | 77,584 | 78,290 | +1% |
| 131,072 | 80,867 | 94,006 | +16% |
Max context ceiling (Q8,4 cache_mode)
| Stack A | Stack C revised | |
|---|---|---|
| Max cache_size (tokens) | 195,072 | 195,072 |
| VRAM used at max (MiB) | 6,883 | 6,883 |
Analysis
flash-attn on 25% of layers is not a small thing
The hypothesis going in was that Triton paged attention on 8 out of 32 layers wouldn't be catastrophic. The data shows otherwise:
- At 512 tokens (minimal KV pressure, dominated by weight ops): 34 vs 33 tok/s — 3% difference, barely measurable
- At 131,072 tokens (maximum KV pressure): 11 vs 6 tok/s — 83% faster with flash-attn
The gap is entirely context-dependent. At short context, the 8 full-attention layers spend most of their time on the matmuls and barely touch the KV cache. At 131K context, those 8 layers are doing O(n²) attention over a 130K-token sequence, and flash-attn's fused CUDA kernel vs Triton's paged implementation is the difference between a usable and an unusable response time.
causal-conv1d: no measurable impact
Stack A has causal-conv1d for the 24 GDN/linear layers, Stack C revised does not. The decode speed difference between the two is 0–3% at all context lengths — within noise. ExLlamaV3's own Triton kernel for the GatedDeltaNet recurrence is already well-optimised on this hardware. The package exists for older GPU generations and smaller models where the Triton path has more overhead.
CUDA 12.8 vs 13.0: identical context ceiling, near-identical throughput
Both stacks hit 195,072 tokens. CUDA 13.0 runtime overhead is below the 256-token page granularity (~3 MiB) for this model size. Throughput differences are within 2% at all context lengths except one: Stack C revised shows +16% prefill throughput at 131K context (94,006 vs 80,867 tok/s).
This is real but narrowly applicable. The 131K cold prefill takes ~87 seconds and is susceptible to thermal variation. It's a single uncached run in each 3-trial set. Whether the CUDA 13.2 kernel compilation produces genuinely faster attention code at very long sequence lengths is worth testing with more trials.
Conclusion
Flash-attn is the only dependency that matters at long context. causal-conv1d, despite covering 75% of layers, makes no measurable difference. CUDA kernel generation (12.8 vs 13.2) makes no difference to decode throughput and has no effect on the context ceiling.
The production recommendation depends on your PyTorch version:
- If running torch 2.9 (CUDA 12.8): install flash-attn from the pre-built wheel. This is the simplest supported configuration and you get full performance.
- If running torch 2.11 (CUDA 13.0):
flash_attn-2.8.3+cu130torch2.11exists and is installable — it is not the default anyone reaches for but it works. Install it and decode at 131K context becomes 11 tok/s instead of 6 tok/s. causal-conv1d has no torch 2.11 wheel; leave it absent.
The architecture insight holds: this model's context ceiling is not where you'd expect it. With only 8 full-attention layers out of 32, KV cache pressure is 4× lower than a pure transformer. The recurrent state of the 24 GDN layers lives entirely in system RAM. At Q8K/Q4V, 195K tokens fits in 8 GB alongside a 4.5 GB model — something that would be impossible at full-transformer architecture.
The ceiling you can't push past is quantisation quality in those 8 attention layers at long context, not VRAM.
Appendix: Key Commands
Run context ceiling probe:
python3 agents/tabbyapi_probe.py \
--container tabbyapi --model-name Qwen3.5-9B-exl3-4.0bpw \
--cache-mode 8,4 --lo 131072 --hi 262144 \
--config-template /home/user/.claude/jobs/f7398869/tabbyapi_config.yml \
--output probe_results.json --timeout 180
Downgrade to Stack A:
incus exec tabbyapi -- systemctl stop tabbyapi
incus exec tabbyapi -- /root/.local/bin/uv pip install \
"https://download.pytorch.org/whl/cu128/torch-2.9.0%2Bcu128-cp312-cp312-linux_x86_64.whl" \
--python /opt/tabbyapi/.venv/bin/python3
# then exllamav3 0.0.34, flash-attn 2.8.3, causal-conv1d 1.6.2.post1
Re-apply ngram patch after any tabbyAPI update:
incus exec tabbyapi -- sed -i \
's/ num_draft_tokens=self.draft_num_tokens,/ num_draft_tokens=self.draft_num_tokens,\n ngram_match_min=3,/' \
/opt/tabbyapi/backends/exllamav3/model.py
incus exec tabbyapi -- systemctl restart tabbyapi
Check package stack:
incus exec tabbyapi -- /opt/tabbyapi/.venv/bin/python3 -c "
import torch, exllamav3
print('torch:', torch.__version__)
print('exllamav3:', exllamav3.__version__)
try:
import flash_attn; print('flash_attn:', flash_attn.__version__)
except ImportError:
print('flash_attn: NOT INSTALLED')
try:
import causal_conv1d; print('causal_conv1d: OK')
except ImportError:
print('causal_conv1d: NOT INSTALLED')
try:
import flash_linear_attention; print('FLA: OK')
except ImportError:
print('FLA: NOT INSTALLED')
"
No comments:
Post a Comment