25 May 2026

tabbyAPI + Qwen3.5-9B on 8GB VRAM: Long Context, CUDA Upgrades, and What Actually Matters

Date: May 2026
Hardware: RTX 4070 Laptop GPU (8GB VRAM), Intel Ultra 7 165H, 64GB DDR5
Model: turboderp/Qwen3.5-9B-exl3 @ 4.00bpw
Backend: tabbyAPI (ExLlamaV3 0.0.37)


The Goal

I wanted a fast, long-context 9B model running locally as an OpenAI-compatible API. Not Ollama — I needed raw performance and a proper context window, not convenience defaults. The candidate: Qwen3.5-9B at 4.0 bpw EXL3 quantisation via tabbyAPI in an Incus LXC container with GPU passthrough.

The question was never whether it would run. It was: what's the real context ceiling, and does the software stack matter as much as the marketing claims?


Why tabbyAPI over vLLM or Ollama?

Ollama's great when you want things to just work. vLLM is the right tool for batched multi-user workloads. But for a single-user API that needs maximum single-request throughput and fine-grained KV control:

  • tabbyAPI (ExLlamaV3): paged KV cache, per-element KV quantisation (k_bits,v_bits independently), aggressive speculative decoding, GDN-aware caching
  • vLLM: excellent batching, but the KV and context ceiling maths are more complex on 8GB with hybrid models
  • Ollama: abstracts away all the knobs — which is a problem when the knobs are exactly what you want to tune

The Architecture Surprise

Before tuning anything I read the config.json:

"layer_types": ["linear", "linear", "linear", "full", ...]
"full_attention_interval": 4

Qwen3.5-9B is not a pure transformer. It's a GDN (Grouped-with-Dense-Notes) hybrid: 24 linear attention (GatedDeltaNet) layers and only 8 full attention layers out of 32 total. That changes everything about the VRAM maths.

Standard transformer KV cache sizing assumes every layer has a full attention KV block. With only 8 full-attention layers:

bytes/token = 8 layers × 4 KV_heads × 256 head_dim × 2 (K+V) × bits/8
cache_mode bytes/token tokens in 2.4 GB KV budget
8,8 (Q8 K+V) 16,384 B ~155,000
8,4 (Q8 K, Q4 V) 12,288 B ~207,000
4,4 (Q4 K+V) 8,192 B ~310,000

The KV pressure is 4× lower than a pure transformer at the same parameter count. That meant 180K tokens at Q8,4 was plausible before a single probe had run.

The GDN recurrent state (the 24 linear layers) doesn't live in VRAM at all. tabbyAPI serialises it to system RAM between requests via sysmem_recurrent_cache — an OrderedDict in Python process memory. At the default 4 GB, dozens of concurrent long sessions co-exist without touching the GPU.


Finding the Actual Context Ceiling

Theoretical maths tells you where to probe. It doesn't tell you the real ceiling. ExLlamaV3 has internal allocation overhead, batch workspace, and the model weights themselves — all competing for the same 8 GB.

I wrote a binary search probe script (tabbyapi_probe.py) that:
1. Writes a config with a candidate cache_size
2. Pushes it into the container and restarts the service
3. Polls journalctl for Application startup complete (success) or Insufficient VRAM in split for model and cache (OOM)
4. Handles the systemd stale-restart race (OOM → systemd auto-restart with old config → second OOM) before probing the next candidate

python3 tabbyapi_probe.py \
  --container tabbyapi \
  --model-name Qwen3.5-9B-exl3-4.0bpw \
  --cache-mode 8,4 \
  --lo 131072 --hi 262144 \
  --config-template tabbyapi_config.yml \
  --output probe_results.json

Q8,4 binary search results (both stacks — identical ceiling)

cache_size pages outcome VRAM used / free
131,072 512 ✓ success 6,083 / 1,724 MiB
262,144 1,024 ✗ OOM
196,608 768 ✗ OOM
163,840 640 ✓ success 6,499 / 1,308 MiB
180,224 704 ✓ success 6,691 / 1,116 MiB
188,416 736 ✓ success 6,787 / 1,020 MiB
192,512 752 ✓ success 6,851 / 956 MiB
194,560 760 ✓ success 6,883 / 924 MiB
195,584 764 ✗ OOM
195,072 762 ✓ confirmed max 6,883 / 924 MiB

Both Stack A (torch 2.9, CUDA 12.8) and Stack C revised (torch 2.11, CUDA 13.0) converged on the same ceiling. The KV page size is 256 tokens, so the allocation granularity is 256 × 12,288 bytes ≈ 3 MiB per page. CUDA runtime overhead differences between 12.8 and 13.0 are smaller than a single page.

Q4,4 results (reference, from earlier work)

Maximum confirmed: 180,992 tokens at Q4,4. The Q8,4 ceiling falls lower due to 1.5× higher bytes/token.


The Package Compatibility Maze

This is where things got interesting.

I'd been running the original stack fine: torch 2.9.0+cu128, ExLlamaV3 0.0.34, flash-attn 2.8.3, causal-conv1d 1.6.2. Clean. Everything working.

Then I went looking for whether a newer ExLlamaV3 with CUDA 13.2 kernels would improve things. The answer required understanding something fundamental about Python C extensions.

Why upgrading PyTorch is not like upgrading a Python package

Every .so extension compiled against PyTorch links against internal PyTorch symbols by name — things like _ZN3c104cuda29c10_cuda_check_implementationEiPKcS2_ib. Those symbols are not part of any stable ABI. They change between PyTorch minor versions. The result: upgrading torch from 2.9 to 2.11 silently breaks flash-attn and causal-conv1d even though they're "installed" and even though importlib.util.find_spec() says they're present.

That last point matters: tabbyAPI uses find_spec() to check optional dependencies. A package with a broken .so still passes find_spec. The crash only happens when the module is actually imported at runtime.

The compatibility matrix

Stack torch ExLlamaV3 flash-attn causal-conv1d FLA
A 2.9.0+cu128 0.0.37+cu128 2.8.3+cu128 ✓ 1.6.2.post1 ✓
B 2.10.0+cu128 0.0.37+cu128 2.8.3+cu128 or +cu130 ✓ 1.6.2.post1 ✓
C 2.11.0+cu130 0.0.37+cu132 2.8.3+cu130 ✓ ✗ no wheel
D 2.12.0+cu130 0.0.37+cu132 2.8.3+cu130 or +cu132 ✓

The story I originally told about Stack C was wrong. I removed flash-attn when upgrading to torch 2.11, assuming no compatible wheel existed. It does: flash_attn-2.8.3+cu130torch2.11 is available from mjun0812's prebuild repo. The find_spec() check in tabbyAPI would have passed (broken ABI .so still shows as installed), but the correct fix was to install a compatible wheel, not remove the package.

  • causal-conv1d still has no torch2.10+ wheel as of May 2026 — this one is genuinely unavailable without building from source
  • flash-linear-attention is pure Python + Triton — no ABI coupling, works with any torch version

What each missing package actually costs

flash-attn handles the 8 full-attention layers in Qwen3.5-9B. Without it, ExLlamaV3 dispatches to Triton paged attention (its next preference in the fallback list). The measured cost: 3% at 512-token context, growing to 83% slower at 131K context. The loss is entirely in the KV-cache decode path — flash-attn's fused kernel has much lower memory bandwidth overhead than Triton paged attention at long sequences.

causal-conv1d accelerates the conv1d operations inside the 24 GDN layers. Without it, ExLlamaV3's Triton kernel for GatedDeltaNet recurrence handles them instead. The measured cost on RTX 4070: zero. No decode speed difference at any context length.

flash-linear-attention (FLA) accelerates the GatedDeltaNet forward pass via Triton kernels. This one stayed installed across all stacks.

The CUDA 13.2 upgrade — what actually changed

Upgrading to ExLlamaV3 0.0.36+cu132 gets you custom CUDA kernels compiled against CUDA 13.2. The ExLlamaV3 custom ops (quantised matmul, RoPE, etc.) are recompiled with newer compiler optimisations. Whether that recovers the flash-attn and causal-conv1d losses empirically is exactly what the benchmark below tests.


Other Tuning Applied

ngram_match_min: 3 (free speculative decoding)

tabbyAPI hardcodes ngram_match_min=0 in its AsyncGenerator constructor — the parameter exists in ExLlamaV3 but isn't exposed in config.yml. One line patch to /opt/tabbyapi/backends/exllamav3/model.py:

self.generator = AsyncGenerator(
    ...
    num_draft_tokens=self.draft_num_tokens,
    ngram_match_min=3,   # ← added
)

With value 3: when the last 3+ output tokens have appeared somewhere in the input context, the next token is drafted from that prior occurrence and the main model validates in parallel. Zero VRAM cost, zero draft model, purely context-driven. Best gains on structured or repetitive text — code, documents that quote themselves, long reasoning chains.

sysmem_recurrent_cache

Confirmed system RAM, not VRAM. The RecurrentCache is a Python OrderedDict holding serialised GDN recurrent states between requests. Default 4 GB; at ~MB-scale per state this handles dozens of concurrent long sessions. Left at default.

max_batch_size

Left at default (4). TODO: measure VRAM savings at batch size 1 — with a single-user setup there's no batch parallelism to lose.


Benchmark: Stack A vs Stack C

Methodology

  • 3 trials per context length, cached-run median reported (trial 1 is always a cold prefill — excluded from medians)
  • Context lengths tested: 512, 4,096, 16,384, 65,536, 131,072 tokens
  • Decode length: 512 tokens per request
  • Prefill speed = prompt tokens / TTFT; decode speed = 512 / (total − TTFT)
  • VRAM peak sampled during generation

Three stacks measured: Stack A (torch 2.9, full dependencies), Stack C original (torch 2.11, flash-attn accidentally removed), Stack C revised (torch 2.11, flash-attn restored).

Results: decode throughput (tokens/sec)

Context (tokens) Stack A Stack C orig (no flash-attn) Stack C revised (flash-attn)
512 34 33 33
4,096 33 30 32
16,384 28 23 28
65,536 16 11 16
131,072 11 6 11

Results: prefill throughput (tokens/sec, cached run)

Context (tokens) Stack A Stack C revised Delta
512 237 236 0%
4,096 10,224 10,401 +2%
16,384 37,432 36,808 −2%
65,536 77,584 78,290 +1%
131,072 80,867 94,006 +16%

Max context ceiling (Q8,4 cache_mode)

Stack A Stack C revised
Max cache_size (tokens) 195,072 195,072
VRAM used at max (MiB) 6,883 6,883

Analysis

flash-attn on 25% of layers is not a small thing

The hypothesis going in was that Triton paged attention on 8 out of 32 layers wouldn't be catastrophic. The data shows otherwise:

  • At 512 tokens (minimal KV pressure, dominated by weight ops): 34 vs 33 tok/s — 3% difference, barely measurable
  • At 131,072 tokens (maximum KV pressure): 11 vs 6 tok/s — 83% faster with flash-attn

The gap is entirely context-dependent. At short context, the 8 full-attention layers spend most of their time on the matmuls and barely touch the KV cache. At 131K context, those 8 layers are doing O(n²) attention over a 130K-token sequence, and flash-attn's fused CUDA kernel vs Triton's paged implementation is the difference between a usable and an unusable response time.

causal-conv1d: no measurable impact

Stack A has causal-conv1d for the 24 GDN/linear layers, Stack C revised does not. The decode speed difference between the two is 0–3% at all context lengths — within noise. ExLlamaV3's own Triton kernel for the GatedDeltaNet recurrence is already well-optimised on this hardware. The package exists for older GPU generations and smaller models where the Triton path has more overhead.

CUDA 12.8 vs 13.0: identical context ceiling, near-identical throughput

Both stacks hit 195,072 tokens. CUDA 13.0 runtime overhead is below the 256-token page granularity (~3 MiB) for this model size. Throughput differences are within 2% at all context lengths except one: Stack C revised shows +16% prefill throughput at 131K context (94,006 vs 80,867 tok/s).

This is real but narrowly applicable. The 131K cold prefill takes ~87 seconds and is susceptible to thermal variation. It's a single uncached run in each 3-trial set. Whether the CUDA 13.2 kernel compilation produces genuinely faster attention code at very long sequence lengths is worth testing with more trials.


Conclusion

Flash-attn is the only dependency that matters at long context. causal-conv1d, despite covering 75% of layers, makes no measurable difference. CUDA kernel generation (12.8 vs 13.2) makes no difference to decode throughput and has no effect on the context ceiling.

The production recommendation depends on your PyTorch version:

  • If running torch 2.9 (CUDA 12.8): install flash-attn from the pre-built wheel. This is the simplest supported configuration and you get full performance.
  • If running torch 2.11 (CUDA 13.0): flash_attn-2.8.3+cu130torch2.11 exists and is installable — it is not the default anyone reaches for but it works. Install it and decode at 131K context becomes 11 tok/s instead of 6 tok/s. causal-conv1d has no torch 2.11 wheel; leave it absent.

The architecture insight holds: this model's context ceiling is not where you'd expect it. With only 8 full-attention layers out of 32, KV cache pressure is 4× lower than a pure transformer. The recurrent state of the 24 GDN layers lives entirely in system RAM. At Q8K/Q4V, 195K tokens fits in 8 GB alongside a 4.5 GB model — something that would be impossible at full-transformer architecture.

The ceiling you can't push past is quantisation quality in those 8 attention layers at long context, not VRAM.


Appendix: Key Commands

Run context ceiling probe:

python3 agents/tabbyapi_probe.py \
  --container tabbyapi --model-name Qwen3.5-9B-exl3-4.0bpw \
  --cache-mode 8,4 --lo 131072 --hi 262144 \
  --config-template /home/user/.claude/jobs/f7398869/tabbyapi_config.yml \
  --output probe_results.json --timeout 180

Downgrade to Stack A:

incus exec tabbyapi -- systemctl stop tabbyapi
incus exec tabbyapi -- /root/.local/bin/uv pip install \
  "https://download.pytorch.org/whl/cu128/torch-2.9.0%2Bcu128-cp312-cp312-linux_x86_64.whl" \
  --python /opt/tabbyapi/.venv/bin/python3
# then exllamav3 0.0.34, flash-attn 2.8.3, causal-conv1d 1.6.2.post1

Re-apply ngram patch after any tabbyAPI update:

incus exec tabbyapi -- sed -i \
  's/                num_draft_tokens=self.draft_num_tokens,/                num_draft_tokens=self.draft_num_tokens,\n                ngram_match_min=3,/' \
  /opt/tabbyapi/backends/exllamav3/model.py
incus exec tabbyapi -- systemctl restart tabbyapi

Check package stack:

incus exec tabbyapi -- /opt/tabbyapi/.venv/bin/python3 -c "
import torch, exllamav3
print('torch:', torch.__version__)
print('exllamav3:', exllamav3.__version__)
try:
    import flash_attn; print('flash_attn:', flash_attn.__version__)
except ImportError:
    print('flash_attn: NOT INSTALLED')
try:
    import causal_conv1d; print('causal_conv1d: OK')
except ImportError:
    print('causal_conv1d: NOT INSTALLED')
try:
    import flash_linear_attention; print('FLA: OK')
except ImportError:
    print('FLA: NOT INSTALLED')
"

No comments:

Post a Comment