cuda-graphs-vllm-gdn-hybrid-qwen35-9b

title: "Getting CUDA Graphs Working on vLLM with a GDN Hybrid Model (Qwen3.5-9B)"
date: 2026-06-26

Getting CUDA Graphs Working on vLLM with a GDN Hybrid Model (Qwen3.5-9B)

I've been running Qwen3.5-9B-AWQ on a Xenon box with an RTX 4060 Ti 16 GB via vLLM 0.20.2. Decode throughput was sitting at around 7–8 tok/s on single-request agentic use, which felt wrong — the 4060 Ti has 288 GB/s of VRAM bandwidth and a 9B AWQ model should be doing considerably better than that. The research notes from May pointed at --enforce-eager as the culprit (CUDA graph capture disabled, 5–10× penalty) but also noted it was required at 239K context for VRAM headroom. That was where things sat for a few weeks.

Today I did the actual sweep to find out what's achievable — and the answer turned out to be more nuanced than "remove the flag and you're done".

Some background on the architecture

Qwen3.5-9B is a GDN (GatedDeltaNet) hybrid, not a pure dense transformer. It has 32 layers: 8 standard attention layers (KV-cached, accelerated by AWQ Marlin) and 24 linear attention / SSM recurrent layers (no KV cache, run unconditionally every token using the FLA kernel). The SSM layers are why the model behaves differently to a vanilla 9B — they eat into the VRAM budget in ways the standard vLLM probe logic doesn't account for, and they're also why --enforce-eager is needed at high context: the FLA activation tensor grows with sequence length (~1 KB/token) and at GMU=0.98 you simply don't have the headroom for CUDA graph capture on top of that.

One important flag note: do not use ngram speculative decoding on GDN hybrids — SSM state rollback on rejected speculative tokens corrupts the recurrent state (vLLM issues #39273 and #40875). MTP speculative decoding is safe because the draft heads are model-native.

The sweep

I ran five configurations, each benchmarked with three consecutive requests of 250 tokens and wall-clock timing. Run 1 is always slower (JIT/FLA kernel warmup on first request after startup), so I quote the median of runs 2–3:

Config	max_model_len	enforce-eager	MTP	CUDA graphs	tok/s
Baseline	239,616	Yes	No	No	7.75
Tier 1	65,536	No	Yes (n=2)	Yes	31.5
Tier 2 ★	108,768	No	Yes (n=2)	Yes	31.3
Tier 3	168,000	Yes	Yes (n=2)	No	12.6
Tier 4	239,616	Yes	No	No	8.0

The key insight from Tier 1 vs Tier 2: both give identical throughput. CUDA graphs run at the same fixed capture sizes ([1, 2, 4, 8, 16, 24, 32, 40, 48] tokens) regardless of max context, so 108K is strictly better than 65K for free. That's the config I've settled on.

The MTP VRAM surprise

Adding --speculative-config '{"method":"mtp","num_speculative_tokens":2}' loads an MTP draft head that borrows the base model's embedding and lm_head weights but adds its own additional layer — costing roughly 0.5 GiB of VRAM. This is not reflected in the existing context ceiling probe table (which was built without MTP), so the real ceiling with MTP is lower than the probe table suggests.

Concretely: at GMU=0.90, the original probe gave 80K tokens without MTP. With MTP, the available KV cache at GMU=0.90 drops below what 65K requires (1.36 GiB needed, 1.02 GiB available). I had to nudge to GMU=0.93 to get 65K working, and GMU=0.95 for 108K. The updated ceiling table:

GMU	enforce-eager	MTP	Max context
0.90	No	Yes	~46K (OOM)
0.93	No	Yes	65,536 ✓
0.95	No	Yes	108,768 ✓
0.97	Yes	Yes	~170,000
0.99	Yes	No	239,616 ✓

The 200K+ context tiers I'd hoped to test with MTP turned out to be impossible on 16 GB — reaching 200K with MTP + enforce-eager would require more VRAM than even GMU=1.0 can provide. If you want 200K+ you have to drop MTP (Tier 4 above, 8.0 tok/s) or use a different backend entirely.

MTP does help even on enforce-eager configs though: compare Tier 3 (168K, eager, MTP) at 12.6 tok/s versus Tier 4 (239K, eager, no MTP) at 8.0 tok/s — the draft head adds about 1.6× at the cost of ~70K of context headroom.

First-startup compile time

With --enforce-eager removed, vLLM runs torch.compile (Inductor) and CUDA graph capture on startup. The first cold start for the backbone takes about 130 seconds; the MTP eagle head takes another 40 seconds on top of that. Subsequent startups read from /root/.cache/vllm/torch_compile_cache/ and complete in about 16 seconds. The cache key includes GMU, so changing that value invalidates it and you pay the full compile time again.

Tier 2 also crashes on its first boot attempt and recovers on the auto-restart (a timing issue in the profiling phase, I suspect). With Restart=on-failure and RestartSec=30 in the service file, it comes up cleanly on the second try — a bit inelegant but consistent.

The systemd ExecStart gotcha

If you're writing the service file programmatically, be careful with the speculative-config JSON. Double-quoted backslash-escaped JSON in ExecStart:

ExecStart=... --speculative-config "{\"method\":\"mtp\",\"num_speculative_tokens\":2}"

...gets silently stripped by systemd to {method:mtp,num_speculative_tokens:2}, which is not valid JSON and fails with Value cannot be converted to <function loads>. Single-quoted JSON works correctly:

ExecStart=... --speculative-config '{"method":"mtp","num_speculative_tokens":2}'

I ended up writing service files with Python's open(...).write(...) to avoid any heredoc quoting nonsense:

MTP = "'" + '{"method":"mtp","num_speculative_tokens":2}' + "'"
# then embed MTP in the ExecStart f-string

Production config (as of 2026-06-26)

# /etc/systemd/system/vllm.service inside LXC 8003 on Xenon (RTX 4060 Ti 16 GB)
[Service]
Environment=CUDA_VISIBLE_DEVICES=0
Environment=VLLM_WORKER_MULTIPROC_METHOD=spawn
Environment=PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
Environment=VLLM_SLEEP_WHEN_IDLE=1
ExecStart=/opt/vllm-env/bin/vllm serve /mnt/models/Qwen3.5-9B-AWQ \
  --host 0.0.0.0 --port 8000 \
  --gpu-memory-utilization 0.95 \
  --kv-cache-dtype fp8 \
  --max-model-len 108768 \
  --enable-prefix-caching \
  --language-model-only \
  --reasoning-parser qwen3 \
  --tool-call-parser qwen3_coder \
  --enable-auto-tool-choice \
  --max-num-seqs 8 \
  --speculative-config '{"method":"mtp","num_speculative_tokens":2}'
Restart=on-failure
RestartSec=30
TimeoutStartSec=600

31 tok/s at 108K context, CUDA graphs and MTP both active, ~15.1 GB VRAM used at rest. That's a four-fold improvement over the --enforce-eager baseline for agentic single-stream use.

If you need context beyond 108K, the choices are: Tier 3 (168K, 12.6 tok/s, MTP on, enforce-eager) or Tier 4 (239K, 8.0 tok/s, MTP off, enforce-eager). Neither is as fast as Tier 2, but at least now there's a table rather than a guess.

References

vLLM v0.20.2 — OpenAI-compatible inference engine for LLMs; issues #39273 and #40875 cover ngram + SSM state corruption
Qwen3.5-9B-AWQ — Alibaba's 9B GDN hybrid model, AWQ-quantised; 32 layers (8 attention + 24 SSM/linear attention)
FLA (Flash Linear Attention) — CUDA kernels for GDN/GatedDeltaNet linear attention layers; the activation buffer grows with sequence length
torch.compile — PyTorch graph compilation; Inductor backend used by vLLM's CUDA graph capture
systemd.service(5) — specifically ExecStart argument quoting behaviour
turboderp/Qwen3.5-9B-exl3 — ExLlamaV3 quants for the same model (alternative backend if vLLM proves to be a ceiling)

Ta ta for now, and I hope this saves someone the same afternoon of flag-tweaking.

Tech Guinea Pig

26 June 2026