26 May 2026

Where Does Your LLM Actually Live? Model Quantisation, File Formats, and the GPU/RAM Memory Trap

If you've spent any time running large language models locally, you've probably heard terms like AWQ, GGUF, EXL3, vLLM, and ExLlamaV2 thrown around — often without much explanation of how they relate to each other, or why choosing the wrong combination can make your model five times slower than it needs to be.

This post aims to fix that. We'll cover what a model actually is in memory terms, how quantisation changes its footprint, which file formats carry which quantised models, which inference engines speak which formats, and — most importantly — the often-misunderstood question of where the model actually lives when it's running, and why a mixture of GPU and CPU is usually the worst outcome rather than a useful compromise.


What a Model Is in Memory

A language model is, at its core, a large collection of floating-point numbers called weights. A 9 billion parameter model has roughly 9 billion of these numbers. Each one, stored at full precision (FP32), occupies 4 bytes — so a raw 9B model would need about 36 GB of storage and memory. In practice, models are stored and loaded in 16-bit formats (BF16 or FP16), halving that to around 18 GB.

18 GB is already more than most consumer GPUs can hold. A typical gaming GPU has 8–16 GB of VRAM. This is where quantisation comes in.


Quantisation: Trading Precision for Space

Quantisation reduces the number of bits used to store each weight. The key insight is that neural networks are surprisingly tolerant of reduced precision — the quality loss from moving from 16-bit to 4-bit is often small enough to be irrelevant for practical use, while the memory saving is dramatic.

The main quantisation levels

Precision Bits per weight 9B model size Quality loss
FP16/BF16 16 ~18 GB None (reference)
FP8 8 ~9 GB Near-zero
INT8 / Q8_0 8 ~9 GB Minimal
INT4 / Q4 4 ~5–6 GB Small but noticeable
3-bit 3 ~4 GB Moderate

4-bit quantisation is currently the practical sweet spot for most consumer hardware: a 9B model fits comfortably in an 8 GB GPU, and quality remains good enough for coding, writing, and reasoning tasks.

It's not just about bit width

The method of quantisation matters as much as the bit width. Two 4-bit models of the same architecture can have meaningfully different output quality depending on how the quantisation was performed:

  • AWQ (Activation-aware Weight Quantization): Calibrates the quantisation using sample inputs, preserving weights that are most sensitive to rounding. Groups of 128 weights share a scale factor.
  • GPTQ: Uses the inverse Hessian to minimise quantisation error block by block. Doesn't account for activation magnitudes, so typically slightly lower quality than AWQ at the same bit width.
  • EXL3 (ExLlamaV2 format): Operates at the individual row level, solving for the optimal bit allocation per row to minimise output error. Can assign more bits to sensitive rows and fewer to robust ones. At 4 bits per weight, EXL3 typically outperforms both AWQ and GPTQ in measured perplexity.
  • GGUF quantisation (Q4_K_M, Q5_K_M, etc.): The K variants use k-means clustering per block, with the _M suffix indicating a mixed-importance strategy — layers deemed more important get higher precision. Well-calibrated and widely tested.

File Formats: The Container Around the Weights

Quantised weights are packaged in different file formats, each tied to a particular ecosystem.

GGUF

The format used by llama.cpp and everything built on it (Ollama, LM Studio, Jan). A GGUF file is self-contained: it includes the weights, the model architecture metadata, and tokenizer data in a single file.

GGUF supports a wide range of quantisation levels: Q4_0, Q4_K_M, Q5_K_M, Q8_0, and many more. It's the most portable format — the same file runs on a CPU, a GPU, or a mixture of both.

Safetensors (HuggingFace format)

The standard format for HuggingFace model repositories. Models in AWQ or GPTQ quantisation are typically distributed as collections of .safetensors files alongside a config.json. This format is used by vLLM, transformers, and most Python-based inference stacks.

EXL3 / EXL2

ExLlamaV2's native formats. EXL3 is the current generation. These are also safetensors files under the hood, but with ExLlamaV2-specific quantisation data embedded. They cannot be loaded by vLLM or standard transformers — they require the ExLlamaV2 runtime.


Inference Engines: Who Speaks What

The inference engine is the software that actually loads the weights and runs the forward pass to generate tokens. Each engine has its own strengths, limitations, and supported formats.

Ollama

Built on llama.cpp. Supports GGUF only. Easiest setup — run ollama pull model-name and it downloads and serves the model immediately. Best for quick local use, development, and simple API access. Not designed for high-throughput serving or very long contexts.

vLLM

A production inference server designed for high-throughput serving of many concurrent users. Supports HuggingFace safetensors format, including AWQ, GPTQ, FP8, and unquantised models. Provides an OpenAI-compatible API. Has sophisticated memory management for long contexts (paged attention, chunked prefill).

Best suited for: serving multiple users simultaneously, very long context windows, production deployments.

Not suited for: models that require ExLlamaV2 quantisation (EXL3), or single-user interactive use where its multi-request optimisations add overhead rather than help.

ExLlamaV2 / tabbyAPI

ExLlamaV2 is a CUDA inference library with custom kernels tuned for low-batch (single-user) decode. tabbyAPI wraps it in an OpenAI-compatible HTTP server. Supports EXL3, EXL2, GPTQ, and some GGUF.

For single-user interactive use, ExLlamaV2 is often faster than vLLM because vLLM is optimised for batched requests. ExLlamaV2's kernels are specifically tuned for the batch-size-1 case that dominates personal use.

transformers (HuggingFace)

The reference implementation. Supports almost everything, but is the slowest option in production because it lacks the custom CUDA kernels of the specialised engines. Useful for research, fine-tuning, and running models before optimised backends exist.


The Format-to-Engine Matching Table

You have Use this engine
GGUF (Q4_K_M, Q5_K_M, etc.) Ollama or llama.cpp directly
AWQ safetensors vLLM
GPTQ safetensors vLLM or ExLlamaV2
EXL3 / EXL2 ExLlamaV2 / tabbyAPI only
FP8 safetensors (official Qwen FP8 etc.) vLLM
Unquantised BF16 safetensors vLLM or transformers

Trying to use the wrong engine with a given format either fails outright or forces a slow conversion at load time. The pairing matters.


Where the Model Actually Lives: The Critical Question

This is where most guides go wrong by omission. A model's performance is determined not just by its quantisation, but by where its weights reside when the forward pass runs.

The three scenarios

Scenario 1: All weights in GPU VRAM

This is the ideal case. The GPU's memory bandwidth — typically 200–900 GB/s depending on the card — feeds weights to the compute cores without any external bottleneck. Token generation is fast.

For a 9B model at 4-bit AWQ (~5.5 GB), an 8 GB GPU holds all the weights comfortably with room left for the KV cache. Decode speed on an RTX 4070 (8 GB) is 20+ tokens per second.

Scenario 2: All weights in CPU RAM

When a model is too large for VRAM and you configure the inference engine to run entirely on CPU, the CPU's memory subsystem handles everything. Modern DDR5 provides 80–100 GB/s bandwidth, which is slower than GPU memory but consistent. A full CPU inference run on a well-quantised 9B model at Q4 typically yields 3–8 tokens per second depending on the CPU.

Crucially: modern Intel CPUs with AVX_VNNI (like the Intel Core Ultra 7 series) have native INT8 dot product instructions. This means Q8_0 (8-bit quantisation) computes at nearly the same speed as Q4_K_M on these CPUs — the extra compute cost of INT8 is offset by the hardware acceleration. You get meaningfully better quality for free.

Scenario 3: Weights split across GPU and CPU RAM (the mixed case)

When a model is larger than VRAM, most inference engines will automatically offload some layers to CPU RAM and keep the rest on GPU. This sounds like a reasonable compromise. In practice, it is almost always the worst outcome.

Here's why. The forward pass through a transformer runs layers sequentially. If some layers are on the GPU and some are on the CPU, the computation must cross the PCIe bus at every GPU-CPU boundary:

GPU layer → compute (fast, ~hundreds of GB/s VRAM)
    ↓
PCIe transfer (bottleneck: ~32 GB/s in both directions)
    ↓
CPU layer → compute (slower, but not the problem)
    ↓
PCIe transfer back
    ↓
GPU layer → compute...

PCIe Gen 4 x16 has a practical throughput of around 28–32 GB/s. Every token generated requires transferring the activations across this bus at every layer boundary. For a 9B model split 50/50, this happens dozens of times per token. The result: decode speed collapses to around 3 tokens per second — slower than running fully on CPU, and slower than running a smaller model fully on GPU.

The empirical evidence is stark. On an Intel Ultra 7 + RTX 4070 8GB machine:

Configuration Model Tokens/sec
All in GPU VRAM Qwen3-8B Q4 20+ tok/s
Split GPU+CPU Qwen3.5-27B Q4 ~3 tok/s
Fully CPU Q8_0 9B (AVX_VNNI) ~4–5 tok/s

The 27B model split across GPU and CPU is slower than running a smaller model fully on CPU, and only marginally faster than the CPU-only run despite using the GPU. The GPU is largely wasted — it spends most of its time waiting for PCIe transfers.

A special case: MoE models with expert offloading

Mixture-of-Experts (MoE) models introduce a nuance. Models like Qwen3.5-35B-A3B have 35 billion total parameters, but only about 3 billion are active on any given forward pass — the MoE routing selects a small subset of "expert" networks per token.

When the expert weights are offloaded to CPU RAM (via vLLM's --cpu-offload-params experts), only the active experts are transferred per token, not the full parameter set. This reduces the PCIe burden dramatically compared to a dense model. In practice, a 35B MoE model running on an 8 GB GPU with experts offloaded to RAM achieves 5–7 tokens per second — competitive with a smaller dense model entirely in VRAM.

This works because MoE expert routing selects only 8 of 256 experts per token. The PCIe transfer is proportional to the active parameter count, not the total. Dense models have no such relief — all weights are active every token, making the PCIe cost unavoidable.


Practical Decision Guide

When choosing how to run a model locally, the decision tree looks like this:

Does the quantised model fit in your GPU VRAM?
→ Yes: run it in VRAM. Use the best engine for your format.
→ No: continue below.

Is it a dense model (standard transformer)?
→ If it exceeds VRAM by a small margin: consider a smaller or more aggressively quantised version that fits. A Q4_K_M 9B fully in VRAM beats a Q4_K_M 14B split across GPU and CPU.
→ If you must run it partially on CPU: set the engine to use zero GPU layers and run fully on CPU. Slow but consistent.
→ Avoid the split if at all possible.

Is it a Mixture-of-Experts model?
→ Expert offloading via vLLM is viable and gives acceptable speed, because only active experts cross PCIe per token.
→ The larger the expert count relative to active experts, the better the ratio.

What file format do you have?
→ GGUF: Ollama. Simplest.
→ AWQ/GPTQ safetensors: vLLM. Best for long context and multi-user.
→ EXL3: tabbyAPI. Best for single-user interactive speed.


Summary

  • Quantisation reduces model size by lowering weight precision. 4-bit is the practical sweet spot for consumer GPUs. Quality varies by method: EXL3 > AWQ > GPTQ at equivalent bit widths.
  • File formats are tied to ecosystems: GGUF for Ollama/llama.cpp, safetensors for vLLM, EXL3 for ExLlamaV2. Mismatching format and engine either fails or adds overhead.
  • Where the model lives determines performance more than almost any other factor:
  • All in GPU VRAM: fast (20+ tok/s for 9B)
  • All in CPU RAM: slow but consistent (3–8 tok/s); Intel AVX_VNNI makes Q8_0 competitive
  • Split GPU+CPU: usually the worst outcome — PCIe becomes the bottleneck and the GPU is underutilised
  • MoE models are the exception to the split-is-worst rule, because only active experts need to cross PCIe per token.
  • Match your model size to your VRAM. When in doubt, run a smaller model fully in VRAM rather than a larger model split across GPU and CPU.

The goal is to never let the PCIe bus become your bottleneck. Everything else — quantisation method, inference engine, file format — is secondary to keeping your weights on the right side of that bus.

"Where Do I Run This?" — A Surprisingly Interesting Answer

"Where Do I Run This?" — A Surprisingly Interesting Answer

Published: 2026-05-15
Tags: claude-code, ai-agents, local-ai, meta


While setting up a large-context benchmark for our llama.cpp series, I asked Claude Code
to prepare a prompt for a sub-agent to run the long benchmark job autonomously. It did,
then added a note at the end:

"To launch: Agent(subagent_type="general-purpose", prompt=open(...).read())"

My immediate question: where do I run that?

The answer reframed something I thought I understood.


It's Not Your Code. It's Claude's.

Agent(...) isn't a Python library you install. It isn't a CLI command. It's a tool
that Claude Code calls internally
— in the same category as Bash, Read, or Write.

When Claude runs Bash("nvidia-smi"), your terminal executes nvidia-smi. When Claude
calls Agent(...), a new AI agent spins up — with its own Bash, its own file access, its
own web search — and works through a task autonomously, just like Claude is working through
this conversation.

The pseudocode Claude wrote was essentially describing its own next action in notation a
programmer would recognise. It was talking about itself.


The Practical Shape of It

The flow looks like this:

You → Claude Code (this chat)
         └─ Agent(prompt="benchmark llama.cpp at 262K context...")
                └─ Sub-agent (no memory of your conversation)
                       ├─ writes bench_large_context.py
                       ├─ runs it (takes ~90 minutes)
                       ├─ reads results
                       └─ writes blog_large_context.md
         └─ "Done — decode speed drops 12% at 262K tokens. Blog post written."
You ← result

The sub-agent gets one thing: the prompt. It has no access to your conversation history.
That's why the prompt file we prepared was so detailed — it had to stand alone as a
complete briefing for someone who just walked into the room.


Why This Matters More Than It Looks

Most AI tooling has a clean boundary: the human decides what to do, the AI executes one
step. What's different here is that Claude can delegate to another Claude — and that
second agent can delegate further, run for an hour, write code, execute it, read the
output, and revise. The human isn't in the loop for each step.

That changes the unit of work. Instead of "ask AI to write a benchmark script," the unit
becomes "ask AI to run the benchmark campaign and deliver results." The script is an
implementation detail.

It also changes what a good prompt looks like. Writing for a sub-agent is closer to
writing a spec for a colleague than writing a prompt for a chatbot. It needs context,
constraints, expected outputs, and failure modes — because there's nobody to ask for
clarification once it starts.


The Meta Moment

The most interesting part of this exchange wasn't the answer. It was the question.

"Where do I run this?" assumes that code is something humans execute. But in a system
where the AI has a shell, a file system, and the ability to spawn other AIs, that
assumption quietly stops being true. The code Claude wrote wasn't for me. It was a note
to itself about what to do next.

We're early enough in this that the boundary between "Claude explaining a thing" and
"Claude doing a thing" isn't always obvious. Paying attention to which side of that line
you're on turns out to be worth it.


This post is part of a series on running large language models locally on consumer
hardware. The benchmark it references — Qwen3.5-35B-A3B at 262K context on 8GB VRAM —
is covered in the companion posts in this series.

Running LLMs Locally: AMD APU vs Discrete GPU — Why Architecture Matters More Than Hardware

Running LLMs Locally: AMD APU vs Discrete GPU — Why Architecture Matters More Than Hardware

The Hardware

I benchmarked two very different local AI setups:

Matt-Mini — a Windows Mini PC that most people would dismiss for AI:
- CPU: AMD Ryzen 7 5800U (8 cores, Zen 3)
- iGPU: AMD Radeon Vega 8 (integrated, shared memory)
- RAM: 64GB DDR4-3200 (~50 GB/s bandwidth)

Ubuntu Laptop — a more conventional AI workstation:
- GPU: NVIDIA RTX 4070 8GB VRAM (~300 GB/s GDDR6X bandwidth)
- RAM: DDR5 system RAM (~80–100 GB/s), separate from GPU VRAM

The critical insight about the APU: the iGPU uses shared system memory as VRAM. With 64GB of RAM, the GPU can access tens of gigabytes for model weights — something impossible on a discrete GPU with fixed VRAM. The trade-off is bandwidth: DDR4 gives ~50 GB/s vs the RTX 4070's ~300 GB/s.


The Benchmark Setup

I used Ollama as the inference server (Vulkan backend for AMD iGPU — no ROCm required) and ran three prompts per model:

  • Short: "What is 2 + 2? Answer in one word." — tests base throughput
  • Reasoning: A multi-step maths problem — tests sustained generation
  • Coding: Fibonacci with memoization in Python — tests structured output

Metric: tokens per second (TPS) for generation.


Results: Matt-Mini (AMD Ryzen 7 5800U + Vega 8 iGPU, 64GB shared RAM)

Model Architecture Comparison (all Q4_K_M)

Model Avg TPS Total Params Active Params Type
qwen3:30b-a3b 12.0 30B 3B MoE
qwen3-coder:30b-a3b 12.1 30B 3B MoE (coding)
qwen3:8b 5.3 8B 8B Dense
qwen3.5-abliterated:35b-a3b 4.65 35B ~3.5B MoE (uncensored)
qwen3.5-opus-distill 3.83 35B ~3.5B MoE (distilled, Q8_0)
mixtral:8x7b 3.5 46.7B 12.9B MoE
deepseek-r1:14b 3.1 14B 14B Dense

Q4_K_M vs Q8_0 on Bandwidth-Constrained iGPU

The Vega 8 iGPU is bottlenecked by DDR4 memory bandwidth (~50 GB/s). Q8_0 uses 2× the memory bandwidth of Q4_K_M with no compute benefit on hardware lacking AVX_VNNI. The speed penalty is significant:

Model Q4_K_M TPS Q8_0 TPS Q4 faster by
qwen3-coder:30b-a3b 12.1 7.73 +57%
qwen3.5-abliterated:35b-a3b 4.65 3.83 +21%

Use Q4_K_M on the APU. Q8_0 only makes sense if quality is paramount and you can accept the speed penalty.


Results: Ubuntu Laptop (NVIDIA RTX 4070 8GB, DDR5)

General and Reasoning Models

Model Avg TPS Params Notes
qwen2.5-coder:1.5b 163 1.5B Tiny, saturates GPU
qwen2.5-coder:7b 52 7B Fast in VRAM
qwen3.5:4b 51 4B
deepseek-r1:7b 39 7B Strong reasoning, consistent TPS
qwen3-vl:8b 35 8B Vision model
llama3.1:latest 36 8B
qwen3.5:latest 24 ~14B Starts hitting VRAM limit
qwen3.5:27b 3.0 27B Exceeds 8GB VRAM, spills to RAM

Vision Models (for ComfyUI and multimodal workflows)

Model Avg TPS VRAM Notes
qwen3-vl:4b-instruct-q8_0 45 ~5.5GB Best balance — fast, high quality, leaves headroom
qwen3-vl:8b-instruct-q4_K_M 35 ~5.5GB Larger model, slightly slower, better comprehension
minicpm-v:8b-2.6-q4_K_M 38 ~5GB Fast but terse — short responses on text tasks
qwen2.5vl:3b-q8_0 15 ~3.5GB Slow despite small size — VRAM load overhead

The dramatic drop from qwen3.5:latest (~24 TPS) to qwen3.5:27b (3 TPS) marks the VRAM cliff. Once the model no longer fits in 8GB, it spills to system RAM — but even though this machine has fast DDR5, the bottleneck becomes the PCIe bus (~32 GB/s) between the GPU and system memory, not the RAM speed itself. Performance collapses to APU-level speeds despite the faster RAM.


The Key Finding: Active Parameters Are What Matter

The headline result is qwen3:30b-a3b hitting 12 TPS — faster than the 8B dense model, despite having 30 billion total parameters.

This seems counterintuitive until you understand Mixture of Experts (MoE) architecture. In a MoE model, the network is split into many "expert" sub-networks. For any given token, only a small subset of experts are activated. qwen3:30b-a3b has 30B total parameters but only 3B active per token — the same compute cost per token as a 3B dense model, but with the knowledge capacity of a 30B model.

The rule that emerges from these results:

MoE speed advantage only materialises when active parameter count is kept low.

Look at mixtral:8x7b: it's MoE, but with 12.9B active parameters per token. Despite the MoE structure it runs at the same speed as the dense 14B model — because the active compute is similar.

qwen3:30b-a3b wins because it keeps active params at just 3B while maximising total capacity.


The Two Hardware Stories

Discrete GPU: Fast but VRAM-limited

The RTX 4070 hits 35–163 TPS for models that fit in 8GB VRAM. It's fast — bandwidth is not the bottleneck. But the moment a model exceeds 8GB, performance falls off a cliff: qwen3.5:27b drops to 3 TPS, identical to the APU. The discrete GPU is a sprinter with a hard wall.

Shared-Memory APU: Slow but capacious

The Vega 8 iGPU runs at 3–12 TPS — slower across the board for models that fit in discrete VRAM. But it can run a 34GB Q8_0 model that would never fit on the RTX 4070. The APU is a distance runner with no wall.

Where they meet

When a model exceeds the discrete GPU's VRAM, both machines run at the same ~3 TPS. At that point, the APU's 64GB capacity advantage becomes the deciding factor — it can run larger models at equal speed, with Q8_0 quality instead of being forced into aggressive quantization.

The MoE Sweet Spot for APUs

Low active-parameter MoE is the ideal architecture for shared-memory systems: fewer active params = less bandwidth per token = more TPS on bandwidth-constrained DDR4. qwen3:30b-a3b at 12 TPS demonstrates this perfectly — 30B total parameters, but only 3B active, running faster than the dense 8B model.


Practical Recommendations

For AMD APU systems with 32GB+ unified memory (Ryzen 5800U, no AVX_VNNI):
1. Use qwen3:30b-a3b or qwen3-coder:30b-a3b as your default — ~12 TPS, best speed/quality
2. Use Q4_K_M, not Q8_0 — Q8_0 is 20–57% slower on bandwidth-limited DDR4; AVX_VNNI (which would offset the bandwidth cost) is not present on Zen 3
3. Prefer MoE models with low active param counts (under 4B active) — this is the single biggest performance lever
4. Ollama with Vulkan is the easiest path — no ROCm build required, works out of the box
5. Disable sleep — large model downloads will resume but you waste time

For discrete GPU systems (e.g. RTX 4070 8GB, Intel Ultra 7 165H with AVX_VNNI):
1. Match model size to VRAM — keep total model size under ~7.5GB to stay fully in VRAM
2. Q4_K_M for 7–8B models at this VRAM level — fits comfortably with headroom
3. Q8_0 is viable for vision models under 6GB (e.g. qwen3-vl:4b-instruct-q8_0) — AVX_VNNI on the host CPU means Q8_0 CPU fallback is no slower
4. For ComfyUI inpainting: qwen3-vl:4b-instruct-q8_0 at 45 TPS uses ~5.5GB, leaving room for the diffusion model
5. Avoid models that spill to RAM — PCIe bandwidth (~32 GB/s) becomes the bottleneck, not DDR5
6. For larger models, the APU is a natural complement — it runs 30B+ at equal speed to any spilling model


Tools Used

  • Ollama — inference server, Vulkan backend
  • llmfit — hardware-fit recommender (useful for finding candidate models, but note: speed estimates for Vega 8 iGPU are inaccurate — it assumes 180 GB/s ROCm bandwidth vs the real ~50 GB/s)
  • benchmark_ollama.py — custom benchmark script measuring TPS across models and prompt types

Tested April 2026 on Ollama — AMD Ryzen 7 5800U (Vega 8 iGPU, 64GB DDR4) and NVIDIA RTX 4070 8GB (DDR5 system RAM).

25 May 2026

tabbyAPI + Qwen3.5-9B on 8GB VRAM: Long Context, CUDA Upgrades, and What Actually Matters

Date: May 2026
Hardware: RTX 4070 Laptop GPU (8GB VRAM), Intel Ultra 7 165H, 64GB DDR5
Model: turboderp/Qwen3.5-9B-exl3 @ 4.00bpw
Backend: tabbyAPI (ExLlamaV3 0.0.37)


The Goal

I wanted a fast, long-context 9B model running locally as an OpenAI-compatible API. Not Ollama — I needed raw performance and a proper context window, not convenience defaults. The candidate: Qwen3.5-9B at 4.0 bpw EXL3 quantisation via tabbyAPI in an Incus LXC container with GPU passthrough.

The question was never whether it would run. It was: what's the real context ceiling, and does the software stack matter as much as the marketing claims?


Why tabbyAPI over vLLM or Ollama?

Ollama's great when you want things to just work. vLLM is the right tool for batched multi-user workloads. But for a single-user API that needs maximum single-request throughput and fine-grained KV control:

  • tabbyAPI (ExLlamaV3): paged KV cache, per-element KV quantisation (k_bits,v_bits independently), aggressive speculative decoding, GDN-aware caching
  • vLLM: excellent batching, but the KV and context ceiling maths are more complex on 8GB with hybrid models
  • Ollama: abstracts away all the knobs — which is a problem when the knobs are exactly what you want to tune

The Architecture Surprise

Before tuning anything I read the config.json:

"layer_types": ["linear", "linear", "linear", "full", ...]
"full_attention_interval": 4

Qwen3.5-9B is not a pure transformer. It's a GDN (Grouped-with-Dense-Notes) hybrid: 24 linear attention (GatedDeltaNet) layers and only 8 full attention layers out of 32 total. That changes everything about the VRAM maths.

Standard transformer KV cache sizing assumes every layer has a full attention KV block. With only 8 full-attention layers:

bytes/token = 8 layers × 4 KV_heads × 256 head_dim × 2 (K+V) × bits/8
cache_mode bytes/token tokens in 2.4 GB KV budget
8,8 (Q8 K+V) 16,384 B ~155,000
8,4 (Q8 K, Q4 V) 12,288 B ~207,000
4,4 (Q4 K+V) 8,192 B ~310,000

The KV pressure is 4× lower than a pure transformer at the same parameter count. That meant 180K tokens at Q8,4 was plausible before a single probe had run.

The GDN recurrent state (the 24 linear layers) doesn't live in VRAM at all. tabbyAPI serialises it to system RAM between requests via sysmem_recurrent_cache — an OrderedDict in Python process memory. At the default 4 GB, dozens of concurrent long sessions co-exist without touching the GPU.


Finding the Actual Context Ceiling

Theoretical maths tells you where to probe. It doesn't tell you the real ceiling. ExLlamaV3 has internal allocation overhead, batch workspace, and the model weights themselves — all competing for the same 8 GB.

I wrote a binary search probe script (tabbyapi_probe.py) that:
1. Writes a config with a candidate cache_size
2. Pushes it into the container and restarts the service
3. Polls journalctl for Application startup complete (success) or Insufficient VRAM in split for model and cache (OOM)
4. Handles the systemd stale-restart race (OOM → systemd auto-restart with old config → second OOM) before probing the next candidate

python3 tabbyapi_probe.py \
  --container tabbyapi \
  --model-name Qwen3.5-9B-exl3-4.0bpw \
  --cache-mode 8,4 \
  --lo 131072 --hi 262144 \
  --config-template tabbyapi_config.yml \
  --output probe_results.json

Q8,4 binary search results (both stacks — identical ceiling)

cache_size pages outcome VRAM used / free
131,072 512 ✓ success 6,083 / 1,724 MiB
262,144 1,024 ✗ OOM
196,608 768 ✗ OOM
163,840 640 ✓ success 6,499 / 1,308 MiB
180,224 704 ✓ success 6,691 / 1,116 MiB
188,416 736 ✓ success 6,787 / 1,020 MiB
192,512 752 ✓ success 6,851 / 956 MiB
194,560 760 ✓ success 6,883 / 924 MiB
195,584 764 ✗ OOM
195,072 762 ✓ confirmed max 6,883 / 924 MiB

Both Stack A (torch 2.9, CUDA 12.8) and Stack C revised (torch 2.11, CUDA 13.0) converged on the same ceiling. The KV page size is 256 tokens, so the allocation granularity is 256 × 12,288 bytes ≈ 3 MiB per page. CUDA runtime overhead differences between 12.8 and 13.0 are smaller than a single page.

Q4,4 results (reference, from earlier work)

Maximum confirmed: 180,992 tokens at Q4,4. The Q8,4 ceiling falls lower due to 1.5× higher bytes/token.


The Package Compatibility Maze

This is where things got interesting.

I'd been running the original stack fine: torch 2.9.0+cu128, ExLlamaV3 0.0.34, flash-attn 2.8.3, causal-conv1d 1.6.2. Clean. Everything working.

Then I went looking for whether a newer ExLlamaV3 with CUDA 13.2 kernels would improve things. The answer required understanding something fundamental about Python C extensions.

Why upgrading PyTorch is not like upgrading a Python package

Every .so extension compiled against PyTorch links against internal PyTorch symbols by name — things like _ZN3c104cuda29c10_cuda_check_implementationEiPKcS2_ib. Those symbols are not part of any stable ABI. They change between PyTorch minor versions. The result: upgrading torch from 2.9 to 2.11 silently breaks flash-attn and causal-conv1d even though they're "installed" and even though importlib.util.find_spec() says they're present.

That last point matters: tabbyAPI uses find_spec() to check optional dependencies. A package with a broken .so still passes find_spec. The crash only happens when the module is actually imported at runtime.

The compatibility matrix

Stack torch ExLlamaV3 flash-attn causal-conv1d FLA
A 2.9.0+cu128 0.0.37+cu128 2.8.3+cu128 ✓ 1.6.2.post1 ✓
B 2.10.0+cu128 0.0.37+cu128 2.8.3+cu128 or +cu130 ✓ 1.6.2.post1 ✓
C 2.11.0+cu130 0.0.37+cu132 2.8.3+cu130 ✓ ✗ no wheel
D 2.12.0+cu130 0.0.37+cu132 2.8.3+cu130 or +cu132 ✓

The story I originally told about Stack C was wrong. I removed flash-attn when upgrading to torch 2.11, assuming no compatible wheel existed. It does: flash_attn-2.8.3+cu130torch2.11 is available from mjun0812's prebuild repo. The find_spec() check in tabbyAPI would have passed (broken ABI .so still shows as installed), but the correct fix was to install a compatible wheel, not remove the package.

  • causal-conv1d still has no torch2.10+ wheel as of May 2026 — this one is genuinely unavailable without building from source
  • flash-linear-attention is pure Python + Triton — no ABI coupling, works with any torch version

What each missing package actually costs

flash-attn handles the 8 full-attention layers in Qwen3.5-9B. Without it, ExLlamaV3 dispatches to Triton paged attention (its next preference in the fallback list). The measured cost: 3% at 512-token context, growing to 83% slower at 131K context. The loss is entirely in the KV-cache decode path — flash-attn's fused kernel has much lower memory bandwidth overhead than Triton paged attention at long sequences.

causal-conv1d accelerates the conv1d operations inside the 24 GDN layers. Without it, ExLlamaV3's Triton kernel for GatedDeltaNet recurrence handles them instead. The measured cost on RTX 4070: zero. No decode speed difference at any context length.

flash-linear-attention (FLA) accelerates the GatedDeltaNet forward pass via Triton kernels. This one stayed installed across all stacks.

The CUDA 13.2 upgrade — what actually changed

Upgrading to ExLlamaV3 0.0.36+cu132 gets you custom CUDA kernels compiled against CUDA 13.2. The ExLlamaV3 custom ops (quantised matmul, RoPE, etc.) are recompiled with newer compiler optimisations. Whether that recovers the flash-attn and causal-conv1d losses empirically is exactly what the benchmark below tests.


Other Tuning Applied

ngram_match_min: 3 (free speculative decoding)

tabbyAPI hardcodes ngram_match_min=0 in its AsyncGenerator constructor — the parameter exists in ExLlamaV3 but isn't exposed in config.yml. One line patch to /opt/tabbyapi/backends/exllamav3/model.py:

self.generator = AsyncGenerator(
    ...
    num_draft_tokens=self.draft_num_tokens,
    ngram_match_min=3,   # ← added
)

With value 3: when the last 3+ output tokens have appeared somewhere in the input context, the next token is drafted from that prior occurrence and the main model validates in parallel. Zero VRAM cost, zero draft model, purely context-driven. Best gains on structured or repetitive text — code, documents that quote themselves, long reasoning chains.

sysmem_recurrent_cache

Confirmed system RAM, not VRAM. The RecurrentCache is a Python OrderedDict holding serialised GDN recurrent states between requests. Default 4 GB; at ~MB-scale per state this handles dozens of concurrent long sessions. Left at default.

max_batch_size

Left at default (4). TODO: measure VRAM savings at batch size 1 — with a single-user setup there's no batch parallelism to lose.


Benchmark: Stack A vs Stack C

Methodology

  • 3 trials per context length, cached-run median reported (trial 1 is always a cold prefill — excluded from medians)
  • Context lengths tested: 512, 4,096, 16,384, 65,536, 131,072 tokens
  • Decode length: 512 tokens per request
  • Prefill speed = prompt tokens / TTFT; decode speed = 512 / (total − TTFT)
  • VRAM peak sampled during generation

Three stacks measured: Stack A (torch 2.9, full dependencies), Stack C original (torch 2.11, flash-attn accidentally removed), Stack C revised (torch 2.11, flash-attn restored).

Results: decode throughput (tokens/sec)

Context (tokens) Stack A Stack C orig (no flash-attn) Stack C revised (flash-attn)
512 34 33 33
4,096 33 30 32
16,384 28 23 28
65,536 16 11 16
131,072 11 6 11

Results: prefill throughput (tokens/sec, cached run)

Context (tokens) Stack A Stack C revised Delta
512 237 236 0%
4,096 10,224 10,401 +2%
16,384 37,432 36,808 −2%
65,536 77,584 78,290 +1%
131,072 80,867 94,006 +16%

Max context ceiling (Q8,4 cache_mode)

Stack A Stack C revised
Max cache_size (tokens) 195,072 195,072
VRAM used at max (MiB) 6,883 6,883

Analysis

flash-attn on 25% of layers is not a small thing

The hypothesis going in was that Triton paged attention on 8 out of 32 layers wouldn't be catastrophic. The data shows otherwise:

  • At 512 tokens (minimal KV pressure, dominated by weight ops): 34 vs 33 tok/s — 3% difference, barely measurable
  • At 131,072 tokens (maximum KV pressure): 11 vs 6 tok/s — 83% faster with flash-attn

The gap is entirely context-dependent. At short context, the 8 full-attention layers spend most of their time on the matmuls and barely touch the KV cache. At 131K context, those 8 layers are doing O(n²) attention over a 130K-token sequence, and flash-attn's fused CUDA kernel vs Triton's paged implementation is the difference between a usable and an unusable response time.

causal-conv1d: no measurable impact

Stack A has causal-conv1d for the 24 GDN/linear layers, Stack C revised does not. The decode speed difference between the two is 0–3% at all context lengths — within noise. ExLlamaV3's own Triton kernel for the GatedDeltaNet recurrence is already well-optimised on this hardware. The package exists for older GPU generations and smaller models where the Triton path has more overhead.

CUDA 12.8 vs 13.0: identical context ceiling, near-identical throughput

Both stacks hit 195,072 tokens. CUDA 13.0 runtime overhead is below the 256-token page granularity (~3 MiB) for this model size. Throughput differences are within 2% at all context lengths except one: Stack C revised shows +16% prefill throughput at 131K context (94,006 vs 80,867 tok/s).

This is real but narrowly applicable. The 131K cold prefill takes ~87 seconds and is susceptible to thermal variation. It's a single uncached run in each 3-trial set. Whether the CUDA 13.2 kernel compilation produces genuinely faster attention code at very long sequence lengths is worth testing with more trials.


Conclusion

Flash-attn is the only dependency that matters at long context. causal-conv1d, despite covering 75% of layers, makes no measurable difference. CUDA kernel generation (12.8 vs 13.2) makes no difference to decode throughput and has no effect on the context ceiling.

The production recommendation depends on your PyTorch version:

  • If running torch 2.9 (CUDA 12.8): install flash-attn from the pre-built wheel. This is the simplest supported configuration and you get full performance.
  • If running torch 2.11 (CUDA 13.0): flash_attn-2.8.3+cu130torch2.11 exists and is installable — it is not the default anyone reaches for but it works. Install it and decode at 131K context becomes 11 tok/s instead of 6 tok/s. causal-conv1d has no torch 2.11 wheel; leave it absent.

The architecture insight holds: this model's context ceiling is not where you'd expect it. With only 8 full-attention layers out of 32, KV cache pressure is 4× lower than a pure transformer. The recurrent state of the 24 GDN layers lives entirely in system RAM. At Q8K/Q4V, 195K tokens fits in 8 GB alongside a 4.5 GB model — something that would be impossible at full-transformer architecture.

The ceiling you can't push past is quantisation quality in those 8 attention layers at long context, not VRAM.


Appendix: Key Commands

Run context ceiling probe:

python3 agents/tabbyapi_probe.py \
  --container tabbyapi --model-name Qwen3.5-9B-exl3-4.0bpw \
  --cache-mode 8,4 --lo 131072 --hi 262144 \
  --config-template /home/user/.claude/jobs/f7398869/tabbyapi_config.yml \
  --output probe_results.json --timeout 180

Downgrade to Stack A:

incus exec tabbyapi -- systemctl stop tabbyapi
incus exec tabbyapi -- /root/.local/bin/uv pip install \
  "https://download.pytorch.org/whl/cu128/torch-2.9.0%2Bcu128-cp312-cp312-linux_x86_64.whl" \
  --python /opt/tabbyapi/.venv/bin/python3
# then exllamav3 0.0.34, flash-attn 2.8.3, causal-conv1d 1.6.2.post1

Re-apply ngram patch after any tabbyAPI update:

incus exec tabbyapi -- sed -i \
  's/                num_draft_tokens=self.draft_num_tokens,/                num_draft_tokens=self.draft_num_tokens,\n                ngram_match_min=3,/' \
  /opt/tabbyapi/backends/exllamav3/model.py
incus exec tabbyapi -- systemctl restart tabbyapi

Check package stack:

incus exec tabbyapi -- /opt/tabbyapi/.venv/bin/python3 -c "
import torch, exllamav3
print('torch:', torch.__version__)
print('exllamav3:', exllamav3.__version__)
try:
    import flash_attn; print('flash_attn:', flash_attn.__version__)
except ImportError:
    print('flash_attn: NOT INSTALLED')
try:
    import causal_conv1d; print('causal_conv1d: OK')
except ImportError:
    print('causal_conv1d: NOT INSTALLED')
try:
    import flash_linear_attention; print('FLA: OK')
except ImportError:
    print('FLA: NOT INSTALLED')
"

18 May 2026

Backing Up Your X99 BIOS on Linux with flashrom

Backing Up Your X99 BIOS on Linux with flashrom

A 5-minute job that could save your board one day.


If you have an ASUS X99-E WS (or any X99 board) running Linux and want a full BIOS backup before modding firmware — Intel's own Flash Programming Tool (FPT) is a dead end. The ME System Tools v9.1 package ships a Linux MEInfo binary but no Linux FPT. Intel only added FPT for Linux in v12+, which won't run on ME 9.x hardware.

The answer is flashrom — open-source, in Debian's repos, and purpose-built for this.

Install

apt install flashrom

Dump the full 16 MB SPI flash

flashrom -p internal -r x99_ews_backup.bin

On a stock X99-E WS you'll see all four flash regions reported as read-write:

Found chipset "Intel C610/X99 (Wellsburg)".
FREG0: Flash Descriptor region (0x00000000-0x00000fff) is read-write.
FREG1: BIOS region (0x00180000-0x00ffffff) is read-write.
FREG2: Management Engine region (0x00003000-0x0017ffff) is read-write.
FREG3: Gigabit Ethernet region (0x00001000-0x00002fff) is read-write.
Found Winbond flash chip "W25Q128.V" (16384 kB, SPI) mapped at physical address ...
Reading flash... done.

ASUS ships the X99-E WS with the SPI descriptor fully open — no read restrictions. This is not guaranteed on all boards.

Verify the dump

Run md5sum twice. Both hashes must match:

md5sum x99_ews_backup.bin
md5sum x99_ews_backup.bin

Two matching hashes rule out SPI bus glitching or a partial read.

What's in the 16 MB image

Region Range Contents
Descriptor 0x000000–0x000FFF Flash layout + region access rights
GbE 0x001000–0x002FFF Intel NIC MAC address + config
ME 0x003000–0x17FFFF Intel Management Engine firmware
BIOS 0x180000–0xFFFFFF UEFI firmware (AMI v4001)

The GbE region is why a full-chip backup beats a BIOS-region-only dump — your board's MAC address lives there. A BIOS-only restore zeros it out.

Copy off the machine

scp root@pve2:/root/x99_ews_backup.bin ~/backups/

Keep a copy off-site. A backup that lives only on the machine you're about to mod isn't much of a backup.

Restore (if needed)

flashrom -p internal -w x99_ews_backup.bin

07 May 2026

Proxmox: Removing a Ghost Node from the Web UI

Proxmox: Removing a Ghost Node from the Web UI

This is a follow-on to Proxmox Cluster Going Sluggish? Your Offline Node Has a Stale Config. That post covers nodes that are misbehaving but still real. This one covers nodes that don't exist at all.


After sorting out a stale corosync config on a returning node, I noticed the web UI was still showing an extra node — PVE9 — on every host in the cluster. It wasn't causing any problems, just sitting there looking wrong. Here's how to get rid of it.

What a Ghost Node Is

A ghost node is a stale directory in /etc/pve/nodes/ with no corresponding corosync membership. The Proxmox web UI reads from the shared cluster filesystem, not from corosync directly — so a leftover directory shows up as a node even if the machine is long gone. It can appear after a node was removed uncleanly, rebuilt under a different name, or never properly decommissioned.

Check It's Actually a Ghost

First confirm the node isn't just offline — check corosync:

pvecm status
Membership information
----------------------
    Nodeid      Votes Name
0x00000001          3 10.140.3.10
0x00000002          1 10.140.3.80
0x00000003          1 10.140.3.70
0x00000004          1 10.140.3.82
0x00000006          1 10.140.3.20

If it's not in this list, it's a ghost. Confirm the directory exists:

cat /etc/pve/corosync.conf | grep pve9   # nothing
ls /etc/pve/nodes/
# pve1  pve2  pve7  pve8  pve9  xenon

Check for VMs Before Deleting

The node directory may contain VM or container configs:

ls /etc/pve/nodes/pve9/qemu-server/
ls /etc/pve/nodes/pve9/lxc/

If there are configs, check them before doing anything:

cat /etc/pve/nodes/pve9/qemu-server/102.conf

If the node is truly gone, any disks listed as local-lvm:vm-XXX-disk-Y were on that node's local storage and are already inaccessible. You won't be able to recover them. Make sure you're happy with that before proceeding.

Remove It

Try the proper route first:

pvecm delnode pve9

If the node was never in corosync you'll get:

Node/IP: pve9 is not a known host of the cluster.

In that case, remove the directory directly:

rm -rf /etc/pve/nodes/pve9

The deletion replicates across the cluster filesystem immediately. Verify:

ls /etc/pve/nodes/
# pve1  pve2  pve7  pve8  xenon

Reload the web UI — the ghost node is gone.

Quick Reference

Command Purpose
pvecm status Confirm node isn't in corosync
ls /etc/pve/nodes/ List all node directories
ls /etc/pve/nodes/<name>/qemu-server/ Check for VM configs
ls /etc/pve/nodes/<name>/lxc/ Check for container configs
pvecm delnode <name> Proper removal (works if node was in corosync)
rm -rf /etc/pve/nodes/<name> Manual removal for true ghost nodes

Proxmox Cluster Going Sluggish? Your Offline Node Has a Stale Config

Proxmox Cluster Going Sluggish? Your Offline Node Has a Stale Config

You power on a node that's been offline for a while. Within seconds, the Proxmox web UI starts showing other nodes as dead. Management operations slow to a crawl. Nothing is obviously broken — all the nodes are still pinging — but something is clearly very wrong.

This is the stale corosync config problem, and it's easy to fix once you know what to look for.

What's Happening

Proxmox uses corosync to manage cluster membership. Every config change — adding a node, removing a node, changing votes — increments a config_version in /etc/corosync/corosync.conf. All cluster members must agree on this version.

When a node comes back online after missing several config changes, corosync starts up with an old config_version. The other nodes reject its packets. But corosync doesn't give up — it keeps retrying, flooding the network with rejected authentication attempts. This hammers pvedaemon on every node, causing the web UI to become sluggish and show phantom "dead" nodes even though the cluster itself is technically still quorate.

Diagnosis

First, confirm the cluster itself is still healthy from a node you trust:

pvecm status

If you see Quorate: Yes, the cluster is fine — the problem is the misbehaving node, not a genuine quorum loss. Note the Config Version value.

Then SSH into the suspect node and check:

ssh pve7 systemctl status corosync
ssh pve7 cat /etc/corosync/corosync.conf | grep config_version

Here's what it looks like when you've found the culprit:

● corosync.service - Corosync Cluster Engine
     Active: active (running) since Thu 2026-05-07 17:30:55 BST; 8min ago
   Main PID: 1146 (corosync)
     Memory: 155.9M (peak: 171.9M)
        CPU: 11.549s

May 07 17:39:36 pve7 corosync[1146]:   [KNET  ] rx: Packet rejected from 10.140.3.80:5405
May 07 17:39:37 pve7 corosync[1146]:   [KNET  ] rx: Packet rejected from 10.140.3.80:5405
May 07 17:39:38 pve7 corosync[1146]:   [KNET  ] rx: Packet rejected from 10.140.3.80:5405
May 07 17:39:43 pve7 corosync[1146]:   [QUORUM] Sync members[1]: 3
May 07 17:39:43 pve7 corosync[1146]:   [TOTEM ] A new membership (3.1a3c) was formed. Members
May 07 17:39:43 pve7 corosync[1146]:   [QUORUM] Members[1]: 3
May 07 17:39:43 pve7 corosync[1146]:   [MAIN  ] Completed service synchronization, ready to provide service.

  config_version: 10

Two red flags:
- "Packet rejected" — every node is refusing this node's traffic
- "Sync members[1]: 3" — the node has formed its own single-node pseudo-cluster with just itself (nodeid 3)
- config_version: 10 while the live cluster is at version 19 or higher

Fix

Step 1 — Stop corosync on the problem node

systemctl stop corosync

Verify it stopped:

systemctl status corosync

Expected output:

○ corosync.service - Corosync Cluster Engine
     Active: inactive (dead) since Thu 2026-05-07 17:40:45 BST; 48s ago

The web UI should recover almost immediately once the flood of rejected packets stops.

Step 2 — Check the corosync directory

ls -la /etc/corosync/

You'll likely see an authkey from the node's prior cluster membership:

drwxr-xr-x  3 root root 4096 Apr 25 17:21 .
-rw-r--r--  1 root root  256 Apr 25 16:14 authkey
-rw-r--r--  1 root root  639 Apr 25 17:21 corosync.conf

Don't manually delete it — pvecm add --force will handle it cleanly.

Step 3 — Rejoin the cluster

Run this from /tmp (pvecm refuses to run from inside /etc/pve/):

cd /tmp && pvecm add <lead-node-ip> --use_ssh --force

For example:

cd /tmp && pvecm add 10.140.3.10 --use_ssh --force
  • --use_ssh — uses existing SSH key trust instead of the API password prompt
  • --force — overrides warnings about existing config, authkey, and VMs (all expected for a rejoin)

You'll see output like:

detected the following error(s):
* authentication key '/etc/corosync/authkey' already exists
* cluster config '/etc/pve/corosync.conf' already exists
* this host already contains virtual guests

WARNING : detected error but forced to continue!

copy corosync auth key
stopping pve-cluster service
backup old database to '/var/lib/pve-cluster/backup/config-1778172132.sql.gz'
waiting for quorum...OK
(re)generate node files
generate new node certificate
merge authorized SSH keys
generated new node certificate, restart pveproxy and pvedaemon services
successfully added node 'pve7' to cluster.

Step 4 — Verify

pvecm status

Healthy output looks like:

Cluster information
-------------------
Name:             pve
Config Version:   20
Transport:        knet
Secure auth:      on

Quorum information
------------------
Nodes:            5
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   7
Total votes:      7
Quorum:           4
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          3 10.140.3.10
0x00000002          1 10.140.3.80
0x00000003          1 10.140.3.70 (local)
0x00000004          1 10.140.3.82
0x00000006          1 10.140.3.20

All nodes present, Quorate: Yes, config_version incremented by 2 (one increment per node during the join handshake — this is expected).

Why This Keeps Happening

Corosync is enabled by default and starts automatically on boot. If a node has been offline long enough to miss cluster config changes, it will always boot into this broken state. The node isn't malfunctioning — it's doing exactly what it's designed to do with the config it has. It's just that the config is stale.

Prevention: Before powering on a long-offline Proxmox node, check your current cluster's config_version with pvecm status. If it's significantly ahead of what the returning node last knew, plan for a rejoin rather than assuming it'll come back cleanly.

Quick Reference

Command Purpose
pvecm status Check cluster health and config version
systemctl status corosync Check corosync state on a node
grep config_version /etc/corosync/corosync.conf Check node's config version
systemctl stop corosync Stop the misbehaving corosync
cd /tmp && pvecm add <ip> --use_ssh --force Rejoin the cluster

15 April 2026

Kokoro Streaming Latency Investigation on Firefox

Making Local TTS Actually Stream: Fixing Kokoro FastAPI for Real-Time Audio

If you’ve been following along with my local AI setup, you’ll know I run most of my services within Proxmox VE LXC or Podman containers.  One of those is Kokoro, a self-hosted text-to-speech instance based on the Kokoro-82M ONNX model.  The audio generated is surprisingly good and the model is small so inference is possible and fast on a CPU.  There are fifty-nine voices covering a few languages.  An important point is that it exposes an OpenAI-compatible API that plugs straight into Open WebUI.

This alone would not have warrantied a blog post, because it's a Text to Speech engine in a repo, but, no surprise, what started as a simple Firefox bug fix turned into a streaming pipeline investigation: with the usual benchmarks, agent assistance with code analysis,  duplicate container sandbox, and concluded with a fix that meaningfully reduces time-to-first-audio for conversational use cases.


The Firefox Bug

It worked fine in Chrome, but produced an error in Firefox when clicking Generate Speech:


The culprit was a single line in AudioService.js:

this.sourceBuffer = this.mediaSource.addSourceBuffer('audio/mpeg');

It turns out Firefox does not support audio/mpeg.  The fix was to test for support and fallback if MediaSource Extensions(MSE) are not available:

if (!window.MediaSource || !MediaSource.isTypeSupported('audio/mpeg')) {
    await this.setupBufferedStream(stream, response, onProgress, estimatedChunks);
    return;
}

The setupBufferedStream fallback collects all incoming audio chunks into a Blob and sets it as a plain audio.src.  No MSE required and works everywhere.  Rather than rebuilding the image the file patch was saved locally and injected using podman cp.

Benchmarking: Does Format or Voice Matter?

With the Firefox issue sorted, I ran a latency benchmark.  Then another and another across three formats (mp3, pcm and wav) and three voices.  The phrase was short and topical, since I'm seeking some consultancy:

“Hi Mediclinic, your EHR project sounds interesting and has the potential for a lot of impact.”

Three runs per combination, stream: false, measured with Python’s time.perf_counter().

By format (averaged across all voices)

Format Avg latency File size
WAV 1382 ms ~256 KB
PCM 1417 ms ~256 KB
MP3 1457 ms ~86 KB

By voice (averaged across all formats)

The choices were: English, Japanese, Mandarin, Spanish, French, Hindi, Italian, and Brazilian Portuguese. 

Voice Description Avg latency
af_heart American English female 1379 ms
bm_fable British English male 1439 ms
ef_dora Spanish female 1438 ms

The takeaway: format and voice choice barely matter for latency. The ONNX inference dominates — everything else (MP3 encoding, voice model differences) contributes at most ~80 ms. MP3 encoding time was minimal and the right choice for web playback because of it's file size advantage. The French voice (ef_dora) performs on par with the English voices, which is a good sign for multilingual deployments.

Can we go faster?

I spotted while reading the documentation, yes it's a bad habit I developed when I was young, that there API has a stream: true parameter.  For a conversational applications, I thought this would be useful and a simple switch to enable...  You can probably guess that I was being naive, because I thought great enable the flag and the server would stream the audio during generation reducing perceived latency.   It turns out that streaming works with sentances, so I split the test phrase to start with a nice short initial sentance:

“Hi Mediclinic.  Your EHR project sounds interesting and has the potential for a lot of impact.”

Then Claude wrote some Python to track exactly when each 1 KB chunk arrived at the client:

t_start = time.perf_counter()
chunks = []
with urllib.request.urlopen(req) as resp:
    while True:
        chunk = resp.read(1024)
        if not chunk: break
        t = round((time.perf_counter() - t_start) * 1000)
        chunks.append((t, len(chunk)))

print(f"First chunk: {chunks[0][0]}ms")
print(f"Last chunk:  {chunks[-1][0]}ms")

Results for stream: true, af_heart, MP3:

First chunk: 1462ms
Last chunk:  1464ms
Chunks: 89

All chunks arrived within 2 ms of each other, after a full 1.4 second wait. stream: false was identical.  Even PCM which has zero encoder overhead.  Eh?  This doesn't seem right.  What was going on?  Was something buffering the audio before a single byte was sent?

The Rabbit Hole

I made a copy of the base container called kokoro-stream, on port 8881 as an isolated sandbox for Claude to play with.  The server code uses async generators and yield statements all the way from the HTTP handler down to the ONNX inference layer, which is good practice.  The StreamingResponse even sets X-Accel-Buffering: no, so it should work.

Three hypotheses:


Hypothesis Evidence for
H1 ONNX inference batches both sentences as one call PCM (no encoder) also shows simultaneous delivery
H2 Uvicorn buffers the response body below a threshold No asyncio yield points between sentence yields
H3 PyAV MP3 encoder buffers early frames Secondary — can’t explain PCM behaviour

What the code actually does

Inside tts_service.py, smart_split() splits the input text into chunks before inference, this is good.  However, it batches sentences together when their combined token count is under 250 tokens.  Guess what?  The two sentence test is only 105 tokens, so both sentences were delivered as a single string to KokoroV1.generate().

Inside kokoro_v1.py, the pipeline called split_pattern=r'\n+' meaning it would only split on newlines, not just full stops.  And since there were no newlines, both sentences went through as a single inference call producing a single audio file.  No amount of downstream async would fix that.

Even if the sentences had been processed separately, the for result in pipeline(...) loop is synchronous and never returns control to the asyncio event loop between sentences, so the HTTP layer has no opportunity to flush.

The Fix

Two changes:

inference/kokoro_v1.py 

Change the split pattern to include breaks on full stops:

# before
split_pattern=r'\n+'
# after
split_pattern=r'(?<=[.!?])\s+'

inference/kokoro_v1.py and services/tts_service.py 

Add yield points:

yield AudioChunk(...)
await asyncio.sleep(0)  # return control to event loop → HTTP layer can flush

Before and now Time To First Audio(TTFA)

Metric Before After
First chunk (TTFA) ~1400 ms ~575 ms
Last chunk ~1400 ms ~1400 ms
Gap ~2 ms ~1100 ms

The first audio now arrives after ~575ms, while the second is still being generated.  The total generation time is unchanged, unsurprisingly, but the latency is lower and this is just what is needed for conversational applications, like calling several service centres to ask about the availability and costs of servicing a car.  I was surprised that online systems here is South Africa, don't show available slots, but instead are lead generation and a human sends and email or calls you a few hours later.

Conclusion

A few things worth noting:

The architecture.  Kokoro uses async generators throughout, so the issue wasn’t bad design, it was two small configuration defaults affect short inputs.  The token batching threshold (250 tokens) and the newline-only split pattern made sense in isolation, but eliminate sentence-level streaming for my test input.

PCM as a diagnostic tool.  Benchmarking PCM format (raw samples, no encoding) alongside MP3 was valuable, to idnetify, and eliminate the audio encoder as a suspect early.   When PCM and MP3 shows similar timings the bottleneck is unrelated and upstream of the encoder.

asyncio.sleep(0) is surprisingly powerful. A zero-duration sleep doesn’t actually sleep, it yields control to the event loop.  That’s enough for uvicorn to flush pending response bytes to the socket.  It’s a one-liner with impact on latency.


Podman on Ubuntu 24.04. 

Kokoro image: ghcr.io/remsky/kokoro-fastapi-cpu:latest

Voices used: af_heart, bm_fable, ef_dora.

LiteLLM + Agent Teams: A Practical Guide

LiteLLM + Agent Teams: A Practical Guide

An aide memoire for using the local AI infrastructure day-to-day.


The big picture

You have three layers:

Your task (plain English)
        ↓
  Agent team (Python, OpenAI Agents SDK)
        ↓
  LiteLLM proxy  ←→  Ollama (local GPU)
                 ←→  OpenRouter (cloud free)
                 ←→  Anthropic (claude-haiku)

LiteLLM is a translation layer. It gives everything a single OpenAI-compatible URL (http://10.140.20.63:4000/v1) regardless of whether the model is running locally on your GPU or fetched from a cloud provider. Your code never changes — only the model name string changes.

The agent team is a set of specialised AI workers. You give the orchestrator a task in plain English; it decides which specialist to hand it to; the specialist does the work and hands results back.


Part 1 — Using LiteLLM directly

From the command line (curl)

# Ask any model a question
curl http://10.140.20.63:4000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer no-key-needed" \
  -d '{
    "model": "qwen3.5:4b",
    "messages": [{"role": "user", "content": "What is a BGP route reflector?"}]
  }'

# List all available models
curl http://10.140.20.63:4000/v1/models | python3 -m json.tool | grep '"id"'

From Python (OpenAI SDK)

from openai import OpenAI

client = OpenAI(
    base_url="http://10.140.20.63:4000/v1",
    api_key="no-key-needed",
)

response = client.chat.completions.create(
    model="qwen3.5:4b",   # or "claude-haiku-4-5", "nemotron-120b", etc.
    messages=[{"role": "user", "content": "Summarise this log: ..."}],
)
print(response.choices[0].message.content)

Choosing a model

Use case Model string Where it runs
Quick questions, triage qwen3.5:4b Local GPU (3.4 GB)
Writing code qwen2.5-coder:7b Local GPU (4.7 GB)
General analysis qwen3.5 Local GPU (6.6 GB)
Images / screenshots qwen3-vl Local GPU (6.1 GB)
Heavy reasoning nemotron-120b Cloud free (OpenRouter)
Reliable tool calling claude-haiku-4-5 Cloud (Anthropic/OpenRouter)
Best available free free Cloud free (auto-routed)

Group aliases — if the specific model is busy or unavailable, LiteLLM falls back automatically:

Alias Primary Fallback
fast qwen3.5:4b qwen2.5-coder:1.5b
coder qwen2.5-coder:7b qwen2.5-coder:1.5b
local qwen3.5 llama3.1
reasoning nemotron-120b gpt-oss-120b

Health check

curl http://10.140.20.63:4000/health
incus exec litellm -- journalctl -u litellm -f   # live logs

Part 2 — Running the agent team

The one-liner

cd /home/user/claude/agents
.venv/bin/python team.py "your task here"

Example tasks

# Coding
.venv/bin/python team.py "write a Python script that tails a log file and alerts on ERROR lines"

# Research
.venv/bin/python team.py "what are the main CVEs in OpenSSH versions 8.x to 9.x?"

# Analysis
.venv/bin/python team.py "analyse this nmap output and prioritise the findings: [paste output]"

# Mixed — the orchestrator chains specialists automatically
.venv/bin/python team.py "research the log4shell vulnerability then write a Python checker for it"

What happens under the hood

You: "research log4shell then write a checker"
        ↓
Orchestrator (claude-haiku) reads task
        ↓
Handoff → Researcher (nemotron-120b, cloud)
  "Log4Shell is CVE-2021-44228, affects Log4j 2.0–2.14.1..."
        ↓
Back to Orchestrator → Handoff → Coder (qwen2.5-coder:7b, local GPU)
  "def check_log4shell(host, port): ..."
        ↓
Orchestrator summarises and returns to you

The orchestrator uses haiku because it reliably produces valid tool-call JSON for handoffs. Local Ollama models are fast but unreliable at structured function-calling.

Watching it work

Add LITELLM_LOG=DEBUG to see every model call:

LITELLM_LOG=DEBUG .venv/bin/python team.py "hello"

Or watch the LiteLLM proxy logs live in another terminal:

incus exec litellm -- journalctl -u litellm -f

Part 3 — Writing your own agents

Minimal single agent

import asyncio, os
os.environ["OPENAI_BASE_URL"] = "http://10.140.20.63:4000/v1"
os.environ["OPENAI_API_KEY"]  = "no-key-needed"

from agents import Agent, Runner

agent = Agent(
    name="Helper",
    model="qwen3.5:4b",
    instructions="You are a helpful assistant. Be concise.",
)

async def main():
    result = await Runner.run(agent, "What is ARP spoofing?")
    print(result.final_output)

asyncio.run(main())

Adding tools (things agents can do)

from agents import Agent, Runner, function_tool
import httpx

@function_tool
async def get_url(url: str) -> str:
    """Fetch the contents of a URL."""
    async with httpx.AsyncClient(timeout=10) as c:
        r = await c.get(url)
        return r.text[:2000]   # truncate to avoid context overflow

agent = Agent(
    name="WebReader",
    model="qwen3.5:4b",
    instructions="You can fetch URLs to answer questions.",
    tools=[get_url],
)

Rule: tools are Python functions decorated with @function_tool. The agent decides when to call them. The docstring becomes the tool description — make it clear.

Handing off between agents

from agents import Agent, Runner, handoff

specialist = Agent(
    name="Specialist",
    model="qwen3.5",
    instructions="You handle detailed analysis. Return results clearly.",
)

orchestrator = Agent(
    name="Orchestrator",
    model="claude-haiku-4-5",
    instructions="Route analysis tasks to Specialist. Summarise results.",
    handoffs=[handoff(specialist)],
)

result = await Runner.run(orchestrator, "Analyse this data: ...")

handoff() is itself a tool the orchestrator can call. When it calls it, execution transfers to the specialist; when the specialist finishes, control returns to the orchestrator.

The existing tools you can reuse

gpu_tools.py — for any agent that needs to know about the GPU:

from gpu_tools import vram_status, list_local_models, comfyui_status
agent = Agent(..., tools=[vram_status, list_local_models])

devops_tools.py — for agents that manage containers:

from devops_tools import container_run, container_write_file, container_read_file, http_probe, container_systemctl
agent = Agent(..., tools=[container_run, http_probe])

Part 4 — Practical patterns

Pattern 1: Quick one-shot query

Use make_client() from litellm_client.py directly — no agent overhead:

from litellm_client import make_client, FAST_MODEL

async def ask(question: str) -> str:
    client = make_client()
    resp = await client.chat.completions.create(
        model=FAST_MODEL,
        messages=[{"role": "user", "content": question}],
    )
    return resp.choices[0].message.content

Pattern 2: Task with a deadline / retry limit

result = await Runner.run(agent, task, max_turns=10)

max_turns prevents infinite loops. The team.py orchestrator uses 40 turns because research+code tasks can take many steps.

Pattern 3: Streaming output

from agents import Runner

async for event in Runner.run_streamed(agent, task):
    if hasattr(event, "delta") and event.delta:
        print(event.delta, end="", flush=True)

Pattern 4: DevOps / automation agent

See setup_tts_stt.py as a reference. The pattern is:
1. Write a detailed task string explaining exactly what the agent should do and verify
2. Give it the right tools (container_run, http_probe, etc.)
3. Set instructions to "act immediately, don't ask permission"
4. Set max_turns=40 for multi-step work

agent = Agent(
    name="DevOps",
    model="claude-haiku-4-5",   # must use haiku — local models can't do tool-calling
    tools=[container_run, container_write_file, http_probe, container_systemctl],
    instructions="Act immediately. Never ask for permission. Verify each step.",
)
result = await Runner.run(agent, TASK, max_turns=40)

Part 5 — Gotchas and tips

Local models can't do structured tool-calling

qwen3.5, qwen2.5-coder:7b, etc. produce good prose but often garble the JSON format needed for handoff() and @function_tool calls. Always use claude-haiku-4-5 as your orchestrator — it's reliable and cheap (Anthropic free tier via OpenRouter).

Only one large model fits in VRAM at a time

The RTX 4070 has 8 GB. If you ask the orchestrator to hand off to a 6.6 GB local model while another 4.7 GB model is loaded, Ollama unloads the first one. There is a ~5–15 second cold-load delay. This is normal.

Free cloud models are rate-limited

nemotron-120b and other OpenRouter free models may queue or time out under load. If an agent stalls for >2 minutes with no output, it's usually rate-limiting. Switch to gpt-oss-120b or qwen3-80b as alternatives.

The free model alias changes

openrouter/openrouter/free routes to whatever OpenRouter considers the best free model at that moment. Good for exploration; use a specific model name for reproducible pipelines.

Ollama keep-alive

Models stay in VRAM for 15 minutes after last use (KEEP_ALIVE=15m). If you want to free VRAM immediately:

curl -X POST http://10.140.20.1:11434/api/generate -d '{"model":"qwen3.5","keep_alive":0}'

Part 6 — Agent Team in Open WebUI

The agent team is exposed as a model in Open WebUI via the Pipelines server — a small FastAPI app that sits between Open WebUI and the agent code.

Open WebUI chat
      ↓  (selects "Agent Team" model)
Pipelines server  (host: 10.140.20.1:9099)
      ↓
Agent orchestrator (claude-haiku)
      ↓  handoffs
Specialist agents (local GPU / cloud free)

Architecture files

File Purpose
agents/pipelines/agent_team.py The pipeline class — wraps the agent team
agents/run_pipelines.sh Manual start script
/etc/systemd/system/owui-pipelines.service Systemd service (starts on boot)

Managing the pipelines server

sudo systemctl status owui-pipelines
sudo systemctl restart owui-pipelines
sudo journalctl -u owui-pipelines -f

Connecting to Open WebUI (one-time setup)

  1. Open http://localhost:3001
  2. Top-right avatar → Admin Panel
  3. Settings → Connections → Pipelines
  4. Add:
  5. URL: http://10.140.20.1:9099
  6. API Key: 0p3n-w3bu!
  7. Click Save — "Agent Team" now appears in the model picker

Using it

Select Agent Team in the model picker and chat normally. Each message is routed by the orchestrator to the right specialist. The full conversation history is passed so the team has context across turns.

The pipelines server API key (0p3n-w3bu!) is the default from the open-webui-pipelines package. Change it in /etc/systemd/system/owui-pipelines.service and update the Open WebUI connection setting to match.

Adding more pipelines

Drop a new .py file with a Pipeline class into agents/pipelines/, then:

sudo systemctl restart owui-pipelines

The new pipeline appears as a model in Open WebUI immediately.


Quick reference card

# Run agent team
cd /home/user/claude/agents && .venv/bin/python team.py "task"

# Query a model directly
curl http://10.140.20.63:4000/v1/chat/completions \
  -H "Content-Type: application/json" -H "Authorization: Bearer no-key-needed" \
  -d '{"model":"qwen3.5:4b","messages":[{"role":"user","content":"hello"}]}'

# List models
curl -s http://10.140.20.63:4000/v1/models | python3 -m json.tool | grep '"id"'

# Watch LiteLLM traffic
incus exec litellm -- journalctl -u litellm -f

# Check VRAM
curl -s http://10.140.20.1:11434/api/ps | python3 -m json.tool

# Add a model to Ollama
ollama pull <model-name>
# Then add it to /etc/litellm/config.yaml and push + restart

File map

/home/user/claude/agents/
├── team.py            ← entry point — run this
├── litellm_client.py  ← model constants and URLs
├── gpu_tools.py       ← tools: vram_status, list_local_models, comfyui_status
├── devops_tools.py    ← tools: container_run, container_write_file, http_probe, ...
├── setup_tts_stt.py   ← reference: single-purpose DevOps agent
└── .venv/             ← virtualenv (openai-agents, openai)

/etc/litellm/
├── config.yaml        ← model list (edit on host, push to container)
└── secrets.env        ← OPENROUTER_API_KEY

CPU vs. GPU: Is Hardware Acceleration Always Faster for Real-Time TTS?

CPU vs. GPU: Is Hardware Acceleration Always Faster for Real-Time TTS?

Following up on my last post about fixing progressive streaming in Kokoro FastAPI, I decided to take things a step further. If the goal is minimizing latency for a conversational AI assistant, shouldn't throwing a dedicated GPU at the problem make it even faster?

I spent the afternoon duplicating my streaming container and configuring it to run on a local NVIDIA GeForce RTX 4070 (8GB). The results were... surprising. It turns out that for real-time, sentence-by-sentence streaming, "faster" hardware doesn't always translate to a better user experience.


The Setup: Moving to Incus and CUDA

While my previous tests were in Podman, I've recently moved to Incus for better resource management. I duplicated the kokoro-stream container to a new sandbox named kokoro-stream-gpu and passed through the GPU:

incus config device add kokoro-stream-gpu mygpu gpu uid=1000 gid=1000
incus config set kokoro-stream-gpu nvidia.runtime true
incus config set kokoro-stream-gpu nvidia.driver.capabilities compute,utility,video

Inside the container, I switched the backend from the ONNX CPU runtime to the PyTorch GPU version. I also had to port over the same split_pattern and asyncio.sleep(0) fixes from the last session to ensure I was comparing apples to apples (sentence-level streaming vs. sentence-level streaming).


The Benchmark: Short vs. Long Form

I ran two tests using the British English male voice (bm_fable): one with a short two-sentence phrase (~90 chars) and one with the full text of my last blog post (~8,700 chars).

Metric CPU (ONNX) GPU (RTX 4070) Speedup
TTFA (Short Text) ~557 ms ~508 ms 1.1x
Total Time (Long Text) ~289 s ~15 s 19.2x
Throughput (Long Text) ~30 char/s ~580 char/s 19.2x
System RAM Usage 1.21 GiB 1.92 GiB -
Video RAM (VRAM) 0 MB ~850 MB -

Reflections: When is the GPU worth it?

The results tell two very different stories depending on what you're doing.

1. Conversational AI (Short Sentences)

If you're building a real-time voice assistant that speaks one or two sentences at a time, the CPU is the clear winner. The Time to First Audio (TTFA) is virtually identical because the overhead of initializing the GPU pipeline eats up any compute gains. For this use case, the GPU is just an expensive way to use more RAM.

2. Long-Form Content (Articles, Blog Posts)

This is where the RTX 4070 absolutely screams. When I threw the full 8,700-character blog post at it, the GPU version finished the entire synthesis in 15 seconds. The CPU version was still grinding away at nearly the 5-minute mark.

At 580 characters per second, the GPU isn't just "faster"—it changes the nature of the service. You can listen to an entire article almost as soon as you click "Generate."

The Verdict

  • Stick with CPU for: Open WebUI, chatbots, home assistants, and low-RAM servers.
  • Switch to GPU for: Audiobook generation, long-form reading, or high-concurrency environments.

The kokoro-stream-gpu container is now my go-to for "reading" long documentation, while the CPU version remains my daily driver for conversational chat.


The Evidence: Benchmarking Code

To keep things evidence-based, here is the Python script used to capture these metrics. It probes the streaming API and measures exactly when the first and last chunks arrive.

1. Throughput & Latency Probe (benchmark_long.py)

import time
import requests
import subprocess

# Ports
GPU_URL = "http://localhost:8881/v1/audio/speech"
CPU_URL = "http://localhost:8882/v1/audio/speech"

# Load long text
with open("blog_post.md", "r") as f:
    LONG_TEXT = f.read()

def run_benchmark(name, url):
    print(f"\n--- Benchmarking {name} ---")
    start_time = time.time()
    first_chunk_time = None

    payload = {
        "input": LONG_TEXT,
        "voice": "bm_fable",
        "response_format": "mp3",
        "stream": True
    }

    with requests.post(url, json=payload, stream=True) as r:
        r.raise_for_status()
        for chunk in r.iter_content(chunk_size=1024):
            if chunk and first_chunk_time is None:
                first_chunk_time = time.time() - start_time

        total_time = time.time() - start_time

    return {
        "ttfa_ms": round(first_chunk_time * 1000, 2),
        "total_s": round(total_time, 2),
        "char_s": round(len(LONG_TEXT) / total_time, 2)
    }

2. Evidence Audio Generation (generate_evidence.py)

import requests
import hashlib

def generate_and_hash(url, filename):
    r = requests.post(url, json={"input": LONG_TEXT, "voice": "bm_fable"})
    with open(filename, "wb") as f:
        f.write(r.content)
    return hashlib.md5(r.content).hexdigest()

# Results:
# CPU Hash: a22fe5e4d70a2888d755e0f8df7dae8f
# GPU Hash: e5ccba5c22ef3edf594aabaa2c08bb5f

Running Incus on Ubuntu 24.04. Hardware: NVIDIA GeForce RTX 4070 8GB. Frameworks: ONNX Runtime (CPU) vs. PyTorch 2.6+CUDA 12.4 (GPU).