17 June 2026

Speed vs Quality: Comparing LFM2, Qwen3, and DeepSeek-R1 on a Local APU

Speed vs Quality: Comparing LFM2, Qwen3, and DeepSeek-R1 on a Local APU

The Question

In a previous post I found that qwen3:30b-a3b runs at 12 TPS on my AMD APU (Ryzen 5800U, 64GB DDR4), beating every other model on the system by exploiting Mixture-of-Experts to keep active parameters low. Then lfm2:24b-a2b arrived and hit 17.8 TPS — 47% faster — using a hybrid SSM+MoE architecture.

But tokens per second is only half the story. A model that answers faster but worse is not a better model. This post compares three architecturally distinct models on the same hardware across reasoning, synthesis, and analytical tasks.


What

Model Architecture TPS (Matt-Mini) What makes it distinctive
lfm2:24b-a2b Hybrid SSM+MoE 17.8 No KV cache (SSM), ~2B active params
qwen3:30b-a3b MoE Transformer 12.0 3B active params, built-in thinking mode
deepseek-r1:14b Dense Transformer 3.1 Reinforcement-learning trained reasoner

All running on Matt-Mini: AMD Ryzen 5800U, 64GB DDR4 (~50 GB/s bandwidth), AMD Vega 8 iGPU via Ollama's Vulkan backend.


How

Test 1: Formal Logic (Einstein's Puzzle)

The classic five-houses logic puzzle — 15 clues, five attributes per house, one correct solution. This tests structured multi-step deduction: can the model hold state, backtrack when it hits a contradiction, and reach a definite answer?

LFM2:24b-a2b — 17.8 TPS

LFM2 engaged immediately with a clear systematic approach, correctly anchoring on the fixed clues (Norwegian in house 1, house 3 drinks milk, house 2 is blue), then working through colour placement. Crucially, when it hit a contradiction — incorrectly placing green at house 3 which conflicted with the coffee/milk clue — it identified the error and self-corrected:

"Wait — green house is house 3, drinks milk — but clue 5 says green house drinks coffee. Contradiction! So our earlier assumption must be wrong. Let's recheck green/white placement."

It then correctly revised to green=4, white=5, and continued building the solution. At 2000 tokens it was still mid-deduction (the puzzle typically requires 2500–3500 tokens to complete), but the reasoning quality was coherent throughout with no hallucinated leaps. The self-correction under contradiction is the key capability here — many smaller models simply assert a wrong answer.

Qwen3:30b-a3b — 12.0 TPS

Qwen3's thinking mode is enabled by default. Given a 2000-token budget, the model consumed all 2000 tokens in internal reasoning and produced no visible output. This is not a failure of reasoning — it is a consequence of extended chain-of-thought: Qwen3 thinks before it writes, and for a complex puzzle, that thinking budget was exhausted before the answer appeared.

The practical implication: At 12 TPS, giving Qwen3 enough tokens to complete a hard logic problem (say 4000 tokens total — 2000 thinking, 2000 answer) means waiting ~5–6 minutes with nothing visible on screen until the model finishes thinking. For interactive use, this requires either disabling thinking mode (/no_think suffix or think: false in the API) or accepting the latency.

DeepSeek-R1:14b — 3.1 TPS

Same issue, worse TPS. At 3.1 TPS, 4000 tokens takes 21 minutes. DeepSeek-R1 is a dedicated reasoning model — the chain-of-thought is the point — but on bandwidth-constrained hardware the combination of dense architecture (no MoE savings), slow TPS, and long thinking chains makes it painful for interactive reasoning tasks. It is the right tool for problems where correctness matters more than speed and you are willing to wait.


The Thinking-Model Problem on Slow Hardware

This test uncovered a fundamental tension that does not appear in benchmark numbers:

Thinking models have a token overhead that multiplies with your hardware's latency.

On a GPU running at 50 TPS, Qwen3 spending 1000 tokens on internal reasoning costs 20 seconds. On an APU at 12 TPS, the same thinking costs 83 seconds — and you see nothing during that time. DeepSeek-R1 at 3.1 TPS spending 2000 tokens thinking costs 10.7 minutes of silence.

This creates a counterintuitive outcome: LFM2, which does not use extended chain-of-thought by default, feels smarter in interactive use on slow hardware — not because it reasons better, but because it shows its work in real-time rather than batching it invisibly. The perceived quality advantage is partly architectural and partly latency psychology.

For batch or automated tasks (where you submit a prompt and retrieve the answer later), thinking models regain their advantage. For interactive research assistance where you are iterating on prompts, fast transparent reasoning often beats slow invisible reasoning.


Test 2: Research Synthesis

Prompt: Advise on running the best LLM for general-purpose research on an AMD APU (Ryzen 5800U, 64GB DDR4, ~50 GB/s). Compare (a) large MoE with few active params, (b) hybrid SSM+MoE, (c) small dense transformer. Cover speed, quality, context handling, and give a recommendation.

Note: Qwen3 and DeepSeek-R1 were run with thinking disabled (think: false) so their responses are direct rather than chain-of-thought.

LFM2:24b-a2b — 17.8 TPS

LFM2 gave a well-structured response covering all three categories with accurate technical descriptions. On MoE it correctly identified the active-parameter advantage and low-bandwidth benefit. On SSM+MoE it correctly described the KV cache elimination benefit. On dense transformers it correctly explained the bandwidth bottleneck.

One notable weakness: it mentioned "AMD XDNA or ROCm kernels for optimal speed" — technically inaccurate for an Ollama/Vulkan setup. It conflated the architecture's theoretical requirements with platform-specific details it was uncertain about. This is a common failure mode in synthesis tasks: confident-sounding specifics that are slightly wrong.

Qwen3:30b-a3b — 12.0 TPS (thinking disabled)

With thinking disabled, Qwen3 produced a response that read as an externalised internal monologue — thinking-as-prose rather than a structured answer. It walked through the problem step-by-step in first person ("Let me break this down... Wait, the user said..."), ultimately recommending a small dense model like Mistral 7B.

This recommendation is wrong for this specific hardware, and interestingly, Qwen3 is itself running on this hardware. It gave generic advice calibrated to typical CPU inference bottlenecks rather than the actual measured results. It also suggested llama.cpp "15–25 TPS on a 5800U for a 7B Q4_K_M model" — a reasonable estimate for CPU-only inference, but ignoring the iGPU Vulkan backend that gives the measured 12 TPS for a 30B MoE model.

The knowledge cutoff problem: Qwen3 has no training data on LFM2 or the specific benchmark results for this hardware. It gave the best answer it could from general principles, and those principles pointed it toward the wrong conclusion. The model recommended a 7B dense model when the empirical data shows a 24B hybrid SSM+MoE outperforms it by a significant margin on this exact configuration.

DeepSeek-R1:14b — 3.1 TPS (thinking disabled)

DeepSeek-R1's response was clean and well-structured, but contained a clear hallucination: it described SSM as "Switched Sparse Memory" — a fabricated expansion of the acronym. SSM stands for State Space Model, a fundamentally different concept.

Its recommendation — large MoE transformer — was directionally correct, citing that "~50 GB/s is high bandwidth." This is precisely backwards: 50 GB/s is the bottleneck on this hardware, not an advantage. The model interpreted the bandwidth figure in absolute terms rather than relative to the model's memory requirements.

Despite these errors, the structural reasoning was sound: MoE's active-parameter reduction is the right lever to pull on bandwidth-constrained hardware. It got to the right answer via partially wrong reasoning.


Test 3: Critical Analysis

Prompt: Analyse the claim: "Q4 always reduces model quality compared to Q8, so for serious research you should always use Q8." Give a rigorous assessment.

This is a well-defined analytical task with a clear correct answer. All three models had relevant training data. This is where quality differences should be most visible.

LFM2:24b-a2b — 17.8 TPS

LFM2 identified three flaws in the claim:

  1. Universal claim without qualification — Q4 vs Q8 impact depends on architecture, task, calibration, and hardware. "Always" is unjustified.
  2. Ignores hardware acceleration — AVX_VNNI provides INT8 dot product acceleration; on AVX_VNNI CPUs, Q8_0 carries no compute overhead, changing the tradeoff entirely.
  3. Task-specific sensitivity — Coarse reasoning is less sensitive to precision loss than fine-grained factual recall.

The hardware-specific point about AVX_VNNI is precisely the kind of nuance that matters for the actual decision. This was accurate and practically useful.

Qwen3:30b-a3b — 12.0 TPS (thinking disabled)

Qwen3 gave the most thorough response of the three, leading with the logical structure: the word "always" is a universal quantifier that is falsified by a single counterexample. It cited concrete benchmark evidence — Q4 on Llama 3 and Mistral models incurs less than 1% absolute accuracy loss on MMLU vs Q8 — and identified multiple factors the claim ignores: quantisation-aware training, task sensitivity, model architecture, and implicit regularisation effects.

It also identified a counterintuitive scenario: some models fine-tuned with quantisation-aware training show no significant degradation at Q4, or in edge cases slight improvements due to regularisation. This is a more complete analysis than LFM2's.

The response was cut at 1000 tokens mid-sentence, suggesting there was more to come. The quality of what was produced was high.

DeepSeek-R1:14b — 3.1 TPS (thinking disabled)

DeepSeek-R1 produced the most concise response. It correctly identified the binary framing as the flaw (Q4 offers 16 values, Q8 offers 256 — but this alone doesn't determine quality impact), and noted that the practical effect depends on the model and task. The response was shorter and less detailed than the others, but accurate within its scope.

At 3.1 TPS, the time cost of a thorough analysis at DeepSeek-R1's depth is high. For tasks where analysis quality scales with thoroughness, the slow TPS compounds against you.


Why?

On factual accuracy

All three models made errors on the synthesis task — LFM2 got a platform detail wrong, Qwen3 gave the wrong recommendation, DeepSeek-R1 hallucinated an acronym expansion. The analysis task showed higher accuracy across all three, because the claim being analysed is well within their training distribution. Models are more reliable on tasks that resemble their training data. Novel hardware configurations and cutting-edge model architectures fall outside that zone.

On reasoning quality

For the logic puzzle, LFM2's transparent step-by-step reasoning with explicit self-correction was more useful interactively than Qwen3's silent exhausted thinking budget. On the analysis task, Qwen3 produced the most thorough and structured response when thinking was disabled — the base model quality is high when it actually produces output.

On the thinking-mode tradeoff

Thinking mode is a quality multiplier. But on hardware where generation is slow, it is also a latency multiplier — and one that applies silently before you see any output. The practical rule for APU-class hardware:

  • Interactive use: Disable thinking (think: false or /no_think). You get the answer faster, and for most tasks the quality loss is modest.
  • Batch / overnight analysis: Enable thinking and set a high token budget. You submit the job, come back later, get a more thorough answer.

LFM2 sidesteps this entirely: its architecture doesn't separate thinking from output, so you see the reasoning in real-time as it generates.


Summary: When to Use Each Model

Scenario Best choice Why
Interactive research, iterative queries lfm2:24b-a2b Fastest, transparent reasoning, good synthesis
Hard reasoning, batch mode deepseek-r1:14b Dedicated reasoner; worth the wait
Thorough structured analysis, batch mode qwen3:30b-a3b (thinking enabled) Best analysis quality when given token budget
Code generation qwen3-coder:30b-a3b-q4_K_M Specialised fine-tune, 12 TPS
Long document analysis lfm2:24b-a2b SSM avoids KV cache growth at long context
Uncensored / sensitive research qwen3.5-abliterated:35b-a3b-q4_K No guardrails, 4.65 TPS
Quick simple queries qwen3:8b 5.3 TPS, low overhead

The key finding

Speed and quality are not independent on bandwidth-constrained hardware. A model that is four times slower doesn't just cost you time — it changes the nature of the interaction. Thinking modes that are near-free on a GPU become minutes-long commitments on an APU, turning iterative exploration into batch processing. LFM2's hybrid SSM+MoE architecture produces the best combination of speed and quality for interactive use on this hardware, not because it is the most capable model in isolation, but because it delivers its capability at a speed that keeps research workflows fluid.


Appendix: Hardware and Setup

Matt-Mini:
- CPU: AMD Ryzen 7 5800U (Zen 3, no AVX_VNNI)
- iGPU: AMD Radeon Vega 8 (shared DDR4, ~50 GB/s)
- RAM: 64GB DDR4-3200
- Inference: Ollama 0.20.6 + Vulkan backend

API note: Thinking can be disabled per-request via "think": false in the Ollama generate payload, or by appending /no_think to the prompt for Qwen3 models. DeepSeek-R1 respects think: false at the API level.

Tested April 2026.

A throwaway Linux browsing desktop in an LXC over RDP (and why I gave up on Chrome)

I wanted a disposable little Linux desktop I could reach from a Windows machine over RDP, purely for web browsing — the sort of thing you spin up, point at the internet, and don't much care about. It lives as an unprivileged LXC container on one of my Proxmox boxes. "How hard can it be?" I thought, which as ever should have been my first warning.

It turned out to be straightforward in every respect bar one — the browser — and the browser is rather the point. So this is partly a build note and partly a cautionary tale about Chrome and containers.

The plan

  • An unprivileged LXC (Ubuntu 24.04) on Proxmox, on the LAN via DHCP.
  • A lightweight desktop, reachable over RDP from Windows' built-in Remote Desktop client.
  • A browser. I reached for Google Chrome out of habit.

Unprivileged is the right default here — it's a browser box I'd rather not hand the host's keys to. Keep that word "unprivileged" in mind, because it's where the trouble lives.

Getting the desktop up

The container itself is the easy bit:

pct create 9006 local:vztmpl/ubuntu-24.04-standard_24.04-2_amd64.tar.zst \
  --hostname penny-desktop --cores 4 --memory 4096 --swap 512 \
  --rootfs local-lvm:30 --net0 name=eth0,bridge=vmbr0,ip=dhcp,firewall=1 \
  --unprivileged 1 --features nesting=1

(nesting=1 is worth having — it lets the in-container systemd and a few desktop bits behave themselves.)

Then a desktop and an RDP server. I started with XFCE and xrdp, which is the classic pairing. Here's the first gotcha, and it's a quiet one: on a minimal Ubuntu 24.04 you also need the Xorg backend for xrdp explicitly, or sessions die the instant you connect with a blank grey screen and nothing else:

apt-get install -y xorg xorgxrdp

Without that, /usr/lib/xorg/Xorg simply isn't present, xrdp can't start an X server, and the session collapses. The sesman log spells it out if you go looking — Error starting X server on display 10 — but the symptom from the Windows end is just an empty desktop with no panel and no right-click menu, which is maddeningly uninformative.

A small XFCE annoyance worth knowing about

XFCE 4.18 (which is what 24.04 ships) won't run a .desktop launcher off the desktop until you mark it "trusted", and the only blessed way to set that flag is GIO metadata — which needs a running gvfs daemon that the minimal image doesn't have. So your nice Chrome icon sits there doing absolutely nothing when you click it. You can work around it with a plain executable shell script instead, but at this point I decided XFCE wasn't earning its keep and switched to MATE, which behaves like a Windows user expects (taskbar at the bottom, launchers that just launch) and has none of the trust faff. caja, MATE's file manager, wants ~/.config/caja to exist and be writable — create it and chown it to your user or you'll get a grumble on login.

So far, so fixable. And then the actual browser.

The part where Chrome quietly refuses to work

Chrome installed fine. It launched fine. And then every single page reported the machine was offline — "This site can't be reached", the lot — despite the container having perfectly good internet. This is the bit that cost me the evening.

The maddening thing is that the container itself was demonstrably online:

ping -c2 google.com      # fine, 8ms
curl -sI https://google.com   # HTTP/2 200, no problem at all

Both as root and as the desktop user. DNS resolved, TLS handshook, the lot. But Chrome — and only Chrome — was convinced it had no network. A headless --dump-dom confirmed it: the page came back with class="offline", Chrome's internal "there is no internet" state, not a TLS error or a DNS error. It wasn't failing to load a page; it had decided the network didn't exist.

I did what one does. I threw flags at it:

  • --no-sandbox — the usual incantation for Chrome in a container.
  • --disable-setuid-sandbox
  • --disable-dev-shm-usage — in case /dev/shm was too small (it wasn't, it was 63GB).
  • --disable-features=NetworkServiceSandbox

None of it helped. I even went as far as setting the container's AppArmor profile to unconfined, which did look like it might be heading somewhere — but that's a real security downgrade on a box whose entire job is to talk to the open internet, and that's precisely the wrong direction. The moment I typed it I knew I was bodging.

Stepping back

Here's the thing I should have seen sooner. Every single problem in that last hour traced to one cause: Chrome is about the hardest browser there is to run inside an unprivileged LXC.

Chrome isn't one process. It's a browser process, a GPU process, a network service process, and a renderer per tab — and each runs under its own seccomp-bpf syscall filter. That's lovely for security on a normal desktop and a genuine nuisance inside a container that also applies a seccomp filter. The two filters interact, and the network service ends up unable to make the syscalls it needs. And — this is the crucial bit — --no-sandbox only relaxes the renderer sandbox. The network service keeps its own, which is exactly why connectivity stayed broken no matter how many flags I added. I was relaxing the wrong sandbox.

You can beat it into submission with a privileged container or an unconfined AppArmor profile, but then you've thrown away the very thing (unprivileged isolation) that made this a sensible idea. For a turnkey browsing box that's a bad trade.

The fix: stop using Chrome

Firefox has none of this. Its process model is lighter, its sandbox degrades gracefully rather than refusing to network, and it runs in an unprivileged LXC with no flags, no AppArmor downgrade, no /dev/shm trickery — nothing. It just works.

One wrinkle: don't take Ubuntu's firefox package, because on 24.04 it's a transitional stub that installs Firefox as a snap, and snapd's own confinement is its own special headache inside an unprivileged container. Pull the real .deb straight from Mozilla and pin apt to prefer it:

install -d -m 0755 /etc/apt/keyrings
wget -q https://packages.mozilla.org/apt/repo-signing-key.gpg \
  -O /etc/apt/keyrings/packages.mozilla.org.asc
echo "deb [signed-by=/etc/apt/keyrings/packages.mozilla.org.asc] https://packages.mozilla.org/apt mozilla main" \
  > /etc/apt/sources.list.d/mozilla.list
printf 'Package: *\nPin: origin packages.mozilla.org\nPin-Priority: 1000\n' \
  > /etc/apt/preferences.d/mozilla
apt-get update && apt-get install -y firefox

I then reverted the AppArmor change (back to a clean, ordinary unprivileged container), removed Chrome entirely, and dropped Firefox's stock .desktop file onto the desktop — which, this being MATE, launches when you click it like a civilised thing.

To be sure I wasn't handing over another dud, I took a headless screenshot from inside the container before declaring victory:

firefox --headless --screenshot /tmp/test.png https://www.google.com

— and there was Google's cookie-consent page, in English, served from the UK. Online. Et voilĂ .

In short

Symptom Cause Fix
Blank grey desktop on RDP connect xrdp has no Xorg backend apt install xorg xorgxrdp
Desktop icon won't launch (XFCE) 4.18 "trusted launcher" + no gvfs Use MATE, or a plain executable script
caja complains on login ~/.config/caja missing/unwritable create it, chown to the user
Chrome says "offline" though curl works network-service seccomp sandbox vs LXC use Firefox

The real lesson isn't about any one flag — it's that if you find yourself disabling security layer after security layer to make a tool work, the tool is probably the wrong one. Chrome is magnificent on a real desktop and a poor houseguest in an unprivileged container. Firefox is the turnkey answer here, and the box has been quietly behaving itself ever since.

I hope this saves someone an evening. Ta ta for now.

26 May 2026

Where Does Your LLM Actually Live? Model Quantisation, File Formats, and the GPU/RAM Memory Trap

If you've spent any time running large language models locally, you've probably heard terms like AWQ, GGUF, EXL3, vLLM, and ExLlamaV2 thrown around — often without much explanation of how they relate to each other, or why choosing the wrong combination can make your model five times slower than it needs to be.

This post aims to fix that. We'll cover what a model actually is in memory terms, how quantisation changes its footprint, which file formats carry which quantised models, which inference engines speak which formats, and — most importantly — the often-misunderstood question of where the model actually lives when it's running, and why a mixture of GPU and CPU is usually the worst outcome rather than a useful compromise.


What a Model Is in Memory

A language model is, at its core, a large collection of floating-point numbers called weights. A 9 billion parameter model has roughly 9 billion of these numbers. Each one, stored at full precision (FP32), occupies 4 bytes — so a raw 9B model would need about 36 GB of storage and memory. In practice, models are stored and loaded in 16-bit formats (BF16 or FP16), halving that to around 18 GB.

18 GB is already more than most consumer GPUs can hold. A typical gaming GPU has 8–16 GB of VRAM. This is where quantisation comes in.


Quantisation: Trading Precision for Space

Quantisation reduces the number of bits used to store each weight. The key insight is that neural networks are surprisingly tolerant of reduced precision — the quality loss from moving from 16-bit to 4-bit is often small enough to be irrelevant for practical use, while the memory saving is dramatic.

The main quantisation levels

Precision Bits per weight 9B model size Quality loss
FP16/BF16 16 ~18 GB None (reference)
FP8 8 ~9 GB Near-zero
INT8 / Q8_0 8 ~9 GB Minimal
INT4 / Q4 4 ~5–6 GB Small but noticeable
3-bit 3 ~4 GB Moderate

4-bit quantisation is currently the practical sweet spot for most consumer hardware: a 9B model fits comfortably in an 8 GB GPU, and quality remains good enough for coding, writing, and reasoning tasks.

It's not just about bit width

The method of quantisation matters as much as the bit width. Two 4-bit models of the same architecture can have meaningfully different output quality depending on how the quantisation was performed:

  • AWQ (Activation-aware Weight Quantization): Calibrates the quantisation using sample inputs, preserving weights that are most sensitive to rounding. Groups of 128 weights share a scale factor.
  • GPTQ: Uses the inverse Hessian to minimise quantisation error block by block. Doesn't account for activation magnitudes, so typically slightly lower quality than AWQ at the same bit width.
  • EXL3 (ExLlamaV2 format): Operates at the individual row level, solving for the optimal bit allocation per row to minimise output error. Can assign more bits to sensitive rows and fewer to robust ones. At 4 bits per weight, EXL3 typically outperforms both AWQ and GPTQ in measured perplexity.
  • GGUF quantisation (Q4_K_M, Q5_K_M, etc.): The K variants use k-means clustering per block, with the _M suffix indicating a mixed-importance strategy — layers deemed more important get higher precision. Well-calibrated and widely tested.

File Formats: The Container Around the Weights

Quantised weights are packaged in different file formats, each tied to a particular ecosystem.

GGUF

The format used by llama.cpp and everything built on it (Ollama, LM Studio, Jan). A GGUF file is self-contained: it includes the weights, the model architecture metadata, and tokenizer data in a single file.

GGUF supports a wide range of quantisation levels: Q4_0, Q4_K_M, Q5_K_M, Q8_0, and many more. It's the most portable format — the same file runs on a CPU, a GPU, or a mixture of both.

Safetensors (HuggingFace format)

The standard format for HuggingFace model repositories. Models in AWQ or GPTQ quantisation are typically distributed as collections of .safetensors files alongside a config.json. This format is used by vLLM, transformers, and most Python-based inference stacks.

EXL3 / EXL2

ExLlamaV2's native formats. EXL3 is the current generation. These are also safetensors files under the hood, but with ExLlamaV2-specific quantisation data embedded. They cannot be loaded by vLLM or standard transformers — they require the ExLlamaV2 runtime.


Inference Engines: Who Speaks What

The inference engine is the software that actually loads the weights and runs the forward pass to generate tokens. Each engine has its own strengths, limitations, and supported formats.

Ollama

Built on llama.cpp. Supports GGUF only. Easiest setup — run ollama pull model-name and it downloads and serves the model immediately. Best for quick local use, development, and simple API access. Not designed for high-throughput serving or very long contexts.

vLLM

A production inference server designed for high-throughput serving of many concurrent users. Supports HuggingFace safetensors format, including AWQ, GPTQ, FP8, and unquantised models. Provides an OpenAI-compatible API. Has sophisticated memory management for long contexts (paged attention, chunked prefill).

Best suited for: serving multiple users simultaneously, very long context windows, production deployments.

Not suited for: models that require ExLlamaV2 quantisation (EXL3), or single-user interactive use where its multi-request optimisations add overhead rather than help.

ExLlamaV2 / tabbyAPI

ExLlamaV2 is a CUDA inference library with custom kernels tuned for low-batch (single-user) decode. tabbyAPI wraps it in an OpenAI-compatible HTTP server. Supports EXL3, EXL2, GPTQ, and some GGUF.

For single-user interactive use, ExLlamaV2 is often faster than vLLM because vLLM is optimised for batched requests. ExLlamaV2's kernels are specifically tuned for the batch-size-1 case that dominates personal use.

transformers (HuggingFace)

The reference implementation. Supports almost everything, but is the slowest option in production because it lacks the custom CUDA kernels of the specialised engines. Useful for research, fine-tuning, and running models before optimised backends exist.


The Format-to-Engine Matching Table

You have Use this engine
GGUF (Q4_K_M, Q5_K_M, etc.) Ollama or llama.cpp directly
AWQ safetensors vLLM
GPTQ safetensors vLLM or ExLlamaV2
EXL3 / EXL2 ExLlamaV2 / tabbyAPI only
FP8 safetensors (official Qwen FP8 etc.) vLLM
Unquantised BF16 safetensors vLLM or transformers

Trying to use the wrong engine with a given format either fails outright or forces a slow conversion at load time. The pairing matters.


Where the Model Actually Lives: The Critical Question

This is where most guides go wrong by omission. A model's performance is determined not just by its quantisation, but by where its weights reside when the forward pass runs.

The three scenarios

Scenario 1: All weights in GPU VRAM

This is the ideal case. The GPU's memory bandwidth — typically 200–900 GB/s depending on the card — feeds weights to the compute cores without any external bottleneck. Token generation is fast.

For a 9B model at 4-bit AWQ (~5.5 GB), an 8 GB GPU holds all the weights comfortably with room left for the KV cache. Decode speed on an RTX 4070 (8 GB) is 20+ tokens per second.

Scenario 2: All weights in CPU RAM

When a model is too large for VRAM and you configure the inference engine to run entirely on CPU, the CPU's memory subsystem handles everything. Modern DDR5 provides 80–100 GB/s bandwidth, which is slower than GPU memory but consistent. A full CPU inference run on a well-quantised 9B model at Q4 typically yields 3–8 tokens per second depending on the CPU.

Crucially: modern Intel CPUs with AVX_VNNI (like the Intel Core Ultra 7 series) have native INT8 dot product instructions. This means Q8_0 (8-bit quantisation) computes at nearly the same speed as Q4_K_M on these CPUs — the extra compute cost of INT8 is offset by the hardware acceleration. You get meaningfully better quality for free.

Scenario 3: Weights split across GPU and CPU RAM (the mixed case)

When a model is larger than VRAM, most inference engines will automatically offload some layers to CPU RAM and keep the rest on GPU. This sounds like a reasonable compromise. In practice, it is almost always the worst outcome.

Here's why. The forward pass through a transformer runs layers sequentially. If some layers are on the GPU and some are on the CPU, the computation must cross the PCIe bus at every GPU-CPU boundary:

GPU layer → compute (fast, ~hundreds of GB/s VRAM)
    ↓
PCIe transfer (bottleneck: ~32 GB/s in both directions)
    ↓
CPU layer → compute (slower, but not the problem)
    ↓
PCIe transfer back
    ↓
GPU layer → compute...

PCIe Gen 4 x16 has a practical throughput of around 28–32 GB/s. Every token generated requires transferring the activations across this bus at every layer boundary. For a 9B model split 50/50, this happens dozens of times per token. The result: decode speed collapses to around 3 tokens per second — slower than running fully on CPU, and slower than running a smaller model fully on GPU.

The empirical evidence is stark. On an Intel Ultra 7 + RTX 4070 8GB machine:

Configuration Model Tokens/sec
All in GPU VRAM Qwen3-8B Q4 20+ tok/s
Split GPU+CPU Qwen3.5-27B Q4 ~3 tok/s
Fully CPU Q8_0 9B (AVX_VNNI) ~4–5 tok/s

The 27B model split across GPU and CPU is slower than running a smaller model fully on CPU, and only marginally faster than the CPU-only run despite using the GPU. The GPU is largely wasted — it spends most of its time waiting for PCIe transfers.

A special case: MoE models with expert offloading

Mixture-of-Experts (MoE) models introduce a nuance. Models like Qwen3.5-35B-A3B have 35 billion total parameters, but only about 3 billion are active on any given forward pass — the MoE routing selects a small subset of "expert" networks per token.

When the expert weights are offloaded to CPU RAM (via vLLM's --cpu-offload-params experts), only the active experts are transferred per token, not the full parameter set. This reduces the PCIe burden dramatically compared to a dense model. In practice, a 35B MoE model running on an 8 GB GPU with experts offloaded to RAM achieves 5–7 tokens per second — competitive with a smaller dense model entirely in VRAM.

This works because MoE expert routing selects only 8 of 256 experts per token. The PCIe transfer is proportional to the active parameter count, not the total. Dense models have no such relief — all weights are active every token, making the PCIe cost unavoidable.


Practical Decision Guide

When choosing how to run a model locally, the decision tree looks like this:

Does the quantised model fit in your GPU VRAM?
→ Yes: run it in VRAM. Use the best engine for your format.
→ No: continue below.

Is it a dense model (standard transformer)?
→ If it exceeds VRAM by a small margin: consider a smaller or more aggressively quantised version that fits. A Q4_K_M 9B fully in VRAM beats a Q4_K_M 14B split across GPU and CPU.
→ If you must run it partially on CPU: set the engine to use zero GPU layers and run fully on CPU. Slow but consistent.
→ Avoid the split if at all possible.

Is it a Mixture-of-Experts model?
→ Expert offloading via vLLM is viable and gives acceptable speed, because only active experts cross PCIe per token.
→ The larger the expert count relative to active experts, the better the ratio.

What file format do you have?
→ GGUF: Ollama. Simplest.
→ AWQ/GPTQ safetensors: vLLM. Best for long context and multi-user.
→ EXL3: tabbyAPI. Best for single-user interactive speed.


Summary

  • Quantisation reduces model size by lowering weight precision. 4-bit is the practical sweet spot for consumer GPUs. Quality varies by method: EXL3 > AWQ > GPTQ at equivalent bit widths.
  • File formats are tied to ecosystems: GGUF for Ollama/llama.cpp, safetensors for vLLM, EXL3 for ExLlamaV2. Mismatching format and engine either fails or adds overhead.
  • Where the model lives determines performance more than almost any other factor:
  • All in GPU VRAM: fast (20+ tok/s for 9B)
  • All in CPU RAM: slow but consistent (3–8 tok/s); Intel AVX_VNNI makes Q8_0 competitive
  • Split GPU+CPU: usually the worst outcome — PCIe becomes the bottleneck and the GPU is underutilised
  • MoE models are the exception to the split-is-worst rule, because only active experts need to cross PCIe per token.
  • Match your model size to your VRAM. When in doubt, run a smaller model fully in VRAM rather than a larger model split across GPU and CPU.

The goal is to never let the PCIe bus become your bottleneck. Everything else — quantisation method, inference engine, file format — is secondary to keeping your weights on the right side of that bus.

"Where Do I Run This?" — A Surprisingly Interesting Answer

"Where Do I Run This?" — A Surprisingly Interesting Answer

Published: 2026-05-15
Tags: claude-code, ai-agents, local-ai, meta


While setting up a large-context benchmark for our llama.cpp series, I asked Claude Code
to prepare a prompt for a sub-agent to run the long benchmark job autonomously. It did,
then added a note at the end:

"To launch: Agent(subagent_type="general-purpose", prompt=open(...).read())"

My immediate question: where do I run that?

The answer reframed something I thought I understood.


It's Not Your Code. It's Claude's.

Agent(...) isn't a Python library you install. It isn't a CLI command. It's a tool
that Claude Code calls internally
— in the same category as Bash, Read, or Write.

When Claude runs Bash("nvidia-smi"), your terminal executes nvidia-smi. When Claude
calls Agent(...), a new AI agent spins up — with its own Bash, its own file access, its
own web search — and works through a task autonomously, just like Claude is working through
this conversation.

The pseudocode Claude wrote was essentially describing its own next action in notation a
programmer would recognise. It was talking about itself.


The Practical Shape of It

The flow looks like this:

You → Claude Code (this chat)
         └─ Agent(prompt="benchmark llama.cpp at 262K context...")
                └─ Sub-agent (no memory of your conversation)
                       ├─ writes bench_large_context.py
                       ├─ runs it (takes ~90 minutes)
                       ├─ reads results
                       └─ writes blog_large_context.md
         └─ "Done — decode speed drops 12% at 262K tokens. Blog post written."
You ← result

The sub-agent gets one thing: the prompt. It has no access to your conversation history.
That's why the prompt file we prepared was so detailed — it had to stand alone as a
complete briefing for someone who just walked into the room.


Why This Matters More Than It Looks

Most AI tooling has a clean boundary: the human decides what to do, the AI executes one
step. What's different here is that Claude can delegate to another Claude — and that
second agent can delegate further, run for an hour, write code, execute it, read the
output, and revise. The human isn't in the loop for each step.

That changes the unit of work. Instead of "ask AI to write a benchmark script," the unit
becomes "ask AI to run the benchmark campaign and deliver results." The script is an
implementation detail.

It also changes what a good prompt looks like. Writing for a sub-agent is closer to
writing a spec for a colleague than writing a prompt for a chatbot. It needs context,
constraints, expected outputs, and failure modes — because there's nobody to ask for
clarification once it starts.


The Meta Moment

The most interesting part of this exchange wasn't the answer. It was the question.

"Where do I run this?" assumes that code is something humans execute. But in a system
where the AI has a shell, a file system, and the ability to spawn other AIs, that
assumption quietly stops being true. The code Claude wrote wasn't for me. It was a note
to itself about what to do next.

We're early enough in this that the boundary between "Claude explaining a thing" and
"Claude doing a thing" isn't always obvious. Paying attention to which side of that line
you're on turns out to be worth it.


This post is part of a series on running large language models locally on consumer
hardware. The benchmark it references — Qwen3.5-35B-A3B at 262K context on 8GB VRAM —
is covered in the companion posts in this series.