If you've spent any time running large language models locally, you've probably heard terms like AWQ, GGUF, EXL3, vLLM, and ExLlamaV2 thrown around — often without much explanation of how they relate to each other, or why choosing the wrong combination can make your model five times slower than it needs to be.
This post aims to fix that. We'll cover what a model actually is in memory terms, how quantisation changes its footprint, which file formats carry which quantised models, which inference engines speak which formats, and — most importantly — the often-misunderstood question of where the model actually lives when it's running, and why a mixture of GPU and CPU is usually the worst outcome rather than a useful compromise.
What a Model Is in Memory
A language model is, at its core, a large collection of floating-point numbers called weights. A 9 billion parameter model has roughly 9 billion of these numbers. Each one, stored at full precision (FP32), occupies 4 bytes — so a raw 9B model would need about 36 GB of storage and memory. In practice, models are stored and loaded in 16-bit formats (BF16 or FP16), halving that to around 18 GB.
18 GB is already more than most consumer GPUs can hold. A typical gaming GPU has 8–16 GB of VRAM. This is where quantisation comes in.
Quantisation: Trading Precision for Space
Quantisation reduces the number of bits used to store each weight. The key insight is that neural networks are surprisingly tolerant of reduced precision — the quality loss from moving from 16-bit to 4-bit is often small enough to be irrelevant for practical use, while the memory saving is dramatic.
The main quantisation levels
| Precision | Bits per weight | 9B model size | Quality loss |
|---|---|---|---|
| FP16/BF16 | 16 | ~18 GB | None (reference) |
| FP8 | 8 | ~9 GB | Near-zero |
| INT8 / Q8_0 | 8 | ~9 GB | Minimal |
| INT4 / Q4 | 4 | ~5–6 GB | Small but noticeable |
| 3-bit | 3 | ~4 GB | Moderate |
4-bit quantisation is currently the practical sweet spot for most consumer hardware: a 9B model fits comfortably in an 8 GB GPU, and quality remains good enough for coding, writing, and reasoning tasks.
It's not just about bit width
The method of quantisation matters as much as the bit width. Two 4-bit models of the same architecture can have meaningfully different output quality depending on how the quantisation was performed:
- AWQ (Activation-aware Weight Quantization): Calibrates the quantisation using sample inputs, preserving weights that are most sensitive to rounding. Groups of 128 weights share a scale factor.
- GPTQ: Uses the inverse Hessian to minimise quantisation error block by block. Doesn't account for activation magnitudes, so typically slightly lower quality than AWQ at the same bit width.
- EXL3 (ExLlamaV2 format): Operates at the individual row level, solving for the optimal bit allocation per row to minimise output error. Can assign more bits to sensitive rows and fewer to robust ones. At 4 bits per weight, EXL3 typically outperforms both AWQ and GPTQ in measured perplexity.
- GGUF quantisation (Q4_K_M, Q5_K_M, etc.): The
Kvariants use k-means clustering per block, with the_Msuffix indicating a mixed-importance strategy — layers deemed more important get higher precision. Well-calibrated and widely tested.
File Formats: The Container Around the Weights
Quantised weights are packaged in different file formats, each tied to a particular ecosystem.
GGUF
The format used by llama.cpp and everything built on it (Ollama, LM Studio, Jan). A GGUF file is self-contained: it includes the weights, the model architecture metadata, and tokenizer data in a single file.
GGUF supports a wide range of quantisation levels: Q4_0, Q4_K_M, Q5_K_M, Q8_0, and many more. It's the most portable format — the same file runs on a CPU, a GPU, or a mixture of both.
Safetensors (HuggingFace format)
The standard format for HuggingFace model repositories. Models in AWQ or GPTQ quantisation are typically distributed as collections of .safetensors files alongside a config.json. This format is used by vLLM, transformers, and most Python-based inference stacks.
EXL3 / EXL2
ExLlamaV2's native formats. EXL3 is the current generation. These are also safetensors files under the hood, but with ExLlamaV2-specific quantisation data embedded. They cannot be loaded by vLLM or standard transformers — they require the ExLlamaV2 runtime.
Inference Engines: Who Speaks What
The inference engine is the software that actually loads the weights and runs the forward pass to generate tokens. Each engine has its own strengths, limitations, and supported formats.
Ollama
Built on llama.cpp. Supports GGUF only. Easiest setup — run ollama pull model-name and it downloads and serves the model immediately. Best for quick local use, development, and simple API access. Not designed for high-throughput serving or very long contexts.
vLLM
A production inference server designed for high-throughput serving of many concurrent users. Supports HuggingFace safetensors format, including AWQ, GPTQ, FP8, and unquantised models. Provides an OpenAI-compatible API. Has sophisticated memory management for long contexts (paged attention, chunked prefill).
Best suited for: serving multiple users simultaneously, very long context windows, production deployments.
Not suited for: models that require ExLlamaV2 quantisation (EXL3), or single-user interactive use where its multi-request optimisations add overhead rather than help.
ExLlamaV2 / tabbyAPI
ExLlamaV2 is a CUDA inference library with custom kernels tuned for low-batch (single-user) decode. tabbyAPI wraps it in an OpenAI-compatible HTTP server. Supports EXL3, EXL2, GPTQ, and some GGUF.
For single-user interactive use, ExLlamaV2 is often faster than vLLM because vLLM is optimised for batched requests. ExLlamaV2's kernels are specifically tuned for the batch-size-1 case that dominates personal use.
transformers (HuggingFace)
The reference implementation. Supports almost everything, but is the slowest option in production because it lacks the custom CUDA kernels of the specialised engines. Useful for research, fine-tuning, and running models before optimised backends exist.
The Format-to-Engine Matching Table
| You have | Use this engine |
|---|---|
| GGUF (Q4_K_M, Q5_K_M, etc.) | Ollama or llama.cpp directly |
| AWQ safetensors | vLLM |
| GPTQ safetensors | vLLM or ExLlamaV2 |
| EXL3 / EXL2 | ExLlamaV2 / tabbyAPI only |
| FP8 safetensors (official Qwen FP8 etc.) | vLLM |
| Unquantised BF16 safetensors | vLLM or transformers |
Trying to use the wrong engine with a given format either fails outright or forces a slow conversion at load time. The pairing matters.
Where the Model Actually Lives: The Critical Question
This is where most guides go wrong by omission. A model's performance is determined not just by its quantisation, but by where its weights reside when the forward pass runs.
The three scenarios
Scenario 1: All weights in GPU VRAM
This is the ideal case. The GPU's memory bandwidth — typically 200–900 GB/s depending on the card — feeds weights to the compute cores without any external bottleneck. Token generation is fast.
For a 9B model at 4-bit AWQ (~5.5 GB), an 8 GB GPU holds all the weights comfortably with room left for the KV cache. Decode speed on an RTX 4070 (8 GB) is 20+ tokens per second.
Scenario 2: All weights in CPU RAM
When a model is too large for VRAM and you configure the inference engine to run entirely on CPU, the CPU's memory subsystem handles everything. Modern DDR5 provides 80–100 GB/s bandwidth, which is slower than GPU memory but consistent. A full CPU inference run on a well-quantised 9B model at Q4 typically yields 3–8 tokens per second depending on the CPU.
Crucially: modern Intel CPUs with AVX_VNNI (like the Intel Core Ultra 7 series) have native INT8 dot product instructions. This means Q8_0 (8-bit quantisation) computes at nearly the same speed as Q4_K_M on these CPUs — the extra compute cost of INT8 is offset by the hardware acceleration. You get meaningfully better quality for free.
Scenario 3: Weights split across GPU and CPU RAM (the mixed case)
When a model is larger than VRAM, most inference engines will automatically offload some layers to CPU RAM and keep the rest on GPU. This sounds like a reasonable compromise. In practice, it is almost always the worst outcome.
Here's why. The forward pass through a transformer runs layers sequentially. If some layers are on the GPU and some are on the CPU, the computation must cross the PCIe bus at every GPU-CPU boundary:
GPU layer → compute (fast, ~hundreds of GB/s VRAM)
↓
PCIe transfer (bottleneck: ~32 GB/s in both directions)
↓
CPU layer → compute (slower, but not the problem)
↓
PCIe transfer back
↓
GPU layer → compute...
PCIe Gen 4 x16 has a practical throughput of around 28–32 GB/s. Every token generated requires transferring the activations across this bus at every layer boundary. For a 9B model split 50/50, this happens dozens of times per token. The result: decode speed collapses to around 3 tokens per second — slower than running fully on CPU, and slower than running a smaller model fully on GPU.
The empirical evidence is stark. On an Intel Ultra 7 + RTX 4070 8GB machine:
| Configuration | Model | Tokens/sec |
|---|---|---|
| All in GPU VRAM | Qwen3-8B Q4 | 20+ tok/s |
| Split GPU+CPU | Qwen3.5-27B Q4 | ~3 tok/s |
| Fully CPU | Q8_0 9B (AVX_VNNI) | ~4–5 tok/s |
The 27B model split across GPU and CPU is slower than running a smaller model fully on CPU, and only marginally faster than the CPU-only run despite using the GPU. The GPU is largely wasted — it spends most of its time waiting for PCIe transfers.
A special case: MoE models with expert offloading
Mixture-of-Experts (MoE) models introduce a nuance. Models like Qwen3.5-35B-A3B have 35 billion total parameters, but only about 3 billion are active on any given forward pass — the MoE routing selects a small subset of "expert" networks per token.
When the expert weights are offloaded to CPU RAM (via vLLM's --cpu-offload-params experts), only the active experts are transferred per token, not the full parameter set. This reduces the PCIe burden dramatically compared to a dense model. In practice, a 35B MoE model running on an 8 GB GPU with experts offloaded to RAM achieves 5–7 tokens per second — competitive with a smaller dense model entirely in VRAM.
This works because MoE expert routing selects only 8 of 256 experts per token. The PCIe transfer is proportional to the active parameter count, not the total. Dense models have no such relief — all weights are active every token, making the PCIe cost unavoidable.
Practical Decision Guide
When choosing how to run a model locally, the decision tree looks like this:
Does the quantised model fit in your GPU VRAM?
→ Yes: run it in VRAM. Use the best engine for your format.
→ No: continue below.
Is it a dense model (standard transformer)?
→ If it exceeds VRAM by a small margin: consider a smaller or more aggressively quantised version that fits. A Q4_K_M 9B fully in VRAM beats a Q4_K_M 14B split across GPU and CPU.
→ If you must run it partially on CPU: set the engine to use zero GPU layers and run fully on CPU. Slow but consistent.
→ Avoid the split if at all possible.
Is it a Mixture-of-Experts model?
→ Expert offloading via vLLM is viable and gives acceptable speed, because only active experts cross PCIe per token.
→ The larger the expert count relative to active experts, the better the ratio.
What file format do you have?
→ GGUF: Ollama. Simplest.
→ AWQ/GPTQ safetensors: vLLM. Best for long context and multi-user.
→ EXL3: tabbyAPI. Best for single-user interactive speed.
Summary
- Quantisation reduces model size by lowering weight precision. 4-bit is the practical sweet spot for consumer GPUs. Quality varies by method: EXL3 > AWQ > GPTQ at equivalent bit widths.
- File formats are tied to ecosystems: GGUF for Ollama/llama.cpp, safetensors for vLLM, EXL3 for ExLlamaV2. Mismatching format and engine either fails or adds overhead.
- Where the model lives determines performance more than almost any other factor:
- All in GPU VRAM: fast (20+ tok/s for 9B)
- All in CPU RAM: slow but consistent (3–8 tok/s); Intel AVX_VNNI makes Q8_0 competitive
- Split GPU+CPU: usually the worst outcome — PCIe becomes the bottleneck and the GPU is underutilised
- MoE models are the exception to the split-is-worst rule, because only active experts need to cross PCIe per token.
- Match your model size to your VRAM. When in doubt, run a smaller model fully in VRAM rather than a larger model split across GPU and CPU.
The goal is to never let the PCIe bus become your bottleneck. Everything else — quantisation method, inference engine, file format — is secondary to keeping your weights on the right side of that bus.
No comments:
Post a Comment