Running LLMs Locally: AMD APU vs Discrete GPU — Why Architecture Matters More Than Hardware
The Hardware
I benchmarked two very different local AI setups:
Matt-Mini — a Windows Mini PC that most people would dismiss for AI:
- CPU: AMD Ryzen 7 5800U (8 cores, Zen 3)
- iGPU: AMD Radeon Vega 8 (integrated, shared memory)
- RAM: 64GB DDR4-3200 (~50 GB/s bandwidth)
Ubuntu Laptop — a more conventional AI workstation:
- GPU: NVIDIA RTX 4070 8GB VRAM (~300 GB/s GDDR6X bandwidth)
- RAM: DDR5 system RAM (~80–100 GB/s), separate from GPU VRAM
The critical insight about the APU: the iGPU uses shared system memory as VRAM. With 64GB of RAM, the GPU can access tens of gigabytes for model weights — something impossible on a discrete GPU with fixed VRAM. The trade-off is bandwidth: DDR4 gives ~50 GB/s vs the RTX 4070's ~300 GB/s.
The Benchmark Setup
I used Ollama as the inference server (Vulkan backend for AMD iGPU — no ROCm required) and ran three prompts per model:
- Short:
"What is 2 + 2? Answer in one word."— tests base throughput - Reasoning: A multi-step maths problem — tests sustained generation
- Coding: Fibonacci with memoization in Python — tests structured output
Metric: tokens per second (TPS) for generation.
Results: Matt-Mini (AMD Ryzen 7 5800U + Vega 8 iGPU, 64GB shared RAM)
Model Architecture Comparison (all Q4_K_M)
| Model | Avg TPS | Total Params | Active Params | Type |
|---|---|---|---|---|
qwen3:30b-a3b |
12.0 | 30B | 3B | MoE |
qwen3-coder:30b-a3b |
12.1 | 30B | 3B | MoE (coding) |
qwen3:8b |
5.3 | 8B | 8B | Dense |
qwen3.5-abliterated:35b-a3b |
4.65 | 35B | ~3.5B | MoE (uncensored) |
qwen3.5-opus-distill |
3.83 | 35B | ~3.5B | MoE (distilled, Q8_0) |
mixtral:8x7b |
3.5 | 46.7B | 12.9B | MoE |
deepseek-r1:14b |
3.1 | 14B | 14B | Dense |
Q4_K_M vs Q8_0 on Bandwidth-Constrained iGPU
The Vega 8 iGPU is bottlenecked by DDR4 memory bandwidth (~50 GB/s). Q8_0 uses 2× the memory bandwidth of Q4_K_M with no compute benefit on hardware lacking AVX_VNNI. The speed penalty is significant:
| Model | Q4_K_M TPS | Q8_0 TPS | Q4 faster by |
|---|---|---|---|
qwen3-coder:30b-a3b |
12.1 | 7.73 | +57% |
qwen3.5-abliterated:35b-a3b |
4.65 | 3.83 | +21% |
Use Q4_K_M on the APU. Q8_0 only makes sense if quality is paramount and you can accept the speed penalty.
Results: Ubuntu Laptop (NVIDIA RTX 4070 8GB, DDR5)
General and Reasoning Models
| Model | Avg TPS | Params | Notes |
|---|---|---|---|
qwen2.5-coder:1.5b |
163 | 1.5B | Tiny, saturates GPU |
qwen2.5-coder:7b |
52 | 7B | Fast in VRAM |
qwen3.5:4b |
51 | 4B | |
deepseek-r1:7b |
39 | 7B | Strong reasoning, consistent TPS |
qwen3-vl:8b |
35 | 8B | Vision model |
llama3.1:latest |
36 | 8B | |
qwen3.5:latest |
24 | ~14B | Starts hitting VRAM limit |
qwen3.5:27b |
3.0 | 27B | Exceeds 8GB VRAM, spills to RAM |
Vision Models (for ComfyUI and multimodal workflows)
| Model | Avg TPS | VRAM | Notes |
|---|---|---|---|
qwen3-vl:4b-instruct-q8_0 |
45 | ~5.5GB | Best balance — fast, high quality, leaves headroom |
qwen3-vl:8b-instruct-q4_K_M |
35 | ~5.5GB | Larger model, slightly slower, better comprehension |
minicpm-v:8b-2.6-q4_K_M |
38 | ~5GB | Fast but terse — short responses on text tasks |
qwen2.5vl:3b-q8_0 |
15 | ~3.5GB | Slow despite small size — VRAM load overhead |
The dramatic drop from qwen3.5:latest (~24 TPS) to qwen3.5:27b (3 TPS) marks the VRAM cliff. Once the model no longer fits in 8GB, it spills to system RAM — but even though this machine has fast DDR5, the bottleneck becomes the PCIe bus (~32 GB/s) between the GPU and system memory, not the RAM speed itself. Performance collapses to APU-level speeds despite the faster RAM.
The Key Finding: Active Parameters Are What Matter
The headline result is qwen3:30b-a3b hitting 12 TPS — faster than the 8B dense model, despite having 30 billion total parameters.
This seems counterintuitive until you understand Mixture of Experts (MoE) architecture. In a MoE model, the network is split into many "expert" sub-networks. For any given token, only a small subset of experts are activated. qwen3:30b-a3b has 30B total parameters but only 3B active per token — the same compute cost per token as a 3B dense model, but with the knowledge capacity of a 30B model.
The rule that emerges from these results:
MoE speed advantage only materialises when active parameter count is kept low.
Look at mixtral:8x7b: it's MoE, but with 12.9B active parameters per token. Despite the MoE structure it runs at the same speed as the dense 14B model — because the active compute is similar.
qwen3:30b-a3b wins because it keeps active params at just 3B while maximising total capacity.
The Two Hardware Stories
Discrete GPU: Fast but VRAM-limited
The RTX 4070 hits 35–163 TPS for models that fit in 8GB VRAM. It's fast — bandwidth is not the bottleneck. But the moment a model exceeds 8GB, performance falls off a cliff: qwen3.5:27b drops to 3 TPS, identical to the APU. The discrete GPU is a sprinter with a hard wall.
Shared-Memory APU: Slow but capacious
The Vega 8 iGPU runs at 3–12 TPS — slower across the board for models that fit in discrete VRAM. But it can run a 34GB Q8_0 model that would never fit on the RTX 4070. The APU is a distance runner with no wall.
Where they meet
When a model exceeds the discrete GPU's VRAM, both machines run at the same ~3 TPS. At that point, the APU's 64GB capacity advantage becomes the deciding factor — it can run larger models at equal speed, with Q8_0 quality instead of being forced into aggressive quantization.
The MoE Sweet Spot for APUs
Low active-parameter MoE is the ideal architecture for shared-memory systems: fewer active params = less bandwidth per token = more TPS on bandwidth-constrained DDR4. qwen3:30b-a3b at 12 TPS demonstrates this perfectly — 30B total parameters, but only 3B active, running faster than the dense 8B model.
Practical Recommendations
For AMD APU systems with 32GB+ unified memory (Ryzen 5800U, no AVX_VNNI):
1. Use qwen3:30b-a3b or qwen3-coder:30b-a3b as your default — ~12 TPS, best speed/quality
2. Use Q4_K_M, not Q8_0 — Q8_0 is 20–57% slower on bandwidth-limited DDR4; AVX_VNNI (which would offset the bandwidth cost) is not present on Zen 3
3. Prefer MoE models with low active param counts (under 4B active) — this is the single biggest performance lever
4. Ollama with Vulkan is the easiest path — no ROCm build required, works out of the box
5. Disable sleep — large model downloads will resume but you waste time
For discrete GPU systems (e.g. RTX 4070 8GB, Intel Ultra 7 165H with AVX_VNNI):
1. Match model size to VRAM — keep total model size under ~7.5GB to stay fully in VRAM
2. Q4_K_M for 7–8B models at this VRAM level — fits comfortably with headroom
3. Q8_0 is viable for vision models under 6GB (e.g. qwen3-vl:4b-instruct-q8_0) — AVX_VNNI on the host CPU means Q8_0 CPU fallback is no slower
4. For ComfyUI inpainting: qwen3-vl:4b-instruct-q8_0 at 45 TPS uses ~5.5GB, leaving room for the diffusion model
5. Avoid models that spill to RAM — PCIe bandwidth (~32 GB/s) becomes the bottleneck, not DDR5
6. For larger models, the APU is a natural complement — it runs 30B+ at equal speed to any spilling model
Tools Used
- Ollama — inference server, Vulkan backend
- llmfit — hardware-fit recommender (useful for finding candidate models, but note: speed estimates for Vega 8 iGPU are inaccurate — it assumes 180 GB/s ROCm bandwidth vs the real ~50 GB/s)
- benchmark_ollama.py — custom benchmark script measuring TPS across models and prompt types
Tested April 2026 on Ollama — AMD Ryzen 7 5800U (Vega 8 iGPU, 64GB DDR4) and NVIDIA RTX 4070 8GB (DDR5 system RAM).
No comments:
Post a Comment