Running LLMs Locally: AMD APU vs Discrete GPU — Why Architecture Matters More Than Hardware

The Hardware

I benchmarked two very different local AI setups:

Matt-Mini — a Windows Mini PC that most people would dismiss for AI:
- CPU: AMD Ryzen 7 5800U (8 cores, Zen 3)
- iGPU: AMD Radeon Vega 8 (integrated, shared memory)
- RAM: 64GB DDR4-3200 (~50 GB/s bandwidth)

Ubuntu Laptop — a more conventional AI workstation:
- GPU: NVIDIA RTX 4070 8GB VRAM (~300 GB/s GDDR6X bandwidth)
- RAM: DDR5 system RAM (~80–100 GB/s), separate from GPU VRAM

The critical insight about the APU: the iGPU uses shared system memory as VRAM. With 64GB of RAM, the GPU can access tens of gigabytes for model weights — something impossible on a discrete GPU with fixed VRAM. The trade-off is bandwidth: DDR4 gives ~50 GB/s vs the RTX 4070's ~300 GB/s.

The Benchmark Setup

I used Ollama as the inference server (Vulkan backend for AMD iGPU — no ROCm required) and ran three prompts per model:

Short: "What is 2 + 2? Answer in one word." — tests base throughput
Reasoning: A multi-step maths problem — tests sustained generation
Coding: Fibonacci with memoization in Python — tests structured output

Metric: tokens per second (TPS) for generation.

Results: Matt-Mini (AMD Ryzen 7 5800U + Vega 8 iGPU, 64GB shared RAM)

Model Architecture Comparison (all Q4_K_M)

Model	Avg TPS	Total Params	Active Params	Type
`qwen3:30b-a3b`	12.0	30B	3B	MoE
`qwen3-coder:30b-a3b`	12.1	30B	3B	MoE (coding)
`qwen3:8b`	5.3	8B	8B	Dense
`qwen3.5-abliterated:35b-a3b`	4.65	35B	~3.5B	MoE (uncensored)
`qwen3.5-opus-distill`	3.83	35B	~3.5B	MoE (distilled, Q8_0)
`mixtral:8x7b`	3.5	46.7B	12.9B	MoE
`deepseek-r1:14b`	3.1	14B	14B	Dense

Q4_K_M vs Q8_0 on Bandwidth-Constrained iGPU

The Vega 8 iGPU is bottlenecked by DDR4 memory bandwidth (~50 GB/s). Q8_0 uses 2× the memory bandwidth of Q4_K_M with no compute benefit on hardware lacking AVX_VNNI. The speed penalty is significant:

Model	Q4_K_M TPS	Q8_0 TPS	Q4 faster by
`qwen3-coder:30b-a3b`	12.1	7.73	+57%
`qwen3.5-abliterated:35b-a3b`	4.65	3.83	+21%

Use Q4_K_M on the APU. Q8_0 only makes sense if quality is paramount and you can accept the speed penalty.

Results: Ubuntu Laptop (NVIDIA RTX 4070 8GB, DDR5)

General and Reasoning Models

Model	Avg TPS	Params	Notes
`qwen2.5-coder:1.5b`	163	1.5B	Tiny, saturates GPU
`qwen2.5-coder:7b`	52	7B	Fast in VRAM
`qwen3.5:4b`	51	4B
`deepseek-r1:7b`	39	7B	Strong reasoning, consistent TPS
`qwen3-vl:8b`	35	8B	Vision model
`llama3.1:latest`	36	8B
`qwen3.5:latest`	24	~14B	Starts hitting VRAM limit
`qwen3.5:27b`	3.0	27B	Exceeds 8GB VRAM, spills to RAM

Vision Models (for ComfyUI and multimodal workflows)

Model	Avg TPS	VRAM	Notes
`qwen3-vl:4b-instruct-q8_0`	45	~5.5GB	Best balance — fast, high quality, leaves headroom
`qwen3-vl:8b-instruct-q4_K_M`	35	~5.5GB	Larger model, slightly slower, better comprehension
`minicpm-v:8b-2.6-q4_K_M`	38	~5GB	Fast but terse — short responses on text tasks
`qwen2.5vl:3b-q8_0`	15	~3.5GB	Slow despite small size — VRAM load overhead

The dramatic drop from qwen3.5:latest (~24 TPS) to qwen3.5:27b (3 TPS) marks the VRAM cliff. Once the model no longer fits in 8GB, it spills to system RAM — but even though this machine has fast DDR5, the bottleneck becomes the PCIe bus (~32 GB/s) between the GPU and system memory, not the RAM speed itself. Performance collapses to APU-level speeds despite the faster RAM.

The Key Finding: Active Parameters Are What Matter

The headline result is qwen3:30b-a3b hitting 12 TPS — faster than the 8B dense model, despite having 30 billion total parameters.

This seems counterintuitive until you understand Mixture of Experts (MoE) architecture. In a MoE model, the network is split into many "expert" sub-networks. For any given token, only a small subset of experts are activated. qwen3:30b-a3b has 30B total parameters but only 3B active per token — the same compute cost per token as a 3B dense model, but with the knowledge capacity of a 30B model.

The rule that emerges from these results:

MoE speed advantage only materialises when active parameter count is kept low.

Look at mixtral:8x7b: it's MoE, but with 12.9B active parameters per token. Despite the MoE structure it runs at the same speed as the dense 14B model — because the active compute is similar.

qwen3:30b-a3b wins because it keeps active params at just 3B while maximising total capacity.

The Two Hardware Stories

Discrete GPU: Fast but VRAM-limited

The RTX 4070 hits 35–163 TPS for models that fit in 8GB VRAM. It's fast — bandwidth is not the bottleneck. But the moment a model exceeds 8GB, performance falls off a cliff: qwen3.5:27b drops to 3 TPS, identical to the APU. The discrete GPU is a sprinter with a hard wall.

Shared-Memory APU: Slow but capacious

The Vega 8 iGPU runs at 3–12 TPS — slower across the board for models that fit in discrete VRAM. But it can run a 34GB Q8_0 model that would never fit on the RTX 4070. The APU is a distance runner with no wall.

Where they meet

When a model exceeds the discrete GPU's VRAM, both machines run at the same ~3 TPS. At that point, the APU's 64GB capacity advantage becomes the deciding factor — it can run larger models at equal speed, with Q8_0 quality instead of being forced into aggressive quantization.

The MoE Sweet Spot for APUs

Low active-parameter MoE is the ideal architecture for shared-memory systems: fewer active params = less bandwidth per token = more TPS on bandwidth-constrained DDR4. qwen3:30b-a3b at 12 TPS demonstrates this perfectly — 30B total parameters, but only 3B active, running faster than the dense 8B model.

Practical Recommendations

For AMD APU systems with 32GB+ unified memory (Ryzen 5800U, no AVX_VNNI):
1. Use qwen3:30b-a3b or qwen3-coder:30b-a3b as your default — ~12 TPS, best speed/quality
2. Use Q4_K_M, not Q8_0 — Q8_0 is 20–57% slower on bandwidth-limited DDR4; AVX_VNNI (which would offset the bandwidth cost) is not present on Zen 3
3. Prefer MoE models with low active param counts (under 4B active) — this is the single biggest performance lever
4. Ollama with Vulkan is the easiest path — no ROCm build required, works out of the box
5. Disable sleep — large model downloads will resume but you waste time

For discrete GPU systems (e.g. RTX 4070 8GB, Intel Ultra 7 165H with AVX_VNNI):
1. Match model size to VRAM — keep total model size under ~7.5GB to stay fully in VRAM
2. Q4_K_M for 7–8B models at this VRAM level — fits comfortably with headroom
3. Q8_0 is viable for vision models under 6GB (e.g. qwen3-vl:4b-instruct-q8_0) — AVX_VNNI on the host CPU means Q8_0 CPU fallback is no slower
4. For ComfyUI inpainting: qwen3-vl:4b-instruct-q8_0 at 45 TPS uses ~5.5GB, leaving room for the diffusion model
5. Avoid models that spill to RAM — PCIe bandwidth (~32 GB/s) becomes the bottleneck, not DDR5
6. For larger models, the APU is a natural complement — it runs 30B+ at equal speed to any spilling model

Tools Used

Ollama — inference server, Vulkan backend
llmfit — hardware-fit recommender (useful for finding candidate models, but note: speed estimates for Vega 8 iGPU are inaccurate — it assumes 180 GB/s ROCm bandwidth vs the real ~50 GB/s)
benchmark_ollama.py — custom benchmark script measuring TPS across models and prompt types

Tested April 2026 on Ollama — AMD Ryzen 7 5800U (Vega 8 iGPU, 64GB DDR4) and NVIDIA RTX 4070 8GB (DDR5 system RAM).

Tech Guinea Pig

15 April 2026