15 April 2026

Running LLMs Locally: AMD APU vs Discrete GPU — Why Architecture Matters More Than Hardware

Running LLMs Locally: AMD APU vs Discrete GPU — Why Architecture Matters More Than Hardware

The Hardware

I benchmarked two very different local AI setups:

Matt-Mini — a Windows Mini PC that most people would dismiss for AI:
- CPU: AMD Ryzen 7 5800U (8 cores, Zen 3)
- iGPU: AMD Radeon Vega 8 (integrated, shared memory)
- RAM: 64GB DDR4-3200 (~50 GB/s bandwidth)

Ubuntu Laptop — a more conventional AI workstation:
- GPU: NVIDIA RTX 4070 8GB VRAM (~300 GB/s GDDR6X bandwidth)
- RAM: DDR5 system RAM (~80–100 GB/s), separate from GPU VRAM

The critical insight about the APU: the iGPU uses shared system memory as VRAM. With 64GB of RAM, the GPU can access tens of gigabytes for model weights — something impossible on a discrete GPU with fixed VRAM. The trade-off is bandwidth: DDR4 gives ~50 GB/s vs the RTX 4070's ~300 GB/s.


The Benchmark Setup

I used Ollama as the inference server (Vulkan backend for AMD iGPU — no ROCm required) and ran three prompts per model:

  • Short: "What is 2 + 2? Answer in one word." — tests base throughput
  • Reasoning: A multi-step maths problem — tests sustained generation
  • Coding: Fibonacci with memoization in Python — tests structured output

Metric: tokens per second (TPS) for generation.


Results: Matt-Mini (AMD Ryzen 7 5800U + Vega 8 iGPU, 64GB shared RAM)

Model Architecture Comparison (all Q4_K_M)

Model Avg TPS Total Params Active Params Type
qwen3:30b-a3b 12.0 30B 3B MoE
qwen3-coder:30b-a3b 12.1 30B 3B MoE (coding)
qwen3:8b 5.3 8B 8B Dense
qwen3.5-abliterated:35b-a3b 4.65 35B ~3.5B MoE (uncensored)
qwen3.5-opus-distill 3.83 35B ~3.5B MoE (distilled, Q8_0)
mixtral:8x7b 3.5 46.7B 12.9B MoE
deepseek-r1:14b 3.1 14B 14B Dense

Q4_K_M vs Q8_0 on Bandwidth-Constrained iGPU

The Vega 8 iGPU is bottlenecked by DDR4 memory bandwidth (~50 GB/s). Q8_0 uses 2× the memory bandwidth of Q4_K_M with no compute benefit on hardware lacking AVX_VNNI. The speed penalty is significant:

Model Q4_K_M TPS Q8_0 TPS Q4 faster by
qwen3-coder:30b-a3b 12.1 7.73 +57%
qwen3.5-abliterated:35b-a3b 4.65 3.83 +21%

Use Q4_K_M on the APU. Q8_0 only makes sense if quality is paramount and you can accept the speed penalty.


Results: Ubuntu Laptop (NVIDIA RTX 4070 8GB, DDR5)

General and Reasoning Models

Model Avg TPS Params Notes
qwen2.5-coder:1.5b 163 1.5B Tiny, saturates GPU
qwen2.5-coder:7b 52 7B Fast in VRAM
qwen3.5:4b 51 4B
deepseek-r1:7b 39 7B Strong reasoning, consistent TPS
qwen3-vl:8b 35 8B Vision model
llama3.1:latest 36 8B
qwen3.5:latest 24 ~14B Starts hitting VRAM limit
qwen3.5:27b 3.0 27B Exceeds 8GB VRAM, spills to RAM

Vision Models (for ComfyUI and multimodal workflows)

Model Avg TPS VRAM Notes
qwen3-vl:4b-instruct-q8_0 45 ~5.5GB Best balance — fast, high quality, leaves headroom
qwen3-vl:8b-instruct-q4_K_M 35 ~5.5GB Larger model, slightly slower, better comprehension
minicpm-v:8b-2.6-q4_K_M 38 ~5GB Fast but terse — short responses on text tasks
qwen2.5vl:3b-q8_0 15 ~3.5GB Slow despite small size — VRAM load overhead

The dramatic drop from qwen3.5:latest (~24 TPS) to qwen3.5:27b (3 TPS) marks the VRAM cliff. Once the model no longer fits in 8GB, it spills to system RAM — but even though this machine has fast DDR5, the bottleneck becomes the PCIe bus (~32 GB/s) between the GPU and system memory, not the RAM speed itself. Performance collapses to APU-level speeds despite the faster RAM.


The Key Finding: Active Parameters Are What Matter

The headline result is qwen3:30b-a3b hitting 12 TPS — faster than the 8B dense model, despite having 30 billion total parameters.

This seems counterintuitive until you understand Mixture of Experts (MoE) architecture. In a MoE model, the network is split into many "expert" sub-networks. For any given token, only a small subset of experts are activated. qwen3:30b-a3b has 30B total parameters but only 3B active per token — the same compute cost per token as a 3B dense model, but with the knowledge capacity of a 30B model.

The rule that emerges from these results:

MoE speed advantage only materialises when active parameter count is kept low.

Look at mixtral:8x7b: it's MoE, but with 12.9B active parameters per token. Despite the MoE structure it runs at the same speed as the dense 14B model — because the active compute is similar.

qwen3:30b-a3b wins because it keeps active params at just 3B while maximising total capacity.


The Two Hardware Stories

Discrete GPU: Fast but VRAM-limited

The RTX 4070 hits 35–163 TPS for models that fit in 8GB VRAM. It's fast — bandwidth is not the bottleneck. But the moment a model exceeds 8GB, performance falls off a cliff: qwen3.5:27b drops to 3 TPS, identical to the APU. The discrete GPU is a sprinter with a hard wall.

Shared-Memory APU: Slow but capacious

The Vega 8 iGPU runs at 3–12 TPS — slower across the board for models that fit in discrete VRAM. But it can run a 34GB Q8_0 model that would never fit on the RTX 4070. The APU is a distance runner with no wall.

Where they meet

When a model exceeds the discrete GPU's VRAM, both machines run at the same ~3 TPS. At that point, the APU's 64GB capacity advantage becomes the deciding factor — it can run larger models at equal speed, with Q8_0 quality instead of being forced into aggressive quantization.

The MoE Sweet Spot for APUs

Low active-parameter MoE is the ideal architecture for shared-memory systems: fewer active params = less bandwidth per token = more TPS on bandwidth-constrained DDR4. qwen3:30b-a3b at 12 TPS demonstrates this perfectly — 30B total parameters, but only 3B active, running faster than the dense 8B model.


Practical Recommendations

For AMD APU systems with 32GB+ unified memory (Ryzen 5800U, no AVX_VNNI):
1. Use qwen3:30b-a3b or qwen3-coder:30b-a3b as your default — ~12 TPS, best speed/quality
2. Use Q4_K_M, not Q8_0 — Q8_0 is 20–57% slower on bandwidth-limited DDR4; AVX_VNNI (which would offset the bandwidth cost) is not present on Zen 3
3. Prefer MoE models with low active param counts (under 4B active) — this is the single biggest performance lever
4. Ollama with Vulkan is the easiest path — no ROCm build required, works out of the box
5. Disable sleep — large model downloads will resume but you waste time

For discrete GPU systems (e.g. RTX 4070 8GB, Intel Ultra 7 165H with AVX_VNNI):
1. Match model size to VRAM — keep total model size under ~7.5GB to stay fully in VRAM
2. Q4_K_M for 7–8B models at this VRAM level — fits comfortably with headroom
3. Q8_0 is viable for vision models under 6GB (e.g. qwen3-vl:4b-instruct-q8_0) — AVX_VNNI on the host CPU means Q8_0 CPU fallback is no slower
4. For ComfyUI inpainting: qwen3-vl:4b-instruct-q8_0 at 45 TPS uses ~5.5GB, leaving room for the diffusion model
5. Avoid models that spill to RAM — PCIe bandwidth (~32 GB/s) becomes the bottleneck, not DDR5
6. For larger models, the APU is a natural complement — it runs 30B+ at equal speed to any spilling model


Tools Used

  • Ollama — inference server, Vulkan backend
  • llmfit — hardware-fit recommender (useful for finding candidate models, but note: speed estimates for Vega 8 iGPU are inaccurate — it assumes 180 GB/s ROCm bandwidth vs the real ~50 GB/s)
  • benchmark_ollama.py — custom benchmark script measuring TPS across models and prompt types

Tested April 2026 on Ollama — AMD Ryzen 7 5800U (Vega 8 iGPU, 64GB DDR4) and NVIDIA RTX 4070 8GB (DDR5 system RAM).

No comments:

Post a Comment