23 June 2026

Speed vs Quality: Comparing LFM2, Qwen3, and DeepSeek-R1 on a Local APU

Speed vs Quality: Comparing LFM2, Qwen3, and DeepSeek-R1 on a Local APU

The Question

In a previous post I found that qwen3:30b-a3b runs at 12 TPS on my AMD APU (Ryzen 5800U, 64GB DDR4), beating every other model on the system by exploiting Mixture-of-Experts to keep active parameters low. Then lfm2:24b-a2b arrived and hit 17.8 TPS — 47% faster — using a hybrid SSM+MoE architecture.

But tokens per second is only half the story. A model that answers faster but worse is not a better model. This post compares three architecturally distinct models on the same hardware across reasoning, synthesis, and analytical tasks.


The Contenders

Model Architecture TPS (Matt-Mini) What makes it distinctive
lfm2:24b-a2b Hybrid SSM+MoE 17.8 No KV cache (SSM), ~2B active params
qwen3:30b-a3b MoE Transformer 12.0 3B active params, built-in thinking mode
deepseek-r1:14b Dense Transformer 3.1 Reinforcement-learning trained reasoner

All running on Matt-Mini: AMD Ryzen 5800U, 64GB DDR4 (~50 GB/s bandwidth), AMD Vega 8 iGPU via Ollama's Vulkan backend.


Test 1: Formal Logic (Einstein's Puzzle)

The classic five-houses logic puzzle — 15 clues, five attributes per house, one correct solution. This tests structured multi-step deduction: can the model hold state, backtrack when it hits a contradiction, and reach a definite answer?

LFM2:24b-a2b — 17.8 TPS

LFM2 engaged immediately with a clear systematic approach, correctly anchoring on the fixed clues (Norwegian in house 1, house 3 drinks milk, house 2 is blue), then working through colour placement. Crucially, when it hit a contradiction — incorrectly placing green at house 3 which conflicted with the coffee/milk clue — it identified the error and self-corrected:

"Wait — green house is house 3, drinks milk — but clue 5 says green house drinks coffee. Contradiction! So our earlier assumption must be wrong. Let's recheck green/white placement."

It then correctly revised to green=4, white=5, and continued building the solution. At 2000 tokens it was still mid-deduction (the puzzle typically requires 2500–3500 tokens to complete), but the reasoning quality was coherent throughout with no hallucinated leaps. The self-correction under contradiction is the key capability here — many smaller models simply assert a wrong answer.

Qwen3:30b-a3b — 12.0 TPS

Qwen3's thinking mode is enabled by default. Given a 2000-token budget, the model consumed all 2000 tokens in internal reasoning and produced no visible output. This is not a failure of reasoning — it is a consequence of extended chain-of-thought: Qwen3 thinks before it writes, and for a complex puzzle, that thinking budget was exhausted before the answer appeared.

The practical implication: At 12 TPS, giving Qwen3 enough tokens to complete a hard logic problem (say 4000 tokens total — 2000 thinking, 2000 answer) means waiting ~5–6 minutes with nothing visible on screen until the model finishes thinking. For interactive use, this requires either disabling thinking mode (/no_think suffix or think: false in the API) or accepting the latency.

DeepSeek-R1:14b — 3.1 TPS

Same issue, worse TPS. At 3.1 TPS, 4000 tokens takes 21 minutes. DeepSeek-R1 is a dedicated reasoning model — the chain-of-thought is the point — but on bandwidth-constrained hardware the combination of dense architecture (no MoE savings), slow TPS, and long thinking chains makes it painful for interactive reasoning tasks. It is the right tool for problems where correctness matters more than speed and you are willing to wait.


The Thinking-Model Problem on Slow Hardware

This test uncovered a fundamental tension that does not appear in benchmark numbers:

Thinking models have a token overhead that multiplies with your hardware's latency.

On a GPU running at 50 TPS, Qwen3 spending 1000 tokens on internal reasoning costs 20 seconds. On an APU at 12 TPS, the same thinking costs 83 seconds — and you see nothing during that time. DeepSeek-R1 at 3.1 TPS spending 2000 tokens thinking costs 10.7 minutes of silence.

This creates a counterintuitive outcome: LFM2, which does not use extended chain-of-thought by default, feels smarter in interactive use on slow hardware — not because it reasons better, but because it shows its work in real-time rather than batching it invisibly. The perceived quality advantage is partly architectural and partly latency psychology.

For batch or automated tasks (where you submit a prompt and retrieve the answer later), thinking models regain their advantage. For interactive research assistance where you are iterating on prompts, fast transparent reasoning often beats slow invisible reasoning.


Test 2: Research Synthesis

Prompt: Advise on running the best LLM for general-purpose research on an AMD APU (Ryzen 5800U, 64GB DDR4, ~50 GB/s). Compare (a) large MoE with few active params, (b) hybrid SSM+MoE, (c) small dense transformer. Cover speed, quality, context handling, and give a recommendation.

Note: Qwen3 and DeepSeek-R1 were run with thinking disabled (think: false) so their responses are direct rather than chain-of-thought.

LFM2:24b-a2b — 17.8 TPS

LFM2 gave a well-structured response covering all three categories with accurate technical descriptions. On MoE it correctly identified the active-parameter advantage and low-bandwidth benefit. On SSM+MoE it correctly described the KV cache elimination benefit. On dense transformers it correctly explained the bandwidth bottleneck.

One notable weakness: it mentioned "AMD XDNA or ROCm kernels for optimal speed" — technically inaccurate for an Ollama/Vulkan setup. It conflated the architecture's theoretical requirements with platform-specific details it was uncertain about. This is a common failure mode in synthesis tasks: confident-sounding specifics that are slightly wrong.

Qwen3:30b-a3b — 12.0 TPS (thinking disabled)

With thinking disabled, Qwen3 produced a response that read as an externalised internal monologue — thinking-as-prose rather than a structured answer. It walked through the problem step-by-step in first person ("Let me break this down... Wait, the user said..."), ultimately recommending a small dense model like Mistral 7B.

This recommendation is wrong for this specific hardware, and interestingly, Qwen3 is itself running on this hardware. It gave generic advice calibrated to typical CPU inference bottlenecks rather than the actual measured results. It also suggested llama.cpp "15–25 TPS on a 5800U for a 7B Q4_K_M model" — a reasonable estimate for CPU-only inference, but ignoring the iGPU Vulkan backend that gives the measured 12 TPS for a 30B MoE model.

The knowledge cutoff problem: Qwen3 has no training data on LFM2 or the specific benchmark results for this hardware. It gave the best answer it could from general principles, and those principles pointed it toward the wrong conclusion. The model recommended a 7B dense model when the empirical data shows a 24B hybrid SSM+MoE outperforms it by a significant margin on this exact configuration.

DeepSeek-R1:14b — 3.1 TPS (thinking disabled)

DeepSeek-R1's response was clean and well-structured, but contained a clear hallucination: it described SSM as "Switched Sparse Memory" — a fabricated expansion of the acronym. SSM stands for State Space Model, a fundamentally different concept.

Its recommendation — large MoE transformer — was directionally correct, citing that "~50 GB/s is high bandwidth." This is precisely backwards: 50 GB/s is the bottleneck on this hardware, not an advantage. The model interpreted the bandwidth figure in absolute terms rather than relative to the model's memory requirements.

Despite these errors, the structural reasoning was sound: MoE's active-parameter reduction is the right lever to pull on bandwidth-constrained hardware. It got to the right answer via partially wrong reasoning.


Test 3: Critical Analysis

Prompt: Analyse the claim: "Q4 always reduces model quality compared to Q8, so for serious research you should always use Q8." Give a rigorous assessment.

This is a well-defined analytical task with a clear correct answer. All three models had relevant training data. This is where quality differences should be most visible.

LFM2:24b-a2b — 17.8 TPS

LFM2 identified three flaws in the claim:

  1. Universal claim without qualification — Q4 vs Q8 impact depends on architecture, task, calibration, and hardware. "Always" is unjustified.
  2. Ignores hardware acceleration — AVX_VNNI provides INT8 dot product acceleration; on AVX_VNNI CPUs, Q8_0 carries no compute overhead, changing the tradeoff entirely.
  3. Task-specific sensitivity — Coarse reasoning is less sensitive to precision loss than fine-grained factual recall.

The hardware-specific point about AVX_VNNI is precisely the kind of nuance that matters for the actual decision. This was accurate and practically useful.

Qwen3:30b-a3b — 12.0 TPS (thinking disabled)

Qwen3 gave the most thorough response of the three, leading with the logical structure: the word "always" is a universal quantifier that is falsified by a single counterexample. It cited concrete benchmark evidence — Q4 on Llama 3 and Mistral models incurs less than 1% absolute accuracy loss on MMLU vs Q8 — and identified multiple factors the claim ignores: quantisation-aware training, task sensitivity, model architecture, and implicit regularisation effects.

It also identified a counterintuitive scenario: some models fine-tuned with quantisation-aware training show no significant degradation at Q4, or in edge cases slight improvements due to regularisation. This is a more complete analysis than LFM2's.

The response was cut at 1000 tokens mid-sentence, suggesting there was more to come. The quality of what was produced was high.

DeepSeek-R1:14b — 3.1 TPS (thinking disabled)

DeepSeek-R1 produced the most concise response. It correctly identified the binary framing as the flaw (Q4 offers 16 values, Q8 offers 256 — but this alone doesn't determine quality impact), and noted that the practical effect depends on the model and task. The response was shorter and less detailed than the others, but accurate within its scope.

At 3.1 TPS, the time cost of a thorough analysis at DeepSeek-R1's depth is high. For tasks where analysis quality scales with thoroughness, the slow TPS compounds against you.


What the Tests Reveal

On factual accuracy

All three models made errors on the synthesis task — LFM2 got a platform detail wrong, Qwen3 gave the wrong recommendation, DeepSeek-R1 hallucinated an acronym expansion. The analysis task showed higher accuracy across all three, because the claim being analysed is well within their training distribution. Models are more reliable on tasks that resemble their training data. Novel hardware configurations and cutting-edge model architectures fall outside that zone.

On reasoning quality

For the logic puzzle, LFM2's transparent step-by-step reasoning with explicit self-correction was more useful interactively than Qwen3's silent exhausted thinking budget. On the analysis task, Qwen3 produced the most thorough and structured response when thinking was disabled — the base model quality is high when it actually produces output.

On the thinking-mode tradeoff

Thinking mode is a quality multiplier. But on hardware where generation is slow, it is also a latency multiplier — and one that applies silently before you see any output. The practical rule for APU-class hardware:

  • Interactive use: Disable thinking (think: false or /no_think). You get the answer faster, and for most tasks the quality loss is modest.
  • Batch / overnight analysis: Enable thinking and set a high token budget. You submit the job, come back later, get a more thorough answer.

LFM2 sidesteps this entirely: its architecture doesn't separate thinking from output, so you see the reasoning in real-time as it generates.


Summary: When to Use Each Model

Scenario Best choice Why
Interactive research, iterative queries lfm2:24b-a2b Fastest, transparent reasoning, good synthesis
Hard reasoning, batch mode deepseek-r1:14b Dedicated reasoner; worth the wait
Thorough structured analysis, batch mode qwen3:30b-a3b (thinking enabled) Best analysis quality when given token budget
Code generation qwen3-coder:30b-a3b-q4_K_M Specialised fine-tune, 12 TPS
Long document analysis lfm2:24b-a2b SSM avoids KV cache growth at long context
Uncensored / sensitive research qwen3.5-abliterated:35b-a3b-q4_K No guardrails, 4.65 TPS
Quick simple queries qwen3:8b 5.3 TPS, low overhead

The key finding

Speed and quality are not independent on bandwidth-constrained hardware. A model that is four times slower doesn't just cost you time — it changes the nature of the interaction. Thinking modes that are near-free on a GPU become minutes-long commitments on an APU, turning iterative exploration into batch processing. LFM2's hybrid SSM+MoE architecture produces the best combination of speed and quality for interactive use on this hardware, not because it is the most capable model in isolation, but because it delivers its capability at a speed that keeps research workflows fluid.


Appendix: Hardware and Setup

Matt-Mini:
- CPU: AMD Ryzen 7 5800U (Zen 3, no AVX_VNNI)
- iGPU: AMD Radeon Vega 8 (shared DDR4, ~50 GB/s)
- RAM: 64GB DDR4-3200
- Inference: Ollama 0.20.6 + Vulkan backend

API note: Thinking can be disabled per-request via "think": false in the Ollama generate payload, or by appending /no_think to the prompt for Qwen3 models. DeepSeek-R1 respects think: false at the API level.

Tested April 2026.

Running LLMs Locally: AMD APU vs Discrete GPU — Why Architecture Matters More Than Hardware

Running LLMs Locally: AMD APU vs Discrete GPU — Why Architecture Matters More Than Hardware

The Hardware

I benchmarked two very different local AI setups:

Matt-Mini — a Windows Mini PC that most people would dismiss for AI:
- CPU: AMD Ryzen 7 5800U (8 cores, Zen 3)
- iGPU: AMD Radeon Vega 8 (integrated, shared memory)
- RAM: 64GB DDR4-3200 (~50 GB/s bandwidth)

Ubuntu Laptop — a more conventional AI workstation:
- GPU: NVIDIA RTX 4070 8GB VRAM (~300 GB/s GDDR6X bandwidth)
- RAM: DDR5 system RAM (~80–100 GB/s), separate from GPU VRAM

The critical insight about the APU: the iGPU uses shared system memory as VRAM. With 64GB of RAM, the GPU can access tens of gigabytes for model weights — something impossible on a discrete GPU with fixed VRAM. The trade-off is bandwidth: DDR4 gives ~50 GB/s vs the RTX 4070's ~300 GB/s.


The Benchmark Setup

I used Ollama as the inference server (Vulkan backend for AMD iGPU — no ROCm required) and ran three prompts per model:

  • Short: "What is 2 + 2? Answer in one word." — tests base throughput
  • Reasoning: A multi-step maths problem — tests sustained generation
  • Coding: Fibonacci with memoization in Python — tests structured output

Metric: tokens per second (TPS) for generation.


Results: Matt-Mini (AMD Ryzen 7 5800U + Vega 8 iGPU, 64GB shared RAM)

Model Architecture Comparison (all Q4_K_M)

Model Avg TPS Total Params Active Params Type
qwen3:30b-a3b 12.0 30B 3B MoE
qwen3-coder:30b-a3b 12.1 30B 3B MoE (coding)
qwen3:8b 5.3 8B 8B Dense
qwen3.5-abliterated:35b-a3b 4.65 35B ~3.5B MoE (uncensored)
qwen3.5-opus-distill 3.83 35B ~3.5B MoE (distilled, Q8_0)
mixtral:8x7b 3.5 46.7B 12.9B MoE
deepseek-r1:14b 3.1 14B 14B Dense

Q4_K_M vs Q8_0 on Bandwidth-Constrained iGPU

The Vega 8 iGPU is bottlenecked by DDR4 memory bandwidth (~50 GB/s). Q8_0 uses 2× the memory bandwidth of Q4_K_M with no compute benefit on hardware lacking AVX_VNNI. The speed penalty is significant:

Model Q4_K_M TPS Q8_0 TPS Q4 faster by
qwen3-coder:30b-a3b 12.1 7.73 +57%
qwen3.5-abliterated:35b-a3b 4.65 3.83 +21%

Use Q4_K_M on the APU. Q8_0 only makes sense if quality is paramount and you can accept the speed penalty.


Results: Ubuntu Laptop (NVIDIA RTX 4070 8GB, DDR5)

General and Reasoning Models

Model Avg TPS Params Notes
qwen2.5-coder:1.5b 163 1.5B Tiny, saturates GPU
qwen2.5-coder:7b 52 7B Fast in VRAM
qwen3.5:4b 51 4B
deepseek-r1:7b 39 7B Strong reasoning, consistent TPS
qwen3-vl:8b 35 8B Vision model
llama3.1:latest 36 8B
qwen3.5:latest 24 ~14B Starts hitting VRAM limit
qwen3.5:27b 3.0 27B Exceeds 8GB VRAM, spills to RAM

Vision Models (for ComfyUI and multimodal workflows)

Model Avg TPS VRAM Notes
qwen3-vl:4b-instruct-q8_0 45 ~5.5GB Best balance — fast, high quality, leaves headroom
qwen3-vl:8b-instruct-q4_K_M 35 ~5.5GB Larger model, slightly slower, better comprehension
minicpm-v:8b-2.6-q4_K_M 38 ~5GB Fast but terse — short responses on text tasks
qwen2.5vl:3b-q8_0 15 ~3.5GB Slow despite small size — VRAM load overhead

The dramatic drop from qwen3.5:latest (~24 TPS) to qwen3.5:27b (3 TPS) marks the VRAM cliff. Once the model no longer fits in 8GB, it spills to system RAM — but even though this machine has fast DDR5, the bottleneck becomes the PCIe bus (~32 GB/s) between the GPU and system memory, not the RAM speed itself. Performance collapses to APU-level speeds despite the faster RAM.


The Key Finding: Active Parameters Are What Matter

The headline result is qwen3:30b-a3b hitting 12 TPS — faster than the 8B dense model, despite having 30 billion total parameters.

This seems counterintuitive until you understand Mixture of Experts (MoE) architecture. In a MoE model, the network is split into many "expert" sub-networks. For any given token, only a small subset of experts are activated. qwen3:30b-a3b has 30B total parameters but only 3B active per token — the same compute cost per token as a 3B dense model, but with the knowledge capacity of a 30B model.

The rule that emerges from these results:

MoE speed advantage only materialises when active parameter count is kept low.

Look at mixtral:8x7b: it's MoE, but with 12.9B active parameters per token. Despite the MoE structure it runs at the same speed as the dense 14B model — because the active compute is similar.

qwen3:30b-a3b wins because it keeps active params at just 3B while maximising total capacity.


The Two Hardware Stories

Discrete GPU: Fast but VRAM-limited

The RTX 4070 hits 35–163 TPS for models that fit in 8GB VRAM. It's fast — bandwidth is not the bottleneck. But the moment a model exceeds 8GB, performance falls off a cliff: qwen3.5:27b drops to 3 TPS, identical to the APU. The discrete GPU is a sprinter with a hard wall.

Shared-Memory APU: Slow but capacious

The Vega 8 iGPU runs at 3–12 TPS — slower across the board for models that fit in discrete VRAM. But it can run a 34GB Q8_0 model that would never fit on the RTX 4070. The APU is a distance runner with no wall.

Where they meet

When a model exceeds the discrete GPU's VRAM, both machines run at the same ~3 TPS. At that point, the APU's 64GB capacity advantage becomes the deciding factor — it can run larger models at equal speed, with Q8_0 quality instead of being forced into aggressive quantization.

The MoE Sweet Spot for APUs

Low active-parameter MoE is the ideal architecture for shared-memory systems: fewer active params = less bandwidth per token = more TPS on bandwidth-constrained DDR4. qwen3:30b-a3b at 12 TPS demonstrates this perfectly — 30B total parameters, but only 3B active, running faster than the dense 8B model.


Practical Recommendations

For AMD APU systems with 32GB+ unified memory (Ryzen 5800U, no AVX_VNNI):
1. Use qwen3:30b-a3b or qwen3-coder:30b-a3b as your default — ~12 TPS, best speed/quality
2. Use Q4_K_M, not Q8_0 — Q8_0 is 20–57% slower on bandwidth-limited DDR4; AVX_VNNI (which would offset the bandwidth cost) is not present on Zen 3
3. Prefer MoE models with low active param counts (under 4B active) — this is the single biggest performance lever
4. Ollama with Vulkan is the easiest path — no ROCm build required, works out of the box
5. Disable sleep — large model downloads will resume but you waste time

For discrete GPU systems (e.g. RTX 4070 8GB, Intel Ultra 7 165H with AVX_VNNI):
1. Match model size to VRAM — keep total model size under ~7.5GB to stay fully in VRAM
2. Q4_K_M for 7–8B models at this VRAM level — fits comfortably with headroom
3. Q8_0 is viable for vision models under 6GB (e.g. qwen3-vl:4b-instruct-q8_0) — AVX_VNNI on the host CPU means Q8_0 CPU fallback is no slower
4. For ComfyUI inpainting: qwen3-vl:4b-instruct-q8_0 at 45 TPS uses ~5.5GB, leaving room for the diffusion model
5. Avoid models that spill to RAM — PCIe bandwidth (~32 GB/s) becomes the bottleneck, not DDR5
6. For larger models, the APU is a natural complement — it runs 30B+ at equal speed to any spilling model


Tools Used

  • Ollama — inference server, Vulkan backend
  • llmfit — hardware-fit recommender (useful for finding candidate models, but note: speed estimates for Vega 8 iGPU are inaccurate — it assumes 180 GB/s ROCm bandwidth vs the real ~50 GB/s)
  • benchmark_ollama.py — custom benchmark script measuring TPS across models and prompt types

Tested April 2026 on Ollama — AMD Ryzen 7 5800U (Vega 8 iGPU, 64GB DDR4) and NVIDIA RTX 4070 8GB (DDR5 system RAM).

ADS1115 16bit 4ch ADC with PGA

The ADS1115 is a 4-quadrant sigma-delta 4 channel ADC capable of 16-bit precision at 860 samples/second over I2C. The inputs can be configured as single-ended input channels or as two differential channels.

The library is called ADS1x15 and can be installed in Library Manager of the Arduino IDE: https://www.dropbox.com/s/1v2qd45e4np8oc0/ADS1X15-master.zip


Connecting and Configuring HC-05 Bluetooth Module

Power and connect the bluetooth module to a serial connection on your computer.  My connections using an FTDI TTL-232R-3V3 cable are:




HC-05
TTL-232R-3V3
VCC
VCC (Red)
GND
GND (Black)
RXD
TXD (Orange)
TXD
RXD (Yellow)

Remember the logic levels of the HC-05 are not natively 5v tolerant so a level converter should be used to avoid letting out the magic smoke.  My cable uses 3V3 logic so a converter is not needed.

Voltage Notation

As an aside I have for a long time followed the convention of putting the V for voltage where the decimal place would go.  I find this quickly and clearly communicates the voltage where a decimal point could be missed, or smudged, since I grew up when CAD was relatively new and expensive and drawings were completed by hand.  I understand this could be confusing to someone and hence this quick explanation.

Voltage
Equivilent
2.5V
2V5
3.3V
3V3
5V or 5.0V
5V
12V
12V

Configuration

Now the connections have been made the LED on the HC-05 should be blinking.  If this is a short blink every two seconds then it is likely you are in AT Command mode, but otherwise AT commands will not work.   This indicates
Commands are case sensitive.

38400-8-N-1
EOL (End of Line): Both NL and CR




AT commands take a similar form to read and write parameters.  The difference is to write a parameter append =NewParamterValue after the AT command.  For example:


Read Parameter
Write Parameter
AT+NAME
AT+NAME=NewName

Where NewName is the new name of the module.

Read/Write PIN Code
AT+PSWD

Check firmware version
AT+VERSION



Check module mode
AT+ROLE
0 Slave
1 Master
2 Slave loopback




AT Command List
AT+RESET will exit AT mode.

Establishing A Bluetooth Connection 

The bluetooth module appears as JDY-30 when adding a bluetooth device to your system.  The default pairing code is: 1234.  Drivers will be installed to create a virtual serial port for the connection.