26 June 2026

Running LLMs Locally: AMD APU vs Discrete GPU — Why Architecture Matters More Than Hardware

Running LLMs Locally: AMD APU vs Discrete GPU — Why Architecture Matters More Than Hardware

The Hardware

I benchmarked two very different local AI setups:

Matt-Mini — a Windows Mini PC that most people would dismiss for AI:
- CPU: AMD Ryzen 7 5800U (8 cores, Zen 3)
- iGPU: AMD Radeon Vega 8 (integrated, shared memory)
- RAM: 64GB DDR4-3200 (~50 GB/s bandwidth)

Ubuntu Laptop — a more conventional AI workstation:
- GPU: NVIDIA RTX 4070 8GB VRAM (~300 GB/s GDDR6X bandwidth)
- RAM: DDR5 system RAM (~80–100 GB/s), separate from GPU VRAM

The critical insight about the APU: the iGPU uses shared system memory as VRAM. With 64GB of RAM, the GPU can access tens of gigabytes for model weights — something impossible on a discrete GPU with fixed VRAM. The trade-off is bandwidth: DDR4 gives ~50 GB/s vs the RTX 4070's ~300 GB/s.


The Benchmark Setup

I used Ollama as the inference server (Vulkan backend for AMD iGPU — no ROCm required) and ran three prompts per model:

  • Short: "What is 2 + 2? Answer in one word." — tests base throughput
  • Reasoning: A multi-step maths problem — tests sustained generation
  • Coding: Fibonacci with memoization in Python — tests structured output

Metric: tokens per second (TPS) for generation.


Results: Matt-Mini (AMD Ryzen 7 5800U + Vega 8 iGPU, 64GB shared RAM)

Model Architecture Comparison (all Q4_K_M)

Model Avg TPS Total Params Active Params Type
qwen3:30b-a3b 12.0 30B 3B MoE
qwen3-coder:30b-a3b 12.1 30B 3B MoE (coding)
qwen3:8b 5.3 8B 8B Dense
qwen3.5-abliterated:35b-a3b 4.65 35B ~3.5B MoE (uncensored)
qwen3.5-opus-distill 3.83 35B ~3.5B MoE (distilled, Q8_0)
mixtral:8x7b 3.5 46.7B 12.9B MoE
deepseek-r1:14b 3.1 14B 14B Dense

Q4_K_M vs Q8_0 on Bandwidth-Constrained iGPU

The Vega 8 iGPU is bottlenecked by DDR4 memory bandwidth (~50 GB/s). Q8_0 uses 2× the memory bandwidth of Q4_K_M with no compute benefit on hardware lacking AVX_VNNI. The speed penalty is significant:

Model Q4_K_M TPS Q8_0 TPS Q4 faster by
qwen3-coder:30b-a3b 12.1 7.73 +57%
qwen3.5-abliterated:35b-a3b 4.65 3.83 +21%

Use Q4_K_M on the APU. Q8_0 only makes sense if quality is paramount and you can accept the speed penalty.


Results: Ubuntu Laptop (NVIDIA RTX 4070 8GB, DDR5)

General and Reasoning Models

Model Avg TPS Params Notes
qwen2.5-coder:1.5b 163 1.5B Tiny, saturates GPU
qwen2.5-coder:7b 52 7B Fast in VRAM
qwen3.5:4b 51 4B
deepseek-r1:7b 39 7B Strong reasoning, consistent TPS
qwen3-vl:8b 35 8B Vision model
llama3.1:latest 36 8B
qwen3.5:latest 24 ~14B Starts hitting VRAM limit
qwen3.5:27b 3.0 27B Exceeds 8GB VRAM, spills to RAM

Vision Models (for ComfyUI and multimodal workflows)

Model Avg TPS VRAM Notes
qwen3-vl:4b-instruct-q8_0 45 ~5.5GB Best balance — fast, high quality, leaves headroom
qwen3-vl:8b-instruct-q4_K_M 35 ~5.5GB Larger model, slightly slower, better comprehension
minicpm-v:8b-2.6-q4_K_M 38 ~5GB Fast but terse — short responses on text tasks
qwen2.5vl:3b-q8_0 15 ~3.5GB Slow despite small size — VRAM load overhead

The dramatic drop from qwen3.5:latest (~24 TPS) to qwen3.5:27b (3 TPS) marks the VRAM cliff. Once the model no longer fits in 8GB, it spills to system RAM — but even though this machine has fast DDR5, the bottleneck becomes the PCIe bus (~32 GB/s) between the GPU and system memory, not the RAM speed itself. Performance collapses to APU-level speeds despite the faster RAM.


The Key Finding: Active Parameters Are What Matter

The headline result is qwen3:30b-a3b hitting 12 TPS — faster than the 8B dense model, despite having 30 billion total parameters.

This seems counterintuitive until you understand Mixture of Experts (MoE) architecture. In a MoE model, the network is split into many "expert" sub-networks. For any given token, only a small subset of experts are activated. qwen3:30b-a3b has 30B total parameters but only 3B active per token — the same compute cost per token as a 3B dense model, but with the knowledge capacity of a 30B model.

The rule that emerges from these results:

MoE speed advantage only materialises when active parameter count is kept low.

Look at mixtral:8x7b: it's MoE, but with 12.9B active parameters per token. Despite the MoE structure it runs at the same speed as the dense 14B model — because the active compute is similar.

qwen3:30b-a3b wins because it keeps active params at just 3B while maximising total capacity.


The Two Hardware Stories

Discrete GPU: Fast but VRAM-limited

The RTX 4070 hits 35–163 TPS for models that fit in 8GB VRAM. It's fast — bandwidth is not the bottleneck. But the moment a model exceeds 8GB, performance falls off a cliff: qwen3.5:27b drops to 3 TPS, identical to the APU. The discrete GPU is a sprinter with a hard wall.

Shared-Memory APU: Slow but capacious

The Vega 8 iGPU runs at 3–12 TPS — slower across the board for models that fit in discrete VRAM. But it can run a 34GB Q8_0 model that would never fit on the RTX 4070. The APU is a distance runner with no wall.

Where they meet

When a model exceeds the discrete GPU's VRAM, both machines run at the same ~3 TPS. At that point, the APU's 64GB capacity advantage becomes the deciding factor — it can run larger models at equal speed, with Q8_0 quality instead of being forced into aggressive quantization.

The MoE Sweet Spot for APUs

Low active-parameter MoE is the ideal architecture for shared-memory systems: fewer active params = less bandwidth per token = more TPS on bandwidth-constrained DDR4. qwen3:30b-a3b at 12 TPS demonstrates this perfectly — 30B total parameters, but only 3B active, running faster than the dense 8B model.


Practical Recommendations

For AMD APU systems with 32GB+ unified memory (Ryzen 5800U, no AVX_VNNI):
1. Use qwen3:30b-a3b or qwen3-coder:30b-a3b as your default — ~12 TPS, best speed/quality
2. Use Q4_K_M, not Q8_0 — Q8_0 is 20–57% slower on bandwidth-limited DDR4; AVX_VNNI (which would offset the bandwidth cost) is not present on Zen 3
3. Prefer MoE models with low active param counts (under 4B active) — this is the single biggest performance lever
4. Ollama with Vulkan is the easiest path — no ROCm build required, works out of the box
5. Disable sleep — large model downloads will resume but you waste time

For discrete GPU systems (e.g. RTX 4070 8GB, Intel Ultra 7 165H with AVX_VNNI):
1. Match model size to VRAM — keep total model size under ~7.5GB to stay fully in VRAM
2. Q4_K_M for 7–8B models at this VRAM level — fits comfortably with headroom
3. Q8_0 is viable for vision models under 6GB (e.g. qwen3-vl:4b-instruct-q8_0) — AVX_VNNI on the host CPU means Q8_0 CPU fallback is no slower
4. For ComfyUI inpainting: qwen3-vl:4b-instruct-q8_0 at 45 TPS uses ~5.5GB, leaving room for the diffusion model
5. Avoid models that spill to RAM — PCIe bandwidth (~32 GB/s) becomes the bottleneck, not DDR5
6. For larger models, the APU is a natural complement — it runs 30B+ at equal speed to any spilling model


Tools Used

  • Ollama — inference server, Vulkan backend
  • llmfit — hardware-fit recommender (useful for finding candidate models, but note: speed estimates for Vega 8 iGPU are inaccurate — it assumes 180 GB/s ROCm bandwidth vs the real ~50 GB/s)
  • benchmark_ollama.py — custom benchmark script measuring TPS across models and prompt types

Tested April 2026 on Ollama — AMD Ryzen 7 5800U (Vega 8 iGPU, 64GB DDR4) and NVIDIA RTX 4070 8GB (DDR5 system RAM).

LiteLLM + Agent Teams: A Practical Guide

LiteLLM + Agent Teams: A Practical Guide

An aide memoire for using the local AI infrastructure day-to-day.


The big picture

You have three layers:

Your task (plain English)
        ↓
  Agent team (Python, OpenAI Agents SDK)
        ↓
  LiteLLM proxy  ←→  Ollama (local GPU)
                 ←→  OpenRouter (cloud free)
                 ←→  Anthropic (claude-haiku)

LiteLLM is a translation layer. It gives everything a single OpenAI-compatible URL (http://10.140.20.63:4000/v1) regardless of whether the model is running locally on your GPU or fetched from a cloud provider. Your code never changes — only the model name string changes.

The agent team is a set of specialised AI workers. You give the orchestrator a task in plain English; it decides which specialist to hand it to; the specialist does the work and hands results back.


Part 1 — Using LiteLLM directly

From the command line (curl)

# Ask any model a question
curl http://10.140.20.63:4000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer no-key-needed" \
  -d '{
    "model": "qwen3.5:4b",
    "messages": [{"role": "user", "content": "What is a BGP route reflector?"}]
  }'

# List all available models
curl http://10.140.20.63:4000/v1/models | python3 -m json.tool | grep '"id"'

From Python (OpenAI SDK)

from openai import OpenAI

client = OpenAI(
    base_url="http://10.140.20.63:4000/v1",
    api_key="no-key-needed",
)

response = client.chat.completions.create(
    model="qwen3.5:4b",   # or "claude-haiku-4-5", "nemotron-120b", etc.
    messages=[{"role": "user", "content": "Summarise this log: ..."}],
)
print(response.choices[0].message.content)

Choosing a model

Use case Model string Where it runs
Quick questions, triage qwen3.5:4b Local GPU (3.4 GB)
Writing code qwen2.5-coder:7b Local GPU (4.7 GB)
General analysis qwen3.5 Local GPU (6.6 GB)
Images / screenshots qwen3-vl Local GPU (6.1 GB)
Heavy reasoning nemotron-120b Cloud free (OpenRouter)
Reliable tool calling claude-haiku-4-5 Cloud (Anthropic/OpenRouter)
Best available free free Cloud free (auto-routed)

Group aliases — if the specific model is busy or unavailable, LiteLLM falls back automatically:

Alias Primary Fallback
fast qwen3.5:4b qwen2.5-coder:1.5b
coder qwen2.5-coder:7b qwen2.5-coder:1.5b
local qwen3.5 llama3.1
reasoning nemotron-120b gpt-oss-120b

Health check

curl http://10.140.20.63:4000/health
incus exec litellm -- journalctl -u litellm -f   # live logs

Part 2 — Running the agent team

The one-liner

cd /home/user/claude/agents
.venv/bin/python team.py "your task here"

Example tasks

# Coding
.venv/bin/python team.py "write a Python script that tails a log file and alerts on ERROR lines"

# Research
.venv/bin/python team.py "what are the main CVEs in OpenSSH versions 8.x to 9.x?"

# Analysis
.venv/bin/python team.py "analyse this nmap output and prioritise the findings: [paste output]"

# Mixed — the orchestrator chains specialists automatically
.venv/bin/python team.py "research the log4shell vulnerability then write a Python checker for it"

What happens under the hood

You: "research log4shell then write a checker"
        ↓
Orchestrator (claude-haiku) reads task
        ↓
Handoff → Researcher (nemotron-120b, cloud)
  "Log4Shell is CVE-2021-44228, affects Log4j 2.0–2.14.1..."
        ↓
Back to Orchestrator → Handoff → Coder (qwen2.5-coder:7b, local GPU)
  "def check_log4shell(host, port): ..."
        ↓
Orchestrator summarises and returns to you

The orchestrator uses haiku because it reliably produces valid tool-call JSON for handoffs. Local Ollama models are fast but unreliable at structured function-calling.

Watching it work

Add LITELLM_LOG=DEBUG to see every model call:

LITELLM_LOG=DEBUG .venv/bin/python team.py "hello"

Or watch the LiteLLM proxy logs live in another terminal:

incus exec litellm -- journalctl -u litellm -f

Part 3 — Writing your own agents

Minimal single agent

import asyncio, os
os.environ["OPENAI_BASE_URL"] = "http://10.140.20.63:4000/v1"
os.environ["OPENAI_API_KEY"]  = "no-key-needed"

from agents import Agent, Runner

agent = Agent(
    name="Helper",
    model="qwen3.5:4b",
    instructions="You are a helpful assistant. Be concise.",
)

async def main():
    result = await Runner.run(agent, "What is ARP spoofing?")
    print(result.final_output)

asyncio.run(main())

Adding tools (things agents can do)

from agents import Agent, Runner, function_tool
import httpx

@function_tool
async def get_url(url: str) -> str:
    """Fetch the contents of a URL."""
    async with httpx.AsyncClient(timeout=10) as c:
        r = await c.get(url)
        return r.text[:2000]   # truncate to avoid context overflow

agent = Agent(
    name="WebReader",
    model="qwen3.5:4b",
    instructions="You can fetch URLs to answer questions.",
    tools=[get_url],
)

Rule: tools are Python functions decorated with @function_tool. The agent decides when to call them. The docstring becomes the tool description — make it clear.

Handing off between agents

from agents import Agent, Runner, handoff

specialist = Agent(
    name="Specialist",
    model="qwen3.5",
    instructions="You handle detailed analysis. Return results clearly.",
)

orchestrator = Agent(
    name="Orchestrator",
    model="claude-haiku-4-5",
    instructions="Route analysis tasks to Specialist. Summarise results.",
    handoffs=[handoff(specialist)],
)

result = await Runner.run(orchestrator, "Analyse this data: ...")

handoff() is itself a tool the orchestrator can call. When it calls it, execution transfers to the specialist; when the specialist finishes, control returns to the orchestrator.

The existing tools you can reuse

gpu_tools.py — for any agent that needs to know about the GPU:

from gpu_tools import vram_status, list_local_models, comfyui_status
agent = Agent(..., tools=[vram_status, list_local_models])

devops_tools.py — for agents that manage containers:

from devops_tools import container_run, container_write_file, container_read_file, http_probe, container_systemctl
agent = Agent(..., tools=[container_run, http_probe])

Part 4 — Practical patterns

Pattern 1: Quick one-shot query

Use make_client() from litellm_client.py directly — no agent overhead:

from litellm_client import make_client, FAST_MODEL

async def ask(question: str) -> str:
    client = make_client()
    resp = await client.chat.completions.create(
        model=FAST_MODEL,
        messages=[{"role": "user", "content": question}],
    )
    return resp.choices[0].message.content

Pattern 2: Task with a deadline / retry limit

result = await Runner.run(agent, task, max_turns=10)

max_turns prevents infinite loops. The team.py orchestrator uses 40 turns because research+code tasks can take many steps.

Pattern 3: Streaming output

from agents import Runner

async for event in Runner.run_streamed(agent, task):
    if hasattr(event, "delta") and event.delta:
        print(event.delta, end="", flush=True)

Pattern 4: DevOps / automation agent

See setup_tts_stt.py as a reference. The pattern is:
1. Write a detailed task string explaining exactly what the agent should do and verify
2. Give it the right tools (container_run, http_probe, etc.)
3. Set instructions to "act immediately, don't ask permission"
4. Set max_turns=40 for multi-step work

agent = Agent(
    name="DevOps",
    model="claude-haiku-4-5",   # must use haiku — local models can't do tool-calling
    tools=[container_run, container_write_file, http_probe, container_systemctl],
    instructions="Act immediately. Never ask for permission. Verify each step.",
)
result = await Runner.run(agent, TASK, max_turns=40)

Part 5 — Gotchas and tips

Local models can't do structured tool-calling

qwen3.5, qwen2.5-coder:7b, etc. produce good prose but often garble the JSON format needed for handoff() and @function_tool calls. Always use claude-haiku-4-5 as your orchestrator — it's reliable and cheap (Anthropic free tier via OpenRouter).

Only one large model fits in VRAM at a time

The RTX 4070 has 8 GB. If you ask the orchestrator to hand off to a 6.6 GB local model while another 4.7 GB model is loaded, Ollama unloads the first one. There is a ~5–15 second cold-load delay. This is normal.

Free cloud models are rate-limited

nemotron-120b and other OpenRouter free models may queue or time out under load. If an agent stalls for >2 minutes with no output, it's usually rate-limiting. Switch to gpt-oss-120b or qwen3-80b as alternatives.

The free model alias changes

openrouter/openrouter/free routes to whatever OpenRouter considers the best free model at that moment. Good for exploration; use a specific model name for reproducible pipelines.

Ollama keep-alive

Models stay in VRAM for 15 minutes after last use (KEEP_ALIVE=15m). If you want to free VRAM immediately:

curl -X POST http://10.140.20.1:11434/api/generate -d '{"model":"qwen3.5","keep_alive":0}'

Part 6 — Agent Team in Open WebUI

The agent team is exposed as a model in Open WebUI via the Pipelines server — a small FastAPI app that sits between Open WebUI and the agent code.

Open WebUI chat
      ↓  (selects "Agent Team" model)
Pipelines server  (host: 10.140.20.1:9099)
      ↓
Agent orchestrator (claude-haiku)
      ↓  handoffs
Specialist agents (local GPU / cloud free)

Architecture files

File Purpose
agents/pipelines/agent_team.py The pipeline class — wraps the agent team
agents/run_pipelines.sh Manual start script
/etc/systemd/system/owui-pipelines.service Systemd service (starts on boot)

Managing the pipelines server

sudo systemctl status owui-pipelines
sudo systemctl restart owui-pipelines
sudo journalctl -u owui-pipelines -f

Connecting to Open WebUI (one-time setup)

  1. Open http://localhost:3001
  2. Top-right avatar → Admin Panel
  3. Settings → Connections → Pipelines
  4. Add:
  5. URL: http://10.140.20.1:9099
  6. API Key: 0p3n-w3bu!
  7. Click Save — "Agent Team" now appears in the model picker

Using it

Select Agent Team in the model picker and chat normally. Each message is routed by the orchestrator to the right specialist. The full conversation history is passed so the team has context across turns.

The pipelines server API key (0p3n-w3bu!) is the default from the open-webui-pipelines package. Change it in /etc/systemd/system/owui-pipelines.service and update the Open WebUI connection setting to match.

Adding more pipelines

Drop a new .py file with a Pipeline class into agents/pipelines/, then:

sudo systemctl restart owui-pipelines

The new pipeline appears as a model in Open WebUI immediately.


Quick reference card

# Run agent team
cd /home/user/claude/agents && .venv/bin/python team.py "task"

# Query a model directly
curl http://10.140.20.63:4000/v1/chat/completions \
  -H "Content-Type: application/json" -H "Authorization: Bearer no-key-needed" \
  -d '{"model":"qwen3.5:4b","messages":[{"role":"user","content":"hello"}]}'

# List models
curl -s http://10.140.20.63:4000/v1/models | python3 -m json.tool | grep '"id"'

# Watch LiteLLM traffic
incus exec litellm -- journalctl -u litellm -f

# Check VRAM
curl -s http://10.140.20.1:11434/api/ps | python3 -m json.tool

# Add a model to Ollama
ollama pull <model-name>
# Then add it to /etc/litellm/config.yaml and push + restart

File map

/home/user/claude/agents/
├── team.py            ← entry point — run this
├── litellm_client.py  ← model constants and URLs
├── gpu_tools.py       ← tools: vram_status, list_local_models, comfyui_status
├── devops_tools.py    ← tools: container_run, container_write_file, http_probe, ...
├── setup_tts_stt.py   ← reference: single-purpose DevOps agent
└── .venv/             ← virtualenv (openai-agents, openai)

/etc/litellm/
├── config.yaml        ← model list (edit on host, push to container)
└── secrets.env        ← OPENROUTER_API_KEY

CPU vs. GPU: Is Hardware Acceleration Always Faster for Real-Time TTS?

CPU vs. GPU: Is Hardware Acceleration Always Faster for Real-Time TTS?

Following up on my last post about fixing progressive streaming in Kokoro FastAPI, I decided to take things a step further. If the goal is minimizing latency for a conversational AI assistant, shouldn't throwing a dedicated GPU at the problem make it even faster?

I spent the afternoon duplicating my streaming container and configuring it to run on a local NVIDIA GeForce RTX 4070 (8GB). The results were... surprising. It turns out that for real-time, sentence-by-sentence streaming, "faster" hardware doesn't always translate to a better user experience.


The Setup: Moving to Incus and CUDA

While my previous tests were in Podman, I've recently moved to Incus for better resource management. I duplicated the kokoro-stream container to a new sandbox named kokoro-stream-gpu and passed through the GPU:

incus config device add kokoro-stream-gpu mygpu gpu uid=1000 gid=1000
incus config set kokoro-stream-gpu nvidia.runtime true
incus config set kokoro-stream-gpu nvidia.driver.capabilities compute,utility,video

Inside the container, I switched the backend from the ONNX CPU runtime to the PyTorch GPU version. I also had to port over the same split_pattern and asyncio.sleep(0) fixes from the last session to ensure I was comparing apples to apples (sentence-level streaming vs. sentence-level streaming).


The Benchmark: Short vs. Long Form

I ran two tests using the British English male voice (bm_fable): one with a short two-sentence phrase (~90 chars) and one with the full text of my last blog post (~8,700 chars).

Metric CPU (ONNX) GPU (RTX 4070) Speedup
TTFA (Short Text) ~557 ms ~508 ms 1.1x
Total Time (Long Text) ~289 s ~15 s 19.2x
Throughput (Long Text) ~30 char/s ~580 char/s 19.2x
System RAM Usage 1.21 GiB 1.92 GiB -
Video RAM (VRAM) 0 MB ~850 MB -

Reflections: When is the GPU worth it?

The results tell two very different stories depending on what you're doing.

1. Conversational AI (Short Sentences)

If you're building a real-time voice assistant that speaks one or two sentences at a time, the CPU is the clear winner. The Time to First Audio (TTFA) is virtually identical because the overhead of initializing the GPU pipeline eats up any compute gains. For this use case, the GPU is just an expensive way to use more RAM.

2. Long-Form Content (Articles, Blog Posts)

This is where the RTX 4070 absolutely screams. When I threw the full 8,700-character blog post at it, the GPU version finished the entire synthesis in 15 seconds. The CPU version was still grinding away at nearly the 5-minute mark.

At 580 characters per second, the GPU isn't just "faster"—it changes the nature of the service. You can listen to an entire article almost as soon as you click "Generate."

The Verdict

  • Stick with CPU for: Open WebUI, chatbots, home assistants, and low-RAM servers.
  • Switch to GPU for: Audiobook generation, long-form reading, or high-concurrency environments.

The kokoro-stream-gpu container is now my go-to for "reading" long documentation, while the CPU version remains my daily driver for conversational chat.


The Evidence: Benchmarking Code

To keep things evidence-based, here is the Python script used to capture these metrics. It probes the streaming API and measures exactly when the first and last chunks arrive.

1. Throughput & Latency Probe (benchmark_long.py)

import time
import requests
import subprocess

# Ports
GPU_URL = "http://localhost:8881/v1/audio/speech"
CPU_URL = "http://localhost:8882/v1/audio/speech"

# Load long text
with open("blog_post.md", "r") as f:
    LONG_TEXT = f.read()

def run_benchmark(name, url):
    print(f"\n--- Benchmarking {name} ---")
    start_time = time.time()
    first_chunk_time = None

    payload = {
        "input": LONG_TEXT,
        "voice": "bm_fable",
        "response_format": "mp3",
        "stream": True
    }

    with requests.post(url, json=payload, stream=True) as r:
        r.raise_for_status()
        for chunk in r.iter_content(chunk_size=1024):
            if chunk and first_chunk_time is None:
                first_chunk_time = time.time() - start_time

        total_time = time.time() - start_time

    return {
        "ttfa_ms": round(first_chunk_time * 1000, 2),
        "total_s": round(total_time, 2),
        "char_s": round(len(LONG_TEXT) / total_time, 2)
    }

2. Evidence Audio Generation (generate_evidence.py)

import requests
import hashlib

def generate_and_hash(url, filename):
    r = requests.post(url, json={"input": LONG_TEXT, "voice": "bm_fable"})
    with open(filename, "wb") as f:
        f.write(r.content)
    return hashlib.md5(r.content).hexdigest()

# Results:
# CPU Hash: a22fe5e4d70a2888d755e0f8df7dae8f
# GPU Hash: e5ccba5c22ef3edf594aabaa2c08bb5f

Running Incus on Ubuntu 24.04. Hardware: NVIDIA GeForce RTX 4070 8GB. Frameworks: ONNX Runtime (CPU) vs. PyTorch 2.6+CUDA 12.4 (GPU).

Making Local TTS Actually Stream: Fixing Kokoro FastAPI for Real-Time Audio

Making Local TTS Stream: Fixing Kokoro FastAPI for Real-Time Audio

If you've been following along with my local AI setup, you'll know I run most of my services in Podman containers on a home server — Ollama, Open WebUI, FasterWhisper, and a handful of other tools.  One of those is Kokoro FastAPI, a self-hosted text-to-speech server based on the Kokoro-82M ONNX model.  It produces surprisingly good speech, supports multiple voices and languages, and exposes an OpenAI-compatible endpoint compatible with Open WebUI.

This post covers a productive session, it makes a change, where what started as a simple Firefox bug turned into a full streaming pipeline investigation — with benchmarks, a duplicate container sandbox, and a fix that meaningfully reduces time-to-first-audio for conversational use cases.


Firefox MediaSource Bug

First thing first: the web UI at kokoro-web.lan worked fine in Chrome but threw this in Firefox when you clicked Generate Speech:

MediaSource.addSourceBuffer: Type not supported in MediaSource

The culprit was this single line in AudioService.js:

this.sourceBuffer = this.mediaSource.addSourceBuffer('audio/mpeg');

Firefox does not support audio/mpeg in the MediaSource Extensions (MSE) API.  Chrome does.  Why not just use Chrome?  Well, I simply prefer Firefox...  The fix was to check for support in the script first, and fall back to a simpler approach when MSE isn't available:

if (!window.MediaSource || !MediaSource.isTypeSupported('audio/mpeg')) {
    await this.setupBufferedStream(stream, response, onProgress, estimatedChunks);
    return;
}

The setupBufferedStream fallback collects all incoming audio chunks into a Blob and sets it as a plain audio.src — no MSE required, works everywhere. The patched file is saved locally and injected via podman cp rather than rebuilding the image.


Benchmarking: Does Format or Voice Matter?

With the Firefox issue sorted, I ran latency benchmarks across the three supported output formats and three voices, using a consistent test phrase:

"I love Mediclinic, but I think there is a lot of scope for the EHR development to go awry."

Three runs per combination, stream: false, measured with Python's time.perf_counter().

By format (averaged across all voices)

Format Avg latency File size
WAV 1382 ms ~256 KB
PCM 1417 ms ~256 KB
MP3 1457 ms ~86 KB

By voice (averaged across all formats)

Voice Description Avg latency
af_heart American English female 1379 ms
bm_fable British English male 1439 ms
ef_dora Dutch female 1438 ms

The takeaway: format and voice choice barely matter for latency. The ONNX inference dominates — everything else (MP3 encoding, voice model differences) contributes at most ~80 ms.  MP3 is still the right default for web playback given its file size advantage. The Dutch voice (ef_dora) performs on par with the English voices, which is a good sign for multilingual deployments.


When stream: true, doesn't stream

The Kokoro API has a stream: true parameter. For a conversational application, this should mean the server sends the first sentence's audio while it's still generating the second — reducing perceived latency significantly. I modified the test phrase to have two clear sentences:

"I love Mediclinic. But I think there is a lot of scope for the EHR development to go awry."

Then I wrote a Python probe to track exactly when each 1 KB chunk arrived at the client:

t_start = time.perf_counter()
chunks = []
with urllib.request.urlopen(req) as resp:
    while True:
        chunk = resp.read(1024)
        if not chunk: break
        t = round((time.perf_counter() - t_start) * 1000)
        chunks.append((t, len(chunk)))

print(f"First chunk: {chunks[0][0]}ms")
print(f"Last chunk:  {chunks[-1][0]}ms")

Results for stream: true, af_heart, MP3:

First chunk: 1462ms
Last chunk:  1464ms
Chunks: 89

All 89 chunks arrived within 2 ms of each other, after a full 1.4 second wait. stream: false was identical. Even PCM format — which has zero encoder overhead — showed the same pattern. Something was buffering the entire audio before sending a single byte.


Chunks all arriving at once is not streaming

I spun up a duplicate container, kokoro-stream, on port 8881 as an isolated sandbox, and set about tracing the pipeline.  The server code is actually well-architected: async generators and yield statements all the way from the HTTP handler down to the ONNX inference layer. The StreamingResponse even sets X-Accel-Buffering: no. On paper, it should stream.

I identified three hypotheses:


Hypothesis Evidence for
H1 ONNX inference batches both sentences as one call PCM (no encoder) also shows simultaneous delivery
H2 Uvicorn buffers the response body below a threshold No asyncio yield points between sentence yields
H3 PyAV MP3 encoder buffers early frames Secondary — can't explain PCM behaviour

What the code actually does

Inside tts_service.py, smart_split() splits the input text into chunks before inference — good. But it batches sentences together when their combined token count is under 250 tokens. The two-sentence test input is only 105 tokens, so both sentences were delivered as a single string to KokoroV1.generate().

Inside kokoro_v1.py, the pipeline was called with split_pattern=r'\n+' — meaning it would only split on newlines. Since there were no newlines, both sentences went through a single ONNX inference call and produced a single audio yield. No amount of async wiring downstream could fix that.

Even if the sentences had been processed separately, the for result in pipeline(...) loop is synchronous — it never returns control to the asyncio event loop between sentences, so the HTTP layer has no opportunity to flush.

The fix

Two minimal changes to kokoro-stream only:

inference/kokoro_v1.py — change the pipeline split pattern to break on sentence-ending punctuation:

# before
split_pattern=r'\n+'
# after
split_pattern=r'(?<=[.!?])\s+'

inference/kokoro_v1.py and services/tts_service.py — add asyncio yield points between sentence yields:

yield AudioChunk(...)
await asyncio.sleep(0)  # return control to event loop → HTTP layer can flush

Before and after

Metric Before After
First chunk (TTFA) ~1400 ms ~575 ms
Last chunk ~1400 ms ~1400 ms
Gap ~2 ms ~1100 ms

First sentence audio now arrives at the client at ~575 ms while the second sentence is still being synthesised. Total generation time is unchanged — we're not making the model faster, we're just not making the user wait for everything before delivering anything.


Setup

Both containers are now accessible via .lan hostnames using Caddy as a reverse proxy:

URL Container Port Notes
https://kokoro-web.lan kokoro-tts 8880 Production
https://kokoro-stream.lan kokoro-stream 8881 Streaming-optimised

Open WebUI is configured to use the production container at port 8880. The streaming container is available for direct use and API calls where lower TTFA matters.


Conclusions

A few things worth noting from this session:

The architecture was already correct. The Kokoro FastAPI codebase uses async generators properly throughout — the issue wasn't bad design, it was two small configuration defaults that compounded badly for short inputs. The token batching threshold (250 tokens) and the newline-only split pattern made sense in isolation but combined to eliminate sentence-level streaming entirely for typical conversational inputs.

PCM as a diagnostic tool. Benchmarking PCM format (raw samples, no encoding) alongside MP3 was valuable precisely because it let us eliminate the audio encoder as a suspect early. When PCM and MP3 showed identical behaviour, we knew the bottleneck was upstream of the encoder.

asyncio.sleep(0) is surprisingly powerful. A zero-duration sleep doesn't actually sleep — it just yields control back to the event loop. That's enough to let uvicorn flush pending response bytes to the socket. It's a one-line fix with a meaningful impact on perceived latency.

The full benchmark data, pipeline analysis, and change logs are all documented if you want to replicate this setup.


Running Podman on Ubuntu 24.04. Kokoro FastAPI image: ghcr.io/remsky/kokoro-fastapi-cpu:latest. Voices used: af_heart, bm_fable, ef_dora.