Making Local TTS Actually Stream: Fixing Kokoro FastAPI for Real-Time Audio

If you've been following along with my local AI setup, you'll know I run most of my services in Podman containers on a home server — Ollama, Open WebUI, Whisper, and a handful of other tools. One of those is Kokoro FastAPI, a self-hosted text-to-speech server based on the Kokoro-82M ONNX model. It produces surprisingly good speech, supports multiple voices and languages, and exposes an OpenAI-compatible API that plugs straight into Open WebUI.

This post covers a productive session where what started as a simple Firefox bug turned into a full streaming pipeline investigation — with benchmarks, a duplicate container sandbox, and a fix that meaningfully reduced time-to-first-audio for conversational use cases.

The Firefox Bug

First thing first: the web UI at kokoro-web.lan worked fine in Chrome but threw this in Firefox when you clicked Generate Speech:

MediaSource.addSourceBuffer: Type not supported in MediaSource

The culprit was a single line in AudioService.js:

this.sourceBuffer = this.mediaSource.addSourceBuffer('audio/mpeg');

Firefox simply does not support audio/mpeg in the MediaSource Extensions (MSE) API. Chrome does. The fix was to check for support first, and fall back to a simpler approach when MSE isn't available:

if (!window.MediaSource || !MediaSource.isTypeSupported('audio/mpeg')) {
    await this.setupBufferedStream(stream, response, onProgress, estimatedChunks);
    return;
}

The setupBufferedStream fallback collects all incoming audio chunks into a Blob and sets it as a plain audio.src — no MSE required, works everywhere. The patched file is saved locally and injected via podman cp rather than rebuilding the image.

Benchmarking: Does Format or Voice Matter?

With the Firefox issue sorted, I ran a proper latency benchmark across the three supported output formats and three voices, using a consistent test phrase:

"I love mediclinic, but I think there is a lot of scope for the EHR development to go awry."

Three runs per combination, stream: false, measured with Python's time.perf_counter().

By format (averaged across all voices)

Format	Avg latency	File size
WAV	1382 ms	~256 KB
PCM	1417 ms	~256 KB
MP3	1457 ms	~86 KB

By voice (averaged across all formats)

Voice	Description	Avg latency
af_heart	American English female	1379 ms
bm_fable	British English male	1439 ms
ef_dora	Dutch female	1438 ms

The takeaway: format and voice choice barely matter for latency. The ONNX inference dominates — everything else (MP3 encoding, voice model differences) contributes at most ~80 ms. MP3 is still the right default for web playback given its file size advantage. The Dutch voice (ef_dora) performs on par with the English voices, which is a good sign for multilingual deployments.

The Streaming Mystery

The Kokoro API has a stream: true parameter. For a conversational application, this should mean the server sends the first sentence's audio while it's still generating the second — reducing perceived latency significantly. I modified the test phrase to have two clear sentences:

"I love mediclinic. But I think there is a lot of scope for the EHR development to go awry."

Then I wrote a Python probe to track exactly when each 1 KB chunk arrived at the client:

t_start = time.perf_counter()
chunks = []
with urllib.request.urlopen(req) as resp:
    while True:
        chunk = resp.read(1024)
        if not chunk: break
        t = round((time.perf_counter() - t_start) * 1000)
        chunks.append((t, len(chunk)))

print(f"First chunk: {chunks[0][0]}ms")
print(f"Last chunk:  {chunks[-1][0]}ms")

Results for stream: true, af_heart, MP3:

First chunk: 1462ms
Last chunk:  1464ms
Chunks: 89

All 89 chunks arrived within 2 ms of each other, after a full 1.4 second wait. stream: false was identical. Even PCM format — which has zero encoder overhead — showed the same pattern. Something was buffering the entire audio before sending a single byte.

The Investigation

I spun up a duplicate container, kokoro-stream, on port 8881 as an isolated sandbox, and set about tracing the pipeline. The server code is actually well-architected: async generators and yield statements all the way from the HTTP handler down to the ONNX inference layer. The StreamingResponse even sets X-Accel-Buffering: no. On paper, it should stream.

I identified three hypotheses:

	Hypothesis	Evidence for
H1	ONNX inference batches both sentences as one call	PCM (no encoder) also shows simultaneous delivery
H2	Uvicorn buffers the response body below a threshold	No asyncio yield points between sentence yields
H3	PyAV MP3 encoder buffers early frames	Secondary — can't explain PCM behaviour

What the code actually does

Inside tts_service.py, smart_split() splits the input text into chunks before inference — good. But it batches sentences together when their combined token count is under 250 tokens. The two-sentence test input is only 105 tokens, so both sentences were delivered as a single string to KokoroV1.generate().

Inside kokoro_v1.py, the pipeline was called with split_pattern=r'\n+' — meaning it would only split on newlines. Since there were no newlines, both sentences went through a single ONNX inference call and produced a single audio yield. No amount of async wiring downstream could fix that.

Even if the sentences had been processed separately, the for result in pipeline(...) loop is synchronous — it never returns control to the asyncio event loop between sentences, so the HTTP layer has no opportunity to flush.

The fix

Two minimal changes to kokoro-stream only:

inference/kokoro_v1.py — change the pipeline split pattern to break on sentence-ending punctuation:

# before
split_pattern=r'\n+'
# after
split_pattern=r'(?<=[.!?])\s+'

inference/kokoro_v1.py and services/tts_service.py — add asyncio yield points between sentence yields:

yield AudioChunk(...)
await asyncio.sleep(0)  # return control to event loop → HTTP layer can flush

Before and after

Metric	Before	After
First chunk (TTFA)	~1400 ms	~575 ms
Last chunk	~1400 ms	~1400 ms
Gap	~2 ms	~1100 ms

First sentence audio now arrives at the client at ~575 ms while the second sentence is still being synthesised. Total generation time is unchanged — we're not making the model faster, we're just not making the user wait for everything before delivering anything.

Setup

Both containers are now accessible via .lan hostnames using Caddy as a reverse proxy:

URL	Container	Port	Notes
`https://kokoro-web.lan`	`kokoro-tts`	8880	Production
`https://kokoro-stream.lan`	`kokoro-stream`	8881	Streaming-optimised

Open WebUI is configured to use the production container at port 8880. The streaming container is available for direct use and API calls where lower TTFA matters.

Reflections

A few things worth noting from this session:

The architecture was already correct. The Kokoro FastAPI codebase uses async generators properly throughout — the issue wasn't bad design, it was two small configuration defaults that compounded badly for short inputs. The token batching threshold (250 tokens) and the newline-only split pattern made sense in isolation but combined to eliminate sentence-level streaming entirely for typical conversational inputs.

PCM as a diagnostic tool. Benchmarking PCM format (raw samples, no encoding) alongside MP3 was valuable precisely because it let us eliminate the audio encoder as a suspect early. When PCM and MP3 showed identical behaviour, we knew the bottleneck was upstream of the encoder.

asyncio.sleep(0) is surprisingly powerful. A zero-duration sleep doesn't actually sleep — it just yields control back to the event loop. That's enough to let uvicorn flush pending response bytes to the socket. It's a one-line fix with a meaningful impact on perceived latency.

The full benchmark data, pipeline analysis, and change logs are all documented if you want to replicate this setup.

Running Podman on Ubuntu 24.04. Kokoro FastAPI image: ghcr.io/remsky/kokoro-fastapi-cpu:latest. Voices used: af_heart, bm_fable, ef_dora.

Tech Guinea Pig

15 April 2026