Making Local TTS Actually Stream: Fixing Kokoro FastAPI for Real-Time Audio
If you've been following along with my local AI setup, you'll know I run most of my services in Podman containers on a home server — Ollama, Open WebUI, Whisper, and a handful of other tools. One of those is Kokoro FastAPI, a self-hosted text-to-speech server based on the Kokoro-82M ONNX model. It produces surprisingly good speech, supports multiple voices and languages, and exposes an OpenAI-compatible API that plugs straight into Open WebUI.
This post covers a productive session where what started as a simple Firefox bug turned into a full streaming pipeline investigation — with benchmarks, a duplicate container sandbox, and a fix that meaningfully reduced time-to-first-audio for conversational use cases.
The Firefox Bug
First thing first: the web UI at kokoro-web.lan worked fine in Chrome but threw this in Firefox when you clicked Generate Speech:
MediaSource.addSourceBuffer: Type not supported in MediaSource
The culprit was a single line in AudioService.js:
this.sourceBuffer = this.mediaSource.addSourceBuffer('audio/mpeg');
Firefox simply does not support audio/mpeg in the MediaSource Extensions (MSE) API. Chrome does. The fix was to check for support first, and fall back to a simpler approach when MSE isn't available:
if (!window.MediaSource || !MediaSource.isTypeSupported('audio/mpeg')) {
await this.setupBufferedStream(stream, response, onProgress, estimatedChunks);
return;
}
The setupBufferedStream fallback collects all incoming audio chunks into a Blob and sets it as a plain audio.src — no MSE required, works everywhere. The patched file is saved locally and injected via podman cp rather than rebuilding the image.
Benchmarking: Does Format or Voice Matter?
With the Firefox issue sorted, I ran a proper latency benchmark across the three supported output formats and three voices, using a consistent test phrase:
"I love mediclinic, but I think there is a lot of scope for the EHR development to go awry."
Three runs per combination, stream: false, measured with Python's time.perf_counter().
By format (averaged across all voices)
| Format | Avg latency | File size |
|---|---|---|
| WAV | 1382 ms | ~256 KB |
| PCM | 1417 ms | ~256 KB |
| MP3 | 1457 ms | ~86 KB |
By voice (averaged across all formats)
| Voice | Description | Avg latency |
|---|---|---|
| af_heart | American English female | 1379 ms |
| bm_fable | British English male | 1439 ms |
| ef_dora | Dutch female | 1438 ms |
The takeaway: format and voice choice barely matter for latency. The ONNX inference dominates — everything else (MP3 encoding, voice model differences) contributes at most ~80 ms. MP3 is still the right default for web playback given its file size advantage. The Dutch voice (ef_dora) performs on par with the English voices, which is a good sign for multilingual deployments.
The Streaming Mystery
The Kokoro API has a stream: true parameter. For a conversational application, this should mean the server sends the first sentence's audio while it's still generating the second — reducing perceived latency significantly. I modified the test phrase to have two clear sentences:
"I love mediclinic. But I think there is a lot of scope for the EHR development to go awry."
Then I wrote a Python probe to track exactly when each 1 KB chunk arrived at the client:
t_start = time.perf_counter()
chunks = []
with urllib.request.urlopen(req) as resp:
while True:
chunk = resp.read(1024)
if not chunk: break
t = round((time.perf_counter() - t_start) * 1000)
chunks.append((t, len(chunk)))
print(f"First chunk: {chunks[0][0]}ms")
print(f"Last chunk: {chunks[-1][0]}ms")
Results for stream: true, af_heart, MP3:
First chunk: 1462ms
Last chunk: 1464ms
Chunks: 89
All 89 chunks arrived within 2 ms of each other, after a full 1.4 second wait. stream: false was identical. Even PCM format — which has zero encoder overhead — showed the same pattern. Something was buffering the entire audio before sending a single byte.
The Investigation
I spun up a duplicate container, kokoro-stream, on port 8881 as an isolated sandbox, and set about tracing the pipeline. The server code is actually well-architected: async generators and yield statements all the way from the HTTP handler down to the ONNX inference layer. The StreamingResponse even sets X-Accel-Buffering: no. On paper, it should stream.
I identified three hypotheses:
| Hypothesis | Evidence for | |
|---|---|---|
| H1 | ONNX inference batches both sentences as one call | PCM (no encoder) also shows simultaneous delivery |
| H2 | Uvicorn buffers the response body below a threshold | No asyncio yield points between sentence yields |
| H3 | PyAV MP3 encoder buffers early frames | Secondary — can't explain PCM behaviour |
What the code actually does
Inside tts_service.py, smart_split() splits the input text into chunks before inference — good. But it batches sentences together when their combined token count is under 250 tokens. The two-sentence test input is only 105 tokens, so both sentences were delivered as a single string to KokoroV1.generate().
Inside kokoro_v1.py, the pipeline was called with split_pattern=r'\n+' — meaning it would only split on newlines. Since there were no newlines, both sentences went through a single ONNX inference call and produced a single audio yield. No amount of async wiring downstream could fix that.
Even if the sentences had been processed separately, the for result in pipeline(...) loop is synchronous — it never returns control to the asyncio event loop between sentences, so the HTTP layer has no opportunity to flush.
The fix
Two minimal changes to kokoro-stream only:
inference/kokoro_v1.py — change the pipeline split pattern to break on sentence-ending punctuation:
# before
split_pattern=r'\n+'
# after
split_pattern=r'(?<=[.!?])\s+'
inference/kokoro_v1.py and services/tts_service.py — add asyncio yield points between sentence yields:
yield AudioChunk(...)
await asyncio.sleep(0) # return control to event loop → HTTP layer can flush
Before and after
| Metric | Before | After |
|---|---|---|
| First chunk (TTFA) | ~1400 ms | ~575 ms |
| Last chunk | ~1400 ms | ~1400 ms |
| Gap | ~2 ms | ~1100 ms |
First sentence audio now arrives at the client at ~575 ms while the second sentence is still being synthesised. Total generation time is unchanged — we're not making the model faster, we're just not making the user wait for everything before delivering anything.
Setup
Both containers are now accessible via .lan hostnames using Caddy as a reverse proxy:
| URL | Container | Port | Notes |
|---|---|---|---|
https://kokoro-web.lan |
kokoro-tts |
8880 | Production |
https://kokoro-stream.lan |
kokoro-stream |
8881 | Streaming-optimised |
Open WebUI is configured to use the production container at port 8880. The streaming container is available for direct use and API calls where lower TTFA matters.
Reflections
A few things worth noting from this session:
The architecture was already correct. The Kokoro FastAPI codebase uses async generators properly throughout — the issue wasn't bad design, it was two small configuration defaults that compounded badly for short inputs. The token batching threshold (250 tokens) and the newline-only split pattern made sense in isolation but combined to eliminate sentence-level streaming entirely for typical conversational inputs.
PCM as a diagnostic tool. Benchmarking PCM format (raw samples, no encoding) alongside MP3 was valuable precisely because it let us eliminate the audio encoder as a suspect early. When PCM and MP3 showed identical behaviour, we knew the bottleneck was upstream of the encoder.
asyncio.sleep(0) is surprisingly powerful. A zero-duration sleep doesn't actually sleep — it just yields control back to the event loop. That's enough to let uvicorn flush pending response bytes to the socket. It's a one-line fix with a meaningful impact on perceived latency.
The full benchmark data, pipeline analysis, and change logs are all documented if you want to replicate this setup.
Running Podman on Ubuntu 24.04. Kokoro FastAPI image: ghcr.io/remsky/kokoro-fastapi-cpu:latest. Voices used: af_heart, bm_fable, ef_dora.
No comments:
Post a Comment