CPU vs. GPU: Is Hardware Acceleration Always Faster for Real-Time TTS?

Following up on my last post about fixing progressive streaming in Kokoro FastAPI, I decided to take things a step further. If the goal is minimizing latency for a conversational AI assistant, shouldn't throwing a dedicated GPU at the problem make it even faster?

I spent the afternoon duplicating my streaming container and configuring it to run on a local NVIDIA GeForce RTX 4070 (8GB). The results were... surprising. It turns out that for real-time, sentence-by-sentence streaming, "faster" hardware doesn't always translate to a better user experience.

The Setup: Moving to Incus and CUDA

While my previous tests were in Podman, I've recently moved to Incus for better resource management. I duplicated the kokoro-stream container to a new sandbox named kokoro-stream-gpu and passed through the GPU:

incus config device add kokoro-stream-gpu mygpu gpu uid=1000 gid=1000
incus config set kokoro-stream-gpu nvidia.runtime true
incus config set kokoro-stream-gpu nvidia.driver.capabilities compute,utility,video

Inside the container, I switched the backend from the ONNX CPU runtime to the PyTorch GPU version. I also had to port over the same split_pattern and asyncio.sleep(0) fixes from the last session to ensure I was comparing apples to apples (sentence-level streaming vs. sentence-level streaming).

The Benchmark: Short vs. Long Form

I ran two tests using the British English male voice (bm_fable): one with a short two-sentence phrase (~90 chars) and one with the full text of my last blog post (~8,700 chars).

Metric	CPU (ONNX)	GPU (RTX 4070)	Speedup
TTFA (Short Text)	~557 ms	~508 ms	1.1x
Total Time (Long Text)	~289 s	~15 s	19.2x
Throughput (Long Text)	~30 char/s	~580 char/s	19.2x
System RAM Usage	1.21 GiB	1.92 GiB	-
Video RAM (VRAM)	0 MB	~850 MB	-

Reflections: When is the GPU worth it?

The results tell two very different stories depending on what you're doing.

1. Conversational AI (Short Sentences)

If you're building a real-time voice assistant that speaks one or two sentences at a time, the CPU is the clear winner. The Time to First Audio (TTFA) is virtually identical because the overhead of initializing the GPU pipeline eats up any compute gains. For this use case, the GPU is just an expensive way to use more RAM.

2. Long-Form Content (Articles, Blog Posts)

This is where the RTX 4070 absolutely screams. When I threw the full 8,700-character blog post at it, the GPU version finished the entire synthesis in 15 seconds. The CPU version was still grinding away at nearly the 5-minute mark.

At 580 characters per second, the GPU isn't just "faster"—it changes the nature of the service. You can listen to an entire article almost as soon as you click "Generate."

The Verdict

Stick with CPU for: Open WebUI, chatbots, home assistants, and low-RAM servers.
Switch to GPU for: Audiobook generation, long-form reading, or high-concurrency environments.

The kokoro-stream-gpu container is now my go-to for "reading" long documentation, while the CPU version remains my daily driver for conversational chat.

The Evidence: Benchmarking Code

To keep things evidence-based, here is the Python script used to capture these metrics. It probes the streaming API and measures exactly when the first and last chunks arrive.

1. Throughput & Latency Probe (`benchmark_long.py`)

import time
import requests
import subprocess

# Ports
GPU_URL = "http://localhost:8881/v1/audio/speech"
CPU_URL = "http://localhost:8882/v1/audio/speech"

# Load long text
with open("blog_post.md", "r") as f:
    LONG_TEXT = f.read()

def run_benchmark(name, url):
    print(f"\n--- Benchmarking {name} ---")
    start_time = time.time()
    first_chunk_time = None

    payload = {
        "input": LONG_TEXT,
        "voice": "bm_fable",
        "response_format": "mp3",
        "stream": True
    }

    with requests.post(url, json=payload, stream=True) as r:
        r.raise_for_status()
        for chunk in r.iter_content(chunk_size=1024):
            if chunk and first_chunk_time is None:
                first_chunk_time = time.time() - start_time

        total_time = time.time() - start_time

    return {
        "ttfa_ms": round(first_chunk_time * 1000, 2),
        "total_s": round(total_time, 2),
        "char_s": round(len(LONG_TEXT) / total_time, 2)
    }

2. Evidence Audio Generation (`generate_evidence.py`)

import requests
import hashlib

def generate_and_hash(url, filename):
    r = requests.post(url, json={"input": LONG_TEXT, "voice": "bm_fable"})
    with open(filename, "wb") as f:
        f.write(r.content)
    return hashlib.md5(r.content).hexdigest()

# Results:
# CPU Hash: a22fe5e4d70a2888d755e0f8df7dae8f
# GPU Hash: e5ccba5c22ef3edf594aabaa2c08bb5f

Running Incus on Ubuntu 24.04. Hardware: NVIDIA GeForce RTX 4070 8GB. Frameworks: ONNX Runtime (CPU) vs. PyTorch 2.6+CUDA 12.4 (GPU).

Tech Guinea Pig

15 April 2026