07 May 2026

Proxmox: Removing a Ghost Node from the Web UI

Proxmox: Removing a Ghost Node from the Web UI

This is a follow-on to Proxmox Cluster Going Sluggish? Your Offline Node Has a Stale Config. That post covers nodes that are misbehaving but still real. This one covers nodes that don't exist at all.


After sorting out a stale corosync config on a returning node, I noticed the web UI was still showing an extra node — PVE9 — on every host in the cluster. It wasn't causing any problems, just sitting there looking wrong. Here's how to get rid of it.

What a Ghost Node Is

A ghost node is a stale directory in /etc/pve/nodes/ with no corresponding corosync membership. The Proxmox web UI reads from the shared cluster filesystem, not from corosync directly — so a leftover directory shows up as a node even if the machine is long gone. It can appear after a node was removed uncleanly, rebuilt under a different name, or never properly decommissioned.

Check It's Actually a Ghost

First confirm the node isn't just offline — check corosync:

pvecm status
Membership information
----------------------
    Nodeid      Votes Name
0x00000001          3 10.140.3.10
0x00000002          1 10.140.3.80
0x00000003          1 10.140.3.70
0x00000004          1 10.140.3.82
0x00000006          1 10.140.3.20

If it's not in this list, it's a ghost. Confirm the directory exists:

cat /etc/pve/corosync.conf | grep pve9   # nothing
ls /etc/pve/nodes/
# pve1  pve2  pve7  pve8  pve9  xenon

Check for VMs Before Deleting

The node directory may contain VM or container configs:

ls /etc/pve/nodes/pve9/qemu-server/
ls /etc/pve/nodes/pve9/lxc/

If there are configs, check them before doing anything:

cat /etc/pve/nodes/pve9/qemu-server/102.conf

If the node is truly gone, any disks listed as local-lvm:vm-XXX-disk-Y were on that node's local storage and are already inaccessible. You won't be able to recover them. Make sure you're happy with that before proceeding.

Remove It

Try the proper route first:

pvecm delnode pve9

If the node was never in corosync you'll get:

Node/IP: pve9 is not a known host of the cluster.

In that case, remove the directory directly:

rm -rf /etc/pve/nodes/pve9

The deletion replicates across the cluster filesystem immediately. Verify:

ls /etc/pve/nodes/
# pve1  pve2  pve7  pve8  xenon

Reload the web UI — the ghost node is gone.

Quick Reference

Command Purpose
pvecm status Confirm node isn't in corosync
ls /etc/pve/nodes/ List all node directories
ls /etc/pve/nodes/<name>/qemu-server/ Check for VM configs
ls /etc/pve/nodes/<name>/lxc/ Check for container configs
pvecm delnode <name> Proper removal (works if node was in corosync)
rm -rf /etc/pve/nodes/<name> Manual removal for true ghost nodes

Proxmox Cluster Going Sluggish? Your Offline Node Has a Stale Config

Proxmox Cluster Going Sluggish? Your Offline Node Has a Stale Config

You power on a node that's been offline for a while. Within seconds, the Proxmox web UI starts showing other nodes as dead. Management operations slow to a crawl. Nothing is obviously broken — all the nodes are still pinging — but something is clearly very wrong.

This is the stale corosync config problem, and it's easy to fix once you know what to look for.

What's Happening

Proxmox uses corosync to manage cluster membership. Every config change — adding a node, removing a node, changing votes — increments a config_version in /etc/corosync/corosync.conf. All cluster members must agree on this version.

When a node comes back online after missing several config changes, corosync starts up with an old config_version. The other nodes reject its packets. But corosync doesn't give up — it keeps retrying, flooding the network with rejected authentication attempts. This hammers pvedaemon on every node, causing the web UI to become sluggish and show phantom "dead" nodes even though the cluster itself is technically still quorate.

Diagnosis

First, confirm the cluster itself is still healthy from a node you trust:

pvecm status

If you see Quorate: Yes, the cluster is fine — the problem is the misbehaving node, not a genuine quorum loss. Note the Config Version value.

Then SSH into the suspect node and check:

ssh pve7 systemctl status corosync
ssh pve7 cat /etc/corosync/corosync.conf | grep config_version

Here's what it looks like when you've found the culprit:

● corosync.service - Corosync Cluster Engine
     Active: active (running) since Thu 2026-05-07 17:30:55 BST; 8min ago
   Main PID: 1146 (corosync)
     Memory: 155.9M (peak: 171.9M)
        CPU: 11.549s

May 07 17:39:36 pve7 corosync[1146]:   [KNET  ] rx: Packet rejected from 10.140.3.80:5405
May 07 17:39:37 pve7 corosync[1146]:   [KNET  ] rx: Packet rejected from 10.140.3.80:5405
May 07 17:39:38 pve7 corosync[1146]:   [KNET  ] rx: Packet rejected from 10.140.3.80:5405
May 07 17:39:43 pve7 corosync[1146]:   [QUORUM] Sync members[1]: 3
May 07 17:39:43 pve7 corosync[1146]:   [TOTEM ] A new membership (3.1a3c) was formed. Members
May 07 17:39:43 pve7 corosync[1146]:   [QUORUM] Members[1]: 3
May 07 17:39:43 pve7 corosync[1146]:   [MAIN  ] Completed service synchronization, ready to provide service.

  config_version: 10

Two red flags:
- "Packet rejected" — every node is refusing this node's traffic
- "Sync members[1]: 3" — the node has formed its own single-node pseudo-cluster with just itself (nodeid 3)
- config_version: 10 while the live cluster is at version 19 or higher

Fix

Step 1 — Stop corosync on the problem node

systemctl stop corosync

Verify it stopped:

systemctl status corosync

Expected output:

○ corosync.service - Corosync Cluster Engine
     Active: inactive (dead) since Thu 2026-05-07 17:40:45 BST; 48s ago

The web UI should recover almost immediately once the flood of rejected packets stops.

Step 2 — Check the corosync directory

ls -la /etc/corosync/

You'll likely see an authkey from the node's prior cluster membership:

drwxr-xr-x  3 root root 4096 Apr 25 17:21 .
-rw-r--r--  1 root root  256 Apr 25 16:14 authkey
-rw-r--r--  1 root root  639 Apr 25 17:21 corosync.conf

Don't manually delete it — pvecm add --force will handle it cleanly.

Step 3 — Rejoin the cluster

Run this from /tmp (pvecm refuses to run from inside /etc/pve/):

cd /tmp && pvecm add <lead-node-ip> --use_ssh --force

For example:

cd /tmp && pvecm add 10.140.3.10 --use_ssh --force
  • --use_ssh — uses existing SSH key trust instead of the API password prompt
  • --force — overrides warnings about existing config, authkey, and VMs (all expected for a rejoin)

You'll see output like:

detected the following error(s):
* authentication key '/etc/corosync/authkey' already exists
* cluster config '/etc/pve/corosync.conf' already exists
* this host already contains virtual guests

WARNING : detected error but forced to continue!

copy corosync auth key
stopping pve-cluster service
backup old database to '/var/lib/pve-cluster/backup/config-1778172132.sql.gz'
waiting for quorum...OK
(re)generate node files
generate new node certificate
merge authorized SSH keys
generated new node certificate, restart pveproxy and pvedaemon services
successfully added node 'pve7' to cluster.

Step 4 — Verify

pvecm status

Healthy output looks like:

Cluster information
-------------------
Name:             pve
Config Version:   20
Transport:        knet
Secure auth:      on

Quorum information
------------------
Nodes:            5
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   7
Total votes:      7
Quorum:           4
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          3 10.140.3.10
0x00000002          1 10.140.3.80
0x00000003          1 10.140.3.70 (local)
0x00000004          1 10.140.3.82
0x00000006          1 10.140.3.20

All nodes present, Quorate: Yes, config_version incremented by 2 (one increment per node during the join handshake — this is expected).

Why This Keeps Happening

Corosync is enabled by default and starts automatically on boot. If a node has been offline long enough to miss cluster config changes, it will always boot into this broken state. The node isn't malfunctioning — it's doing exactly what it's designed to do with the config it has. It's just that the config is stale.

Prevention: Before powering on a long-offline Proxmox node, check your current cluster's config_version with pvecm status. If it's significantly ahead of what the returning node last knew, plan for a rejoin rather than assuming it'll come back cleanly.

Quick Reference

Command Purpose
pvecm status Check cluster health and config version
systemctl status corosync Check corosync state on a node
grep config_version /etc/corosync/corosync.conf Check node's config version
systemctl stop corosync Stop the misbehaving corosync
cd /tmp && pvecm add <ip> --use_ssh --force Rejoin the cluster

15 April 2026

Kokoro Streaming Latency Investigation on Firefox

Making Local TTS Actually Stream: Fixing Kokoro FastAPI for Real-Time Audio

If you’ve been following along with my local AI setup, you’ll know I run most of my services within Proxmox VE LXC or Podman containers.  One of those is Kokoro, a self-hosted text-to-speech instance based on the Kokoro-82M ONNX model.  The audio generated is surprisingly good and the model is small so inference is possible and fast on a CPU.  There are fifty-nine voices covering a few languages.  An important point is that it exposes an OpenAI-compatible API that plugs straight into Open WebUI.

This alone would not have warrantied a blog post, because it's a Text to Speech engine in a repo, but, no surprise, what started as a simple Firefox bug fix turned into a streaming pipeline investigation: with the usual benchmarks, agent assistance with code analysis,  duplicate container sandbox, and concluded with a fix that meaningfully reduces time-to-first-audio for conversational use cases.


The Firefox Bug

It worked fine in Chrome, but produced an error in Firefox when clicking Generate Speech:


The culprit was a single line in AudioService.js:

this.sourceBuffer = this.mediaSource.addSourceBuffer('audio/mpeg');

It turns out Firefox does not support audio/mpeg.  The fix was to test for support and fallback if MediaSource Extensions(MSE) are not available:

if (!window.MediaSource || !MediaSource.isTypeSupported('audio/mpeg')) {
    await this.setupBufferedStream(stream, response, onProgress, estimatedChunks);
    return;
}

The setupBufferedStream fallback collects all incoming audio chunks into a Blob and sets it as a plain audio.src.  No MSE required and works everywhere.  Rather than rebuilding the image the file patch was saved locally and injected using podman cp.

Benchmarking: Does Format or Voice Matter?

With the Firefox issue sorted, I ran a latency benchmark.  Then another and another across three formats (mp3, pcm and wav) and three voices.  The phrase was short and topical, since I'm seeking some consultancy:

“Hi Mediclinic, your EHR project sounds interesting and has the potential for a lot of impact.”

Three runs per combination, stream: false, measured with Python’s time.perf_counter().

By format (averaged across all voices)

Format Avg latency File size
WAV 1382 ms ~256 KB
PCM 1417 ms ~256 KB
MP3 1457 ms ~86 KB

By voice (averaged across all formats)

The choices were: English, Japanese, Mandarin, Spanish, French, Hindi, Italian, and Brazilian Portuguese. 

Voice Description Avg latency
af_heart American English female 1379 ms
bm_fable British English male 1439 ms
ef_dora Spanish female 1438 ms

The takeaway: format and voice choice barely matter for latency. The ONNX inference dominates — everything else (MP3 encoding, voice model differences) contributes at most ~80 ms. MP3 encoding time was minimal and the right choice for web playback because of it's file size advantage. The French voice (ef_dora) performs on par with the English voices, which is a good sign for multilingual deployments.

Can we go faster?

I spotted while reading the documentation, yes it's a bad habit I developed when I was young, that there API has a stream: true parameter.  For a conversational applications, I thought this would be useful and a simple switch to enable...  You can probably guess that I was being naive, because I thought great enable the flag and the server would stream the audio during generation reducing perceived latency.   It turns out that streaming works with sentances, so I split the test phrase to start with a nice short initial sentance:

“Hi Mediclinic.  Your EHR project sounds interesting and has the potential for a lot of impact.”

Then Claude wrote some Python to track exactly when each 1 KB chunk arrived at the client:

t_start = time.perf_counter()
chunks = []
with urllib.request.urlopen(req) as resp:
    while True:
        chunk = resp.read(1024)
        if not chunk: break
        t = round((time.perf_counter() - t_start) * 1000)
        chunks.append((t, len(chunk)))

print(f"First chunk: {chunks[0][0]}ms")
print(f"Last chunk:  {chunks[-1][0]}ms")

Results for stream: true, af_heart, MP3:

First chunk: 1462ms
Last chunk:  1464ms
Chunks: 89

All chunks arrived within 2 ms of each other, after a full 1.4 second wait. stream: false was identical.  Even PCM which has zero encoder overhead.  Eh?  This doesn't seem right.  What was going on?  Was something buffering the audio before a single byte was sent?

The Rabbit Hole

I made a copy of the base container called kokoro-stream, on port 8881 as an isolated sandbox for Claude to play with.  The server code uses async generators and yield statements all the way from the HTTP handler down to the ONNX inference layer, which is good practice.  The StreamingResponse even sets X-Accel-Buffering: no, so it should work.

Three hypotheses:


Hypothesis Evidence for
H1 ONNX inference batches both sentences as one call PCM (no encoder) also shows simultaneous delivery
H2 Uvicorn buffers the response body below a threshold No asyncio yield points between sentence yields
H3 PyAV MP3 encoder buffers early frames Secondary — can’t explain PCM behaviour

What the code actually does

Inside tts_service.py, smart_split() splits the input text into chunks before inference, this is good.  However, it batches sentences together when their combined token count is under 250 tokens.  Guess what?  The two sentence test is only 105 tokens, so both sentences were delivered as a single string to KokoroV1.generate().

Inside kokoro_v1.py, the pipeline called split_pattern=r'\n+' meaning it would only split on newlines, not just full stops.  And since there were no newlines, both sentences went through as a single inference call producing a single audio file.  No amount of downstream async would fix that.

Even if the sentences had been processed separately, the for result in pipeline(...) loop is synchronous and never returns control to the asyncio event loop between sentences, so the HTTP layer has no opportunity to flush.

The Fix

Two changes:

inference/kokoro_v1.py 

Change the split pattern to include breaks on full stops:

# before
split_pattern=r'\n+'
# after
split_pattern=r'(?<=[.!?])\s+'

inference/kokoro_v1.py and services/tts_service.py 

Add yield points:

yield AudioChunk(...)
await asyncio.sleep(0)  # return control to event loop → HTTP layer can flush

Before and now Time To First Audio(TTFA)

Metric Before After
First chunk (TTFA) ~1400 ms ~575 ms
Last chunk ~1400 ms ~1400 ms
Gap ~2 ms ~1100 ms

The first audio now arrives after ~575ms, while the second is still being generated.  The total generation time is unchanged, unsurprisingly, but the latency is lower and this is just what is needed for conversational applications, like calling several service centres to ask about the availability and costs of servicing a car.  I was surprised that online systems here is South Africa, don't show available slots, but instead are lead generation and a human sends and email or calls you a few hours later.

Conclusion

A few things worth noting:

The architecture.  Kokoro uses async generators throughout, so the issue wasn’t bad design, it was two small configuration defaults affect short inputs.  The token batching threshold (250 tokens) and the newline-only split pattern made sense in isolation, but eliminate sentence-level streaming for my test input.

PCM as a diagnostic tool.  Benchmarking PCM format (raw samples, no encoding) alongside MP3 was valuable, to idnetify, and eliminate the audio encoder as a suspect early.   When PCM and MP3 shows similar timings the bottleneck is unrelated and upstream of the encoder.

asyncio.sleep(0) is surprisingly powerful. A zero-duration sleep doesn’t actually sleep, it yields control to the event loop.  That’s enough for uvicorn to flush pending response bytes to the socket.  It’s a one-liner with impact on latency.


Podman on Ubuntu 24.04. 

Kokoro image: ghcr.io/remsky/kokoro-fastapi-cpu:latest

Voices used: af_heart, bm_fable, ef_dora.

LiteLLM + Agent Teams: A Practical Guide

LiteLLM + Agent Teams: A Practical Guide

An aide memoire for using the local AI infrastructure day-to-day.


The big picture

You have three layers:

Your task (plain English)
        ↓
  Agent team (Python, OpenAI Agents SDK)
        ↓
  LiteLLM proxy  ←→  Ollama (local GPU)
                 ←→  OpenRouter (cloud free)
                 ←→  Anthropic (claude-haiku)

LiteLLM is a translation layer. It gives everything a single OpenAI-compatible URL (http://10.140.20.63:4000/v1) regardless of whether the model is running locally on your GPU or fetched from a cloud provider. Your code never changes — only the model name string changes.

The agent team is a set of specialised AI workers. You give the orchestrator a task in plain English; it decides which specialist to hand it to; the specialist does the work and hands results back.


Part 1 — Using LiteLLM directly

From the command line (curl)

# Ask any model a question
curl http://10.140.20.63:4000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer no-key-needed" \
  -d '{
    "model": "qwen3.5:4b",
    "messages": [{"role": "user", "content": "What is a BGP route reflector?"}]
  }'

# List all available models
curl http://10.140.20.63:4000/v1/models | python3 -m json.tool | grep '"id"'

From Python (OpenAI SDK)

from openai import OpenAI

client = OpenAI(
    base_url="http://10.140.20.63:4000/v1",
    api_key="no-key-needed",
)

response = client.chat.completions.create(
    model="qwen3.5:4b",   # or "claude-haiku-4-5", "nemotron-120b", etc.
    messages=[{"role": "user", "content": "Summarise this log: ..."}],
)
print(response.choices[0].message.content)

Choosing a model

Use case Model string Where it runs
Quick questions, triage qwen3.5:4b Local GPU (3.4 GB)
Writing code qwen2.5-coder:7b Local GPU (4.7 GB)
General analysis qwen3.5 Local GPU (6.6 GB)
Images / screenshots qwen3-vl Local GPU (6.1 GB)
Heavy reasoning nemotron-120b Cloud free (OpenRouter)
Reliable tool calling claude-haiku-4-5 Cloud (Anthropic/OpenRouter)
Best available free free Cloud free (auto-routed)

Group aliases — if the specific model is busy or unavailable, LiteLLM falls back automatically:

Alias Primary Fallback
fast qwen3.5:4b qwen2.5-coder:1.5b
coder qwen2.5-coder:7b qwen2.5-coder:1.5b
local qwen3.5 llama3.1
reasoning nemotron-120b gpt-oss-120b

Health check

curl http://10.140.20.63:4000/health
incus exec litellm -- journalctl -u litellm -f   # live logs

Part 2 — Running the agent team

The one-liner

cd /home/user/claude/agents
.venv/bin/python team.py "your task here"

Example tasks

# Coding
.venv/bin/python team.py "write a Python script that tails a log file and alerts on ERROR lines"

# Research
.venv/bin/python team.py "what are the main CVEs in OpenSSH versions 8.x to 9.x?"

# Analysis
.venv/bin/python team.py "analyse this nmap output and prioritise the findings: [paste output]"

# Mixed — the orchestrator chains specialists automatically
.venv/bin/python team.py "research the log4shell vulnerability then write a Python checker for it"

What happens under the hood

You: "research log4shell then write a checker"
        ↓
Orchestrator (claude-haiku) reads task
        ↓
Handoff → Researcher (nemotron-120b, cloud)
  "Log4Shell is CVE-2021-44228, affects Log4j 2.0–2.14.1..."
        ↓
Back to Orchestrator → Handoff → Coder (qwen2.5-coder:7b, local GPU)
  "def check_log4shell(host, port): ..."
        ↓
Orchestrator summarises and returns to you

The orchestrator uses haiku because it reliably produces valid tool-call JSON for handoffs. Local Ollama models are fast but unreliable at structured function-calling.

Watching it work

Add LITELLM_LOG=DEBUG to see every model call:

LITELLM_LOG=DEBUG .venv/bin/python team.py "hello"

Or watch the LiteLLM proxy logs live in another terminal:

incus exec litellm -- journalctl -u litellm -f

Part 3 — Writing your own agents

Minimal single agent

import asyncio, os
os.environ["OPENAI_BASE_URL"] = "http://10.140.20.63:4000/v1"
os.environ["OPENAI_API_KEY"]  = "no-key-needed"

from agents import Agent, Runner

agent = Agent(
    name="Helper",
    model="qwen3.5:4b",
    instructions="You are a helpful assistant. Be concise.",
)

async def main():
    result = await Runner.run(agent, "What is ARP spoofing?")
    print(result.final_output)

asyncio.run(main())

Adding tools (things agents can do)

from agents import Agent, Runner, function_tool
import httpx

@function_tool
async def get_url(url: str) -> str:
    """Fetch the contents of a URL."""
    async with httpx.AsyncClient(timeout=10) as c:
        r = await c.get(url)
        return r.text[:2000]   # truncate to avoid context overflow

agent = Agent(
    name="WebReader",
    model="qwen3.5:4b",
    instructions="You can fetch URLs to answer questions.",
    tools=[get_url],
)

Rule: tools are Python functions decorated with @function_tool. The agent decides when to call them. The docstring becomes the tool description — make it clear.

Handing off between agents

from agents import Agent, Runner, handoff

specialist = Agent(
    name="Specialist",
    model="qwen3.5",
    instructions="You handle detailed analysis. Return results clearly.",
)

orchestrator = Agent(
    name="Orchestrator",
    model="claude-haiku-4-5",
    instructions="Route analysis tasks to Specialist. Summarise results.",
    handoffs=[handoff(specialist)],
)

result = await Runner.run(orchestrator, "Analyse this data: ...")

handoff() is itself a tool the orchestrator can call. When it calls it, execution transfers to the specialist; when the specialist finishes, control returns to the orchestrator.

The existing tools you can reuse

gpu_tools.py — for any agent that needs to know about the GPU:

from gpu_tools import vram_status, list_local_models, comfyui_status
agent = Agent(..., tools=[vram_status, list_local_models])

devops_tools.py — for agents that manage containers:

from devops_tools import container_run, container_write_file, container_read_file, http_probe, container_systemctl
agent = Agent(..., tools=[container_run, http_probe])

Part 4 — Practical patterns

Pattern 1: Quick one-shot query

Use make_client() from litellm_client.py directly — no agent overhead:

from litellm_client import make_client, FAST_MODEL

async def ask(question: str) -> str:
    client = make_client()
    resp = await client.chat.completions.create(
        model=FAST_MODEL,
        messages=[{"role": "user", "content": question}],
    )
    return resp.choices[0].message.content

Pattern 2: Task with a deadline / retry limit

result = await Runner.run(agent, task, max_turns=10)

max_turns prevents infinite loops. The team.py orchestrator uses 40 turns because research+code tasks can take many steps.

Pattern 3: Streaming output

from agents import Runner

async for event in Runner.run_streamed(agent, task):
    if hasattr(event, "delta") and event.delta:
        print(event.delta, end="", flush=True)

Pattern 4: DevOps / automation agent

See setup_tts_stt.py as a reference. The pattern is:
1. Write a detailed task string explaining exactly what the agent should do and verify
2. Give it the right tools (container_run, http_probe, etc.)
3. Set instructions to "act immediately, don't ask permission"
4. Set max_turns=40 for multi-step work

agent = Agent(
    name="DevOps",
    model="claude-haiku-4-5",   # must use haiku — local models can't do tool-calling
    tools=[container_run, container_write_file, http_probe, container_systemctl],
    instructions="Act immediately. Never ask for permission. Verify each step.",
)
result = await Runner.run(agent, TASK, max_turns=40)

Part 5 — Gotchas and tips

Local models can't do structured tool-calling

qwen3.5, qwen2.5-coder:7b, etc. produce good prose but often garble the JSON format needed for handoff() and @function_tool calls. Always use claude-haiku-4-5 as your orchestrator — it's reliable and cheap (Anthropic free tier via OpenRouter).

Only one large model fits in VRAM at a time

The RTX 4070 has 8 GB. If you ask the orchestrator to hand off to a 6.6 GB local model while another 4.7 GB model is loaded, Ollama unloads the first one. There is a ~5–15 second cold-load delay. This is normal.

Free cloud models are rate-limited

nemotron-120b and other OpenRouter free models may queue or time out under load. If an agent stalls for >2 minutes with no output, it's usually rate-limiting. Switch to gpt-oss-120b or qwen3-80b as alternatives.

The free model alias changes

openrouter/openrouter/free routes to whatever OpenRouter considers the best free model at that moment. Good for exploration; use a specific model name for reproducible pipelines.

Ollama keep-alive

Models stay in VRAM for 15 minutes after last use (KEEP_ALIVE=15m). If you want to free VRAM immediately:

curl -X POST http://10.140.20.1:11434/api/generate -d '{"model":"qwen3.5","keep_alive":0}'

Part 6 — Agent Team in Open WebUI

The agent team is exposed as a model in Open WebUI via the Pipelines server — a small FastAPI app that sits between Open WebUI and the agent code.

Open WebUI chat
      ↓  (selects "Agent Team" model)
Pipelines server  (host: 10.140.20.1:9099)
      ↓
Agent orchestrator (claude-haiku)
      ↓  handoffs
Specialist agents (local GPU / cloud free)

Architecture files

File Purpose
agents/pipelines/agent_team.py The pipeline class — wraps the agent team
agents/run_pipelines.sh Manual start script
/etc/systemd/system/owui-pipelines.service Systemd service (starts on boot)

Managing the pipelines server

sudo systemctl status owui-pipelines
sudo systemctl restart owui-pipelines
sudo journalctl -u owui-pipelines -f

Connecting to Open WebUI (one-time setup)

  1. Open http://localhost:3001
  2. Top-right avatar → Admin Panel
  3. Settings → Connections → Pipelines
  4. Add:
  5. URL: http://10.140.20.1:9099
  6. API Key: 0p3n-w3bu!
  7. Click Save — "Agent Team" now appears in the model picker

Using it

Select Agent Team in the model picker and chat normally. Each message is routed by the orchestrator to the right specialist. The full conversation history is passed so the team has context across turns.

The pipelines server API key (0p3n-w3bu!) is the default from the open-webui-pipelines package. Change it in /etc/systemd/system/owui-pipelines.service and update the Open WebUI connection setting to match.

Adding more pipelines

Drop a new .py file with a Pipeline class into agents/pipelines/, then:

sudo systemctl restart owui-pipelines

The new pipeline appears as a model in Open WebUI immediately.


Quick reference card

# Run agent team
cd /home/user/claude/agents && .venv/bin/python team.py "task"

# Query a model directly
curl http://10.140.20.63:4000/v1/chat/completions \
  -H "Content-Type: application/json" -H "Authorization: Bearer no-key-needed" \
  -d '{"model":"qwen3.5:4b","messages":[{"role":"user","content":"hello"}]}'

# List models
curl -s http://10.140.20.63:4000/v1/models | python3 -m json.tool | grep '"id"'

# Watch LiteLLM traffic
incus exec litellm -- journalctl -u litellm -f

# Check VRAM
curl -s http://10.140.20.1:11434/api/ps | python3 -m json.tool

# Add a model to Ollama
ollama pull <model-name>
# Then add it to /etc/litellm/config.yaml and push + restart

File map

/home/user/claude/agents/
├── team.py            ← entry point — run this
├── litellm_client.py  ← model constants and URLs
├── gpu_tools.py       ← tools: vram_status, list_local_models, comfyui_status
├── devops_tools.py    ← tools: container_run, container_write_file, http_probe, ...
├── setup_tts_stt.py   ← reference: single-purpose DevOps agent
└── .venv/             ← virtualenv (openai-agents, openai)

/etc/litellm/
├── config.yaml        ← model list (edit on host, push to container)
└── secrets.env        ← OPENROUTER_API_KEY