15 April 2026

LiteLLM + Agent Teams: A Practical Guide

LiteLLM + Agent Teams: A Practical Guide

An aide memoire for using the local AI infrastructure day-to-day.


The big picture

You have three layers:

Your task (plain English)
        ↓
  Agent team (Python, OpenAI Agents SDK)
        ↓
  LiteLLM proxy  ←→  Ollama (local GPU)
                 ←→  OpenRouter (cloud free)
                 ←→  Anthropic (claude-haiku)

LiteLLM is a translation layer. It gives everything a single OpenAI-compatible URL (http://10.140.20.63:4000/v1) regardless of whether the model is running locally on your GPU or fetched from a cloud provider. Your code never changes — only the model name string changes.

The agent team is a set of specialised AI workers. You give the orchestrator a task in plain English; it decides which specialist to hand it to; the specialist does the work and hands results back.


Part 1 — Using LiteLLM directly

From the command line (curl)

# Ask any model a question
curl http://10.140.20.63:4000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer no-key-needed" \
  -d '{
    "model": "qwen3.5:4b",
    "messages": [{"role": "user", "content": "What is a BGP route reflector?"}]
  }'

# List all available models
curl http://10.140.20.63:4000/v1/models | python3 -m json.tool | grep '"id"'

From Python (OpenAI SDK)

from openai import OpenAI

client = OpenAI(
    base_url="http://10.140.20.63:4000/v1",
    api_key="no-key-needed",
)

response = client.chat.completions.create(
    model="qwen3.5:4b",   # or "claude-haiku-4-5", "nemotron-120b", etc.
    messages=[{"role": "user", "content": "Summarise this log: ..."}],
)
print(response.choices[0].message.content)

Choosing a model

Use case Model string Where it runs
Quick questions, triage qwen3.5:4b Local GPU (3.4 GB)
Writing code qwen2.5-coder:7b Local GPU (4.7 GB)
General analysis qwen3.5 Local GPU (6.6 GB)
Images / screenshots qwen3-vl Local GPU (6.1 GB)
Heavy reasoning nemotron-120b Cloud free (OpenRouter)
Reliable tool calling claude-haiku-4-5 Cloud (Anthropic/OpenRouter)
Best available free free Cloud free (auto-routed)

Group aliases — if the specific model is busy or unavailable, LiteLLM falls back automatically:

Alias Primary Fallback
fast qwen3.5:4b qwen2.5-coder:1.5b
coder qwen2.5-coder:7b qwen2.5-coder:1.5b
local qwen3.5 llama3.1
reasoning nemotron-120b gpt-oss-120b

Health check

curl http://10.140.20.63:4000/health
incus exec litellm -- journalctl -u litellm -f   # live logs

Part 2 — Running the agent team

The one-liner

cd /home/user/claude/agents
.venv/bin/python team.py "your task here"

Example tasks

# Coding
.venv/bin/python team.py "write a Python script that tails a log file and alerts on ERROR lines"

# Research
.venv/bin/python team.py "what are the main CVEs in OpenSSH versions 8.x to 9.x?"

# Analysis
.venv/bin/python team.py "analyse this nmap output and prioritise the findings: [paste output]"

# Mixed — the orchestrator chains specialists automatically
.venv/bin/python team.py "research the log4shell vulnerability then write a Python checker for it"

What happens under the hood

You: "research log4shell then write a checker"
        ↓
Orchestrator (claude-haiku) reads task
        ↓
Handoff → Researcher (nemotron-120b, cloud)
  "Log4Shell is CVE-2021-44228, affects Log4j 2.0–2.14.1..."
        ↓
Back to Orchestrator → Handoff → Coder (qwen2.5-coder:7b, local GPU)
  "def check_log4shell(host, port): ..."
        ↓
Orchestrator summarises and returns to you

The orchestrator uses haiku because it reliably produces valid tool-call JSON for handoffs. Local Ollama models are fast but unreliable at structured function-calling.

Watching it work

Add LITELLM_LOG=DEBUG to see every model call:

LITELLM_LOG=DEBUG .venv/bin/python team.py "hello"

Or watch the LiteLLM proxy logs live in another terminal:

incus exec litellm -- journalctl -u litellm -f

Part 3 — Writing your own agents

Minimal single agent

import asyncio, os
os.environ["OPENAI_BASE_URL"] = "http://10.140.20.63:4000/v1"
os.environ["OPENAI_API_KEY"]  = "no-key-needed"

from agents import Agent, Runner

agent = Agent(
    name="Helper",
    model="qwen3.5:4b",
    instructions="You are a helpful assistant. Be concise.",
)

async def main():
    result = await Runner.run(agent, "What is ARP spoofing?")
    print(result.final_output)

asyncio.run(main())

Adding tools (things agents can do)

from agents import Agent, Runner, function_tool
import httpx

@function_tool
async def get_url(url: str) -> str:
    """Fetch the contents of a URL."""
    async with httpx.AsyncClient(timeout=10) as c:
        r = await c.get(url)
        return r.text[:2000]   # truncate to avoid context overflow

agent = Agent(
    name="WebReader",
    model="qwen3.5:4b",
    instructions="You can fetch URLs to answer questions.",
    tools=[get_url],
)

Rule: tools are Python functions decorated with @function_tool. The agent decides when to call them. The docstring becomes the tool description — make it clear.

Handing off between agents

from agents import Agent, Runner, handoff

specialist = Agent(
    name="Specialist",
    model="qwen3.5",
    instructions="You handle detailed analysis. Return results clearly.",
)

orchestrator = Agent(
    name="Orchestrator",
    model="claude-haiku-4-5",
    instructions="Route analysis tasks to Specialist. Summarise results.",
    handoffs=[handoff(specialist)],
)

result = await Runner.run(orchestrator, "Analyse this data: ...")

handoff() is itself a tool the orchestrator can call. When it calls it, execution transfers to the specialist; when the specialist finishes, control returns to the orchestrator.

The existing tools you can reuse

gpu_tools.py — for any agent that needs to know about the GPU:

from gpu_tools import vram_status, list_local_models, comfyui_status
agent = Agent(..., tools=[vram_status, list_local_models])

devops_tools.py — for agents that manage containers:

from devops_tools import container_run, container_write_file, container_read_file, http_probe, container_systemctl
agent = Agent(..., tools=[container_run, http_probe])

Part 4 — Practical patterns

Pattern 1: Quick one-shot query

Use make_client() from litellm_client.py directly — no agent overhead:

from litellm_client import make_client, FAST_MODEL

async def ask(question: str) -> str:
    client = make_client()
    resp = await client.chat.completions.create(
        model=FAST_MODEL,
        messages=[{"role": "user", "content": question}],
    )
    return resp.choices[0].message.content

Pattern 2: Task with a deadline / retry limit

result = await Runner.run(agent, task, max_turns=10)

max_turns prevents infinite loops. The team.py orchestrator uses 40 turns because research+code tasks can take many steps.

Pattern 3: Streaming output

from agents import Runner

async for event in Runner.run_streamed(agent, task):
    if hasattr(event, "delta") and event.delta:
        print(event.delta, end="", flush=True)

Pattern 4: DevOps / automation agent

See setup_tts_stt.py as a reference. The pattern is:
1. Write a detailed task string explaining exactly what the agent should do and verify
2. Give it the right tools (container_run, http_probe, etc.)
3. Set instructions to "act immediately, don't ask permission"
4. Set max_turns=40 for multi-step work

agent = Agent(
    name="DevOps",
    model="claude-haiku-4-5",   # must use haiku — local models can't do tool-calling
    tools=[container_run, container_write_file, http_probe, container_systemctl],
    instructions="Act immediately. Never ask for permission. Verify each step.",
)
result = await Runner.run(agent, TASK, max_turns=40)

Part 5 — Gotchas and tips

Local models can't do structured tool-calling

qwen3.5, qwen2.5-coder:7b, etc. produce good prose but often garble the JSON format needed for handoff() and @function_tool calls. Always use claude-haiku-4-5 as your orchestrator — it's reliable and cheap (Anthropic free tier via OpenRouter).

Only one large model fits in VRAM at a time

The RTX 4070 has 8 GB. If you ask the orchestrator to hand off to a 6.6 GB local model while another 4.7 GB model is loaded, Ollama unloads the first one. There is a ~5–15 second cold-load delay. This is normal.

Free cloud models are rate-limited

nemotron-120b and other OpenRouter free models may queue or time out under load. If an agent stalls for >2 minutes with no output, it's usually rate-limiting. Switch to gpt-oss-120b or qwen3-80b as alternatives.

The free model alias changes

openrouter/openrouter/free routes to whatever OpenRouter considers the best free model at that moment. Good for exploration; use a specific model name for reproducible pipelines.

Ollama keep-alive

Models stay in VRAM for 15 minutes after last use (KEEP_ALIVE=15m). If you want to free VRAM immediately:

curl -X POST http://10.140.20.1:11434/api/generate -d '{"model":"qwen3.5","keep_alive":0}'

Part 6 — Agent Team in Open WebUI

The agent team is exposed as a model in Open WebUI via the Pipelines server — a small FastAPI app that sits between Open WebUI and the agent code.

Open WebUI chat
      ↓  (selects "Agent Team" model)
Pipelines server  (host: 10.140.20.1:9099)
      ↓
Agent orchestrator (claude-haiku)
      ↓  handoffs
Specialist agents (local GPU / cloud free)

Architecture files

File Purpose
agents/pipelines/agent_team.py The pipeline class — wraps the agent team
agents/run_pipelines.sh Manual start script
/etc/systemd/system/owui-pipelines.service Systemd service (starts on boot)

Managing the pipelines server

sudo systemctl status owui-pipelines
sudo systemctl restart owui-pipelines
sudo journalctl -u owui-pipelines -f

Connecting to Open WebUI (one-time setup)

  1. Open http://localhost:3001
  2. Top-right avatar → Admin Panel
  3. Settings → Connections → Pipelines
  4. Add:
  5. URL: http://10.140.20.1:9099
  6. API Key: 0p3n-w3bu!
  7. Click Save — "Agent Team" now appears in the model picker

Using it

Select Agent Team in the model picker and chat normally. Each message is routed by the orchestrator to the right specialist. The full conversation history is passed so the team has context across turns.

The pipelines server API key (0p3n-w3bu!) is the default from the open-webui-pipelines package. Change it in /etc/systemd/system/owui-pipelines.service and update the Open WebUI connection setting to match.

Adding more pipelines

Drop a new .py file with a Pipeline class into agents/pipelines/, then:

sudo systemctl restart owui-pipelines

The new pipeline appears as a model in Open WebUI immediately.


Quick reference card

# Run agent team
cd /home/user/claude/agents && .venv/bin/python team.py "task"

# Query a model directly
curl http://10.140.20.63:4000/v1/chat/completions \
  -H "Content-Type: application/json" -H "Authorization: Bearer no-key-needed" \
  -d '{"model":"qwen3.5:4b","messages":[{"role":"user","content":"hello"}]}'

# List models
curl -s http://10.140.20.63:4000/v1/models | python3 -m json.tool | grep '"id"'

# Watch LiteLLM traffic
incus exec litellm -- journalctl -u litellm -f

# Check VRAM
curl -s http://10.140.20.1:11434/api/ps | python3 -m json.tool

# Add a model to Ollama
ollama pull <model-name>
# Then add it to /etc/litellm/config.yaml and push + restart

File map

/home/user/claude/agents/
├── team.py            ← entry point — run this
├── litellm_client.py  ← model constants and URLs
├── gpu_tools.py       ← tools: vram_status, list_local_models, comfyui_status
├── devops_tools.py    ← tools: container_run, container_write_file, http_probe, ...
├── setup_tts_stt.py   ← reference: single-purpose DevOps agent
└── .venv/             ← virtualenv (openai-agents, openai)

/etc/litellm/
├── config.yaml        ← model list (edit on host, push to container)
└── secrets.env        ← OPENROUTER_API_KEY

No comments:

Post a Comment