LiteLLM + Agent Teams: A Practical Guide
An aide memoire for using the local AI infrastructure day-to-day.
The big picture
You have three layers:
Your task (plain English)
↓
Agent team (Python, OpenAI Agents SDK)
↓
LiteLLM proxy ←→ Ollama (local GPU)
←→ OpenRouter (cloud free)
←→ Anthropic (claude-haiku)
LiteLLM is a translation layer. It gives everything a single OpenAI-compatible URL (http://10.140.20.63:4000/v1) regardless of whether the model is running locally on your GPU or fetched from a cloud provider. Your code never changes — only the model name string changes.
The agent team is a set of specialised AI workers. You give the orchestrator a task in plain English; it decides which specialist to hand it to; the specialist does the work and hands results back.
Part 1 — Using LiteLLM directly
From the command line (curl)
# Ask any model a question
curl http://10.140.20.63:4000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer no-key-needed" \
-d '{
"model": "qwen3.5:4b",
"messages": [{"role": "user", "content": "What is a BGP route reflector?"}]
}'
# List all available models
curl http://10.140.20.63:4000/v1/models | python3 -m json.tool | grep '"id"'
From Python (OpenAI SDK)
from openai import OpenAI
client = OpenAI(
base_url="http://10.140.20.63:4000/v1",
api_key="no-key-needed",
)
response = client.chat.completions.create(
model="qwen3.5:4b", # or "claude-haiku-4-5", "nemotron-120b", etc.
messages=[{"role": "user", "content": "Summarise this log: ..."}],
)
print(response.choices[0].message.content)
Choosing a model
| Use case | Model string | Where it runs |
|---|---|---|
| Quick questions, triage | qwen3.5:4b |
Local GPU (3.4 GB) |
| Writing code | qwen2.5-coder:7b |
Local GPU (4.7 GB) |
| General analysis | qwen3.5 |
Local GPU (6.6 GB) |
| Images / screenshots | qwen3-vl |
Local GPU (6.1 GB) |
| Heavy reasoning | nemotron-120b |
Cloud free (OpenRouter) |
| Reliable tool calling | claude-haiku-4-5 |
Cloud (Anthropic/OpenRouter) |
| Best available free | free |
Cloud free (auto-routed) |
Group aliases — if the specific model is busy or unavailable, LiteLLM falls back automatically:
| Alias | Primary | Fallback |
|---|---|---|
fast |
qwen3.5:4b | qwen2.5-coder:1.5b |
coder |
qwen2.5-coder:7b | qwen2.5-coder:1.5b |
local |
qwen3.5 | llama3.1 |
reasoning |
nemotron-120b | gpt-oss-120b |
Health check
curl http://10.140.20.63:4000/health
incus exec litellm -- journalctl -u litellm -f # live logs
Part 2 — Running the agent team
The one-liner
cd /home/user/claude/agents
.venv/bin/python team.py "your task here"
Example tasks
# Coding
.venv/bin/python team.py "write a Python script that tails a log file and alerts on ERROR lines"
# Research
.venv/bin/python team.py "what are the main CVEs in OpenSSH versions 8.x to 9.x?"
# Analysis
.venv/bin/python team.py "analyse this nmap output and prioritise the findings: [paste output]"
# Mixed — the orchestrator chains specialists automatically
.venv/bin/python team.py "research the log4shell vulnerability then write a Python checker for it"
What happens under the hood
You: "research log4shell then write a checker"
↓
Orchestrator (claude-haiku) reads task
↓
Handoff → Researcher (nemotron-120b, cloud)
"Log4Shell is CVE-2021-44228, affects Log4j 2.0–2.14.1..."
↓
Back to Orchestrator → Handoff → Coder (qwen2.5-coder:7b, local GPU)
"def check_log4shell(host, port): ..."
↓
Orchestrator summarises and returns to you
The orchestrator uses haiku because it reliably produces valid tool-call JSON for handoffs. Local Ollama models are fast but unreliable at structured function-calling.
Watching it work
Add LITELLM_LOG=DEBUG to see every model call:
LITELLM_LOG=DEBUG .venv/bin/python team.py "hello"
Or watch the LiteLLM proxy logs live in another terminal:
incus exec litellm -- journalctl -u litellm -f
Part 3 — Writing your own agents
Minimal single agent
import asyncio, os
os.environ["OPENAI_BASE_URL"] = "http://10.140.20.63:4000/v1"
os.environ["OPENAI_API_KEY"] = "no-key-needed"
from agents import Agent, Runner
agent = Agent(
name="Helper",
model="qwen3.5:4b",
instructions="You are a helpful assistant. Be concise.",
)
async def main():
result = await Runner.run(agent, "What is ARP spoofing?")
print(result.final_output)
asyncio.run(main())
Adding tools (things agents can do)
from agents import Agent, Runner, function_tool
import httpx
@function_tool
async def get_url(url: str) -> str:
"""Fetch the contents of a URL."""
async with httpx.AsyncClient(timeout=10) as c:
r = await c.get(url)
return r.text[:2000] # truncate to avoid context overflow
agent = Agent(
name="WebReader",
model="qwen3.5:4b",
instructions="You can fetch URLs to answer questions.",
tools=[get_url],
)
Rule: tools are Python functions decorated with @function_tool. The agent decides when to call them. The docstring becomes the tool description — make it clear.
Handing off between agents
from agents import Agent, Runner, handoff
specialist = Agent(
name="Specialist",
model="qwen3.5",
instructions="You handle detailed analysis. Return results clearly.",
)
orchestrator = Agent(
name="Orchestrator",
model="claude-haiku-4-5",
instructions="Route analysis tasks to Specialist. Summarise results.",
handoffs=[handoff(specialist)],
)
result = await Runner.run(orchestrator, "Analyse this data: ...")
handoff() is itself a tool the orchestrator can call. When it calls it, execution transfers to the specialist; when the specialist finishes, control returns to the orchestrator.
The existing tools you can reuse
gpu_tools.py — for any agent that needs to know about the GPU:
from gpu_tools import vram_status, list_local_models, comfyui_status
agent = Agent(..., tools=[vram_status, list_local_models])
devops_tools.py — for agents that manage containers:
from devops_tools import container_run, container_write_file, container_read_file, http_probe, container_systemctl
agent = Agent(..., tools=[container_run, http_probe])
Part 4 — Practical patterns
Pattern 1: Quick one-shot query
Use make_client() from litellm_client.py directly — no agent overhead:
from litellm_client import make_client, FAST_MODEL
async def ask(question: str) -> str:
client = make_client()
resp = await client.chat.completions.create(
model=FAST_MODEL,
messages=[{"role": "user", "content": question}],
)
return resp.choices[0].message.content
Pattern 2: Task with a deadline / retry limit
result = await Runner.run(agent, task, max_turns=10)
max_turns prevents infinite loops. The team.py orchestrator uses 40 turns because research+code tasks can take many steps.
Pattern 3: Streaming output
from agents import Runner
async for event in Runner.run_streamed(agent, task):
if hasattr(event, "delta") and event.delta:
print(event.delta, end="", flush=True)
Pattern 4: DevOps / automation agent
See setup_tts_stt.py as a reference. The pattern is:
1. Write a detailed task string explaining exactly what the agent should do and verify
2. Give it the right tools (container_run, http_probe, etc.)
3. Set instructions to "act immediately, don't ask permission"
4. Set max_turns=40 for multi-step work
agent = Agent(
name="DevOps",
model="claude-haiku-4-5", # must use haiku — local models can't do tool-calling
tools=[container_run, container_write_file, http_probe, container_systemctl],
instructions="Act immediately. Never ask for permission. Verify each step.",
)
result = await Runner.run(agent, TASK, max_turns=40)
Part 5 — Gotchas and tips
Local models can't do structured tool-calling
qwen3.5, qwen2.5-coder:7b, etc. produce good prose but often garble the JSON format needed for handoff() and @function_tool calls. Always use claude-haiku-4-5 as your orchestrator — it's reliable and cheap (Anthropic free tier via OpenRouter).
Only one large model fits in VRAM at a time
The RTX 4070 has 8 GB. If you ask the orchestrator to hand off to a 6.6 GB local model while another 4.7 GB model is loaded, Ollama unloads the first one. There is a ~5–15 second cold-load delay. This is normal.
Free cloud models are rate-limited
nemotron-120b and other OpenRouter free models may queue or time out under load. If an agent stalls for >2 minutes with no output, it's usually rate-limiting. Switch to gpt-oss-120b or qwen3-80b as alternatives.
The free model alias changes
openrouter/openrouter/free routes to whatever OpenRouter considers the best free model at that moment. Good for exploration; use a specific model name for reproducible pipelines.
Ollama keep-alive
Models stay in VRAM for 15 minutes after last use (KEEP_ALIVE=15m). If you want to free VRAM immediately:
curl -X POST http://10.140.20.1:11434/api/generate -d '{"model":"qwen3.5","keep_alive":0}'
Part 6 — Agent Team in Open WebUI
The agent team is exposed as a model in Open WebUI via the Pipelines server — a small FastAPI app that sits between Open WebUI and the agent code.
Open WebUI chat
↓ (selects "Agent Team" model)
Pipelines server (host: 10.140.20.1:9099)
↓
Agent orchestrator (claude-haiku)
↓ handoffs
Specialist agents (local GPU / cloud free)
Architecture files
| File | Purpose |
|---|---|
agents/pipelines/agent_team.py |
The pipeline class — wraps the agent team |
agents/run_pipelines.sh |
Manual start script |
/etc/systemd/system/owui-pipelines.service |
Systemd service (starts on boot) |
Managing the pipelines server
sudo systemctl status owui-pipelines
sudo systemctl restart owui-pipelines
sudo journalctl -u owui-pipelines -f
Connecting to Open WebUI (one-time setup)
- Open
http://localhost:3001 - Top-right avatar → Admin Panel
- Settings → Connections → Pipelines
- Add:
- URL:
http://10.140.20.1:9099 - API Key:
0p3n-w3bu! - Click Save — "Agent Team" now appears in the model picker
Using it
Select Agent Team in the model picker and chat normally. Each message is routed by the orchestrator to the right specialist. The full conversation history is passed so the team has context across turns.
The pipelines server API key (0p3n-w3bu!) is the default from the open-webui-pipelines package. Change it in /etc/systemd/system/owui-pipelines.service and update the Open WebUI connection setting to match.
Adding more pipelines
Drop a new .py file with a Pipeline class into agents/pipelines/, then:
sudo systemctl restart owui-pipelines
The new pipeline appears as a model in Open WebUI immediately.
Quick reference card
# Run agent team
cd /home/user/claude/agents && .venv/bin/python team.py "task"
# Query a model directly
curl http://10.140.20.63:4000/v1/chat/completions \
-H "Content-Type: application/json" -H "Authorization: Bearer no-key-needed" \
-d '{"model":"qwen3.5:4b","messages":[{"role":"user","content":"hello"}]}'
# List models
curl -s http://10.140.20.63:4000/v1/models | python3 -m json.tool | grep '"id"'
# Watch LiteLLM traffic
incus exec litellm -- journalctl -u litellm -f
# Check VRAM
curl -s http://10.140.20.1:11434/api/ps | python3 -m json.tool
# Add a model to Ollama
ollama pull <model-name>
# Then add it to /etc/litellm/config.yaml and push + restart
File map
/home/user/claude/agents/
├── team.py ← entry point — run this
├── litellm_client.py ← model constants and URLs
├── gpu_tools.py ← tools: vram_status, list_local_models, comfyui_status
├── devops_tools.py ← tools: container_run, container_write_file, http_probe, ...
├── setup_tts_stt.py ← reference: single-purpose DevOps agent
└── .venv/ ← virtualenv (openai-agents, openai)
/etc/litellm/
├── config.yaml ← model list (edit on host, push to container)
└── secrets.env ← OPENROUTER_API_KEY
No comments:
Post a Comment