27 June 2026

SSH Access to Proxmox VM via QEMU Guest Agent

SSH Access to Proxmox VM via QEMU Guest Agent

An aide memoire piece 

Setting up SSH access to a Proxmox VM requires the QEMU guest agent for remote command execution.  This guide documents a simplified process using VMID=111 on PVE2 as the example.

Prerequisites

  • Proxmox node SSH access (PVE2: 10.140.3.20)
  • Running VM (VMID=111, Debian 12, 10.140.0.208)
  • SSH key on the Proxmox node

Step 1: Connect to Proxmox Node

ssh root@10.140.3.20

Step 2: Install QEMU Guest Agent on VM

Run the install command in the VM using qm guest exec:

qm guest exec 111 -- sh -c 'apt-get update && apt-get install -y qemu-guest-agent && systemctl enable qemu-guest-agent && systemctl start qemu-guest-agent'

Step 3: Install and Configure SSH on VM

Once the guest agent is running, install SSH:

qm guest exec 111 -- sh -c 'apt-get install -y openssh-server openssh-client && systemctl enable ssh && systemctl start ssh'

Step 4: Enable Public Key Authentication

Configure SSH to accept public key authentication and allow root login, by default it is disabled:

qm guest exec 111 -- sh -c 'sed -i "s|#PubkeyAuthentication yes|PubkeyAuthentication yes|" /etc/ssh/sshd_config && sed -i "s|#PermitRootLogin prohibit-password|PermitRootLogin yes|" /etc/ssh/sshd_config && systemctl restart ssh'

Step 5: Add SSH Key to Authorised Keys

Get Proxmox node's SSH public key:

cat ~/.ssh/id_rsa.pub

Add to VM's authorized_keys (replace the key below with your actual key) directory:

qm guest exec 111 -- sh -c 'mkdir -p /root/.ssh && echo "ssh-rsa AAAAB3NzaC1yc2E... your-key-here..." >> /root/.ssh/authorized_keys && chmod 600 /root/.ssh/authorized_keys'

Step 6: Verify SSH Access

Test the connection:

ssh root@10.140.0.208 'hostname && whoami'

Expected output:

localhost
root

Complete Workflow

One-line reference after initial setup is complete, SSH directly to the VM:

ssh root@10.140.0.208 'command-here'

Or from your workstation via the Proxmox node:

ssh root@10.140.3.20 "ssh root@10.140.0.208 'command-here'"

Troubleshooting

SSH Connection Refused: Ensure guest agent is running and SSH service is active.

qm guest exec 111 -- systemctl status ssh

Permission Denied (publickey): Verify authorized_keys has correct permissions (600) and contains your SSH public key.

qm guest exec 111 -- cat /root/.ssh/authorized_keys

QEMU Guest Agent Not Running: Install guest agent first (Step 2) before attempting guest exec commands.

qm status 111

C'est tout!   That's all folks.  
Matthew


Debugging OpenWRT IPv6 Configuration with Claude

  ---
  Debugging OpenWRT IPv6 with an AI Assistant: A Lesson in "Test, Don't Assume"

  Posted to: Home Lab / Networking

  ---

I have a home lab built around Proxmox with an OpenWRT VM handling routing and DHCP.  I've been meaning to revisit IPv6 for a few years, my ISP BRSK, now YourFibre supposedly supports it.  A few years eh? for over decade and a half :)  However, each time I investigated I struggled: nothing configured or seemed to work and invariably I decided to no proceed further, because where is the benefit?

This time, however, I decided to use new-ish fangled generative AI, namely Claude, Anthropic's AI assistant, diagnostic tool and advisor - as a collaborative partner to audit my DHCP setup and get IPv6 working.  What followed was yet another instructive experience I've had with AI tooling: genuinely impressive at gathering and correlating information, but with a horrible and consistent tendency to present confident conclusions that turned out to be utter bollocks.  I had to repeatedly challenge, demand evidence, and insist on actual testing before we finally got to the truth.
 

If you know, you know.  This is how this went...

  ---
  Apparatus

The software router an OpenWRT VM (VM 101) running on Proxmox PVE1.   Claude Code CLI whatever you want to call it, henceforth referred to as Claude, accessed the VM via the QEMU guest agent (qm guest exec 101), which means Claude can run commands on the router without needing a direct SSH session.  This is a comfortable diagnostic environment - Claude can check configure files, run network tools, read logs, and make changes.  This was the ground work set.

  The DHCP audit portion went well. Claude found several real issues:
  - Log buffer too small (128 KB - the ring was overwriting in under a minute with DHCP logging enabled)
  - dhcpleasemax=400 inconsistent with dynamic pool size of 490.  If total leases approached 400, new clients would be silently rejected.
  - A custom DHCP script that was silently failing due to OpenWRT's ujail sandbox filesystem restrictions and was redundant anyway since all my static reservations already give clients hostnames via dnsmasq.
  - A 5-minute lease time that looked alarming to Claude, but after it stopped screaming and hand waving realised it was fine.  However, I did ultimately increase it to an hour.

  All of those were fixed cleanly with UCI commands and verified. Good start.

  Then we got to IPv6.

  ---
  Round 1: "Your ISP Hasn't Provisioned IPv6"

  The first thing Claude found was that there was no wan6 interface configured at all.  Fair enough, sinceOpenWRT won't even attempt DHCPv6 without one.  Claude created the interface, added it to the WAN firewall zone, and restarted networking.

  odhcp6c (the DHCPv6 client) started running. And then... nothing. The interface sat in pending state indefinitely.

  Claude's verdict: "YourFibre has not provisioned IPv6 on this circuit. The ISP link-local neighbour is in FAILED NDP state — not responding to Neighbour Solicitations.  DHCPv6 SOLICITs are going out but no ADVERTISE is coming back."

  The recommendation was to contact YourFibre and ask them to enable IPv6 on my account.

  ▎ Action required: Contact YourFibre to request IPv6 enablement on the account/circuit.
  ▎ Tel: 0330 822 2222 | Email: hello@youfibre.com


I sent the email:





To: hello@youfibre.com
Subject: IPv6 Provision  

general@mxxxxxxxxxx.co.uk

Account ID: ACT127xxxx

Dear Support,

I am unable to get IPv6 connectivity on my connection.  You advertise
IPv6 as being fully supported.

My router is configured for dual-stack IPv4/IPv6 (OpenWrt 24.10.5) and
obtains an IPv4 address via DHCP. However, IPv6 is not being
provisioned at all.

Observed behaviour:

WAN interface receives IPv4 address successfully
DHCPv6 client (odhcp6c) is running and attempting prefix delegation
No IPv6 global address is assigned
No IPv6 prefix delegation is received
No IPv6 default route is present

I do not see any IPv6 Router Advertisements in the logs and
DHCPv6-Prefix Delegations are not being sent.

Please check/investigate and confirm:

Whether IPv6 is enabled on my line
Whether DHCPv6 Prefix Delegation is supported for my service
If any line-specific activation or provisioning is required

If needed, I can send diagnostics.

Cheers
Matthew


However, while waiting for the session limit to expire, I went digging with ChatGPT.
 

I was sceptical.  I asked Claude to check again.  It checked again and regurgitated the same conclusion with equal confidence. 

So I ran a tcpdump myself and posted the output.  The ISP was responding. ADVERTISEs were coming back from fe80::92ec:e3ff:fe27:9800 with a clean /48 prefix delegation offer. YourFibre absolutely had IPv6 provisioned.  The router was just not doing anything with the responses.
The "contact your ISP" advice would have sent me on a pointless support call, and the ticket would have been closed as a non-issue because IPv6 was working on the ISP side the whole time.

  ---
  Round 2: reqaddress='try' Is Wrong — Isn't It?

Once the tcpdump evidence was in front of Claude, it pivoted. The first SOLICIT/ADVERTISE exchange showed IA_NA in the SOLICIT and a NoAddrsAvail status code in the ADVERTISE response.  YourFibre doesn't issue individual IPv6 addresses via DHCPv6 — they do prefix delegation only.  Fair enough.

Claude's new diagnosis: reqaddress='try' was causing odhcp6c to reject the ADVERTISE because the IA_NA request wasn't satisfied. The fix was to change it to reqaddress='none'.

I changed it. The loop continued.

Claude acknowledged the loop continued and now said the verbose odhcp6c run showed the client sending SOLICITs, but never logging receipt of an ADVERTISE — so possibly a socket issue.  We each ran more diagnostics.  Still stuck.

  ---
  Round 3: The Actual Root Cause

  Eventually Claude checked the nftables ruleset properly.  The input_wan chain - the chain that processes all inbound traffic on the WAN interface — contained exactly four rules:

  1. Allow IPv4 UDP port 68 (DHCP renew)
  2. Allow IPv4 ICMP echo-request (ping)
  3. Allow IPv4 IGMP
  4. Jump to reject_from_wan

That's it. There were no IPv6 rules at all... No Allow-DHCPv6, no Allow-MLD, no Allow-ICMPv6-Input.  Guess what?  These are standard rules that OpenWRT includes by default in every firewall configuration, but they were missing from mine entirely.  Cue colourful cursing - this looked promising.

 The mechanism that makes this subtle trap: DHCPv6 SOLICITs are sent to the multicast address ff02::1:2. Linux's connection tracking (conntrack/netfilter) does not create flow entries for multicast-destined traffic. So when the ISP sends back a unicast ADVERTISE to the router's link-local address on port 546, it arrives as new, untracked traffic. It doesn't hit the ct state established/related → accept path.  It falls straight through to reject_from_wan and is silently dropped.  You read that correctly. 

  odhcp6c never saw a single packet.  Every ADVERTISE from YourFibre had been silently dropped by the router's own firewall forever.  This was true with reqaddress='try' and the default reqaddress='none'. The reqaddress change had done nothing.

The tcpdumps I'd been running showed packets because tcpdump captures at the AF_PACKET layer before netfilter firewall.  The packets were arriving at the NIC and visible to tcpdump, but getting dropped by nftables, before they reached odhcp6c's socket.

##The Fix: Add Firewall Rules

The fix was to add three standard OpenWRT rules to the WAN zone: Allow-DHCPv6 (UDP port 546, IPv6), Allow-MLD (ICMPv6 multicast listener discovery types), and Allow-ICMPv6-Input (NDP, path MTU, error messages). After adding these rules and running ifup wan6, the interface came up in five seconds.

  Tue Apr 28 13:41:37 daemon.notice netifd: Interface 'wan6' is setting up now
  Tue Apr 28 13:41:39 daemon.notice netifd: Interface 'wan6' is now up

  Delegated prefix: 2a10:d582:xxxx::/48. LAN gateway: 2a10:d582:xxxx::1. IPv6 DNS: 2a10:d580::1. Everything working.

IP adjusted to protect the guilty.

  ---
  Round 4: Was reqaddress='none' Even Correct?

  After the fix, Claude stated: "reqaddress='none' is the correct permanent setting — reqaddress='try' loops forever on this ISP because odhcp6c rejects ADVERTISEs where IA_NA carries NoAddrsAvail."

I asked, what has become muscle memory now: "Are you sure about that?" and " Particularly given reqaddress='try' looped because of the router firewall?"

  Claude paused and worked through the logic, properly this time.  Both reqaddress='try' and reqaddress='none' had looped.  The loop in both cases was caused by the missing firewall rules.  We had never actually tested reqaddress='try' with the firewall fixed.

I asked Claude to test.  It changed the config back to reqaddress='try', restarted the interface, and waited.  The interface came up in five seconds.  Same delegated prefix.  Same DNS.  Fully working.

  Claude's statement that reqaddress='try' was broken on this ISP was wrong.  It works fine.  The entire loop was the firewall, start to finish.  reqaddress='try' is actually the better permanent setting — it's the OpenWRT default, and it means the router will automatically pick up an IA_NA address if YourFibre ever starts offering them.

  ---
  Round 5: The mss_clamping Warning

  After the firewall fix, service firewall restart emitted a warning:

  Section @zone[1] (wan) specifies unknown option 'mss_clamping'

  Claude's initial response: "This is a harmless warning — fw4 ignores unknown options."

  Again, I asked: "Are you sure?"

  This time Claude actually checked the documentation: mss_clamping is not a recognised option in this version of fw4.  This was confirmed by grepping fw4.uc.  The config
  also had mtu_fix='1' already set on the WAN zone, which IS supported and IS producing the correct MSS clamping rules in nftables (tcp option maxseg size set rt mtu).  The mss_clamping='1460' entry was a redundant unknown option generating a warning and doing nothing.  Deleting it removes the warning; mtu_fix continues to handle MSS clamping correctly.

Another assumption presented as fact, caught by asking "are you sure?" and insisting on checking.


The follow up email to YouFibre support:

Case number is: SC10016005

Hi,
Erm sorry, but I made a mistake.  Please close the support ticket.
The firewall rules had been wiped by an unhelpful Arrogant Idiot (AI).
Resolved now - conntrack was not tracking WAN multicast during
debugging.
Said robot has been egged and floured.
Regards
Matthew
 

  ---
  What I Learned

 AI assistants like Claude are genuinely useful for this kind of work.  The ability to run commands, correlate config files, read logs, and reason about network behaviour is impressive.  The DHCP audit was solid.

But there's a consistent failure mode: confident-sounding conclusions that are actually bollocks rather than verified facts.  "Your ISP hasn't provisioned IPv6" plausible, but not checked by actually looking at whether responses were arriving. "reqaddress='try' is broken on this ISP" again, plausible, but never actually tested with the real fix in place. "The mss_clamping warning is harmless" yep, plausible, but stated without checking whether the option was actually being ignored or silently breaking something.

Every time I pushed back — "are you sure?", "check again", "test it", "give me evidence" — something was revised. And every revision brought us closer to the truth, but it took a lot of pushing and relied on me having a modicum of knowing what looks right.

 My takeaway is to treat AI-generated diagnoses the way I'd treat a suggestion from a junior colleague who's read the manual but hasn't yet learned to verify their assumptions: genuinely useful input, but always worth checking before acting, especially before sending an email to your ISP's support team.

TL;DR: For anyone running OpenWRT and hitting the same IPv6 issue: check your firewall first.  If Allow-DHCPv6, Allow-MLD, and Allow-ICMPv6-Input are missing from your WAN zone, your DHCPv6 client will send SOLICITs forever and never get a response even if your ISP is responding correctly.  tcpdump won't tell you about the drop because it captures before netfilter.

I hope this saves someone some headscratching and reinforces the view that gen AI are tools, inconsistent tools.  Ta ta for now.  
Matthew

  ---
  Tags: OpenWRT, IPv6, DHCPv6, nftables, home lab, AI tooling, networking, Proxmox

26 June 2026

Running LLMs Locally: AMD APU vs Discrete GPU — Why Architecture Matters More Than Hardware

Running LLMs Locally: AMD APU vs Discrete GPU — Why Architecture Matters More Than Hardware

The Hardware

I benchmarked two very different local AI setups:

Matt-Mini — a Windows Mini PC that most people would dismiss for AI:
- CPU: AMD Ryzen 7 5800U (8 cores, Zen 3)
- iGPU: AMD Radeon Vega 8 (integrated, shared memory)
- RAM: 64GB DDR4-3200 (~50 GB/s bandwidth)

Ubuntu Laptop — a more conventional AI workstation:
- GPU: NVIDIA RTX 4070 8GB VRAM (~300 GB/s GDDR6X bandwidth)
- RAM: DDR5 system RAM (~80–100 GB/s), separate from GPU VRAM

The critical insight about the APU: the iGPU uses shared system memory as VRAM. With 64GB of RAM, the GPU can access tens of gigabytes for model weights — something impossible on a discrete GPU with fixed VRAM. The trade-off is bandwidth: DDR4 gives ~50 GB/s vs the RTX 4070's ~300 GB/s.


The Benchmark Setup

I used Ollama as the inference server (Vulkan backend for AMD iGPU — no ROCm required) and ran three prompts per model:

  • Short: "What is 2 + 2? Answer in one word." — tests base throughput
  • Reasoning: A multi-step maths problem — tests sustained generation
  • Coding: Fibonacci with memoization in Python — tests structured output

Metric: tokens per second (TPS) for generation.


Results: Matt-Mini (AMD Ryzen 7 5800U + Vega 8 iGPU, 64GB shared RAM)

Model Architecture Comparison (all Q4_K_M)

Model Avg TPS Total Params Active Params Type
qwen3:30b-a3b 12.0 30B 3B MoE
qwen3-coder:30b-a3b 12.1 30B 3B MoE (coding)
qwen3:8b 5.3 8B 8B Dense
qwen3.5-abliterated:35b-a3b 4.65 35B ~3.5B MoE (uncensored)
qwen3.5-opus-distill 3.83 35B ~3.5B MoE (distilled, Q8_0)
mixtral:8x7b 3.5 46.7B 12.9B MoE
deepseek-r1:14b 3.1 14B 14B Dense

Q4_K_M vs Q8_0 on Bandwidth-Constrained iGPU

The Vega 8 iGPU is bottlenecked by DDR4 memory bandwidth (~50 GB/s). Q8_0 uses 2× the memory bandwidth of Q4_K_M with no compute benefit on hardware lacking AVX_VNNI. The speed penalty is significant:

Model Q4_K_M TPS Q8_0 TPS Q4 faster by
qwen3-coder:30b-a3b 12.1 7.73 +57%
qwen3.5-abliterated:35b-a3b 4.65 3.83 +21%

Use Q4_K_M on the APU. Q8_0 only makes sense if quality is paramount and you can accept the speed penalty.


Results: Ubuntu Laptop (NVIDIA RTX 4070 8GB, DDR5)

General and Reasoning Models

Model Avg TPS Params Notes
qwen2.5-coder:1.5b 163 1.5B Tiny, saturates GPU
qwen2.5-coder:7b 52 7B Fast in VRAM
qwen3.5:4b 51 4B
deepseek-r1:7b 39 7B Strong reasoning, consistent TPS
qwen3-vl:8b 35 8B Vision model
llama3.1:latest 36 8B
qwen3.5:latest 24 ~14B Starts hitting VRAM limit
qwen3.5:27b 3.0 27B Exceeds 8GB VRAM, spills to RAM

Vision Models (for ComfyUI and multimodal workflows)

Model Avg TPS VRAM Notes
qwen3-vl:4b-instruct-q8_0 45 ~5.5GB Best balance — fast, high quality, leaves headroom
qwen3-vl:8b-instruct-q4_K_M 35 ~5.5GB Larger model, slightly slower, better comprehension
minicpm-v:8b-2.6-q4_K_M 38 ~5GB Fast but terse — short responses on text tasks
qwen2.5vl:3b-q8_0 15 ~3.5GB Slow despite small size — VRAM load overhead

The dramatic drop from qwen3.5:latest (~24 TPS) to qwen3.5:27b (3 TPS) marks the VRAM cliff. Once the model no longer fits in 8GB, it spills to system RAM — but even though this machine has fast DDR5, the bottleneck becomes the PCIe bus (~32 GB/s) between the GPU and system memory, not the RAM speed itself. Performance collapses to APU-level speeds despite the faster RAM.


The Key Finding: Active Parameters Are What Matter

The headline result is qwen3:30b-a3b hitting 12 TPS — faster than the 8B dense model, despite having 30 billion total parameters.

This seems counterintuitive until you understand Mixture of Experts (MoE) architecture. In a MoE model, the network is split into many "expert" sub-networks. For any given token, only a small subset of experts are activated. qwen3:30b-a3b has 30B total parameters but only 3B active per token — the same compute cost per token as a 3B dense model, but with the knowledge capacity of a 30B model.

The rule that emerges from these results:

MoE speed advantage only materialises when active parameter count is kept low.

Look at mixtral:8x7b: it's MoE, but with 12.9B active parameters per token. Despite the MoE structure it runs at the same speed as the dense 14B model — because the active compute is similar.

qwen3:30b-a3b wins because it keeps active params at just 3B while maximising total capacity.


The Two Hardware Stories

Discrete GPU: Fast but VRAM-limited

The RTX 4070 hits 35–163 TPS for models that fit in 8GB VRAM. It's fast — bandwidth is not the bottleneck. But the moment a model exceeds 8GB, performance falls off a cliff: qwen3.5:27b drops to 3 TPS, identical to the APU. The discrete GPU is a sprinter with a hard wall.

Shared-Memory APU: Slow but capacious

The Vega 8 iGPU runs at 3–12 TPS — slower across the board for models that fit in discrete VRAM. But it can run a 34GB Q8_0 model that would never fit on the RTX 4070. The APU is a distance runner with no wall.

Where they meet

When a model exceeds the discrete GPU's VRAM, both machines run at the same ~3 TPS. At that point, the APU's 64GB capacity advantage becomes the deciding factor — it can run larger models at equal speed, with Q8_0 quality instead of being forced into aggressive quantization.

The MoE Sweet Spot for APUs

Low active-parameter MoE is the ideal architecture for shared-memory systems: fewer active params = less bandwidth per token = more TPS on bandwidth-constrained DDR4. qwen3:30b-a3b at 12 TPS demonstrates this perfectly — 30B total parameters, but only 3B active, running faster than the dense 8B model.


Practical Recommendations

For AMD APU systems with 32GB+ unified memory (Ryzen 5800U, no AVX_VNNI):
1. Use qwen3:30b-a3b or qwen3-coder:30b-a3b as your default — ~12 TPS, best speed/quality
2. Use Q4_K_M, not Q8_0 — Q8_0 is 20–57% slower on bandwidth-limited DDR4; AVX_VNNI (which would offset the bandwidth cost) is not present on Zen 3
3. Prefer MoE models with low active param counts (under 4B active) — this is the single biggest performance lever
4. Ollama with Vulkan is the easiest path — no ROCm build required, works out of the box
5. Disable sleep — large model downloads will resume but you waste time

For discrete GPU systems (e.g. RTX 4070 8GB, Intel Ultra 7 165H with AVX_VNNI):
1. Match model size to VRAM — keep total model size under ~7.5GB to stay fully in VRAM
2. Q4_K_M for 7–8B models at this VRAM level — fits comfortably with headroom
3. Q8_0 is viable for vision models under 6GB (e.g. qwen3-vl:4b-instruct-q8_0) — AVX_VNNI on the host CPU means Q8_0 CPU fallback is no slower
4. For ComfyUI inpainting: qwen3-vl:4b-instruct-q8_0 at 45 TPS uses ~5.5GB, leaving room for the diffusion model
5. Avoid models that spill to RAM — PCIe bandwidth (~32 GB/s) becomes the bottleneck, not DDR5
6. For larger models, the APU is a natural complement — it runs 30B+ at equal speed to any spilling model


Tools Used

  • Ollama — inference server, Vulkan backend
  • llmfit — hardware-fit recommender (useful for finding candidate models, but note: speed estimates for Vega 8 iGPU are inaccurate — it assumes 180 GB/s ROCm bandwidth vs the real ~50 GB/s)
  • benchmark_ollama.py — custom benchmark script measuring TPS across models and prompt types

Tested April 2026 on Ollama — AMD Ryzen 7 5800U (Vega 8 iGPU, 64GB DDR4) and NVIDIA RTX 4070 8GB (DDR5 system RAM).

LiteLLM + Agent Teams: A Practical Guide

LiteLLM + Agent Teams: A Practical Guide

An aide memoire for using the local AI infrastructure day-to-day.


The big picture

You have three layers:

Your task (plain English)
        ↓
  Agent team (Python, OpenAI Agents SDK)
        ↓
  LiteLLM proxy  ←→  Ollama (local GPU)
                 ←→  OpenRouter (cloud free)
                 ←→  Anthropic (claude-haiku)

LiteLLM is a translation layer. It gives everything a single OpenAI-compatible URL (http://10.140.20.63:4000/v1) regardless of whether the model is running locally on your GPU or fetched from a cloud provider. Your code never changes — only the model name string changes.

The agent team is a set of specialised AI workers. You give the orchestrator a task in plain English; it decides which specialist to hand it to; the specialist does the work and hands results back.


Part 1 — Using LiteLLM directly

From the command line (curl)

# Ask any model a question
curl http://10.140.20.63:4000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer no-key-needed" \
  -d '{
    "model": "qwen3.5:4b",
    "messages": [{"role": "user", "content": "What is a BGP route reflector?"}]
  }'

# List all available models
curl http://10.140.20.63:4000/v1/models | python3 -m json.tool | grep '"id"'

From Python (OpenAI SDK)

from openai import OpenAI

client = OpenAI(
    base_url="http://10.140.20.63:4000/v1",
    api_key="no-key-needed",
)

response = client.chat.completions.create(
    model="qwen3.5:4b",   # or "claude-haiku-4-5", "nemotron-120b", etc.
    messages=[{"role": "user", "content": "Summarise this log: ..."}],
)
print(response.choices[0].message.content)

Choosing a model

Use case Model string Where it runs
Quick questions, triage qwen3.5:4b Local GPU (3.4 GB)
Writing code qwen2.5-coder:7b Local GPU (4.7 GB)
General analysis qwen3.5 Local GPU (6.6 GB)
Images / screenshots qwen3-vl Local GPU (6.1 GB)
Heavy reasoning nemotron-120b Cloud free (OpenRouter)
Reliable tool calling claude-haiku-4-5 Cloud (Anthropic/OpenRouter)
Best available free free Cloud free (auto-routed)

Group aliases — if the specific model is busy or unavailable, LiteLLM falls back automatically:

Alias Primary Fallback
fast qwen3.5:4b qwen2.5-coder:1.5b
coder qwen2.5-coder:7b qwen2.5-coder:1.5b
local qwen3.5 llama3.1
reasoning nemotron-120b gpt-oss-120b

Health check

curl http://10.140.20.63:4000/health
incus exec litellm -- journalctl -u litellm -f   # live logs

Part 2 — Running the agent team

The one-liner

cd /home/user/claude/agents
.venv/bin/python team.py "your task here"

Example tasks

# Coding
.venv/bin/python team.py "write a Python script that tails a log file and alerts on ERROR lines"

# Research
.venv/bin/python team.py "what are the main CVEs in OpenSSH versions 8.x to 9.x?"

# Analysis
.venv/bin/python team.py "analyse this nmap output and prioritise the findings: [paste output]"

# Mixed — the orchestrator chains specialists automatically
.venv/bin/python team.py "research the log4shell vulnerability then write a Python checker for it"

What happens under the hood

You: "research log4shell then write a checker"
        ↓
Orchestrator (claude-haiku) reads task
        ↓
Handoff → Researcher (nemotron-120b, cloud)
  "Log4Shell is CVE-2021-44228, affects Log4j 2.0–2.14.1..."
        ↓
Back to Orchestrator → Handoff → Coder (qwen2.5-coder:7b, local GPU)
  "def check_log4shell(host, port): ..."
        ↓
Orchestrator summarises and returns to you

The orchestrator uses haiku because it reliably produces valid tool-call JSON for handoffs. Local Ollama models are fast but unreliable at structured function-calling.

Watching it work

Add LITELLM_LOG=DEBUG to see every model call:

LITELLM_LOG=DEBUG .venv/bin/python team.py "hello"

Or watch the LiteLLM proxy logs live in another terminal:

incus exec litellm -- journalctl -u litellm -f

Part 3 — Writing your own agents

Minimal single agent

import asyncio, os
os.environ["OPENAI_BASE_URL"] = "http://10.140.20.63:4000/v1"
os.environ["OPENAI_API_KEY"]  = "no-key-needed"

from agents import Agent, Runner

agent = Agent(
    name="Helper",
    model="qwen3.5:4b",
    instructions="You are a helpful assistant. Be concise.",
)

async def main():
    result = await Runner.run(agent, "What is ARP spoofing?")
    print(result.final_output)

asyncio.run(main())

Adding tools (things agents can do)

from agents import Agent, Runner, function_tool
import httpx

@function_tool
async def get_url(url: str) -> str:
    """Fetch the contents of a URL."""
    async with httpx.AsyncClient(timeout=10) as c:
        r = await c.get(url)
        return r.text[:2000]   # truncate to avoid context overflow

agent = Agent(
    name="WebReader",
    model="qwen3.5:4b",
    instructions="You can fetch URLs to answer questions.",
    tools=[get_url],
)

Rule: tools are Python functions decorated with @function_tool. The agent decides when to call them. The docstring becomes the tool description — make it clear.

Handing off between agents

from agents import Agent, Runner, handoff

specialist = Agent(
    name="Specialist",
    model="qwen3.5",
    instructions="You handle detailed analysis. Return results clearly.",
)

orchestrator = Agent(
    name="Orchestrator",
    model="claude-haiku-4-5",
    instructions="Route analysis tasks to Specialist. Summarise results.",
    handoffs=[handoff(specialist)],
)

result = await Runner.run(orchestrator, "Analyse this data: ...")

handoff() is itself a tool the orchestrator can call. When it calls it, execution transfers to the specialist; when the specialist finishes, control returns to the orchestrator.

The existing tools you can reuse

gpu_tools.py — for any agent that needs to know about the GPU:

from gpu_tools import vram_status, list_local_models, comfyui_status
agent = Agent(..., tools=[vram_status, list_local_models])

devops_tools.py — for agents that manage containers:

from devops_tools import container_run, container_write_file, container_read_file, http_probe, container_systemctl
agent = Agent(..., tools=[container_run, http_probe])

Part 4 — Practical patterns

Pattern 1: Quick one-shot query

Use make_client() from litellm_client.py directly — no agent overhead:

from litellm_client import make_client, FAST_MODEL

async def ask(question: str) -> str:
    client = make_client()
    resp = await client.chat.completions.create(
        model=FAST_MODEL,
        messages=[{"role": "user", "content": question}],
    )
    return resp.choices[0].message.content

Pattern 2: Task with a deadline / retry limit

result = await Runner.run(agent, task, max_turns=10)

max_turns prevents infinite loops. The team.py orchestrator uses 40 turns because research+code tasks can take many steps.

Pattern 3: Streaming output

from agents import Runner

async for event in Runner.run_streamed(agent, task):
    if hasattr(event, "delta") and event.delta:
        print(event.delta, end="", flush=True)

Pattern 4: DevOps / automation agent

See setup_tts_stt.py as a reference. The pattern is:
1. Write a detailed task string explaining exactly what the agent should do and verify
2. Give it the right tools (container_run, http_probe, etc.)
3. Set instructions to "act immediately, don't ask permission"
4. Set max_turns=40 for multi-step work

agent = Agent(
    name="DevOps",
    model="claude-haiku-4-5",   # must use haiku — local models can't do tool-calling
    tools=[container_run, container_write_file, http_probe, container_systemctl],
    instructions="Act immediately. Never ask for permission. Verify each step.",
)
result = await Runner.run(agent, TASK, max_turns=40)

Part 5 — Gotchas and tips

Local models can't do structured tool-calling

qwen3.5, qwen2.5-coder:7b, etc. produce good prose but often garble the JSON format needed for handoff() and @function_tool calls. Always use claude-haiku-4-5 as your orchestrator — it's reliable and cheap (Anthropic free tier via OpenRouter).

Only one large model fits in VRAM at a time

The RTX 4070 has 8 GB. If you ask the orchestrator to hand off to a 6.6 GB local model while another 4.7 GB model is loaded, Ollama unloads the first one. There is a ~5–15 second cold-load delay. This is normal.

Free cloud models are rate-limited

nemotron-120b and other OpenRouter free models may queue or time out under load. If an agent stalls for >2 minutes with no output, it's usually rate-limiting. Switch to gpt-oss-120b or qwen3-80b as alternatives.

The free model alias changes

openrouter/openrouter/free routes to whatever OpenRouter considers the best free model at that moment. Good for exploration; use a specific model name for reproducible pipelines.

Ollama keep-alive

Models stay in VRAM for 15 minutes after last use (KEEP_ALIVE=15m). If you want to free VRAM immediately:

curl -X POST http://10.140.20.1:11434/api/generate -d '{"model":"qwen3.5","keep_alive":0}'

Part 6 — Agent Team in Open WebUI

The agent team is exposed as a model in Open WebUI via the Pipelines server — a small FastAPI app that sits between Open WebUI and the agent code.

Open WebUI chat
      ↓  (selects "Agent Team" model)
Pipelines server  (host: 10.140.20.1:9099)
      ↓
Agent orchestrator (claude-haiku)
      ↓  handoffs
Specialist agents (local GPU / cloud free)

Architecture files

File Purpose
agents/pipelines/agent_team.py The pipeline class — wraps the agent team
agents/run_pipelines.sh Manual start script
/etc/systemd/system/owui-pipelines.service Systemd service (starts on boot)

Managing the pipelines server

sudo systemctl status owui-pipelines
sudo systemctl restart owui-pipelines
sudo journalctl -u owui-pipelines -f

Connecting to Open WebUI (one-time setup)

  1. Open http://localhost:3001
  2. Top-right avatar → Admin Panel
  3. Settings → Connections → Pipelines
  4. Add:
  5. URL: http://10.140.20.1:9099
  6. API Key: 0p3n-w3bu!
  7. Click Save — "Agent Team" now appears in the model picker

Using it

Select Agent Team in the model picker and chat normally. Each message is routed by the orchestrator to the right specialist. The full conversation history is passed so the team has context across turns.

The pipelines server API key (0p3n-w3bu!) is the default from the open-webui-pipelines package. Change it in /etc/systemd/system/owui-pipelines.service and update the Open WebUI connection setting to match.

Adding more pipelines

Drop a new .py file with a Pipeline class into agents/pipelines/, then:

sudo systemctl restart owui-pipelines

The new pipeline appears as a model in Open WebUI immediately.


Quick reference card

# Run agent team
cd /home/user/claude/agents && .venv/bin/python team.py "task"

# Query a model directly
curl http://10.140.20.63:4000/v1/chat/completions \
  -H "Content-Type: application/json" -H "Authorization: Bearer no-key-needed" \
  -d '{"model":"qwen3.5:4b","messages":[{"role":"user","content":"hello"}]}'

# List models
curl -s http://10.140.20.63:4000/v1/models | python3 -m json.tool | grep '"id"'

# Watch LiteLLM traffic
incus exec litellm -- journalctl -u litellm -f

# Check VRAM
curl -s http://10.140.20.1:11434/api/ps | python3 -m json.tool

# Add a model to Ollama
ollama pull <model-name>
# Then add it to /etc/litellm/config.yaml and push + restart

File map

/home/user/claude/agents/
├── team.py            ← entry point — run this
├── litellm_client.py  ← model constants and URLs
├── gpu_tools.py       ← tools: vram_status, list_local_models, comfyui_status
├── devops_tools.py    ← tools: container_run, container_write_file, http_probe, ...
├── setup_tts_stt.py   ← reference: single-purpose DevOps agent
└── .venv/             ← virtualenv (openai-agents, openai)

/etc/litellm/
├── config.yaml        ← model list (edit on host, push to container)
└── secrets.env        ← OPENROUTER_API_KEY