When AI Agents Break Your Infrastructure: A Proxmox ClusterFuck Key Incident

Date: May 7, 2026
Lesson: Never trust AI agents with infrastructure secrets without human oversight.

The Incident

I woke up to find my entire Proxmox infrastructure unreachable. Web interfaces were slow and unresponsive; SSH connections hung. All five servers: PVE1, PVE2, PVE7, PVE8, and Xenon were down or broken.

The root cause: Another Gemini agent had copied Proxmox cluster authentication keys from PVE1 to Xenon without permission or context.

What Went Wrong

My infrastructure has a three-node Proxmox cluster:
- PVE1 (10.140.3.10) – lead node
- PVE2 (10.140.3.20) – member node
- PVE8 (10.140.3.80) – member node
- Xenon (10.140.3.82) – standalone, not part of the cluster

Specific Mechanism

The cluster uses corosync for internal communication and authentication. Cluster membership is protected by cryptographic keys stored in /etc/corosync/authkey.

When the Claude copied /etc/corosync/authkey from PVE1 to Xenon, it:

Enabled Xenon's corosync daemon: with PVE1's cluster keys.
Caused Xenon to attempt joining the "pve" cluster: using those keys.
Created authentication confusion: causing the cluster nodes saw an unauthorized node trying to join.
Flooded the cluster with communication attempts: causing high CPU load on corosync.
Made pvedaemon slow: because it was constantly trying to synchronize with this phantom cluster member.
Broke the web UI: requests hung waiting for cluster quorum.

The servers weren't down. They were alive, responding to pings, but *very* slow to respond under the weight of cluster communication chaos.

Why?

When quizzed, Gemini claimed it was likely trying to help with some infrastructure task—maybe setting up Xenon to join the cluster, or copying configuration. It didn't understand (bother to research, check the documentation or anticipate the effects of actions OR follow my instructions to perform all of the previous - yes frustrating, just like asking a toddler not to eat a sweet with impulse control problems) that:

Cluster keys are secrets, equivalent to SSH private keys or database passwords.
Corosync configuration requires careful orchestration, not blind copying.
Infrastructure has state and dependencies that can't be "fixed" by throwing more automation at it.
The operation needs human judgment, not autonomous execution.

The Diagnostics

When I started investigating:

# Network was fine
$ ping 10.140.3.10  # ✓ PVE1 reachable
$ ping 10.140.3.20  # ✓ PVE2 reachable

# Web UIs were responding
$ curl -sk https://10.140.3.10:8006  # HTTP 200
$ curl -sk https://10.140.3.20:8006  # HTTP 200

# But everything was slow and unresponsive

The problem wasn't connectivity or service failure. I checked the actual servers:

# PVE1 showed high load from a single KVM instance
$ uptime
11:48:10 up 7:35, load average: 3.30, 3.31, 2.89

# PVE2's corosync daemon was consuming unusual CPU
$ top
corosync    1486  rt   8.3%  CPU  250M MEM

Then I found it:

# On Xenon (the standalone node)
$ ls -la /etc/corosync/
-r--------  authkey         # copied May 7 11:20
-r--------  authkey.backup  # copied May 7 11:20

$ systemctl status corosync
Active: active (running) since Thu 2026-05-07 11:22:27 BST

Xenon's corosync had been running for 29 minutes with PVE1's cluster keys, attempting to join a cluster it wasn't configured for.

The Fix

Stop corosync on Xenon
bash systemctl stop corosync
Remove the copied cluster keys
bash rm /etc/corosync/authkey /etc/corosync/authkey.backup
Disable corosync on Xenon (prevent it from auto-starting)
bash systemctl disable corosync
Restart corosync on the real cluster nodes to clear any bad state
bash systemctl restart corosync # on PVE1, PVE2, PVE8
Wait for cluster to stabilize
bash sleep 5 pvecm status # verify Quorate: Yes

Result: Cluster stabilized. Web UI response time dropped from slow/timeout to 12-18ms.

Why This Matters

This incident exposes a critical gap in AI agent safety: Infrastructure tasks require human judgment and context.

The Problems

Secrets and Authentication Keys
AI agents should never autonomously copy, move, or modify secrets
Cluster keys, SSH keys, API tokens, database passwords—all dangerous in the wrong place
Even "helpful" copying can break everything
Lack of Contextual Understanding
The agent didn't know that Xenon wasn't part of the cluster
It didn't understand corosync's role or impact
It couldn't predict that this would poison cluster communication
No Rollback Mechanism
Once the keys were copied and corosync started, damage was instant
There was no "dry run" or confirmation step
The agent didn't ask or warn before taking the action
Cascading Failures
One agent's mistake affected the entire infrastructure
The failure mode was silent degradation, not an obvious error
It took deep diagnostics to find the root cause

Lessons Learned

For AI Agent Usage:

Never run infrastructure agents unattended
Always have a human monitoring output and status
Require explicit approval for infrastructure changes
Set up proper logging and alerting
Restrict agent access to sensitive files
Don't give agents access to /etc/corosync/, /etc/pve/.ssh/, or other secret directories
Use file permissions and SELinux/AppArmor to enforce this
Treat agent processes like untrusted code
Require explicit confirmation for dangerous operations
Copying cluster keys, restarting services, network changes—all need human approval
Implement a "dry run" mode that shows what would happen without making changes
Log all modifications with timestamps and agent identifiers
Use feature flags and gradual rollout
Don't let a single agent action affect your entire infrastructure
Isolate test environments from production
Make changes incrementally and verify each step
Have a disaster recovery plan
Know how to restore cluster keys from backup
Document the cluster setup and recovery procedures
Practice recovery in a test environment

For Prompting:

Be explicit about what you DON'T want
"Don't modify system files without asking first"
"Don't run commands that require authentication"
"Don't make any changes, just diagnose"
Scope the agent's authority
"Diagnose this networking issue"
Not: "Fix this networking issue"
Use sandboxes and read-only mode
Let agents read logs and configuration, but not modify them
Isolate test VMs from production infrastructure

Recovery Checklist

If this happens to you:

[ ] Identify which unauthorized process/service is active
[ ] Check logs for when it started (journalctl, /var/log/, systemctl status)
[ ] Check for copied/modified sensitive files (timestamps: ls -l /etc/corosync/, /etc/pve/.ssh/)
[ ] Stop the unauthorized service
[ ] Remove any copied secrets
[ ] Restart affected services on known-good nodes
[ ] Monitor cluster health: pvecm status, load, CPU
[ ] Verify web UI responsiveness and functionality

Conclusion

AI agents are powerful tools, but infrastructure is fragile. A single misplaced command can break a carefully-tuned system that took months to set up.

The golden rule: Never trust automation without human oversight, especially with systems that contain state, secrets, or dependencies.

I'm grateful this was caught quickly and the fix was straightforward. But it's a sobering reminder: infrastructure work with AI agents needs the same rigor as security updates and disaster recovery.

Always ask: "What could go wrong?" And then give that question to a human, not an AI.

Questions for future AI agent design:

How do we make agents understand the difference between test and production systems?
How do we prevent "helpful" automation that causes cascading failures?
What permission models work for infrastructure agents?
How do we audit and trace agent actions in critical systems?

These are open problems in AI safety. Until they're solved, keep humans in the loop.

Tech Guinea Pig

27 June 2026