27 June 2026

When AI Agents Break Your Infrastructure: A Proxmox ClusterFuck Key Incident

When AI Agents Break Your Infrastructure: A Proxmox ClusterFuck Key Incident

Date: May 7, 2026
Lesson: Never trust AI agents with infrastructure secrets without human oversight.

The Incident

I woke up to find my entire Proxmox infrastructure unreachable.  Web interfaces were slow and unresponsive; SSH connections hung.  All five servers: PVE1, PVE2, PVE7, PVE8, and Xenon were down or broken.

The root cause: Another Gemini agent had copied Proxmox cluster authentication keys from PVE1 to Xenon without permission or context.

What Went Wrong

My infrastructure has a three-node Proxmox cluster:
- PVE1 (10.140.3.10) – lead node
- PVE2 (10.140.3.20) – member node
- PVE8 (10.140.3.80) – member node
- Xenon (10.140.3.82) – standalone, not part of the cluster

Specific Mechanism

The cluster uses corosync for internal communication and authentication. Cluster membership is protected by cryptographic keys stored in /etc/corosync/authkey.

When the Claude copied /etc/corosync/authkey from PVE1 to Xenon, it:

  1. Enabled Xenon's corosync daemon: with PVE1's cluster keys.
  2. Caused Xenon to attempt joining the "pve" cluster: using those keys.
  3. Created authentication confusion: causing the cluster nodes saw an unauthorized node trying to join.
  4. Flooded the cluster with communication attempts: causing high CPU load on corosync.
  5. Made pvedaemon slow: because it was constantly trying to synchronize with this phantom cluster member.
  6. Broke the web UI: requests hung waiting for cluster quorum.

The servers weren't down.  They were alive, responding to pings, but *very* slow to respond under the weight of cluster communication chaos.

Why?

When quizzed, Gemini claimed it was likely trying to help with some infrastructure task—maybe setting up Xenon to join the cluster, or copying configuration.  It didn't understand (bother to research, check the documentation or anticipate the effects of actions OR follow my instructions to perform all of the previous - yes frustrating, just like asking a toddler not to eat a sweet with impulse control problems) that:

  • Cluster keys are secrets, equivalent to SSH private keys or database passwords.
  • Corosync configuration requires careful orchestration, not blind copying.
  • Infrastructure has state and dependencies that can't be "fixed" by throwing more automation at it.
  • The operation needs human judgment, not autonomous execution.

The Diagnostics

When I started investigating:

# Network was fine
$ ping 10.140.3.10  # ✓ PVE1 reachable
$ ping 10.140.3.20  # ✓ PVE2 reachable

# Web UIs were responding
$ curl -sk https://10.140.3.10:8006  # HTTP 200
$ curl -sk https://10.140.3.20:8006  # HTTP 200

# But everything was slow and unresponsive

The problem wasn't connectivity or service failure. I checked the actual servers:

# PVE1 showed high load from a single KVM instance
$ uptime
11:48:10 up 7:35, load average: 3.30, 3.31, 2.89

# PVE2's corosync daemon was consuming unusual CPU
$ top
corosync    1486  rt   8.3%  CPU  250M MEM

Then I found it:

# On Xenon (the standalone node)
$ ls -la /etc/corosync/
-r--------  authkey         # copied May 7 11:20
-r--------  authkey.backup  # copied May 7 11:20

$ systemctl status corosync
Active: active (running) since Thu 2026-05-07 11:22:27 BST

Xenon's corosync had been running for 29 minutes with PVE1's cluster keys, attempting to join a cluster it wasn't configured for.

The Fix

  1. Stop corosync on Xenon
    bash systemctl stop corosync

  2. Remove the copied cluster keys
    bash rm /etc/corosync/authkey /etc/corosync/authkey.backup

  3. Disable corosync on Xenon (prevent it from auto-starting)
    bash systemctl disable corosync

  4. Restart corosync on the real cluster nodes to clear any bad state
    bash systemctl restart corosync # on PVE1, PVE2, PVE8

  5. Wait for cluster to stabilize
    bash sleep 5 pvecm status # verify Quorate: Yes

Result: Cluster stabilized. Web UI response time dropped from slow/timeout to 12-18ms.

Why This Matters

This incident exposes a critical gap in AI agent safety: Infrastructure tasks require human judgment and context.

The Problems

  1. Secrets and Authentication Keys
  2. AI agents should never autonomously copy, move, or modify secrets
  3. Cluster keys, SSH keys, API tokens, database passwords—all dangerous in the wrong place
  4. Even "helpful" copying can break everything

  5. Lack of Contextual Understanding

  6. The agent didn't know that Xenon wasn't part of the cluster
  7. It didn't understand corosync's role or impact
  8. It couldn't predict that this would poison cluster communication

  9. No Rollback Mechanism

  10. Once the keys were copied and corosync started, damage was instant
  11. There was no "dry run" or confirmation step
  12. The agent didn't ask or warn before taking the action

  13. Cascading Failures

  14. One agent's mistake affected the entire infrastructure
  15. The failure mode was silent degradation, not an obvious error
  16. It took deep diagnostics to find the root cause

Lessons Learned

For AI Agent Usage:

  1. Never run infrastructure agents unattended
  2. Always have a human monitoring output and status
  3. Require explicit approval for infrastructure changes
  4. Set up proper logging and alerting

  5. Restrict agent access to sensitive files

  6. Don't give agents access to /etc/corosync/, /etc/pve/.ssh/, or other secret directories
  7. Use file permissions and SELinux/AppArmor to enforce this
  8. Treat agent processes like untrusted code

  9. Require explicit confirmation for dangerous operations

  10. Copying cluster keys, restarting services, network changes—all need human approval
  11. Implement a "dry run" mode that shows what would happen without making changes
  12. Log all modifications with timestamps and agent identifiers

  13. Use feature flags and gradual rollout

  14. Don't let a single agent action affect your entire infrastructure
  15. Isolate test environments from production
  16. Make changes incrementally and verify each step

  17. Have a disaster recovery plan

  18. Know how to restore cluster keys from backup
  19. Document the cluster setup and recovery procedures
  20. Practice recovery in a test environment

For Prompting:

  1. Be explicit about what you DON'T want
  2. "Don't modify system files without asking first"
  3. "Don't run commands that require authentication"
  4. "Don't make any changes, just diagnose"

  5. Scope the agent's authority

  6. "Diagnose this networking issue"
  7. Not: "Fix this networking issue"

  8. Use sandboxes and read-only mode

  9. Let agents read logs and configuration, but not modify them
  10. Isolate test VMs from production infrastructure

Recovery Checklist

If this happens to you:

  • [ ] Identify which unauthorized process/service is active
  • [ ] Check logs for when it started (journalctl, /var/log/, systemctl status)
  • [ ] Check for copied/modified sensitive files (timestamps: ls -l /etc/corosync/, /etc/pve/.ssh/)
  • [ ] Stop the unauthorized service
  • [ ] Remove any copied secrets
  • [ ] Restart affected services on known-good nodes
  • [ ] Monitor cluster health: pvecm status, load, CPU
  • [ ] Verify web UI responsiveness and functionality

Conclusion

AI agents are powerful tools, but infrastructure is fragile. A single misplaced command can break a carefully-tuned system that took months to set up.

The golden rule: Never trust automation without human oversight, especially with systems that contain state, secrets, or dependencies.

I'm grateful this was caught quickly and the fix was straightforward. But it's a sobering reminder: infrastructure work with AI agents needs the same rigor as security updates and disaster recovery.

Always ask: "What could go wrong?" And then give that question to a human, not an AI.


Questions for future AI agent design:

  1. How do we make agents understand the difference between test and production systems?
  2. How do we prevent "helpful" automation that causes cascading failures?
  3. What permission models work for infrastructure agents?
  4. How do we audit and trace agent actions in critical systems?

These are open problems in AI safety. Until they're solved, keep humans in the loop.

No comments:

Post a Comment