07 May 2026

Proxmox Cluster Going Sluggish? Your Offline Node Has a Stale Config

Proxmox Cluster Going Sluggish? Your Offline Node Has a Stale Config

You power on a node that's been offline for a while. Within seconds, the Proxmox web UI starts showing other nodes as dead. Management operations slow to a crawl. Nothing is obviously broken — all the nodes are still pinging — but something is clearly very wrong.

This is the stale corosync config problem, and it's easy to fix once you know what to look for.

What's Happening

Proxmox uses corosync to manage cluster membership. Every config change — adding a node, removing a node, changing votes — increments a config_version in /etc/corosync/corosync.conf. All cluster members must agree on this version.

When a node comes back online after missing several config changes, corosync starts up with an old config_version. The other nodes reject its packets. But corosync doesn't give up — it keeps retrying, flooding the network with rejected authentication attempts. This hammers pvedaemon on every node, causing the web UI to become sluggish and show phantom "dead" nodes even though the cluster itself is technically still quorate.

Diagnosis

First, confirm the cluster itself is still healthy from a node you trust:

pvecm status

If you see Quorate: Yes, the cluster is fine — the problem is the misbehaving node, not a genuine quorum loss. Note the Config Version value.

Then SSH into the suspect node and check:

ssh pve7 systemctl status corosync
ssh pve7 cat /etc/corosync/corosync.conf | grep config_version

Here's what it looks like when you've found the culprit:

● corosync.service - Corosync Cluster Engine
     Active: active (running) since Thu 2026-05-07 17:30:55 BST; 8min ago
   Main PID: 1146 (corosync)
     Memory: 155.9M (peak: 171.9M)
        CPU: 11.549s

May 07 17:39:36 pve7 corosync[1146]:   [KNET  ] rx: Packet rejected from 10.140.3.80:5405
May 07 17:39:37 pve7 corosync[1146]:   [KNET  ] rx: Packet rejected from 10.140.3.80:5405
May 07 17:39:38 pve7 corosync[1146]:   [KNET  ] rx: Packet rejected from 10.140.3.80:5405
May 07 17:39:43 pve7 corosync[1146]:   [QUORUM] Sync members[1]: 3
May 07 17:39:43 pve7 corosync[1146]:   [TOTEM ] A new membership (3.1a3c) was formed. Members
May 07 17:39:43 pve7 corosync[1146]:   [QUORUM] Members[1]: 3
May 07 17:39:43 pve7 corosync[1146]:   [MAIN  ] Completed service synchronization, ready to provide service.

  config_version: 10

Two red flags:
- "Packet rejected" — every node is refusing this node's traffic
- "Sync members[1]: 3" — the node has formed its own single-node pseudo-cluster with just itself (nodeid 3)
- config_version: 10 while the live cluster is at version 19 or higher

Fix

Step 1 — Stop corosync on the problem node

systemctl stop corosync

Verify it stopped:

systemctl status corosync

Expected output:

○ corosync.service - Corosync Cluster Engine
     Active: inactive (dead) since Thu 2026-05-07 17:40:45 BST; 48s ago

The web UI should recover almost immediately once the flood of rejected packets stops.

Step 2 — Check the corosync directory

ls -la /etc/corosync/

You'll likely see an authkey from the node's prior cluster membership:

drwxr-xr-x  3 root root 4096 Apr 25 17:21 .
-rw-r--r--  1 root root  256 Apr 25 16:14 authkey
-rw-r--r--  1 root root  639 Apr 25 17:21 corosync.conf

Don't manually delete it — pvecm add --force will handle it cleanly.

Step 3 — Rejoin the cluster

Run this from /tmp (pvecm refuses to run from inside /etc/pve/):

cd /tmp && pvecm add <lead-node-ip> --use_ssh --force

For example:

cd /tmp && pvecm add 10.140.3.10 --use_ssh --force
  • --use_ssh — uses existing SSH key trust instead of the API password prompt
  • --force — overrides warnings about existing config, authkey, and VMs (all expected for a rejoin)

You'll see output like:

detected the following error(s):
* authentication key '/etc/corosync/authkey' already exists
* cluster config '/etc/pve/corosync.conf' already exists
* this host already contains virtual guests

WARNING : detected error but forced to continue!

copy corosync auth key
stopping pve-cluster service
backup old database to '/var/lib/pve-cluster/backup/config-1778172132.sql.gz'
waiting for quorum...OK
(re)generate node files
generate new node certificate
merge authorized SSH keys
generated new node certificate, restart pveproxy and pvedaemon services
successfully added node 'pve7' to cluster.

Step 4 — Verify

pvecm status

Healthy output looks like:

Cluster information
-------------------
Name:             pve
Config Version:   20
Transport:        knet
Secure auth:      on

Quorum information
------------------
Nodes:            5
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   7
Total votes:      7
Quorum:           4
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          3 10.140.3.10
0x00000002          1 10.140.3.80
0x00000003          1 10.140.3.70 (local)
0x00000004          1 10.140.3.82
0x00000006          1 10.140.3.20

All nodes present, Quorate: Yes, config_version incremented by 2 (one increment per node during the join handshake — this is expected).

Why This Keeps Happening

Corosync is enabled by default and starts automatically on boot. If a node has been offline long enough to miss cluster config changes, it will always boot into this broken state. The node isn't malfunctioning — it's doing exactly what it's designed to do with the config it has. It's just that the config is stale.

Prevention: Before powering on a long-offline Proxmox node, check your current cluster's config_version with pvecm status. If it's significantly ahead of what the returning node last knew, plan for a rejoin rather than assuming it'll come back cleanly.

Quick Reference

Command Purpose
pvecm status Check cluster health and config version
systemctl status corosync Check corosync state on a node
grep config_version /etc/corosync/corosync.conf Check node's config version
systemctl stop corosync Stop the misbehaving corosync
cd /tmp && pvecm add <ip> --use_ssh --force Rejoin the cluster

No comments:

Post a Comment