Bug ID 1305609: Missing cluster hearbeart packets in clusterd process and the blades temporarily leave the cluster

Last Modified: Jun 13, 2024

Affected Product(s):
BIG-IP TMOS(all modules)

Known Affected Versions:
12.0.0, 12.0.0 HF1, 12.1.0 HF1, 12.0.0 HF2, 12.1.0 HF2, 12.0.0 HF3, 12.0.0 HF4, 12.1.1 HF1, 12.1.1 HF2, 12.1.2 HF1, 12.1.2 HF2, 12.1.0, 12.1.1, 12.1.2, 12.1.3, 12.1.3.1, 12.1.3.2, 12.1.3.3, 12.1.3.4, 12.1.3.5, 12.1.3.6, 12.1.3.7, 12.1.4, 12.1.4.1, 12.1.5, 12.1.5.1, 12.1.5.2, 12.1.5.3, 12.1.6, 13.0.0, 13.0.0 HF1, 13.0.0 HF2, 13.0.0 HF3, 13.0.1, 13.1.0, 13.1.0.1, 13.1.0.2, 13.1.0.3, 13.1.0.4, 13.1.0.5, 13.1.0.6, 13.1.0.7, 13.1.0.8, 13.1.1, 13.1.1.2, 13.1.1.3, 13.1.1.4, 13.1.1.5, 13.1.3, 13.1.3.1, 13.1.3.2, 13.1.3.3, 13.1.3.4, 13.1.3.5, 13.1.3.6, 13.1.4, 13.1.4.1, 13.1.5, 13.1.5.1, 14.0.0, 14.0.0.1, 14.0.0.2, 14.0.0.3, 14.0.0.4, 14.0.0.5, 14.0.1, 14.0.1.1, 14.1.0, 14.1.0.1, 14.1.0.2, 14.1.0.3, 14.1.0.5, 14.1.0.6, 14.1.2, 14.1.2.1, 14.1.2.2, 14.1.2.3, 14.1.2.4, 14.1.2.5, 14.1.2.6, 14.1.2.7, 14.1.2.8, 14.1.3, 14.1.3.1, 14.1.4, 14.1.4.1, 14.1.4.2, 14.1.4.3, 14.1.4.4, 14.1.4.5, 14.1.4.6, 14.1.5, 14.1.5.1, 14.1.5.2, 14.1.5.3, 14.1.5.4, 14.1.5.6, 15.0.0, 15.0.1, 15.0.1.1, 15.0.1.2, 15.0.1.3, 15.0.1.4, 15.1.0, 15.1.0.1, 15.1.0.2, 15.1.0.3, 15.1.0.4, 15.1.0.5, 15.1.1, 15.1.2, 15.1.2.1, 15.1.3, 15.1.3.1, 15.1.4, 15.1.4.1, 15.1.5, 15.1.5.1, 15.1.6, 15.1.6.1, 15.1.7, 15.1.8, 15.1.8.1, 15.1.8.2, 15.1.9, 15.1.9.1, 15.1.10, 15.1.10.2, 15.1.10.3, 15.1.10.4, 16.0.0, 16.0.0.1, 16.0.1, 16.0.1.1, 16.0.1.2, 16.1.0, 16.1.1, 16.1.2, 16.1.2.1, 16.1.2.2, 16.1.3, 16.1.3.1, 16.1.3.2, 16.1.3.3, 16.1.3.4, 16.1.3.5, 16.1.4, 16.1.4.1, 16.1.4.2, 16.1.4.3, 17.0.0, 17.0.0.1, 17.0.0.2, 17.1.0, 17.1.0.1, 17.1.0.2, 17.1.0.3, 17.1.1, 17.1.1.1, 17.1.1.2, 17.1.1.3

Opened: Jun 08, 2023

Severity: 3-Major

Symptoms

If two or more clusterd processes experience a long HAL timeout communicating with chmand, then either of those clusterd process will report a lack of cluster heartbeart packets and one or more blades will leave the cluster. Here are two example log messages that will occur when this issue is encountered. # slot 3 marking itself as failed because of a partition event where the heartbeat timeout only occurred on the mgmt_bp interface. err clusterd[21260]: 013a0004:3: Marking slot 3 SS_FAILED due to partition detected on mgmt_bp from peer 4 to local 3 # slot 2 marking slot 1 as failed due to a lack of cluster packets from slot1 on both mgmt and tmm bp interfaces. err clusterd[29069]: 013a0004:3: Local slot 2: not getting clusterd pkts from slot 1; timed out on mgmt_bp and tmm_bp after 10 seconds. Marking peer slot 1 SS_FAILED These messages are not unique to this bug. There are other bugs and conditions that can cause clusterd to stop sending/receiving heartbeat packets.

Impact

A blade will temporarily leave the cluster but then re-join unless ID 1273161 or something similar also occurs. If the # of blades leaving the cluster causes the number of online blades to be less then the min-up-members, min-up-members-enabled is set to 'yes' and the chassis is Active a failover will occur.

Conditions

1) Multi-blade chassis with a minimum of 5 blades. More blades increases the chances of encountering this bug. 2) A condition that causes long HAL delays between clusterd and chmand. One condition of long HAL delays that is specific to 14.1.x and prior is a full config sync. However that condition was fixed in 15.1.0 and higher with the changes for ID 721020 and ID 746122.

Workaround

N/A

Fix Information

None

Behavior Change

Guides & references

K10134038: F5 Bug Tracker Filter Names and Tips