Bug ID 1273161: Secondary blades are unavailable, clusterd is reporting shutdown, and waiting for other blades

Last Modified: Jun 17, 2024

Affected Product(s):
BIG-IP None(all modules)

Known Affected Versions:
13.1.0, 13.1.0.1, 13.1.0.2, 13.1.0.3, 13.1.0.4, 13.1.0.5, 13.1.0.6, 13.1.0.7, 13.1.0.8, 13.1.1, 13.1.1.2, 13.1.1.3, 13.1.1.4, 13.1.1.5, 13.1.3, 13.1.3.1, 13.1.3.2, 13.1.3.3, 13.1.3.4, 13.1.3.5, 13.1.3.6, 13.1.4, 13.1.4.1, 13.1.5, 13.1.5.1, 14.0.0, 14.0.0.1, 14.0.0.2, 14.0.0.3, 14.0.0.4, 14.0.0.5, 14.0.1, 14.0.1.1, 15.0.0, 15.0.1, 15.0.1.1, 15.0.1.2, 15.0.1.3, 15.0.1.4, 15.1.1, 15.1.2, 15.1.2.1, 15.1.3, 15.1.3.1, 15.1.4, 15.1.4.1, 15.1.5, 15.1.5.1, 15.1.6, 15.1.6.1, 15.1.7, 15.1.8, 15.1.8.1, 15.1.8.2, 15.1.9, 15.1.9.1, 15.1.10, 15.1.10.2, 15.1.10.3, 15.1.10.4, 16.0.0, 16.0.0.1, 16.0.1, 16.0.1.1, 16.0.1.2, 16.1.0, 16.1.1, 16.1.2, 16.1.2.1, 16.1.2.2, 16.1.3, 16.1.3.1, 16.1.3.2, 16.1.3.3, 16.1.3.4, 16.1.3.5, 16.1.4, 16.1.4.1, 16.1.4.2, 16.1.4.3, 17.0.0, 17.0.0.1, 17.0.0.2, 17.1.0, 17.1.0.1, 17.1.0.2, 17.1.0.3, 17.1.1, 17.1.1.1, 17.1.1.2, 17.1.1.3

Opened: Mar 17, 2023

Severity: 3-Major

Symptoms

On a multi-slot chassis, VCMP guest, or F5OS tenant, clusterd can enter a shutdown state causing some slots to become unavailable. The event that can cause this is called a partition and occurs when clusterd stops receiving heartbeat packets from a slot over the mgmt_bp interface but is still receiving them over the tmm_bp interface. Here is the error that is logged when this occurs: Mar 17 10:38:28 localhost err clusterd[4732]: 013a0004:3: Marking slot 1 SS_FAILED due to partition detected on mgmt_bp from peer 2 to local 1 When this occurs, clusterd enters a shutdown state and at times will never recover. Here is an example, tmsh show sys cluster command where clusterd is in the shutdown yet waiting state: ----------------------------------------- Sys::Cluster: default ----------------------------------------- Address 172.0.0.160/23 Alt-Address :: Availability available State enabled Reason Cluster Enabled Primary Slot ID 2 Primary Selection Time 03/17/23 10:38:30 ---------------------------------------------------------------------------------- | Sys::Cluster Members | ID Address Alt-Address Availability State Licensed HA Clusterd Reason ---------------------------------------------------------------------------------- | 1 :: :: unknown enabled false unknown shutdown ShutDown: default/1 waiting for blade 2 | 2 :: :: available enabled true standby running Run

Impact

The unavailable slots/blades will not accept traffic.

Conditions

Multi-slot chassis, VCMP guest, or F5OS tenant. A blade determines there is a partition where it's receiving cluster packets over the tmm+bp interface but not the mgmt_bp interface.

Workaround

Running tmsh show sys cluster will report the primary slot and all slot statuses. For all blades reporting shutdown or less likely initializing and "waiting for blade(s)" restart clusterd on that slot with bigstart restart clusterd. Ensure you do not restart clusterd on the primary slot.

Fix Information

None

Behavior Change

Guides & references

K10134038: F5 Bug Tracker Filter Names and Tips