Bug ID 1556173: Poor management backplane link performance on system controller failover

Last Modified: Apr 28, 2025

Affected Product(s):
F5OS Velos(all modules)

Known Affected Versions:
F5OS-C 1.6.0, F5OS-C 1.6.1, F5OS-C 1.6.2

Fixed In:
F5OS-C 1.8.0

Opened: Mar 05, 2024

Severity: 3-Major

Symptoms

The connectivity of the chassis management backplane may be disrupted for a minimum of 1-5 seconds, and in specific situations, for up to 20 seconds. During this time, tenant instances are unable to communicate with each other over the chassis management backplane.

Impact

Since tenant instances cannot communicate with one another during this period, if the link downtime exceeds 10 seconds, it will trigger a BIG-IP tenant's clusterd timeout. If that BIG-IP tenant is active in an HA pair, a failover will tigger such that the standby BIG-IP is now active. Additionally, a sod out-of-band mgmt timeout will be triggered for that BIG-IP tenant even if the system controller's management interfaces are configured in a trunk. In some scenarios, this can trigger temporary split brain behavior between BIG-IP tenants in an HA pair. This can cause unexpected HA failovers if the downtime is long enough and the tenants are multi-slot despite a TMM self-ip being configured in the HA mesh.

Conditions

Failover of the system controller has been observed. Rebooting the active system controller may aggravate the symptoms.

Workaround

No workaround, only mitigations. 1. Do not reboot the active system controller. Perform a system controller failover, then reboot the controller that was previously active. 2. To mitigate issues during an unplanned controller failover, for example health check failures, increase each BIG-IP tenant's clusterd timeout and/or sod timeout up to 30 seconds to reduce erroneous sod and clusterd timeouts. clusterd timeout can be modified in each BIG-IP via 'tmsh' modify sys db clusterd.peermembertimeout value <int>. sod timeout can be modified in each BIG-IP via tmsh modify sys db failover.nettimeoutsec value <int>. 3. To mitigate issues during planned controller failovers in a maintenance window, it is possible to prevent unwanted inter BIG-IP tenant failovers or split brain behavior altogether. One strategy includes for each BIG-IP HA pair, set the BIG-IP device failover offline on the chassis where controller failovers are to be executed. While the BIG-IP device is offline, health checks like the sod and clusterd timeouts will not trigger a failover to offline BIG-IP devices. Once the maintenance window is over, each BIG-IP device should have failover set back online. Reference the following article to set a BIG-IP traffic-group's device offline. https://my.f5.com/manage/s/article/K15122.

Fix Information

System controller failover incurs no chassis management backplane link downtime.

Behavior Change

Guides & references

K10134038: F5 Bug Tracker Filter Names and Tips