Bug ID 481869: Certain blade failure events may result in a 10+ second delay in failover occurring

Last Modified: Mar 12, 2019

Bug Tracker

Affected Product:  See more info
BIG-IP TMOS(all modules)

Known Affected Versions:
11.4.1, 11.5.1, 11.5.1 HF1, 11.5.1 HF10, 11.5.1 HF11, 11.5.1 HF2, 11.5.1 HF3, 11.5.1 HF4, 11.5.1 HF5, 11.5.1 HF6, 11.5.1 HF7, 11.5.1 HF8, 11.5.1 HF9, 11.5.2, 11.5.2 HF1, 11.5.3, 11.5.3 HF1, 11.5.3 HF2, 11.5.4, 11.5.4 HF1, 11.5.4 HF2, 11.5.4 HF3, 11.5.4 HF4, 11.5.5, 11.5.6, 11.5.7, 11.5.8, 11.5.9, 12.0.0, 12.0.0 HF1, 12.0.0 HF2, 12.0.0 HF3, 12.0.0 HF4

Fixed In:
12.1.0

Opened: Sep 30, 2014
Severity: 4-Minor

Symptoms

For certain blade failures scenarios the HA score on the remaining blades does not update, and thus a failover does not occur, for at least ten seconds. This is because the remaining blades wait for a ten second timeout period before marking the powered-off blade as down.

Impact

The expected failover will not occur for at least ten seconds

Conditions

A blade is powered off via the serial console or the 'bladectl' command, or the blade is physically removed from the chassis, and the chassis is configured in an HA pair where the loss of a blade should result in a failover.

Workaround

There is no workaround for this issue.

Fix Information

The issue has been addressed with two separate changes. The first results in a cluster member being marked down immediately when its blade is physically removed from the chassis. The second is the addition of a DB variable ("Clusterd.PeerMemberTimeout") that allows configuring of the timeout value used to determine when an unresponsive blade has been marked down. This controls how long before an unresponsive cluster member is marked down by its peers. Its default value is ten seconds, and it can be set as low as one second. This can help lower the delay before a failover occurs in the event of other blade power down scenarios, such as when a blade is powered down via the serial console or the 'bladectl' command.

Behavior Change