Bug ID 1026237: Partition high availability (HA) framework can fail to report 'failed' state if the node crashes immediately when becoming 'active'

Last Modified: Sep 22, 2021

Bug Tracker

Affected Product:  See more info
F5OS Velos(all modules)

Known Affected Versions:
1.1.0, 1.1.1, 1.1.2, 1.1.3, 1.1.4

Opened: Jun 16, 2021
Severity: 3-Major

Symptoms

The system controller "show partitions partition" status output shows both controller 1 and controller 2 as "running-active". The partition "show system redundancy" also reports incorrect active/active status.

Impact

The system controller and partition CLI/GUI will display both partition instances as "active", even though only one of them is.

Conditions

Normally, the partition instance runs on both system controllers in an Active/Standby configuration. If the Active yields or fails, the Standby will become Active. If the partition instance crashes immediately after reporting Active status, it does not update the status reported to the system controller software, so will appear to be active. The other partition instance will detect the crash, and reclaim the active role if possible. The observed instance of this problem occurred when the partition volume ran out of space due to an large number of qkviews and tcpdumps.

Workaround

To prevent the crash, limit the number of partition qkviews stored on the partition. Once taken, the qkview should be promptly copied to another system and deleted before taking another.

Fix Information

None

Behavior Change