Bug ID 1226465: Persistent alarm for "Fault detected in PSU controller health" due to PSU I2C fault in VELOS PSU controller runtime status

Last Modified: May 29, 2024

Affected Product(s):
F5OS Velos(all modules)

Known Affected Versions:
F5OS-C 1.5.0, F5OS-C 1.5.1

Fixed In:
F5OS-C 1.6.0

Opened: Jan 18, 2023

Severity: 2-Critical

Symptoms

The two PSU controllers in a VELOS 8-slot chassis are redundant. Both PSU controllers have access to a shared I2C bus connected to the 4 power supplies. The AOM on the active controller selects one of the 2 PSU controllers to use for PSU management and directs all PSU accesses through that PSU controller. When a PSU controller indicates a runtime fault, then the AOM fails over to using the other PSU controller. The PSU controller runtime status fault is recorded in the event log and asserts an alarm. Unfortunately, this behavior leaves an active alarm for the PSU controller reporting a runtime status fault once it occurs. That PSU controller is no longer being used for PSU management, thus there is no opportunity for it to clear its own reported PSU I2C fault. The associated alarm remains active indefinitely.

Impact

No system impact is expected because the two PSU controllers are redundant. The active alarm for a VELOS PSU controller runtime status fault may persist indefinitely because that PSU controller is no longer being used for PSU management.

Conditions

When a VELOS PSU controller indicates a runtime status fault, then the AOM fails-over to using the other PSU controller. The PSU controller runtime status fault is recorded in the event log and asserts an alarm.

Workaround

The PSU controller reporting a persistent alarm for a runtime status fault can be reset to clear the alarm. Log in to a controller as the root user and execute either of these two commands at the host prompt. For PSU controller 1: docker exec -it platform-hal psf call POST:lop/object/reset-device destSlot=PsuCtrl1 device=Vpc For PSU controller 2: docker exec -it platform-hal psf call POST:lop/object/reset-device destSlot=PsuCtrl2 device=Vpc

Fix Information

Fixes to this issue are available with VELOS PSU controller firmware v2.00.806.0.1 and later: - Automatically expire a PSU controller's PSU I2C fault runtime status fault after 120 seconds, so that the persistent alarm is cleared - Increase the PSU I2C fault threshold from 5 to 10 consecutive faults, to reduce the chance of an unnecessary occurrence - Write a PEL (to both CC-LOPs) on assertion and deassertion of PSU I2C channel runtime status errors, to assist with trouble-shooting similar issues

Behavior Change

Guides & references

K10134038: F5 Bug Tracker Filter Names and Tips