Bug ID 841469: Application traffic may fail after an internal interface failure on a VIPRION system.

Last Modified: Oct 06, 2020

Bug Tracker

Affected Product:  See more info
BIG-IP All(all modules)

Known Affected Versions:
11.5.0, 11.5.1, 11.5.1 HF1, 11.5.1 HF10, 11.5.1 HF11, 11.5.1 HF2, 11.5.1 HF3, 11.5.1 HF4, 11.5.1 HF5, 11.5.1 HF6, 11.5.1 HF7, 11.5.1 HF8, 11.5.1 HF9, 11.5.10, 11.5.2, 11.5.2 HF1, 11.5.3, 11.5.3 HF1, 11.5.3 HF2, 11.5.4, 11.5.4 HF1, 11.5.4 HF2, 11.5.4 HF3, 11.5.4 HF4, 11.5.5, 11.5.6, 11.5.7, 11.5.8, 11.5.9, 11.6.0, 11.6.0 HF1, 11.6.0 HF2, 11.6.0 HF3, 11.6.0 HF4, 11.6.0 HF5, 11.6.0 HF6, 11.6.0 HF7, 11.6.0 HF8, 11.6.1, 11.6.1 HF1, 11.6.1 HF2, 11.6.2, 11.6.2 HF1, 11.6.3, 11.6.3.1, 11.6.3.2, 11.6.3.3, 11.6.3.4, 11.6.4, 11.6.5, 11.6.5.1, 11.6.5.2, 12.0.0, 12.0.0 HF1, 12.0.0 HF2, 12.0.0 HF3, 12.0.0 HF4, 12.1.0, 12.1.0 HF1, 12.1.0 HF2, 12.1.1, 12.1.1 HF1, 12.1.1 HF2, 12.1.2, 12.1.2 HF1, 12.1.2 HF2, 12.1.3, 12.1.3.1, 12.1.3.2, 12.1.3.3, 12.1.3.4, 12.1.3.5, 12.1.3.6, 12.1.3.7, 12.1.4, 12.1.4.1, 12.1.5, 12.1.5.1, 12.1.5.2, 13.0.0, 13.0.0 HF1, 13.0.0 HF2, 13.0.0 HF3, 13.0.1, 13.1.0, 13.1.0.1, 13.1.0.2, 13.1.0.3, 13.1.0.4, 13.1.0.5, 13.1.0.6, 13.1.0.7, 13.1.0.8, 13.1.1, 13.1.1.2, 13.1.1.3, 13.1.1.4, 13.1.1.5, 13.1.3, 13.1.3.1, 13.1.3.2, 13.1.3.3, 14.0.0, 14.0.0.1, 14.0.0.2, 14.0.0.3, 14.0.0.4, 14.0.0.5, 14.0.1, 14.0.1.1, 14.1.0, 14.1.0.1, 14.1.0.2, 14.1.0.3, 14.1.0.5, 14.1.0.6, 14.1.2, 14.1.2.1, 14.1.2.2, 14.1.2.3, 14.1.2.4, 14.1.2.5, 14.1.2.6, 14.1.2.7, 14.1.2.8, 15.0.0, 15.0.1, 15.0.1.1, 15.0.1.2, 15.0.1.3, 15.0.1.4, 15.1.0, 15.1.0.1, 15.1.0.2, 15.1.0.3, 15.1.0.4, 15.1.0.5

Fixed In:
16.0.0, 13.1.3.4

Opened: Oct 21, 2019
Severity: 2-Critical

Symptoms

Blades in a VIPRION system connect with one another over a data backplane and a management backplane. For more information on the manner in which blades interconnect over the data backplane, please refer to K13306: Overview of the manner in which the VIPRION chassis and blades interconnect :: https://support.f5.com/csp/article/K13306. Should an internal interface fail and thus block communication over the data backplane between two distinct blades, an unusual situation arises where different blades compute different CMP states. For example, if on a 4-slot chassis, blades 2 and 3 become disconnected with one another, the following is TMM's computation of which slots are on-line: slot1: slots 1, 2, 3, and 4 on-line (cmp state 0xf / 15) slot2: slots 1, 2, and 4 on-line (cmp state 0xb / 11) slot3: slots 1, 3, and 4 on-line (cmp state 0xd / 13) slot4: slots 1, 2, 3, and 4 on-line (cmp state 0xf / 15) As different slots are effectively operating under different assumptions of the state of the cluster, application traffic does not flow as expected. Some connections time out or are reset. You can run the following command to inspect the CMP state of each slot: clsh 'tmctl -d blade -s cmp_state tmm/cmp' All slots should report the same state, for instance: # clsh 'tmctl -d blade -s cmp_state tmm/cmp' === slot 2 addr 127.3.0.2 color green === cmp_state --------- 15 === slot 3 addr 127.3.0.3 color green === cmp_state --------- 15 === slot 4 addr 127.3.0.4 color green === cmp_state --------- 15 === slot 1 addr 127.3.0.1 color green === cmp_state --------- 15 When this issue occurs, logs similar to the following example can be expected in the /var/log/ltm file: -- info bcm56xxd[4276]: 012c0015:6: Link: 2/5.3 is DOWN -- info bcm56xxd[4296]: 012c0015:6: Link: 3/5.1 is DOWN -- info bcm56xxd[4296]: 012c0012:6: Trunk default member mod 13 port 0 slot 2; CMP state changed from 0xf to 0xd -- info bcm56xxd[4339]: 012c0012:6: Trunk default member mod 13 port 0 slot 2; CMP state changed from 0xf to 0xd -- info bcm56xxd[4214]: 012c0012:6: Trunk default member mod 13 port 0 slot 2; CMP state changed from 0xf to 0xd And a CMP transition will be visible in the /var/log/tmm file similar to the following example: -- notice CDP: PG 2 timed out -- notice CDP: New pending state 0f -> 0b -- notice Immediately transitioning dissaggregator to state 0xb -- notice cmp state: 0xb For more information on troubleshooting VIPRION backplane hardware issues, please refer to K14764: Troubleshooting possible hardware issues on the VIPRION backplane :: https://support.f5.com/csp/article/K14764.

Impact

Application traffic is impacted and fails sporadically due to a mismatch in CMP states between the blades. Failures are likely to manifest as timeouts or resets from the BIG-IP system.

Conditions

This issue arises after a very specific type of hardware failure. The condition is very unlikely to occur and is impossible to predict in advance.

Workaround

F5 recommends the following to minimize the impact of this potential issue: 1) For all highly available configurations (e.g., A/S, A/A, A/A/S, etc.). The BIG-IP system has functionality, in all software versions, to enact a fast failover when the conditions described occur. To ensure this functionality will trigger, the following configuration requirements must be met: a) The mirroring strategy must be set to 'between'. b) A mirroring channel to the next-active unit must be up. c) The min-up-members option must be set to the number of blades in the chassis (e.g., 4 if there are 4 blades in the chassis). Note: It is not required to actually configure connection mirroring on any virtual server; simply choosing the aforementioned strategy and ensuring a channel is up to the next-active unit will suffice. However, note that some configurations will benefit by also configuring connection mirroring on some virtual servers, as that can greatly reduce the number of affected connections during a failover. 2) For 'regular' standalone units. If a VIPRION system is truly standalone (no kind of redundancy whatsoever), there is no applicable failsafe action, as you will want to keep that chassis online even if some traffic is impaired. Ensure suitable monitoring of the system is in place (e.g., remote syslog servers, SNMP traps, etc.), so that a BIG-IP Administrator can react quickly in the unlikely event this issue does occur. 3) For a standalone chassis which belongs to a pool on an upstream load-balancer. If the virtual servers of a standalone VIPRION system are pool members on an upstream load-balancer, it makes sense for the virtual servers to report unavailable (e.g., by resetting all new connection attempts) so that the upstream load-balancer can select different pool members. An Engineering Hotfix can be provided which introduces an enhancement for this particular use-case. A new DB key is made available under the Engineering Hotfix: tmm.cdp.requirematchingstates, which takes values 'enable' and 'disable'. The default is 'disable', which makes the VIPRION system behave as in versions without the enhancement. When set to 'enable', the VIPRION system attempts to detect this failure and, if it does, resets all new connections. This should trigger some monitor failures on the upstream load-balancer and allow it to select different pool members. Please note you should only request the Engineering Hotfix and enable this DB key when this specific use-case applies: a standalone VIPRION system which belongs to a pool on an upstream load-balancer. When the new feature is enabled, the following log messages in the /var/log/ltm file indicate when this begins and stops triggering: -- crit tmm[13733]: 01010366:2: CMP state discrepancy between blades detected, forcing maintenance mode. Unable to relinquish maintenance mode until event clears or feature (tmm.cdp.requirematchingstates) is disabled. -- crit tmm[13262]: 01010367:2: CMP state discrepancy between blades cleared or feature (tmm.cdp.requirematchingstates) disabled, relinquishing maintenance mode.

Fix Information

The system now includes the enhancement for the 'standalone chassis which belongs to a pool' use-case, as discussed under the Workaround section.

Behavior Change