Bug ID 727467: Some iSeries appliances can experience traffic disruption when the high availability (HA) peer is upgraded from 12.1.3 and earlier to 13.1.0 or later.

Last Modified: Sep 13, 2023

Affected Product(s):
BIG-IP All(all modules)

Known Affected Versions:
12.1.3, 12.1.3.1, 12.1.3.2, 12.1.3.3, 12.1.3.4, 12.1.3.5, 12.1.3.6, 12.1.3.7, 12.1.4, 12.1.4.1, 12.1.5, 12.1.5.1, 12.1.5.2, 12.1.5.3, 12.1.6, 13.0.0, 13.0.0 HF1, 13.0.0 HF2, 13.0.0 HF3, 13.0.1, 13.1.0, 13.1.0.1, 13.1.0.2, 13.1.0.3, 13.1.0.4, 13.1.0.5, 13.1.0.6, 13.1.0.7, 13.1.0.8, 13.1.1, 14.0.0, 14.0.0.1, 14.0.0.2, 14.0.0.3, 14.0.0.4

Fixed In:
14.1.0, 14.0.0.5, 13.1.1.2

Opened: Jul 10, 2018

Severity: 3-Major

Symptoms

-- CPU core 0 can be seen utilizing 100% CPU. -- Other even cores may show a 40% increase in CPU usage. -- Pool monitors are seen flapping in /var/log/ltm. -- System posts the following messages: + In /var/log/ltm: - err tmm4[21025]: 01340004:3: high availability (HA) Connection detected dissimilar peer: local npgs 1, remote npgs 1, local npus 8, remote npus 8, local pg 0, remote pg 0, local pu 4, remote pu 0. Connection will be aborted. + In /var/log/tmm: - notice DAGLIB: Invalid table size 12 - notice DAG: Failed to consume DAG data

Impact

- High CPU usage. - Traffic disruption.

Conditions

-- Active unit on a pre-12.1.3.1 release. -- Standby peer upgraded to a 13.1.0 or later release. -- Device is an iSeries device (i5600 or later). Important: This issue may also affect iSeries high availability (HA) peers on the same software version if the devices do not share the same model number. Note: Although this also occurs when upgrading to 12.1.3.8 and 13.0.x, the issue is not as severe.

Workaround

Minimize impact on affected active devices by keeping the upgraded post-13.1.0 unit offline as long as possible before going directly to Active. For example, on a 12.1.3 unit to be upgraded (pre-upgrade): -- Run the following command: tmsh run sys failover offline persist -- Run the following command: tmsh save sys config -- Upgrade to 13.1.0.8. -- Unit comes back up on 13.1.0.8 as 'Forced Offline' and does not communicate with the active unit running 12.1.3 at all. -- Set up high availability (HA) group and make sure the 12.1.3 Active unit's high availability (HA) score is lower than 13.1.0.8. -- To cause the 13.1.0.8 unit to go directly to Active and take over traffic, run the following command on the unit running 13.1.0.8: tmsh run sys failover online At this point, the 12.1.3 unit starts to show symptoms of this issue, however, because it is no longer processing traffic, there is no cause for concern.

Fix Information

This release introduces a new bigdb variable DAG.OverrideTableSize. To prevent the issue on an upgraded post-13.1.0 unit, set DAG.OverrideTableSize to 3. In order to return the system to typical CPU usage, you must set the db variable, and then restart tmm by running the following command: bigstart restart tmm (Restarting tmm is required for 13.1.1.2 and newer 13.1.1.x releases.) Note: Because the restart is occurring on the Standby unit, no traffic is disrupted while tmm restarts.

Behavior Change

Guides & references

K10134038: F5 Bug Tracker Filter Names and Tips