Bug ID 1128765: Data Mover lock-up causes major application traffic impact and tenant deploy failures

Last Modified: Jan 10, 2023

Bug Tracker

Affected Product:  See more info
F5OS Velos(all modules)

Known Affected Versions:
1.3.0, 1.3.1, 1.3.2, 1.5.0

Fixed In:
1.5.1

Opened: Jul 25, 2022
Severity: 2-Critical

Symptoms

Major impact to BIG-IP tenant virtual server traffic. PoolMember health monitors fluctuate up and down, or remain down. LACP LAGs may go down. Depending on which Data Mover (DM) is impacted, a subset of the BIG-IP tenant TMMs will no longer transmit packets. The LACP daemon will be unable to transmit its PDUs. /var/F5/partition<n>/log/velos.log contains messages like these at the time the problem started: blade-1(p1) dma-agent[10]: priority="Alert" version=1.0 msgid=0x4201000000000129 msg="Health monitor detected DM Tx Action Completion ring hung." ATSE=0 DM=2 OQS=3. blade-1(p1) dma-agent[10]: priority="Info" version=1.0 msgid=0x4201000000000135 msg="Health monitor DM register dump requested.". blade-1(p1) dma-agent[10]: priority="Info" version=1.0 msgid=0x4201000000000137 msg="Health monitor DM register dump complete." FILE="agent-dump-1666310215.txt". In the BIG-IP tenant, the tmctl sep_stats table shows high counts for tx_send_drops2 or tx_send_drops3 (over 10,000). In the output below, all of the TMMs with SEP devices on DM 2 are impacted, unable to transmit packets. # tmctl sep_stats --select=iface,dm,sep,atse_socket,tx_send_drops2,tx_send_drops3 iface dm sep atse_socket tx_send_drops2 tx_send_drops3 ------ -- --- ----------- -------------- -------------- 1/0.1 2 0 0 1180470 <-- 80068 <-- 1/0.10 2 9 0 0 33046 <-- 1/0.11 0 10 0 0 0 1/0.2 0 1 0 0 0 1/0.3 1 2 0 0 0 1/0.4 2 3 0 0 33714 <-- 1/0.5 0 4 0 0 0 1/0.6 1 5 0 0 0 1/0.7 2 6 0 0 32980 <-- 1/0.8 0 7 0 0 0 1/0.9 1 8 0 0 0 In the F5OS Partition CLI, the following command will show a high count of tx-action-ring-full drops. In the output below, DM 2 on blade-1 is impacted: default-1# show dma-states dma-state state dm-packets dm-packet * 2-3 tx-action-ring-full TX ACTION NAME DM QOS RING FULL -------------------------------- blade-1 0 2 0 0 3 0 1 2 0 1 3 0 2 2 65890377811 <-- 2 3 328664822594 <-- merged 0 2 0 0 3 0 1 2 0 1 3 0 2 2 65890377811 <-- 2 3 328664822594 <-- After encountering this, subsequent attempts to deploy a tenant may fail until the blade is recovered, since the locked-up Data Mover is unable to free the memory it is holding for the impacted tenants.

Impact

Significant or total loss of application traffic for BIG-IP tenant instances running on the affected blade. This impact could also affect tenant instances on other blades if the LACP LAGs are marked down. Subsequent attempts to launch a new tenant or to stop and then start an existing one may fail.

Conditions

Although the exact conditions are unknown, the problem is more likely to occur when standard virtual servers are configured to mirror traffic to the peer BIG-IP. While L7 connection mirroring increases the risk, it is not a necessary condition.

Workaround

To recover a device, determine which blade is affected by looking at the start the following dma-agent log message in /var/F5/partition<n>/log/velos.log: blade-1(p1) dma-agent[10]: priority="Alert" version=1.0 msgid=0x4201000000000129 msg="Health monitor detected DM Tx Action Completion ring hung." ATSE=0 DM=2 OQS=3. ^^^^^^^ Then, reboot the blade. This will shut down all tenant instances on the blade. Once the blade boots up, the tenants should run and pass traffic normally. If the blade cannot be rebooted immediately, it may be possible to mitigate the problem for a multi-slot tenant by disabling the impacted slot to steer traffic to the remaining slots that are still healthy: # An example of disabling BIG-IP tenant slot 1 tmsh modify sys cluster default members { 1 { disabled } } Reducing the use of connection mirroring, especially for standard virtual servers, should reduce the likelihood of encountering this issue.

Fix Information

No fix exists yet.

Behavior Change