Bug ID 507902: Failure and restart of mcpd in secondary blade when cluster is part of a trust domain.

Last Modified: Sep 13, 2023

Affected Product(s):
BIG-IP ASM, FPS(all modules)

Known Affected Versions:
11.5.1 HF1, 11.5.1 HF2, 11.5.1 HF3, 11.5.1 HF4, 11.5.1 HF5, 11.5.1 HF6, 11.5.1 HF7, 11.5.1 HF8, 11.5.1 HF9, 11.5.1 HF10, 11.5.1 HF11, 11.5.2 HF1, 11.5.3 HF1, 11.5.3 HF2, 11.5.4 HF1, 11.5.4 HF2, 11.5.4 HF3, 11.5.4 HF4, 12.1.0 HF1, 12.1.0 HF2, 12.1.1 HF1, 12.1.1 HF2, 12.1.2 HF1, 12.1.2 HF2

Fixed In:
12.0.0, 11.6.0 HF5

Opened: Feb 19, 2015

Severity: 3-Major

Related Article: K16697

Symptoms

The mcpd daemon of a secondary blade reports failure and is restarted, causing the blade to be offline and not handle traffic for a few minutes.

Impact

During the mcpd restart, the blade is offline and not handling traffic for a few minutes. There is no impact to traffic handled by the primary blade.

Conditions

A multi-blade device (cluster) is part of a trust domain, and one of the other devices in the trust domain is being rebooted. The mcpd failure may occur within a time frame of between a few minutes, and up to 24 hours. The failure should only happen once, and not repeat until the next time that a device in the trust-domain is being rebooted.

Workaround

The mcpd failure is caused by inconsistency between the primary and the secondary blades, after a reboot of a different device in the trust domain. So, the workaround is to check and fix the inconsistency after every reboot of any device in the trust domain. There is no need to do this when only one of the blades is being rebooted. After any reboot of a device in the trust-domain, perform the following actions: ( 1. ) Check for inconsistency: On each blade of each cluster in the trust-domain, run the following command: tmsh -c 'list security datasync device-stats /Common/datasync-device-*/*cs-asm-dosl7* table' You should see an object for each of the devices (clusters) in the trust domain. For example, if two multi-blade devices are joined in the trust-domain: vcmp1 and vcmp2, both having 2 blades. [root@vcmp1:/S2-green-S:Active:In Sync (Sync Only)] config # tmsh -c 'list security datasync device-stats /Common/datasync-device-*/*cs-asm-dosl7* table' security datasync device-stats datasync-device-vcmp1.qa.com/datasync-device-vcmp1.qa.com-cs-asm-dosl7-stats { table cs-asm-dosl7 } security datasync device-stats datasync-device-vcmp2.qa.com/datasync-device-vcmp2.qa.com-cs-asm-dosl7-stats { table cs-asm-dosl7 } This shows both vcmp1 and vcmp2, so the state is good, no further action needed on this device. However, in the faulty state, the secondary blade of vcmp2 will show: [root@vcmp2:/S2-green-S:Active:In Sync (Sync Only)] config # tmsh -c 'list security datasync device-stats /Common/datasync-device-*/*cs-asm-dosl7* table' security datasync device-stats datasync-device-vcmp1.qa.com/datasync-device-vcmp1.qa.com-cs-asm-dosl7-stats { table cs-asm-dosl7 } The vcmp2 device is missing. The means that the state is inconsistent, and an mcpd failure may happen sometime within 24 hours. ( 2. ) Fix the inconsistency if needed: To fix the state, force a sync of the datasync device groups from vcmp1 (if vcmp2 had the faulty state). If vcmp2 had the inconsistency, run the following commands on vcmp1 : tmsh modify cm device-group datasync-global-dg devices modify { vcmp1.qa.com { set-sync-leader } } Wait a few seconds tmsh modify cm device-group datasync-device-vcmp1.qa.com-dg devices modify { vcmp1.qa.com { set-sync-leader } } tmsh modify cm device-group datasync-device-vcmp2.qa.com-dg devices modify { vcmp1.qa.com { set-sync-leader } } Wait a few more seconds, then check again the state using the instructions in step #1. (tmsh -c 'list security datasync device-stats /Common/datasync-device-*/*cs-asm-dosl7* table') All blades should be good now. Repeat steps #1 and #2 on each of the blades, in each of the clusters that are part of a trust-domain, when a device is being rebooted.

Fix Information

The mcpd daemon of a secondary blade in a cluster no longer fails and restarts, when the cluster is part of a trust domain, and one of the other devices in the trust-domain is being rebooted.

Behavior Change

Guides & references

K10134038: F5 Bug Tracker Filter Names and Tips