Bug ID 642923: MCP misses its heartbeat (and is killed by sod) if there are a large number of file objects on the system

Last Modified: Jul 13, 2024

Affected Product(s):
BIG-IP All(all modules)

Known Affected Versions:
11.0.0, 11.1.0, 11.2.0, 11.2.1, 11.3.0, 11.4.0, 11.4.1, 11.5.0, 11.5.1, 11.5.1 HF1, 11.5.1 HF2, 11.5.1 HF3, 11.5.1 HF4, 11.5.1 HF5, 11.5.1 HF6, 11.5.1 HF7, 11.5.1 HF8, 11.5.1 HF9, 11.5.1 HF10, 11.5.1 HF11, 11.5.2, 11.5.2 HF1, 11.5.3, 11.5.3 HF1, 11.5.3 HF2, 11.5.4, 11.5.4 HF1, 11.5.4 HF2, 11.5.4 HF3, 11.5.4 HF4, 11.5.5, 11.5.6, 11.5.7, 11.5.8, 11.5.9, 11.5.10, 11.6.0, 11.6.0 HF1, 11.6.0 HF2, 11.6.0 HF3, 11.6.0 HF4, 11.6.0 HF5, 11.6.0 HF6, 11.6.0 HF7, 11.6.0 HF8, 11.6.1, 11.6.1 HF1, 11.6.1 HF2, 11.6.2, 11.6.2 HF1, 11.6.3, 11.6.3.1, 11.6.3.2, 11.6.3.3, 11.6.3.4, 11.6.4, 11.6.5, 11.6.5.1, 11.6.5.2, 11.6.5.3, 12.0.0, 12.0.0 HF1, 12.0.0 HF2, 12.0.0 HF3, 12.0.0 HF4, 12.1.0, 12.1.0 HF1, 12.1.0 HF2, 12.1.1, 12.1.1 HF1, 12.1.1 HF2, 12.1.2, 12.1.2 HF1, 12.1.2 HF2, 12.1.3, 12.1.3.1, 12.1.3.2, 12.1.3.3, 12.1.3.4, 12.1.3.5, 12.1.3.6, 12.1.3.7, 13.0.0, 13.0.0 HF1, 13.0.0 HF2, 13.0.0 HF3, 13.0.1, 13.1.0, 13.1.0.1, 13.1.0.2, 13.1.0.3, 13.1.0.4, 13.1.0.5, 13.1.0.6, 13.1.0.7, 13.1.0.8, 13.1.1

Fixed In:
14.0.0, 13.1.1.2, 12.1.4

Opened: Feb 02, 2017

Severity: 3-Major

Related Article: K01951295

Symptoms

MCP may timeout and be killed by the sod watchdog, causing mcpd to restart.

Impact

mcpd restarts, which causes a system to go offline and restart services.

Conditions

Certain operations, under certain conditions, on certain platforms, may take longer to complete than the mcpd heartbeat timeout (300 seconds). When that happens, the system considers mcpd unresponsive, and will kill mcpd before it has finished its task, resulting in this issue. There are a number of ways that this issue may manifest. For example, the default mcpd heartbeat timeout might be reached when loading a configuration file with a large number* of file objects configured (e.g., SSL certificates and keys, data-groups, APM customizations, EPSEC file updates, external monitors, or other data present in the filestore (/config/filestore)). *Note: Depending the operations mcpd is performing, the performance of the hardware, the speed of disk access, and other potential factors, 3,000 is a relative estimate of the number of filestore objects that might cause this issue to occur.

Workaround

To prevent the issue from occurring, you can temporarily disable the heartbeat timeout using the following command: modify sys daemon-ha mcpd heartbeat disable Important: Disabling the heartbeat timer means that, should the mcpd process legitimately become unresponsive, the system will not automatically restart mcpd to recover. Note: If you have a large number of objects (more than 3,000) in the filestore, and are able to reduce this by deleting their related configuration objects, you may be able to work around the issue. To determine the specific cause of the issue, you can open a support case with F5, to inspect the resulting mcpd core file.

Fix Information

A possible case where mcpd goes too long without updating the heartbeat has been fixed by replacing one algorithm with a more efficient one.

Behavior Change

Guides & references

K10134038: F5 Bug Tracker Filter Names and Tips