Bug ID 1025089: Pool members marked DOWN by database monitor under heavy load and/or unstable connections

Last Modified: Jan 20, 2023

Bug Tracker

Affected Product:  See more info
BIG-IP GTM, LTM(all modules)

Known Affected Versions:
12.1.0, 12.1.0 HF1, 12.1.0 HF2, 12.1.1, 12.1.1 HF1, 12.1.1 HF2, 12.1.2, 12.1.2 HF1, 12.1.2 HF2, 12.1.3, 12.1.3.1, 12.1.3.2, 12.1.3.3, 12.1.3.4, 12.1.3.5, 12.1.3.6, 12.1.3.7, 12.1.4, 12.1.4.1, 12.1.5, 12.1.5.1, 12.1.5.2, 12.1.5.3, 12.1.6, 13.1.0, 13.1.0.1, 13.1.0.2, 13.1.0.3, 13.1.0.4, 13.1.0.5, 13.1.0.6, 13.1.0.7, 13.1.0.8, 13.1.1, 13.1.1.2, 13.1.1.3, 13.1.1.4, 13.1.1.5, 13.1.3, 13.1.3.1, 13.1.3.2, 13.1.3.3, 13.1.3.4, 13.1.3.5, 13.1.3.6, 13.1.4, 13.1.4.1, 13.1.5, 13.1.5.1, 14.1.0, 14.1.0.1, 14.1.0.2, 14.1.0.3, 14.1.0.5, 14.1.0.6, 14.1.2, 14.1.2.1, 14.1.2.2, 14.1.2.3, 14.1.2.4, 14.1.2.5, 14.1.2.6, 14.1.2.7, 14.1.2.8, 14.1.3, 14.1.3.1, 14.1.4, 14.1.4.1, 14.1.4.2, 14.1.4.3, 14.1.4.4, 14.1.4.5, 14.1.4.6, 14.1.5, 14.1.5.1, 14.1.5.2, 14.1.5.3, 15.0.0, 15.0.1, 15.0.1.1, 15.0.1.2, 15.0.1.3, 15.0.1.4, 15.1.0, 15.1.0.1, 15.1.0.2, 15.1.0.3, 15.1.0.4, 15.1.0.5, 15.1.1, 15.1.2, 15.1.2.1, 15.1.3, 15.1.3.1, 15.1.4, 15.1.4.1, 15.1.5, 15.1.5.1, 15.1.6, 15.1.6.1, 15.1.7, 15.1.8, 15.1.8.1, 16.0.0, 16.0.0.1, 16.0.1, 16.0.1.1, 16.0.1.2, 16.1.0, 16.1.1, 16.1.2, 16.1.2.1, 16.1.2.2, 16.1.3, 16.1.3.1, 16.1.3.2, 16.1.3.3, 17.0.0, 17.0.0.1, 17.0.0.2

Opened: Jun 11, 2021
Severity: 3-Major

Symptoms

BIG-IP database monitors (mssql, mysql, oracle, postgresql) may exhibit one of the following symptoms: - Under heavy, sustained load, the database monitoring subsystem may become unresponsive, causing pool members to be marked DOWN and eventually causing the database monitoring daemon (DBDaemon) to restart unexpectedly. - If the network connection to a monitored database server is unstable (experiences intermittent interruptions, drops, or latency), pool members may be marked DOWN as the result of a momentary loss of connectivity. This is more likely to occur when a database monitor is used to monitor a GTM pool member instead of an LTM pool member, due to differences between how monitors are configured for GTM versus LTM.

Impact

-- High CPU utilization is observed on control plane cores. -- The database monitoring daemon (DBDaemon) may restart unexpectedly, causing GTM or LTM pool members monitored by a database monitor to be marked DOWN temporarily. -- GTM or LTM pool members monitored by a database monitor may be marked DOWN temporarily if the network connection to the database server is dropped or times out.

Conditions

These symptoms may occur under the following conditions: - The database monitoring subsystem may become unresponsive, and the database monitoring daemon (DBDaemon) may restart unexpectedly, if a large number of LTM or GTM pool members are being monitored by database monitors, and/or with short polling intervals ("interval" of 10 seconds or less), or when GTM pool members are monitored by database monitors with a short "probe-timeout" value (10 seconds or less). - The GTM pool members may be marked DOWN after a single interrupted connection if they are monitored by a database monitor, configured with a short "probe-timeout" value (10 seconds or less) and "ignore-down-response" configured as "disabled" (default).

Workaround

Perform one of the following actions: -- Configure the database (mssql, mysql, oracle, postgresql) monitor with a "count" value of "1". This prevents the caching or reuse of network connections to the database server between probes. Thus there is no cached connection to time out or get dropped. However, the overhead of establishing the network connection to the database server will be incurred for each probe and will result in generally higher (but more consistent) CPU usage by the database monitoring daemon (DBDaemon). -- Configure the database monitor "interval" and "timeout" values (for an LTM monitor), or the "interval", "timeout", "probe-attempts", "probe-interval" and "probe-timeout" values (for a GTM monitor) such that multiple failed monitor probes are required before the monitored member is marked DOWN, and with a minimum value of 10 seconds or greater. Note: A restart of bigd (and consequently the DBDaemon) might be necessary to properly clear any currently stale/stuck database connections.

Fix Information

None

Behavior Change