Bug ID 517053: bigd detection and logging of load and overload

Last Modified: Sep 13, 2023

Affected Product(s):
BIG-IP LTM(all modules)

Known Affected Versions:
11.6.0, 12.1.0 HF1, 12.1.0 HF2, 12.1.1 HF1, 12.1.1 HF2, 12.1.2 HF1, 12.1.2 HF2

Fixed In:
12.0.0, 11.6.1

Opened: Apr 08, 2015

Severity: 3-Major

Symptoms

When BIG-IP is configured with a very large number of monitor instances (multiple thousands) probing at relatively fast intervals, BIG-IP may not be able to keep up with its servicing load. This can be indicated by pool members being marked down/up (flapping) that were not actually having connectivity problems.

Impact

When overloaded, bigd is unable to probe consistently which may result in odd or unpredictable pool member up/down behavior.

Conditions

Heavy monitor instance probe rate (monitor instance probes per second).

Workaround

The main way to mitigate overload issues is either to reduce the number of monitor instances, to increase the probe time to probe less often, and/or to switch monitored pool members/nodes to simpler, lower-overhead monitor (i.e. ICMP instead of HTTP, or HTTP instead of HTTPS).

Fix Information

This particular fix does not change the problem or mitigation steps. Rather, it helps detect when overloading has occurred. When it has been determined that overloading has occurred, a message will be logged to /var/log/ltm to indicate this. By default, the overload message will be triggered if the main 1/10 second (100 ms) loop takes, on average, more than 150 ms to service. This overload threshold value can be adjusted with the new Bigd.Overload.Latency sys db variable. The variable indicates the number of ms latency at which servicing the 100 ms main loop is considered overload. In addition, main loop latency logging has been added to /var/log/bigdlog. The latency information will be logged every 15 seconds. The main loop latency information will be logged whenever Bigd.Debug is enabled, or if the new sys db variable Bigd.Debug.TimingStats is enabled. The new Bigd.Debug.TimingStats variable allows the main loop latency stats to be emitted even if other debug information, which can be quite verbose, is suppressed. The main loop latency information is such: insts, avg-5m mean-5m stddev5, avg-1m mean-1m stddev1 insts: # of active monitor instances being monitored avg-5m: weighted decaying average loop latency over 5 minutes mean-5m: mean average loop latency over 5 minutes stddev5: standard deviation of loop latency over 5 minutes avg-1m: weighted decaying average loop latency over 1 minute mean-1m: mean average loop latency over 1 minute stddev1: standard deviation of loop latency over 1 minute Once again, these average/mean values are measuring the 100 ms service loop, which under normal circumstances should always complete in close to 100 ms. When the value rises above 100 ms, that means we are not able to service all our monitor instances in a timely fashion.

Behavior Change

Guides & references

K10134038: F5 Bug Tracker Filter Names and Tips