Bug ID 509346: Intermittent or complete SSL handshake failure with netHSM keys

Last Modified: Nov 07, 2022

Bug Tracker

Affected Product:  See more info
BIG-IP LTM(all modules)

Known Affected Versions:
11.5.1, 11.5.1 HF1, 11.5.1 HF10, 11.5.1 HF11, 11.5.1 HF2, 11.5.1 HF3, 11.5.1 HF4, 11.5.1 HF5, 11.5.1 HF6, 11.5.1 HF7, 11.5.1 HF8, 11.5.1 HF9, 11.5.10, 11.5.2, 11.5.2 HF1, 11.5.3, 11.5.3 HF1, 11.5.3 HF2, 11.5.4, 11.5.4 HF1, 11.5.4 HF2, 11.5.4 HF3, 11.5.4 HF4, 11.5.5, 11.5.6, 11.5.7, 11.5.8, 11.5.9, 11.6.0, 11.6.0 HF1, 11.6.0 HF2, 11.6.0 HF3, 11.6.0 HF4, 11.6.0 HF5

Fixed In:
12.0.0, 11.6.2, 11.6.0 HF6

Opened: Feb 25, 2015
Severity: 2-Critical
Related Article:
K01051264

Symptoms

1) When the network HSM takes too long to respond, TMM is considered down. For chassis, this causes failover to other blades. Since all blades share the same netHSM, these blades might quickly fail as well. If that happens, all tmm traffic will be down. There might be many reasons causing netHSM delay/failure. For appliance and VE, it may cause intermittent or all SSL handshake failure, depending on the network HSM connection reliability. 2) With high memory consumption due to heavy configuration, if PKCS11d is restarted, the system might also experience PKCS11d service malfunctions, which might be seen as intermittent or complete SSL handshake failures, depending on each TMM's memory usage.

Impact

1) All blades in chassis are put into "disabled" mode leading to all SSL handshake failure. 2) PKCS11d service malfunction, which might be seen as intermittent or all SSL handshake failures.

Conditions

This affects all platforms - chassis, appliance, and VE. 1) netHSM has delay or failure. 2) High memory usage due to heavy configuration or provisioning followed by PKCS11d restart.

Workaround

1) Restart the chassis to clear the state. 2) Reboot.

Fix Information

1) The timeout trigger is now disabled for failover when netHSM is used. Although there might be many reasons for such failures, with this fix, netHSM-related SSL failures won't cause all blades to be disabled. 2) The system now resets shared memory queues at creation, to avoid potential memory corruption.

Behavior Change