Bug ID 619958: Thales HSM in HA with Failure can cause key creation delay for over 1 minute

Last Modified: Jul 12, 2023

Affected Product(s):
BIG-IP GTM, LTM(all modules)

Known Affected Versions:
13.0.0, 13.0.0 HF1, 13.0.0 HF2, 13.0.0 HF3, 13.0.1, 13.1.0, 13.1.0.1, 13.1.0.2, 13.1.0.3, 13.1.0.4, 13.1.0.5, 13.1.0.6, 13.1.0.7, 13.1.0.8, 13.1.1, 13.1.1.2, 13.1.1.3, 13.1.1.4, 13.1.1.5, 13.1.3, 13.1.3.1, 13.1.3.2, 13.1.3.3, 13.1.3.4, 13.1.3.5, 13.1.3.6, 13.1.4, 13.1.4.1, 13.1.5, 13.1.5.1

Opened: Oct 01, 2016

Severity: 3-Major

Symptoms

When an HSM goes offline in an HA HSM configuration, the switchover to the other HSM will not occur immediately. After the failover timeout, the switchover will occur, but in the meantime SSL handshakes will fail.

Impact

SSL handshakes fail between when the HSM goes down and when the failover timeout occurs.

Conditions

Whenever there is a disruption to a Thales HSM configured in HA with at least one other HSM.

Workaround

Lower the relevant settings in /opt/nfast/kmdata/config/config file. The Thales User Guide has a detailed explanation of what each of the settings does. Thales recommends that in a production setup, unless there is a solid reason to modify these settings, they recommend that it is best to use the default values. Here are two example configs with lower timeouts: Very tight settings [server_settings] connect_retry=1 connect_keepalive=10 connect_broken=1 connect_command_block=0 Please note that this can cause a module to be marked as failed when there is just a short network glitch from which it may well recover. More relaxed settings [server_settings] connect_retry=3 connect_keepalive=4 connect_broken=10 connect_command_block=15 Following is more detailed information: In order to limit the time where SSL connections will fail, edit the Thales config settings in /opt/nfast/kmdata/config/config. The relevant settings are -- connect_retry: This field specifies the number of seconds to wait before retrying a remote connection to a client hardserver. The default is 10. -- connect_broken: This field specifies the number of seconds of inactivity allowed before a connection to a client hardserver is declared broken. The default is 90. -- connect_keepalive: This field specifies the number of seconds between keepalive packets for remote connections to a client hardserver. The default is 10. -- connect_command_block: When a netHSM has failed, this field specifies the number of seconds the hardserver should wait before failing commands directed to that netHSM with a NetworkError message. For commands to have a chance of succeeding after a netHSM has failed this value should be greater than that of connect_retry. If it is set to 0, commands to a netHSM are failed with NetworkError immediately, as soon as the netHSM fails. The default is 35. A slightly tighter setting than the default settings looks similar to the following: [server_settings] connect_retry=3 connect_keepalive=4 connect_broken=10 connect_command_block=15 A very tight setting looks similar to the following (Note: This can cause a module to be marked as failed when there is just a short network glitch from which it may well recover): [server_settings] connect_retry=1 connect_keepalive=10 connect_broken=1 connect_command_block=0 Thales recommends that in a production setup, unless there is a solid reason to modify these settings, it is best to use the default values.

Fix Information

None

Behavior Change

Guides & references

K10134038: F5 Bug Tracker Filter Names and Tips