Bug ID 1173061: etcd database may be corrupted in certain failure scenarios

Last Modified: May 29, 2024

Affected Product(s):
F5OS Velos(all modules)

Fixed In:
F5OS-C 1.6.0, F5OS-C 1.5.1

Opened: Oct 12, 2022

Severity: 2-Critical


/etc/etcd/dump_etcd.sh might show that the etcd instance native to system controller #1 or #2 does not come up after an upgrade. This displays in the output of /etc/etcd/dump_etcd.sh and might occur for the .3.51 or .3.52 node: failed to check the health of member 25fa6669d235caa6 on Get dial tcp connect: connection refused member 25fa6669d235caa6 is unreachable: [] are all unreachable This can cause a longer OpenShift outage if the system controller containing the healthy instance is rebooted, and complete outage if the system controller containing the healthy instance is lost.


The local etcd instance on the affected system controller will not work correctly, compromising the high availability (HA) of the OpenShift cluster. The cluster will continue to work correctly while both system controllers are up.


This can happen if both system controllers are rebooted at the same time.


The only workaround is to rebuild the OpenShift cluster by running "touch /var/omd/CLUSTER_REINSTALL" from the shell as root on the active system controller. This will cause all running tenants to be taken down during the cluster reinstall, which takes 90+ minutes.

Fix Information

This is fixed in F5OS-C-1.5.1 and later. With this fix, the impacted etcd instance will be recovered automatically, restoring full high availability support in etcd.

Behavior Change

Guides & references

K10134038: F5 Bug Tracker Filter Names and Tips