Bug ID 1076705: Etcd instance might not start correctly after upgrade

Last Modified: May 29, 2024

Affected Product(s):
F5OS Install/Upgrade, Velos(all modules)

Fixed In:
F5OS-C 1.5.0

Opened: Jan 28, 2022

Severity: 2-Critical

Symptoms

/etc/etcd/dump_etcd.sh might show that the etcd instance native to system controller #1 or #2 does not come up after an upgrade. This displays in the output of /etc/etcd/dump_etcd.sh and might occur for the .3.51 or .3.52 node: failed to check the health of member 25fa6669d235caa6 on https://100.65.3.52:2379: Get https://100.65.3.52:2379/health: dial tcp 100.65.3.52:2379: connect: connection refused member 25fa6669d235caa6 is unreachable: [https://100.65.3.52:2379] are all unreachable This can cause a longer OpenShift outage if the system controller containing the healthy instance is rebooted, and complete outage if the system controller containing the healthy instance is lost.

Impact

The local etcd instance on the affected system controller will not work correctly, compromising the high availability (HA) availability of the OpenShift cluster. The cluster will continue to work correctly while both system controllers are up.

Conditions

This is caused by a previous mount failure of the drbd file system, which causes a corruption of the etcd instance on the standby system controller. This is seen very infrequently.

Workaround

Rebuild the OpenShift cluster by running "touch /var/omd/CLUSTER_REINSTALL" from the CLI on the active system controller. This will cause all running tenants to be taken down during the cluster reinstall, which takes 50+ minutes. Once the cluster rebuild is complete, all chassis partitions should be disabled and re-enabled, and all tenants should be cycled to provisioned and back to deployed to ensure they have restarted correctly after the cluster rebuild.

Fix Information

This is fixed in F5OS-C 1.4.0 and later.

Behavior Change

Guides & references

K10134038: F5 Bug Tracker Filter Names and Tips