Bug ID 1349257: Rolling software upgrade is stuck with one system controller in an "in-progress" state, and a "No such file or directory" error in sw-mgmt.debug

Last Modified: May 29, 2024

Affected Product(s):
F5OS F5OS, F5OS-C, Install/Upgrade, Velos(all modules)

Known Affected Versions:
F5OS-C 1.5.0, F5OS-C 1.5.1, F5OS-C 1.6.0, F5OS-C 1.6.1

Fixed In:
F5OS-C 1.6.2

Opened: Sep 07, 2023

Severity: 1-Blocking

Related Article: K000137531

Symptoms

While performing a rolling software upgrade on VELOS system controller software, one controller completes the installation process, but the other remains stuck in an "in-progress" state, and is not reachable on its management IP. 1. One of the two system controllers is "stuck" and largely inaccessible after a rolling upgrade: a. Cannot connect to system controller's management IP. b. Cannot connect to the system controller as root from the active system controller, e.g. "ssh controller-#"). The controller should be accessible over the "ccpeer" link. c. Platform services are not running. 2. When you access the stuck system controller (via console or connection over the "ccpeer" link): a. Some subset of the files in /var/docker/config/ are broken symlinks (env_var, env_var.patch, platform.yml, platform.patch.yml) b. A log message similar to this is in /var/log/sw-mgmt.debug with the error "No such file or directory": 19-Oct-23 14:55:34 - ERROR: sw-mgmt: priority=error msgid=0x3501000000000153 msg=Unexpected error importing controller services 1.6.1-19136: ('Connection aborted.', FileNotFoundError(2, 'No such file or directory'))

Impact

The upgrade process is stuck, and one controller remains inoperative.

Conditions

Performing a rolling system update of VELOS system controller software.

Workaround

To avoid running into this issue during an upgrade, either: 1) perform an out-of-service upgrade, rather than a rolling upgrade. Refer to https://techdocs.f5.com/en-us/velos-1-5-0/velos-systems-installation-upgrade/title-install-upgrade-software.html for more information. 2) add a systemd drop-in file for the sw-mgmt service on each controller, by logging into each controller as root and doing the following: a. Create a systemd drop-in file for the sw-mgmt service by running the following commands: mkdir /etc/systemd/system/sw-mgmt.service.d/ echo -e '[Unit]\nWants=docker.service\nAfter=docker.service' > /etc/systemd/system/sw-mgmt.service.d/deps.conf cat /etc/systemd/system/sw-mgmt.service.d/deps.conf The output of displaying the file should look like this: [root@controller-2 ~]# cat /etc/systemd/system/sw-mgmt.service.d/deps.conf [Unit] Wants=docker.service After=docker.service [root@controller-2 ~]# b. Activate the modified configuration: systemctl daemon-reload c. Verify that the service is in a functioning state, and now has an explicit dependency on docker: systemctl status -l sw-mgmt systemctl list-dependencies sw-mgmt | grep docker To remove the workaround, run the following on each system controller individually: a. Log into the system controller as root b. Rename the systemd drop-in file to have a ".disabled" extension mv /etc/systemd/system/sw-mgmt.service.d/deps.conf /etc/systemd/system/sw-mgmt.service.d/deps.conf.disabled c. Reload systemd: systemctl daemon-reload If a system has encountered this problem: 1. Log into the active system controller via SSH as 'root'. 2. SSH to the offline system controller over the internal 'ccpeer' link. By default, if a chassis uses RFC6598 IP addressing, the IP addresses of the system controllers on this network will be: controller-1: 100.65.7.51 controller-2: 100.65.7.52 The IP address of the peer controller on the ccpeer link can be found by running this command: echo "peer controller: $(ifconfig ccpeer | grep -Po '(?<=inet )([^.]+\.){3}')$(( 53 - $(grep Slot /etc/PLATFORM | cut -d':' -f2 | tr -d ' ')))" 3. Verify that both system controllers have the same set of software images present, by comparing the output of "ls /var/import/staging/*.iso" on both system controllers. 4. On the offline system controller, stop the sw-mgmt service systemctl stop sw-mgmt 5. On the offline system controller, make a backup copy of import.json: cp /var/import/import.json ~/import.json.bak 6. On the offline system controller, copy import.json from the working controller over the ccpeer link. If controller-1 is working, and controller-2 is offline: scp 100.65.7.51:/var/import/import.json /var/import/import.json If controller-1 is offline, and controller-2 is working: scp 100.65.7.52:/var/import/import.json /var/import/import.json 7. On the offline system controller, start the sw-mgmt service systemctl start sw-mgmt 8. Wait about 5 or 10 minutes (you can monitor progress by tailing /var/log/sw-mgmt.debug), and then run this command to list the controller services versions that the sw-mgmt service has imported: echo list cc_iso | nc -U /var/sw-mgmt.unix 9. If that works as expected, reboot the offline system controller. reboot After the system controller reboots, it should progress further in the installation process. If there are pending firmware upgrades, the system controller may reboot automatically again to complete those upgrades.

Fix Information

None

Behavior Change

Guides & references

K10134038: F5 Bug Tracker Filter Names and Tips