Bug ID 1188141: Tenant launch gets stuck due to un-initialization of VFs under one or more PF

Last Modified: Jul 26, 2024

Affected Product(s):
F5OS F5OS(all modules)

Known Affected Versions:
F5OS-A 1.3.1, F5OS-A 1.3.2

Fixed In:
F5OS-A 1.4.0

Opened: Nov 09, 2022

Severity: 3-Major

Symptoms

On r2x00/r4x00 based systems, tenant launch gets stuck with an error in ConfD tenant status leaf: "error adding container to network \"sriov-net3-tenant1\": SRIOV-CNI failed to load netconf: LoadConf(): failed to get VF information: \"lstat /sys/bus/pci/devices/0000:ec:00.7/physfn/net: no such file or directory" The VFs(aka, SR-IOV Based Virtual Functions) were not seen under a PF(aka, SR-IOV based Physical Function) when run following the command. Command: `ip link show <PF>` PF can be, `x557_1`, `x557_2`, `x557_3`, `x557_4`, `sfp_5`, `sfp_6`, `sfp_7`, `sfp_8`. For example, the faulty PF(x557_4 in this case) has no VFs listed compared to the healthy PF(x557_1 in this case), # ip link show x557_4 18: x557_4: <NO-CARRIER,BROADCAST,MULTICAST,PROMISC,UP> mtu 1500 qdisc mq state DOWN mode DEFAULT group default qlen 1000 link/ether 14:a9:d0:01:56:8a brd ff:ff:ff:ff:ff:ff # ip link show x557_1 15: x557_1: <NO-CARRIER,BROADCAST,MULTICAST,PROMISC,UP> mtu 1500 qdisc mq state DOWN mode DEFAULT group default qlen 1000 link/ether 14:a9:d0:01:56:87 brd ff:ff:ff:ff:ff:ff vf 2 MAC 00:00:00:00:11:02, spoof checking on, link-state auto, trust off vf 2 MAC 00:00:00:00:11:02, spoof checking on, link-state auto, trust off vf 2 MAC 00:00:00:00:11:02, spoof checking on, link-state auto, trust off vf 2 MAC 00:00:00:00:11:02, spoof checking on, link-state auto, trust off

Impact

Tenant launch will be unsuccessful and is not able to connect to the tenant console or over tenant's management connection.

Conditions

On r4x00 or r2x00 based systems: 1. ConfD tenant status leaf reports "LoadConf(): failed to get VF information". 2. The VFs were not created under one or more PFs. 3. One of the files from "x557_1", "x557_2", "x557_3", "x557_4", "sfp_5", "sfp_6", "sfp_7", "sfp_8" missed from "/sys/class/net" directory. For suppose when x557_4 is a faulty PF(aka, SR-IOV based Physical Function), then `/sys/class/net` shouldn't list x557_4 in its directory. [root@appliance-1 ~]# ls /sys/class/net/x557_4 ls: cannot access /sys/class/net/x557_4: No such file or directory [root@appliance-1 ~]#

Workaround

Workaround #1 =============== 1. Move the tenant(s)' running-state in ConfD to provisioned. 2. Run "/usr/omd/scripts/config_ice_vfs.sh" script when "/sys/class/net" starts to show missing PF from the list above. 3. Run "kubectl rollout restart daemonset kube-sriov-device-plugin-amd64 -n kube-system". 4. Move the tenant(s)' running-state in ConfD to deployed. Workaround #2 (only when second step takes too long) ================================================== 1. From second step in Workaround #1, if the PF wasn't detected in "/sys/class/net" even after a 20 minute duration, reboot the host to trigger the device probing.

Fix Information

The workarounds should fix the tenants' statuses and move them to a running state.

Behavior Change

Guides & references

K10134038: F5 Bug Tracker Filter Names and Tips