In working with a client recently, we encountered an issue that we thought was worth sharing as others may be experiencing the same problem. Several ESXi hosts at version 6.7, build 13644319, were no longer able to migrate guest virtual machines (VMs) using vMotion. We also found that we could not connect with a Putty SSH connection after starting the SSH service on the host. A reboot of the host with the least critical guest VMs temporarily resolved the issue with vMotion and Putty until the /var partition filled up again. This was not a desirable resolution. We needed to first migrate the VMs to other hosts, and then prevent the /var partition from filling up again.
Migrating the guests
The issue was presenting itself after approximately 2-weeks of uptime on a host. Since we could not connect with Putty, we had to use the console of the host. We found that the /var/log/EMU/mili/mili2d.log was continually growing. The log file had a repeating error, “OneConnect Adapter not found”, that was referring to elxnet drivers. After clearing the log file, the /var partition was still full.
- https://kb.vmware.com/s/article/1029661, The /var/log partition or /var appears to be full even after removing files to free disk space
- We found that the log file was open by the hostd process, so restarted hostd:
- /etc/init.d/hostd restart
- Now we could successfully use vMotion to migrate VMs to prevent a guest outage. We were still not able to connect to the host with Putty. We rebooted the host and then all was working normally again. However, this was just a temporary fix. Over time the log will fill up the /var partition again.
Resolving the root cause of the runaway log file
At the time of troubleshooting the hosts were only one build behind. We could not apply patches to bring them up to the latest build. There was a compatibility issue with a third-party solution used for backup & replication, so hosts had to temporarily remain on their current build. We did not find anything in the latest release notes relating to this issue or the drivers noted in the log anyway.
- The Emulex drivers were current at the time, and nothing newer was posted on the host hardware manufacturer’s downloads website or VMware.
- Since these hosts do not have any Emulex hardware installed, the elxnet drivers are not needed so we simply removed them and rebooted. Now we no longer have the issue of the continually growing mili2d.log file filling up the /var partition.
esxcli software vib remove —vibname elxnet
esxcli software vib remove —vibname elxiscsi
esxcli software vib remove —vibname elx-esx-libelxima.so
A full /var partition on an ESXI host will cause issue with the hosts. Keeping the /var partition with free space will keep the hosts in your cluster running well.