Don’t waste reboots in vSAN!

1 minute read

This topic will focus on vSAN and more particularly to situations where a host in a vSAN cluster becomes “not responding”.

Many reasons can lead to a host becoming not responding and often a call to GSS becomes essential. If the host is not responding, a reboot can become required as the last step and most of the time a root cause analysis will be expected.

Usually, the regular ESXi logs are not enough to diagnose the issue.

Why a NMI?

The NMI or Non Maskable Interrupt allows to generate the famous “Purple Screen Of Death” on the ESXi. As a consequence, a dump will be generated and will allow the analysis by VMware of the dump taken previously in order to establish a root cause.

NMI PSOD

Warning : A NMI should be the last step. Do not send the NMI if no troubleshooting steps was tried.

How to generate a NMI?

Every vendor should have the option in their management interface like ILO or IDRAC. Please refer to their documentation.

How to grab it in logs ?

When you gather the logs through ESXi via vm-support, an automatic script is going to grab the dump and put it in the vm-support tgz. As the dump can be quite heavy, please make sure that you have enough space in the directory where you will generate the support bundle.

Last comment but not the least, the host must have a diagnostic partition configured otherwise no dump will be collecting leading to a wasted reboot.