This blog contains experience gained over the years of implementing (and de-implementing) large scale IT applications/software.

Preventing File System Corruption from Halting Boot Up of SLES in Azure

When you create a Linux VM in Azure, you don’t get to know the “root” user password.
By default, if a Linux VM detects journaled file system corruption at boot if will go into recovery mode, requiring the root password to be able to fix it.
Without the root password, the only other way to fix the issue is copying the O/S disk, mounting on another VM and fixing the issue.
If you don’t have Azure Boot Diagnostics enabled, you might not even know what the problem is! The VM will just appear to not boot.

In this post I show a simple way to prevent Debian based Linux distributions (I use SLES) from failing boot up due to file system corruption. Our example is an XFS file system just like in my previous post.
XFS is journaled and will check the integrity on mounting. If there are problems with the file system then Linux will fail to mount it, which will cause the O/S boot up process to stall.

In a production system, you can imagine the scenario where a simple restart of a VM causes an hour long downtime (or longer).

NOTE: In my scenario there is no Linux device encryption, which could make the job or repair even harder, and all the more important to prevent boot failure.

Preventing Boot Failure

To prevent our corrupt XFS file system from halting boot, we just need to add 1 single option to the mount options in file /etc/fstab.
We use the “nofail” option.

We could just go and write this straight out to the fstab file and expect it to work.
However, we can test it first to make sure that it is:

  • supported on your version/distribution of Linux.
  • supported for your file system type (mine is XFS).

We could use the “-f” (fake) mount option to the “mount” command, but in testing I cannot get this to actually show an error when it is passed an invalid mount point option.
Instead, let’s actually mount the file system to check if “nofail” is accepted.

As the root user (or with sudo) get the current mount options for your file system (the one you will be applying “nofail” to):

grep BIG /etc/fstab

/dev/volTMP/lvTMP1 /BIGSTRIPEDDISK xfs defaults 0 0

I can see that my /BIGSTRIPEDDISK is mounted from a volume group and has the “defaults” mount options. Yours may be different.
We can now create a new mount point location and temporarily mount the file system adding the “nofail” option to test it is accepted (adjust the mount options using your current mount point settings):

mkdir /mnt/tempmount
mount -o defaults,nofail /dev/volTMP/lvTMP1 /mnt/tempmount

If you got an error or warning, then the file system type or your Linux distribution does not support the use of “nofail”. Maybe check the man page for an equivalent option (“man mount”).

If you didn’t get an error, then you know that you can successfully apply the “nofail” option to the end of the options column (column number 4) in the fstab:

vi /etc/fstab
/dev/volTMP/lvTMP1 /BIGSTRIPEDDISK xfs defaults,nofail 0 0

Once applied, it is recommended that you always verify boot related changes, by taking some downtime to restart the machine. There is nothing worse than applying a change and not testing it.

With “nofail” in place, the next time the O/S boots and the file system is mounted if there are issues with the integrity or even if the device is missing, the O/S will move forward in the boot process and ignore the error.
There is obviously a small consequence of this, file systems may not be mounted after a boot has completed.
It is possible to mitigate this problem with monitoring (scripts that monitor file system free space, for example) or other checks after boot.
Of course there is also a second option to all of this, set the root user password on new VMs and store in your secure password location. You can use a 16 character random string like those generated from a password manager.
You will also need to ensure that you can use the Azure Serial Console to get to the VM command line, because in some configurations, security practices can indirectly prevent this.

Add Your Comment

* Indicates Required Field

Your email address will not be published.