This blog contains experience gained over the years of implementing (and de-implementing) large scale IT applications/software.

Is my GCP hosted SLES 12 Linux VM Affected by the BootHole Vulnerability

In an effort to really drag this topic out (it’s now a trilogy), I’ve taken my previous Azure specific post and also the AWS specific post and decided to do some further research into whether the same is true in Google Cloud Platform (a.k.a GCP).

Previously

(If I was writing this like a true screenwriter, it would get shorter and faster each recap).

In July 2020, a GRUB2 bootloader vulnerability was discovered which could allow attackers to replace the bootloader on a machine which has Secure Boot turned on.
The vulnerability is designated CVE-2020-10713 and is rated 8.2 HIGH on the CVSS (see here).

Let’s recap what this is (honestly, please see my Azure post for details, it’s quite technical), and how it impacts a GCP virtual machine running SUSE Enterprise Linux 12, which is commonly used to run SAP systems such as SAP HANA or other SAP products.

What is the Vulnerability?

Essentially, some evil input data can be entered into some part of the GRUB2 program binaries, which is not checked/validated.
By carefully crafting the data that is the overflow, it is possible to cause a specifically targeted memory area to be overwritten.

As described by Eclypsium here (the security company that detected this) “Attackers exploiting this vulnerability can install persistent and stealthy bootkits or malicious bootloaders that could give them near-total control over the victim device“.

Essentially, the vulnerability allows an attacker with root privileges to replace the bootloader with a malicious one.

What is GRUB2?

GRUB2 is v2 of the GRand Unified Bootloader (see here for the manual).
It can be used to load the main operating system of a computer.

What is Secure Boot?

There are commonly two boot methods: “Legacy Boot” and “Secure Boot” (a.k.a UEFI boot).
Until Secure Boot was invented, the bootloader would sit in a designated location on the hard disk and would be executed by the computer BIOS to start the chain of processes for the computer start up.

With Secure Boot, certificates are used to secure the boot process chain.
This BootHole vulnerability means a new CA certificate needs to be implemented in every machine that uses Secure Boot!

But the attackers Need Root?

Yes, the vulnerability is in a GRUB2 configuration text file owned by the root user. Additional text added to the file can cause the buffer overflow.
Anti-virus can’t remove the bootloader if the bootloader boots first and “adjusts” the anti-virus.

NOTE: The flaw also exists if you also use the network boot capability (PXE boot).

What is the Patch?

Due to the complexity of the problem (did you read the prior Eclypsium link?), it needs more than one piece of software to be patched and in different layers of the boot chain.

The vulnerable GRUB2 software needs patching.
To be able to stop the vulnerable version of GRUB2 being re-installed and used, three things need to happen:

  1. The O/S vendor (SUSE) needs to adjust their code (known as the “shim”) so that it no longer trusts the vulnerable version of GRUB2. Again, this is a software patch from the O/S vendor (SUSE) which will need a reboot.
  2. Since someone with root could simply re-install O/S vendor code (the “shim”) that trusts the vulnerable version of GRUB2, the adjusted O/S vendor code will need signing and trusting by the certificates further up the chain.
  3. The revocation list of Secure Boot needs to be adjusted to prevent the vulnerable version of the O/S vendor code (“shim”) from being called during boot. (This is known as the “dbx” (exclusion database), which will need updating with a firmware update).

What is SUSE doing about it?

There needs to be a multi-pronged patching process because SUSE also found some additional bugs during their analysis.

You can see the SUSE page on CVE-2020-10713 here, which includes the mention of the additional bugs.

How does this impact GCP VMs?

In the previous paragraphs we found that a firmware update is needed to update the “dbx” exclusion database.
Since GCP virtual machines are hosted in a KVM based hypervisor, the “firmware” is actually software.

Whilst looking for details on “Secure Boot” in GCP virtual machines, we come across the Google Compute Engine’s “Shielded VM” option.
You can read about it in detail here.
In brief, in GCP a Shielded VM is deployed using a pre-defined set of Google specific guest operating systems:

As noted above, the documentation specifically mentions that the “firmware” underpinning the virtual machine contains Google’s Certificate Authority (CA) certificate, as the root of the trust chain.
This is important because the Eclypsium description of the vulnerability is specifically citing a problem with the Microsoft CA.
What this means is that Google actually decide on the trust chain themselves and can probably more rapidly adjust the firmware with a new CA certificate.
To reiterate, this is specific to Google specific VM images that you deploy as a Shielded VM.

Another point worth noting is that when creating a Shielded VM, you can enable the vTPM (virtual trusted platform module), which allows integrity monitoring of the boot process. Any change to the boot process and a validation alert is triggered. Whilst this would not prevent compromise, it would at least alert an administrator.

Reading the Google infrastructure security document, we find that just like AWS, Google have designed and are implementing their own security chip called Titan, on the physical hosts. This is used to ensure that physical hosts boot securely, but it is not clear if this chip is used in anyway for Shielded VMs booted on the physical host.

If we delve further into the GCP documentation we find that we also have the option to create a custom image for deployment into a Shielded VM.
See the documentation on how to create a custom Shielded VM image:

The above states that you can create your own Secure Boot capable VM image for deployment in GCP as a Shielded VM.
If we read further down that page under section “Default certificates“, we find a slight difference compared to the Google “curated” images:

The above is telling us, by default the standard Microsoft CA certificates are used for the Secure Boot setup of VMs created using a custom image (remember non-custom Secure Boot images use Google’s root CA) in GCP.
When it says “default values”, right now, they are the only values because of a small note further up the page:

OK, so you can only use the defaults for now. The same compromised defaults that will need fixing. 🤷‍♂️

What do we think needs to happen once Google create the ability to replace the certificates?
From reading those previously mentioned documents, I would guess that to rebuild the certificate database used during the creation of the custom Shielded VM image, you are going to need to re-create the VM image and then re-deploy a VM from that image!

The question remains, is SLES 12 supported as a Shielded VM guest-OS on GCP?
According to the Shielded VM page here, it is not by default. You will need to therefore create your own image:

Summary:

The BootHole vulnerability is far reaching and will impact many, many devices (servers, laptops, IoT devices, TVs, fridges, cars?).
However, only those devices that actually *use* Secure Boot will truly be impacted, since the devices not using Secure Boot do not need to be patched (it’s fruitless).

If you run SLES 12 on GCP virtual machines, using public images, then by default you will not being using the Shielded VM instances, so there is no point patching to fix a vulnerability for which you are not affected.
You are only introducing more risk by patching.

If however, you do decide to patch (even if you don’t need to) then follow the advice from SUSE and patch to fix GRUB2, the “shim” and the other vulnerabilities that were found.

On a final closing point, you could be running a custom SLES image deployed in GCP as a Shielded VM. An image that your company has built and which uses Secure Boot. You would be wise to contact your cloud administrators to ensure that they are preparing for a VM rebuild and subsequent patching required to ensure that Secure Boot remains secure.

Useful Links:

Is my AWS hosted SLES 12 Linux VM Affected by the BootHole Vulnerability

In an effort to spin this story out a little further, I’ve taken my previous Azure specific post and decided to do some further research into whether the same is true in Amazon Web Services (a.k.a AWS).

Previously

In July 2020, a GRUB2 bootloader vulnerability was discovered which could allow attackers to replace the bootloader on a machine which has Secure Boot turned on.
The vulnerability is designated CVE-2020-10713 and is rated 8.2 HIGH on the CVSS (see here).

Let’s recap what this is (honestly, please see my other post for details, it’s quite technical), and how it impacts an AWS virtual machine running SUSE Enterprise Linux 12, which is commonly used to run SAP systems such as SAP HANA or other SAP products.

What is the Vulnerability?

Essentially, some evil input data can be entered into some part of the GRUB2 program binaries, which is not checked/validated.
By carefully crafting the data that is the overflow, it is possible to cause a specifically targeted memory area to be overwritten.

As described by Eclypsium here (the security company that detected this) “Attackers exploiting this vulnerability can install persistent and stealthy bootkits or malicious bootloaders that could give them near-total control over the victim device“.

Essentially, the vulnerability allows an attacker with root privileges to replace the bootloader with a malicious one.

What is GRUB2?

GRUB2 is v2 of the GRand Unified Bootloader (see here for the manual).
It can be used to load the main operating system of a computer.

What is Secure Boot?

There are commonly two boot methods: “Legacy Boot” and “Secure Boot” (a.k.a UEFI boot).
Until Secure Boot was invented, the bootloader would sit in a designated location on the hard disk and would be executed by the computer BIOS to start the chain of processes for the computer start up.

With Secure Boot, certificates are used to secure the boot process chain.
This BootHole vulnerability means a new CA certificate needs to be implemented in every machine that uses Secure Boot!

But the attackers Need Root?

Yes, the vulnerability is in a GRUB2 configuration text file owned by the root user. Additional text added to the file can cause the buffer overflow.
Anti-virus can’t remove the bootloader if the bootloader boots first and “adjusts” the anti-virus.

NOTE: The flaw also exists if you also use the network boot capability (PXE boot).

What is the Patch?

Due to the complexity of the problem (did you read the prior Eclypsium link?), it needs more than one piece of software to be patched and in different layers of the boot chain.

The vulnerable GRUB2 software needs patching.
To be able to stop the vulnerable version of GRUB2 being re-installed and used, three things need to happen:

  1. The O/S vendor (SUSE) needs to adjust their code (known as the “shim”) so that it no longer trusts the vulnerable version of GRUB2. Again, this is a software patch from the O/S vendor (SUSE) which will need a reboot.
  2. Since someone with root could simply re-install O/S vendor code (the “shim”) that trusts the vulnerable version of GRUB2, the adjusted O/S vendor code will need signing and trusting by the certificates further up the chain.
  3. The revocation list of Secure Boot needs to be adjusted to prevent the vulnerable version of the O/S vendor code (“shim”) from being called during boot. (This is known as the “dbx” (exclusion database), which will need updating with a firmware update).

What is SUSE doing about it?

There needs to be a multi-pronged patching process because SUSE also found some additional bugs during their analysis.

You can see the SUSE page on CVE-2020-10713 here, which includes the mention of the additional bugs.

How does this impact AWS VMs?

In the previous paragraphs we found that a firmware update is needed to update the “dbx” exclusion database.
Since AWS virtual machines are hosted in a KVM based hypervisor, the “firmware” is actually software.

Whilst looking for details on “Secure Boot” in AWS virtual machines, there is absolutely no mention of it being supported for Linux.
If we dig into the the VM import/export documents here on the AWS docs site, we find:

So the above states that for VMs imported/exported, “UEFI/EFI boot partitions are supported only for Windows boot volumes with VHDX as the image format. Otherwise, a VM’s boot volume must use Master Boot Record (MBR) partitions.“.
The words “…only for Windows…” are the key part of this.
Because if we scan just a little further down the page, it says that the UEFI boot partitions are actually “supported” for Windows, by being converted to MBR (not Secure Boot compatible):

I feel we can surmise that AWS does not support running Linux VMs with Secure Boot.
Apart from this little gem of information here.
This slide shows that the launch of the AWS Graviton2 chip enables ARM based Linux distributions to support Secure Boot.
We can read the Amazon EC2 User Guide here (updated August 28, 2020), to find that SLES 15 is the only SUSE Linux that supports ARM cpus on AWS:

So we know that Secure Boot is not available in AWS on any of the SLES x86 operating systems, and SLES 12 on ARM is not supported on Graviton based cpus.

Summary:

The BootHole vulnerability is far reaching and will impact many, many devices (servers, laptops, IoT devices, TVs, fridges, cars?).
However, only those devices that actually *use* Secure Boot will truly be impacted, since the devices not using Secure Boot do not need to be patched (it’s fruitless).

If you run SLES 12 on AWS virtual machines, you cannot possibly use Secure Boot, so there is no point patching to fix a vulnerability for which you are not affected.
You are only introducing more risk by patching.

If however, you do decide to patch (even if you don’t need to) then follow the advice from SUSE and patch to fix GRUB2, the “shim” and the other vulnerabilities that were found.

If you are running SLES 12 on AWS, then there is no specific order of patching, because you do not use Secure Boot, so there is no possibility of breaking the trust chain that doesn’t exist.

On a final closing point, you could be running a HANA system in AWS on what is known as “Bare Metal” (“High Memory Instances” or a.k.a “*.metal”). These are physical machines using the Nitro based hyper-visor. So whilst EC2 Virtual Machines can’t use Secure Boot, these “Bare Metal” machines may well do so through the use of the Nitro Security Chip (see a good deep dive here). You would be wise to contact your AWS account representative to establish if they will be patching the firmware.

Useful Links:

Is my Azure hosted SLES 12 Linux VM Affected by the BootHole Vulnerability

In July 2020, a GRUB2 bootloader vulnerability was discovered which could allow attackers to replace the bootloader on a machine which has Secure Boot turned on.
The vulnerability is designated CVE-2020-10713 and is rated 8.2 HIGH on the CVSS (see here).

Let’s look at what this is and how it impacts a Microsoft Azure virtual machine running SUSE Enterprise Linux 12, which is commonly used to run SAP systems such as SAP HANA or other SAP products.

What is the Vulnerability?

It is a “Classic Buffer Overflow” vulnerability in the GRUB2 bootloader for versions prior to 2.06.
Essentially, some evil input data can be entered into some part of the GRUB2 program binaries, which is not checked/validated.
The input data causes an overflow of the holding memory area into adjacent memory areas.
By carefully crafting the data that is the overflow, it is possible to cause a specifically targeted memory area to be overwritten.

As described by Eclypsium here (the security company that detected this) “Attackers exploiting this vulnerability can install persistent and stealthy bootkits or malicious bootloaders that could give them near-total control over the victim device“.

Essentially, the vulnerability allows an attacker with root privileges to replace the bootloader with a malicious one, boot into it and then have further capability to effectively set up camp (a backdoor) on the server.
This backdoor would be hard to remove because the bootloader is one of the first things to be booted (anti-virus can’t remove the bootloader if the bootloader boots first and “adjusts” the anti-virus).

What is GRUB2?

GRUB2 is v2 of the GRand Unified Bootloader (see here for the manual).
It is used to load the main operating system of a computer.
Usually on Linux virtual machines, GRUB is used to load Linux. It is possible to install GRUB on machines that then boot into Windows.

What is Secure Boot?

There are commonly two boot methods: “Legacy Boot” and “Secure Boot” (a.k.a UEFI boot).
Until Secure Boot was invented, the bootloader would sit in a designated location on the hard disk and would be executed by the computer BIOS to start the chain of processes for the computer start up.
This is clearly quite insecure, since any program could put itself at the designated location and then be executed at boot up.

With Secure Boot, certificates are used to secure the boot process chain.
As with any certificate based process, at the top (root) level there needs to exist a certificate which is valid for many years and is ultimately trusted – the Certificate Authority (CA).
The next levels in the chain trust that CA certificate implicitly and if any point in the chain is compromised, then the trust is broken and will need re-establishing with new certificates.
Depending which level of the chain is compromised, will dictate the amount of effort needed to fix it.

This BootHole vulnerability means a new CA certificate needs to be implemented in every machine that uses Secure Boot!

But the attackers Need Root?

Yes, the vulnerability is in a GRUB2 configuration text file owned by the root user. Additional text added to the file can cause the buffer overflow.
Once the attacker has used malware to instigate the overflow, and installed a malicious bootloader, they then have a backdoor to the server, which would be executed every time the server is rebooted.
This backdoor would be hard to remove because the bootloader is one of the first things to be booted (anti-virus can’t remove the bootloader if the bootloader boots first and “adjusts” the anti-virus).

NOTE: The flaw also exists if you also use the network boot capability (PXE boot).

What is the Patch?

Due to the complexity of the problem (did you read the prior Eclypsium link?), it needs more than one piece of software to be patched and in different layers of the boot chain.

First off, the vulnerable GRUB2 software needs patching; this is quite easy and will require a reboot of the Linux O/S.
The problem with patching just GRUB2, is that it is still possible for an attacker with root to re-install a vulnerable version of GRUB2 and then use that vulnerable version to compromise the system further.
Remember, the chain of trust is still trusting that vulnerable version of GRUB2.
Therefore, to be able to stop the vulnerable version of GRUB2 being re-installed and used, three things need to happen:

  1. The O/S vendor (SUSE) needs to adjust their code (known as the “shim”) so that it no longer trusts the vulnerable version of GRUB2. Again, this is a software patch from the O/S vendor (SUSE) which will need a reboot.
  2. Since someone with root could simply re-install O/S vendor code (the “shim”) that trusts the vulnerable version of GRUB2, the adjusted O/S vendor code will need signing and trusting by the certificates further up the chain.
  3. The revocation list of Secure Boot needs to be adjusted to prevent the vulnerable version of the O/S vendor code (“shim”) from being called during boot. (This is known as the “dbx” (exclusion database), which will need updating with a firmware update).

What is SUSE doing about it?

There needs to be a multi-pronged patching process because SUSE also found some additional bugs during their analysis.

You can see the SUSE page on CVE-2020-10713 here, which includes the mention of the additional bugs.

They key point is that you *could* start patching, but if it were me, I would be tempted to wait until the SUSE “shim” has been updated with the new chain certificate, patch GRUB2 and then update the “dbx”.

How does this impact Azure VMs?

In the previous paragraphs we found that a firmware update is needed to update the “dbx” exclusion database.
Since Microsoft Azure is using the Hyper-V hypervisor, the “firmware” is actually software in Hyper-v.
See here, which says: “Secure Boot or UEFI firmware isn’t required on the physical Hyper-V host. Hyper-V provides virtual firmware to virtual machines that is independent of what’s on the Hyper-V host.

So the above would indicate that the Virtual Machine contains the necessary code from Hyper-V.
I would imagine that this is included at VM creation time.

If we dig into the VM details a little bit here on the Microsoft sites, we find:

So the above states that “…generation 2 VMs in Azure do not support Secure Boot…“.
The words “…in Azure…” are the key part of this.

OK, then how about Hyper-V in general (on-premise):

The above states “To Secure Boot generation 2 Linux virtual machines, you need to choose the UEFI CA Secure Boot template when you create the virtual machine.“.
BUT this is for Hyper-V in general, not for Azure virtual machines.

So we know that Secure Boot is not available in Azure on any of the generation 1 or generation 2 VMs (as of writing there are only 2).

Summary:

The BootHole vulnerability is far reaching and will impact many, many devices (servers, laptops, IoT devices, TVs, fridges, cars?).
However, only those devices that actually *use* Secure Boot will truly be impacted, since the devices not using Secure Boot do not need to be patched (it’s fruitless).

If you run SLES 12 on Azure virtual machines, you cannot possibly use Secure Boot, so there is no point patching to fix a vulnerability for which you are not affected.
You are only introducing more risk by patching.

If however, you do decide to patch (even if you don’t need to) then follow the advice from SUSE and patch to fix GRUB2, the “shim” and the other vulnerabilities that were found.

If you are running SLES on Azure, then there is no specific order of patching, because you do not use Secure Boot, so there is no possibility of breaking the trust chain that doesn’t exist.

On a final closing point, you could be running a HANA system in Azure on what is known as “HANA Large Instances” (HLI). These are physical machines. So whilst Virtual Machines can’t use Secure Boot, these physical machines may well do so. You would be wise to contact your Microsoft account representative to establish if they will be patching the firmware.

Useful Links:

SAP ASE Error – Process No Longer Found After Startup

This post is about a strange issue I was hitting during the configuration of SAP LaMa 3.0 to start/stop a SAP ABAP 7.52 system (with Kernel 7.53) that was running with a SAP ASE 16.0 database.

During the LaMa start task, the task would fail with an error message: “ASE process no longer found after startup. (fault code: 127)“.

When I logged directly onto the SAP server Linux host, I could see that the database had indeed started up, eventually.
So what was causing the failure?

The Investigation

At first I thought this was related to the Kernel, but having checked the versions of the Kernel components, I found that they were the same as another SAP system that was starting up perfectly fine using the exact same LaMa system.

The next check I did was to turn on tracing on the hostagent itself. This is a simple task of putting the trace value to “3” in the host_profile of the hostagent and restarting it:

service/trace = 3

The trace output is shown in a number of different trace files in the work directory of the hostagent but the trace file we were interested in is called dev_sapdbctrl.

The developer trace file for the sapdbctrl binary executable is important, because the sapdbctrl binary is executed by SAP hostagent (saphostexec) to perform the database start. If you observe the contents of the sapdbctrl trace output, you will see that it loads the Sybase specific shared library which contains the required code to start/stop the ASE database.

The same sapdbctrl also contains the ability to load the required libraries for other database systems.

As a side note, it is still not known to me, how the Sybase shared library comes to exist in the hostagent executable directory. When SAP ASE is patched, this library must also be patched, otherwise how does the hostagent stay in-step with the ASE database that it needs to talk with?

Once tracing was turned on, I shut the SAP ASE instance down again and then used SAP LaMa to initiate the SAP system start once again.
Sure enough, the LaMa start task failed again.

Looking in the trace file dev_sapdbctrl I could see the same error message that I was seeing in SAP LaMa:

Error: Command execution failed. : ASE process no longer found after startup. 
(fault code: 127) Operation ID: 000D3A3862631EEAAEDDA232BE624001
----- Log messages ---- 
Info: saphostcontrol: Executing StartDatabase 
Error: sapdbctrl: ASE process no longer found after startup. 
Error: saphostcontrol: StartDatabase failed (sapdbctrl exit code = 1)

This was great. It confirmed that SAP LaMa was just seeing the symptom of some other issue, since LaMa just calls the hostagent to do the start.

Now I knew the hostagent was seeing the error, I tried using the hostagent directly to perform the start, using the following:

/usr/sap/hostctrl/exe/saphostctrl -debug -function StartDatabase -dbname <SID> -dbtype syb -dbhost <the-ASE-host>

NOTE: The hostagent “-debug” command line option puts out the same information without the need for the hostagent tracing to be turned on in the host_profile.

Once again, the start process failed and the same error message was present in the dev_sapdbctrl trace file.

This was now really strange.
I decided that the next course of action was to start the process of raising the issue with SAP via an incident.
If you suspect that something could take a while to fix, then it’s always best to raise it with SAP early and continue to look at the issue in parallel.

Continuing the Diagnosis

While the SAP incident was in progress, I continued the process of trying to self-diagnose the issue further.
I tried a couple more things such as:

  • Starting and stopping SAP ASE manually using stopdb/startdb commands.
  • Restarting the whole server (this step has a place in every troubleshooting process, eventually).
  • Checking the server patch level.
  • Checking the environment of the Linux user, the shell, the profile files, the O/S limits applied.
  • Checking what happens if McAfee anti-virus was disabled (I’ve seen the ePO blocking processes before).

Eventually exhaustion set in and I decided to give the SAP support processor time to get back to me with some hints.

Some Sleep

I spend a lot of time solving SAP problems. A lot of time.
Something doesn’t work according to the docs, something did work but has stopped working, something has never worked well…
It builds up in your mind and you carry this stuff around in your head.
Subconsciously you think about these problems.

Then, at about 3am when you can’t get back to sleep, you have a revelation.
The hostagent is forking the process to start the database as the syb<sid> Linux user (it uses “su”), from the root user (hostagent runs as the root user).

Linux Domain Users

The revelation I had regarding the forking of the user, was just the trigger I needed to make me consider the way the Linux authentication was setup on this specific server with the problem ASE startup.

I remembered at the beginning of the project that I had hit an issue with the SSSD Linux daemon, which is responsible for interfacing between Linux and Microsoft Active Directory. At that time, the issue was causing the hostagent to hang when operations were executed which required a switch to another Linux user.
This previous issue was actually a hostagent issue that was fixed in a later hostagent patch. During that particular investigation, I requested that the Linux team re-configure the SSSD daemon to be more efficient with its Active Directory traversals, when it was looking to see if the desired user account was local to Linux or if it was a domain account.

With this previous issue in mind, I checked the SSSD configuration on the problem server. This is a simple conf file in /etc/sssd.

The Solution

After all the troubleshooting, the raising of the incident, the sleeping, I had finally got to the solution.

After checking the SSSD daemon configuration file /etc/sssd/sssd.conf, I could clearly see that there was one entry missing compared to the other servers that didn’t experience the SAP ASE start error.

The parameter: “subdomain_enumerate = none” was missing.
Looking at the manual page for SSSD it would seem that without this parameter there is additional overhead during any Active Directory traversal.

I set the parameter accordingly in the /etc/sssd/sssd.conf file and restarted the SSSD daemon using:

service sssd restart

Then I retried the start of the database using the hostagent command shown previously.
It worked!
I then retried with SAP LaMa and that also now started ASE without error messages.

Root Cause

What it seems was happening, was some sort of internal pre-set timeout in the sapdbctrl binary, when hit, the sapdbctrl just abandons and throws the error that I was seeing. This leaves the ASE database to continue and start (the process was initiated), but in the hostagent it looked like it had failed to start.
By adding the “subdomain_enumerate = none” parameter, any “delay”, caused by inappropriate call to Active Directory was massively reduced and subsequent start activities were successful.

Azure Disk Cache Settings for an SAP Database on Linux

One of your go-live tasks once you have built a VM in Azure, should be to ensure that the Azure disk cache settings on the Linux VM data disks, are set correctly in accordance with the Microsoft recommended settings.
In this post I explain the disk cache options and how they apply to SAP and especially to SAP databases such as SAP ASE and SAP HANA, to ensure you get optimum performance.

What Are the Azure Disk Cache Settings?

In Microsoft Azure you can configure different disk cache settings on data disks that are attached to a VM.
NOTE: You do not need to consider changing the O/S root disk cache settings, as by default they are applied as per the Azure recommendations.

Only specific VMs and specific disks (Standard or Premium Storage) have the ability to use caching.
If you use Azure Standard storage, the cache is provided by local disks on the physical server hosting your Linux VM.
If you use Azure Premium storage, the cache is provided by a combination of RAM and local SSD on the physical server hosting your Linux VM.

There are 3 different Azure disk cache settings:

  • None
  • ReadOnly (or “read-only”)
  • ReadWrite (or “read/write”)

The cache settings can influence the performance and also the consistency of the data written to the Azure storage service where your data disks are stored.

Cache Setting: None

By specifying “None” as the cache setting, no caching is used and a write operation at the VM O/S level is confirmed as completed once the data is written to the storage service.
All read operations for data not already in the VM O/S file system cache, will be read from the storage service.

Cache Setting: ReadOnly

By specifying “ReadOnly” as the cache setting, a write operation at the VM O/S level is confirmed as completed once the data is written to the storage service.
All read operations for data not already in the VM O/S file system cache, will be read from the read cache on the underlying physical machine, before being read from the storage service.

Cache Setting: ReadWrite

By specifying “ReadWrite” as the cache setting, a write operation at the VM O/S level is confirmed as completed once the data is written to the cache on the underlying physical machine.
All read operations for data not already in the VM O/S file system cache, will be read from the read cache on the underlying physical machine, before being read from the storage service.

Where Do We Configure the Disk Cache Settings?

The disk cache settings are configured in Azure against the VM (in the Disks settings), since the disk cache is both physical host and VM series dependent. It is *not* configured against the disk resource itself, as explained in my previous blog post: Listing Azure VM DataDisks and Cache Settings Using Azure Portal JMESPATH & Bash

Any Recommendations for Disk Cache Settings?

There are specific recommendations for Azure disk cache settings, especially when running SAP and especially when running databases like SAP ASE or SAP HANA.

In general, the rules are:

Disk UsageAzure Disk Cache Setting
Root O/S disk (/)ReadWrite – ALWAYS!
HANA SharedReadOnly
ASE Home
(/sybase/<SID>)
ReadOnly
Database DataHANA=None, ASE=ReadOnly
Database LogNone

The above settings for SAP ASE have been obtained from SAP note 2367194 (SQL Server is same as ASE) and from the general deployment guide here: https://docs.microsoft.com/en-us/azure/virtual-machines/workloads/sap/dbms_guide_general
The use of write caching on the ASE home is optional, you could choose ReadOnly, it would help protect the ASE config file in a very specific scenario. It is envisaged that using ASE 16.0 with SRS/HADR you would have a separate data disk for the Replication Server data (I’ll talk about this in another post).

The above settings for HANA have been taken from the updated guide here: https://docs.microsoft.com/en-us/azure/virtual-machines/workloads/sap/hana-vm-operations-storage which is designed to meet the KPIs mentioned in SAP note 2762990.

The reason for not using a write cache every time, is because an issue at the physical host level, affecting the cache, could cause the application (e.g database) to think it has committed data, when it actually isn’t written to disk. This is not good for databases, especially if the issue affects the transaction/redo log area. Data loss could occur.

It’s worth noting that this cache “issue” has always been true of every caching technology ever created, on which databases run. Storage tech vendors try to mitigate this by putting batteries into the storage appliances, but since the write cache in Azure is at the physical host level, there’s just no guarantee that when the VM O/S thinks the write operation has committed to disk, that it has actually been written to disk.

How About Write Accelerator?

There are specific Azure VM series (M-series at current) that support something known as “Write Accelerator”.
This is an extra VM level setting for Premium Storage disks attached to M-series VMs.

Enabling the Write Accelerator setting is a requirement by Microsoft for production SAP HANA transaction log disks on M-Series VMs. This setting ebales the Azure VM to meet the SAP HANA key performance indicators in note 2762990. Azure Write Accelerator is designed to provide lower latency write times on Premium Storage.

You should ensure that the Write Accelerator setting is enabled where appropriate, for your HANA database transaction log disks. You can check if it is enabled following my previous blog post: Listing Azure VM DataDisks and Cache Settings Using Azure Portal JMESPATH & Bash

I’ve tried my best to find more detailed information on how the Write Accelerator feature is actually provided, but unfortunately it seems very elusive. Robert Boban (of Microsoft) commented on a LinkedIn post here: “It is special caching impl. for M-Series VM to fulfill SAP HANA req. for <1ms latency between VM and storage layer.“.

Check the IOPS

Once you have configured your disks and the cache settings, you should ensure that you test the IOPS achieved using the Microsoft recommended process.
You can follow similar steps as my previous post: Recreating SAP ASE Database I/O Workload using Fio on Azure

As mentioned in other places in the Microsoft documentation and SAP notes such as 2367194, you need to ensure that you choose the correct size and series of VM to ensure that you align the required VM maximum IOPS with the intended amount of data disks and their potential IOPS maximum. Otherwise you could hit the VM max IOPS before touching the disk IOPS maximum.

Enable Accelerated Networking

Since the storage is itself connected to your VM via the network, you should ensure that Accelerator Networking is enabled in your VMs Network Settings:

Checking Cache Settings Directly on the VM

As per my previous post Checking Azure Disk Cache Settings on a Linux VM in Shell, you can actually check the Azure disk cache settings on the VM itself. You can do it manually, or write a script (better option for whole landscape validation).

Summary:

I discussed the two types of storage (standard or premium) that offer disk caching, plus where in Azure you need to change the setting.
The table provided a list of cache settings for both SAP ASE and SAP HANA databases and their data disk areas, based on available best-practices.

I mentioned Write Accelerator for HANA transaction log disks and ensuring that you enable Accelerated Networking.
Also provided was a link to my previous post about running a check of IOPS for your data disks, as recommended by Microsoft as part of your go-live checks.

A final mention was made another post of mine, with a great way of checking the disk cache settings across the VMs in the landscape.

Useful Links:

Windows File Cache

https://docs.microsoft.com/en-us/azure/virtual-machines/linux/premium-storage-performance

https://docs.microsoft.com/en-us/azure/virtual-machines/windows/how-to-enable-write-accelerator

https://docs.microsoft.com/en-us/azure/virtual-machines/workloads/sap/hana-vm-operations-storage#production-storage-solution-with-azure-write-accelerator-for-azure-m-series-virtual-machines

https://petri.com/digging-into-azure-vm-disk-performance-features

https://techcommunity.microsoft.com/t5/running-sap-applications-on-the/sap-on-azure-general-update-march-2019/ba-p/377456

https://docs.microsoft.com/en-us/azure/virtual-machines/workloads/sap/dbms_guide_general

https://docs.microsoft.com/en-us/azure/virtual-machines/workloads/sap/hana-vm-operations-storage

SAP Note 2762990 – How to interpret the report of HWCCT File System Test

SAP Note 2367194 – Use of Azure Premium SSD Storage for SAP DBMS Instance