This blog contains experience gained over the years of implementing (and de-implementing) large scale IT applications/software.

Is my GCP hosted SLES 12 Linux VM Affected by the BootHole Vulnerability

In an effort to really drag this topic out (it’s now a trilogy), I’ve taken my previous Azure specific post and also the AWS specific post and decided to do some further research into whether the same is true in Google Cloud Platform (a.k.a GCP).

Previously

(If I was writing this like a true screenwriter, it would get shorter and faster each recap).

In July 2020, a GRUB2 bootloader vulnerability was discovered which could allow attackers to replace the bootloader on a machine which has Secure Boot turned on.
The vulnerability is designated CVE-2020-10713 and is rated 8.2 HIGH on the CVSS (see here).

Let’s recap what this is (honestly, please see my Azure post for details, it’s quite technical), and how it impacts a GCP virtual machine running SUSE Enterprise Linux 12, which is commonly used to run SAP systems such as SAP HANA or other SAP products.

What is the Vulnerability?

Essentially, some evil input data can be entered into some part of the GRUB2 program binaries, which is not checked/validated.
By carefully crafting the data that is the overflow, it is possible to cause a specifically targeted memory area to be overwritten.

As described by Eclypsium here (the security company that detected this) “Attackers exploiting this vulnerability can install persistent and stealthy bootkits or malicious bootloaders that could give them near-total control over the victim device“.

Essentially, the vulnerability allows an attacker with root privileges to replace the bootloader with a malicious one.

What is GRUB2?

GRUB2 is v2 of the GRand Unified Bootloader (see here for the manual).
It can be used to load the main operating system of a computer.

What is Secure Boot?

There are commonly two boot methods: “Legacy Boot” and “Secure Boot” (a.k.a UEFI boot).
Until Secure Boot was invented, the bootloader would sit in a designated location on the hard disk and would be executed by the computer BIOS to start the chain of processes for the computer start up.

With Secure Boot, certificates are used to secure the boot process chain.
This BootHole vulnerability means a new CA certificate needs to be implemented in every machine that uses Secure Boot!

But the attackers Need Root?

Yes, the vulnerability is in a GRUB2 configuration text file owned by the root user. Additional text added to the file can cause the buffer overflow.
Anti-virus can’t remove the bootloader if the bootloader boots first and “adjusts” the anti-virus.

NOTE: The flaw also exists if you also use the network boot capability (PXE boot).

What is the Patch?

Due to the complexity of the problem (did you read the prior Eclypsium link?), it needs more than one piece of software to be patched and in different layers of the boot chain.

The vulnerable GRUB2 software needs patching.
To be able to stop the vulnerable version of GRUB2 being re-installed and used, three things need to happen:

  1. The O/S vendor (SUSE) needs to adjust their code (known as the “shim”) so that it no longer trusts the vulnerable version of GRUB2. Again, this is a software patch from the O/S vendor (SUSE) which will need a reboot.
  2. Since someone with root could simply re-install O/S vendor code (the “shim”) that trusts the vulnerable version of GRUB2, the adjusted O/S vendor code will need signing and trusting by the certificates further up the chain.
  3. The revocation list of Secure Boot needs to be adjusted to prevent the vulnerable version of the O/S vendor code (“shim”) from being called during boot. (This is known as the “dbx” (exclusion database), which will need updating with a firmware update).

What is SUSE doing about it?

There needs to be a multi-pronged patching process because SUSE also found some additional bugs during their analysis.

You can see the SUSE page on CVE-2020-10713 here, which includes the mention of the additional bugs.

How does this impact GCP VMs?

In the previous paragraphs we found that a firmware update is needed to update the “dbx” exclusion database.
Since GCP virtual machines are hosted in a KVM based hypervisor, the “firmware” is actually software.

Whilst looking for details on “Secure Boot” in GCP virtual machines, we come across the Google Compute Engine’s “Shielded VM” option.
You can read about it in detail here.
In brief, in GCP a Shielded VM is deployed using a pre-defined set of Google specific guest operating systems:

As noted above, the documentation specifically mentions that the “firmware” underpinning the virtual machine contains Google’s Certificate Authority (CA) certificate, as the root of the trust chain.
This is important because the Eclypsium description of the vulnerability is specifically citing a problem with the Microsoft CA.
What this means is that Google actually decide on the trust chain themselves and can probably more rapidly adjust the firmware with a new CA certificate.
To reiterate, this is specific to Google specific VM images that you deploy as a Shielded VM.

Another point worth noting is that when creating a Shielded VM, you can enable the vTPM (virtual trusted platform module), which allows integrity monitoring of the boot process. Any change to the boot process and a validation alert is triggered. Whilst this would not prevent compromise, it would at least alert an administrator.

Reading the Google infrastructure security document, we find that just like AWS, Google have designed and are implementing their own security chip called Titan, on the physical hosts. This is used to ensure that physical hosts boot securely, but it is not clear if this chip is used in anyway for Shielded VMs booted on the physical host.

If we delve further into the GCP documentation we find that we also have the option to create a custom image for deployment into a Shielded VM.
See the documentation on how to create a custom Shielded VM image:

The above states that you can create your own Secure Boot capable VM image for deployment in GCP as a Shielded VM.
If we read further down that page under section “Default certificates“, we find a slight difference compared to the Google “curated” images:

The above is telling us, by default the standard Microsoft CA certificates are used for the Secure Boot setup of VMs created using a custom image (remember non-custom Secure Boot images use Google’s root CA) in GCP.
When it says “default values”, right now, they are the only values because of a small note further up the page:

OK, so you can only use the defaults for now. The same compromised defaults that will need fixing. 🤷‍♂️

What do we think needs to happen once Google create the ability to replace the certificates?
From reading those previously mentioned documents, I would guess that to rebuild the certificate database used during the creation of the custom Shielded VM image, you are going to need to re-create the VM image and then re-deploy a VM from that image!

The question remains, is SLES 12 supported as a Shielded VM guest-OS on GCP?
According to the Shielded VM page here, it is not by default. You will need to therefore create your own image:

Summary:

The BootHole vulnerability is far reaching and will impact many, many devices (servers, laptops, IoT devices, TVs, fridges, cars?).
However, only those devices that actually *use* Secure Boot will truly be impacted, since the devices not using Secure Boot do not need to be patched (it’s fruitless).

If you run SLES 12 on GCP virtual machines, using public images, then by default you will not being using the Shielded VM instances, so there is no point patching to fix a vulnerability for which you are not affected.
You are only introducing more risk by patching.

If however, you do decide to patch (even if you don’t need to) then follow the advice from SUSE and patch to fix GRUB2, the “shim” and the other vulnerabilities that were found.

On a final closing point, you could be running a custom SLES image deployed in GCP as a Shielded VM. An image that your company has built and which uses Secure Boot. You would be wise to contact your cloud administrators to ensure that they are preparing for a VM rebuild and subsequent patching required to ensure that Secure Boot remains secure.

Useful Links:

Is my AWS hosted SLES 12 Linux VM Affected by the BootHole Vulnerability

In an effort to spin this story out a little further, I’ve taken my previous Azure specific post and decided to do some further research into whether the same is true in Amazon Web Services (a.k.a AWS).

Previously

In July 2020, a GRUB2 bootloader vulnerability was discovered which could allow attackers to replace the bootloader on a machine which has Secure Boot turned on.
The vulnerability is designated CVE-2020-10713 and is rated 8.2 HIGH on the CVSS (see here).

Let’s recap what this is (honestly, please see my other post for details, it’s quite technical), and how it impacts an AWS virtual machine running SUSE Enterprise Linux 12, which is commonly used to run SAP systems such as SAP HANA or other SAP products.

What is the Vulnerability?

Essentially, some evil input data can be entered into some part of the GRUB2 program binaries, which is not checked/validated.
By carefully crafting the data that is the overflow, it is possible to cause a specifically targeted memory area to be overwritten.

As described by Eclypsium here (the security company that detected this) “Attackers exploiting this vulnerability can install persistent and stealthy bootkits or malicious bootloaders that could give them near-total control over the victim device“.

Essentially, the vulnerability allows an attacker with root privileges to replace the bootloader with a malicious one.

What is GRUB2?

GRUB2 is v2 of the GRand Unified Bootloader (see here for the manual).
It can be used to load the main operating system of a computer.

What is Secure Boot?

There are commonly two boot methods: “Legacy Boot” and “Secure Boot” (a.k.a UEFI boot).
Until Secure Boot was invented, the bootloader would sit in a designated location on the hard disk and would be executed by the computer BIOS to start the chain of processes for the computer start up.

With Secure Boot, certificates are used to secure the boot process chain.
This BootHole vulnerability means a new CA certificate needs to be implemented in every machine that uses Secure Boot!

But the attackers Need Root?

Yes, the vulnerability is in a GRUB2 configuration text file owned by the root user. Additional text added to the file can cause the buffer overflow.
Anti-virus can’t remove the bootloader if the bootloader boots first and “adjusts” the anti-virus.

NOTE: The flaw also exists if you also use the network boot capability (PXE boot).

What is the Patch?

Due to the complexity of the problem (did you read the prior Eclypsium link?), it needs more than one piece of software to be patched and in different layers of the boot chain.

The vulnerable GRUB2 software needs patching.
To be able to stop the vulnerable version of GRUB2 being re-installed and used, three things need to happen:

  1. The O/S vendor (SUSE) needs to adjust their code (known as the “shim”) so that it no longer trusts the vulnerable version of GRUB2. Again, this is a software patch from the O/S vendor (SUSE) which will need a reboot.
  2. Since someone with root could simply re-install O/S vendor code (the “shim”) that trusts the vulnerable version of GRUB2, the adjusted O/S vendor code will need signing and trusting by the certificates further up the chain.
  3. The revocation list of Secure Boot needs to be adjusted to prevent the vulnerable version of the O/S vendor code (“shim”) from being called during boot. (This is known as the “dbx” (exclusion database), which will need updating with a firmware update).

What is SUSE doing about it?

There needs to be a multi-pronged patching process because SUSE also found some additional bugs during their analysis.

You can see the SUSE page on CVE-2020-10713 here, which includes the mention of the additional bugs.

How does this impact AWS VMs?

In the previous paragraphs we found that a firmware update is needed to update the “dbx” exclusion database.
Since AWS virtual machines are hosted in a KVM based hypervisor, the “firmware” is actually software.

Whilst looking for details on “Secure Boot” in AWS virtual machines, there is absolutely no mention of it being supported for Linux.
If we dig into the the VM import/export documents here on the AWS docs site, we find:

So the above states that for VMs imported/exported, “UEFI/EFI boot partitions are supported only for Windows boot volumes with VHDX as the image format. Otherwise, a VM’s boot volume must use Master Boot Record (MBR) partitions.“.
The words “…only for Windows…” are the key part of this.
Because if we scan just a little further down the page, it says that the UEFI boot partitions are actually “supported” for Windows, by being converted to MBR (not Secure Boot compatible):

I feel we can surmise that AWS does not support running Linux VMs with Secure Boot.
Apart from this little gem of information here.
This slide shows that the launch of the AWS Graviton2 chip enables ARM based Linux distributions to support Secure Boot.
We can read the Amazon EC2 User Guide here (updated August 28, 2020), to find that SLES 15 is the only SUSE Linux that supports ARM cpus on AWS:

So we know that Secure Boot is not available in AWS on any of the SLES x86 operating systems, and SLES 12 on ARM is not supported on Graviton based cpus.

Summary:

The BootHole vulnerability is far reaching and will impact many, many devices (servers, laptops, IoT devices, TVs, fridges, cars?).
However, only those devices that actually *use* Secure Boot will truly be impacted, since the devices not using Secure Boot do not need to be patched (it’s fruitless).

If you run SLES 12 on AWS virtual machines, you cannot possibly use Secure Boot, so there is no point patching to fix a vulnerability for which you are not affected.
You are only introducing more risk by patching.

If however, you do decide to patch (even if you don’t need to) then follow the advice from SUSE and patch to fix GRUB2, the “shim” and the other vulnerabilities that were found.

If you are running SLES 12 on AWS, then there is no specific order of patching, because you do not use Secure Boot, so there is no possibility of breaking the trust chain that doesn’t exist.

On a final closing point, you could be running a HANA system in AWS on what is known as “Bare Metal” (“High Memory Instances” or a.k.a “*.metal”). These are physical machines using the Nitro based hyper-visor. So whilst EC2 Virtual Machines can’t use Secure Boot, these “Bare Metal” machines may well do so through the use of the Nitro Security Chip (see a good deep dive here). You would be wise to contact your AWS account representative to establish if they will be patching the firmware.

Useful Links:

Is my Azure hosted SLES 12 Linux VM Affected by the BootHole Vulnerability

In July 2020, a GRUB2 bootloader vulnerability was discovered which could allow attackers to replace the bootloader on a machine which has Secure Boot turned on.
The vulnerability is designated CVE-2020-10713 and is rated 8.2 HIGH on the CVSS (see here).

Let’s look at what this is and how it impacts a Microsoft Azure virtual machine running SUSE Enterprise Linux 12, which is commonly used to run SAP systems such as SAP HANA or other SAP products.

What is the Vulnerability?

It is a “Classic Buffer Overflow” vulnerability in the GRUB2 bootloader for versions prior to 2.06.
Essentially, some evil input data can be entered into some part of the GRUB2 program binaries, which is not checked/validated.
The input data causes an overflow of the holding memory area into adjacent memory areas.
By carefully crafting the data that is the overflow, it is possible to cause a specifically targeted memory area to be overwritten.

As described by Eclypsium here (the security company that detected this) “Attackers exploiting this vulnerability can install persistent and stealthy bootkits or malicious bootloaders that could give them near-total control over the victim device“.

Essentially, the vulnerability allows an attacker with root privileges to replace the bootloader with a malicious one, boot into it and then have further capability to effectively set up camp (a backdoor) on the server.
This backdoor would be hard to remove because the bootloader is one of the first things to be booted (anti-virus can’t remove the bootloader if the bootloader boots first and “adjusts” the anti-virus).

What is GRUB2?

GRUB2 is v2 of the GRand Unified Bootloader (see here for the manual).
It is used to load the main operating system of a computer.
Usually on Linux virtual machines, GRUB is used to load Linux. It is possible to install GRUB on machines that then boot into Windows.

What is Secure Boot?

There are commonly two boot methods: “Legacy Boot” and “Secure Boot” (a.k.a UEFI boot).
Until Secure Boot was invented, the bootloader would sit in a designated location on the hard disk and would be executed by the computer BIOS to start the chain of processes for the computer start up.
This is clearly quite insecure, since any program could put itself at the designated location and then be executed at boot up.

With Secure Boot, certificates are used to secure the boot process chain.
As with any certificate based process, at the top (root) level there needs to exist a certificate which is valid for many years and is ultimately trusted – the Certificate Authority (CA).
The next levels in the chain trust that CA certificate implicitly and if any point in the chain is compromised, then the trust is broken and will need re-establishing with new certificates.
Depending which level of the chain is compromised, will dictate the amount of effort needed to fix it.

This BootHole vulnerability means a new CA certificate needs to be implemented in every machine that uses Secure Boot!

But the attackers Need Root?

Yes, the vulnerability is in a GRUB2 configuration text file owned by the root user. Additional text added to the file can cause the buffer overflow.
Once the attacker has used malware to instigate the overflow, and installed a malicious bootloader, they then have a backdoor to the server, which would be executed every time the server is rebooted.
This backdoor would be hard to remove because the bootloader is one of the first things to be booted (anti-virus can’t remove the bootloader if the bootloader boots first and “adjusts” the anti-virus).

NOTE: The flaw also exists if you also use the network boot capability (PXE boot).

What is the Patch?

Due to the complexity of the problem (did you read the prior Eclypsium link?), it needs more than one piece of software to be patched and in different layers of the boot chain.

First off, the vulnerable GRUB2 software needs patching; this is quite easy and will require a reboot of the Linux O/S.
The problem with patching just GRUB2, is that it is still possible for an attacker with root to re-install a vulnerable version of GRUB2 and then use that vulnerable version to compromise the system further.
Remember, the chain of trust is still trusting that vulnerable version of GRUB2.
Therefore, to be able to stop the vulnerable version of GRUB2 being re-installed and used, three things need to happen:

  1. The O/S vendor (SUSE) needs to adjust their code (known as the “shim”) so that it no longer trusts the vulnerable version of GRUB2. Again, this is a software patch from the O/S vendor (SUSE) which will need a reboot.
  2. Since someone with root could simply re-install O/S vendor code (the “shim”) that trusts the vulnerable version of GRUB2, the adjusted O/S vendor code will need signing and trusting by the certificates further up the chain.
  3. The revocation list of Secure Boot needs to be adjusted to prevent the vulnerable version of the O/S vendor code (“shim”) from being called during boot. (This is known as the “dbx” (exclusion database), which will need updating with a firmware update).

What is SUSE doing about it?

There needs to be a multi-pronged patching process because SUSE also found some additional bugs during their analysis.

You can see the SUSE page on CVE-2020-10713 here, which includes the mention of the additional bugs.

They key point is that you *could* start patching, but if it were me, I would be tempted to wait until the SUSE “shim” has been updated with the new chain certificate, patch GRUB2 and then update the “dbx”.

How does this impact Azure VMs?

In the previous paragraphs we found that a firmware update is needed to update the “dbx” exclusion database.
Since Microsoft Azure is using the Hyper-V hypervisor, the “firmware” is actually software in Hyper-v.
See here, which says: “Secure Boot or UEFI firmware isn’t required on the physical Hyper-V host. Hyper-V provides virtual firmware to virtual machines that is independent of what’s on the Hyper-V host.

So the above would indicate that the Virtual Machine contains the necessary code from Hyper-V.
I would imagine that this is included at VM creation time.

If we dig into the VM details a little bit here on the Microsoft sites, we find:

So the above states that “…generation 2 VMs in Azure do not support Secure Boot…“.
The words “…in Azure…” are the key part of this.

OK, then how about Hyper-V in general (on-premise):

The above states “To Secure Boot generation 2 Linux virtual machines, you need to choose the UEFI CA Secure Boot template when you create the virtual machine.“.
BUT this is for Hyper-V in general, not for Azure virtual machines.

So we know that Secure Boot is not available in Azure on any of the generation 1 or generation 2 VMs (as of writing there are only 2).

Summary:

The BootHole vulnerability is far reaching and will impact many, many devices (servers, laptops, IoT devices, TVs, fridges, cars?).
However, only those devices that actually *use* Secure Boot will truly be impacted, since the devices not using Secure Boot do not need to be patched (it’s fruitless).

If you run SLES 12 on Azure virtual machines, you cannot possibly use Secure Boot, so there is no point patching to fix a vulnerability for which you are not affected.
You are only introducing more risk by patching.

If however, you do decide to patch (even if you don’t need to) then follow the advice from SUSE and patch to fix GRUB2, the “shim” and the other vulnerabilities that were found.

If you are running SLES on Azure, then there is no specific order of patching, because you do not use Secure Boot, so there is no possibility of breaking the trust chain that doesn’t exist.

On a final closing point, you could be running a HANA system in Azure on what is known as “HANA Large Instances” (HLI). These are physical machines. So whilst Virtual Machines can’t use Secure Boot, these physical machines may well do so. You would be wise to contact your Microsoft account representative to establish if they will be patching the firmware.

Useful Links:

Checking Azure Disk Cache Settings on a Linux VM in Shell

In a previous blog post, I ended the post by showing how you can use the Azure Enhanced Monitoring for Linux to obtain the disk cache settings.
Except, as we found, it doesn’t easily allow you to relate the Linux O/S disk device names and volume groups, to the Azure data disk names.

You can read the previous post here: Listing Azure VM DataDisks and Cache Settings Using Azure Portal JMESPATH & Bash

In this short post, I pick up where I left off and outline a method that will allow you to correlate the O/S volume group name, with the Linux O/S disk devices and correlate those Linux disk devices with the Azure data disk names, and finally, the Azure data disks with their disk cache settings.

Using the method I will show you, you will see how easily you can verify that the disk cache settings are consistent for all disks that make up a single volume group (very important), and also be able to easily associate those volume groups with the type of usage of the underlying Azure disks (e.g. is it for database data, logs or executable binaries).

1. Check If AEM Is Installed

Our first step is to check if the Azure Enhanced Monitoring for Linux (AEM) extension is installed on the Azure VM.
This extension is required, for your VM to be supported by SAP.

We use standard Linux command line to check for the extension on the VM:

ls -1 /var/lib/waagent/Microsoft.OSTCExtensions.AzureEnhancedMonitorForLinux-*/config/0.settings

The listing should return at least 1 file called “0.settings”.
If you don’t have this and you don’t have a directory starting with “Microsoft.OSTCExtensions.AzureEnhancedMonitorForLinux-“, then you don’t have AEM and you should get it installed following standard Microsoft documentation.

2. Get the Number of Disks Known to AEM

We need to know how many disks AEM knows about:

grep -c 'disk;Caching;' /var/lib/AzureEnhancedMonitor/PerfCounters

3. Get the Number of SCSI Disks Known to Linux

We need to know how many disks Linux knows about (we exclude the root disk /dev/sda):

lsscsi --size --size | grep -cv '/dev/sda'

4. Compare Disk Counts

Compare the disks quantity from AEM and from Linux.  They should be the same.  This is the number of data disks attached to the VM.

If you have a lower number from the AEM PerfCounters file, then you may be suffering the effects of an Azure bug in the AEM extension which is unable to handle more than 9 data disks.
Do you have more than 9 data disks?

At this point if you do not have matching numbers, then you will not be able to continue, as the AEM output is vital in the next steps.

Mapping Disks to the Cache Settings

Once we know our AEM PerfCounters file contains all our data disks, we are now ready to map the physical volumes (on our disk devices) to the cache settings. On the Linux VM:

pvs -o "pv_name,vg_name" --separator=' ' --noheadings

Your output should be a list of disks and their volume groups like so (based on our diagram earlier in the post):

/dev/sdc vg_data
/dev/sdd vg_data

Next we look for a line in the AEM PerfCounters file that contains that disk device name, to get the cache setting:

awk -F';' '/;disk;Caching;/ { sub(/\/dev\//,"",$4); printf "/dev/%s %s\n", tolower($4), tolower($6) }' /var/lib/AzureEnhancedMonitor/PerfCounters

The output will be the Linux disk device name and the Azure data disk cache setting:

/dev/sdc none
/dev/sdd none

For each line of disks from the cache setting, we can now see what volume group it belongs to.
Example: /dev/sdc is vg_data and the disk in Azure has a cache setting of “none”.

If there are multiple disks in the volume group, they all must have the same cache setting applied!

Finally, we look for the device name in the PerfCounters file again, to get the name of the Azure disk:

NOTE: Below is looking specifically for “sdc”.

awk -F';' '/;Phys. Disc to Storage Mapping;sdc;/ { print $6 }' /var/lib/AzureEnhancedMonitor/PerfCounters

The output will be like so:

None sapserver01-datadisk1
None sapserver01-datadisk2

We can ignore the first column output (“None”) in the above, it’s not needed.

Summary

If you package the AEM disk count check and the subsequent AEM PerfCounters AWK scripts into one neat script with the required loops, then you can get the output similar to this, in one call:

/dev/sdd none vg_data sapserver01-datadisk2
/dev/sdc none vg_data sapserver01-datadisk1
/dev/sda readwrite

Based on the above output, I can see that my vg_data volume group disks (sdc & sdd) all have the correct setting for Azure data disk caching in Azure for a HANA database data disk location.

Taking a step further, if you have intelligently named your volume group names, you then also check in your script, the cache setting based on the name of the volume group to determine if it is correct, or not.
You can then embed this validation script into a “custom validation” within SAP LaMa and it will alert you automatically if your VM disk cache settings are not correct.

You may be wondering, why not do all this from the Azure Portal?
Well, the answer to that is that you don’t know what Linux VM volume groups those Azure disks are used by, unless you have tagged them or named them intelligently in Azure.

Add Second IP to Loopback for SAP ASCS on SUSE Linux 12 in Azure

Back in 2018, I helped design a Highly Available architecture for the SAP Netweaver ASCS (contains the message server and enqueue server).
This was to be implemented on SUSE Linux 12 in the Microsoft Azure cloud, and therefore it was sitting behind an Azure Internal Load Balancer (ILB).

At this point in time, the Microsoft Reference Architecture for SAP on Azure, was not really well established and the documentation was dispersed throughout the Microsoft site.

Browsing LinkedIn posts, I could see Microsoft appeared to be hiring SAP guys 10 to the dozen. Maybe they recognised that some of their existing SAP on Azure architecture design documentation was either too light on details, or just wrong. Who knows for sure, but I do know that since 2018, the documentation from Microsoft has been much, much better with regards to SAP content and specifically the reference architecture for SAP on Azure.
It now all exists in a nice single location and contains all the steps and details needed. Plus we have “Embrace”, the official tie-up between Microsoft and SAP, which puts a nice frame around the picture.

What Was Up With Our HA Design?

Going back to 2018, the ASCS HA design we created was suffering a number of small niggles.
The basic architecture pattern had 2 Azure VMs, of which either one could run the ASCS. Both VMs sit in the same ILB back-end pool fronted by the ILB with a dedicated DNS hostname and IP address.

One of the VMs runs the ERS (Enqueue Replication Server) for replication of the enqueue shared-memory table. Yes, this is pre-ENSA2 (see 2630416), which will change things going forward!

We experienced issues with the ILB timeout settings killing traffic to the enqueue server. These were resolved by scanning the Microsoft Documents in great detail and applying the required SAP level parameters and adjusting the ILB settings.

We also found that we were missing some of the required O/S level settings which are usually applied by saptune. This explains why the documentation was sparse around these parameters, they are usually already set.
Finding those key pieces of information was possible, it just wasn’t very clear back then.

We also had issues with local SAP utility commands like “ensmon” for diagnosing issues with the enqueue server process. These just would not work from the active ASCS VM, because the front-end for the ASCS was now the ILB.

Why Ensmon Didn’t Work

You see, when a VM is sitting behind an ILB, it is not able to send traffic to the ILB of which it is an active back-end member. This facet of Microsoft’s ILB design, is there to prevent a feedback loop.

This same issue also caused SAP LaMa landscape detection issues, because LaMa could not see which VM host the SAP ASCS was running on.
LaMa usually detects which VM the ASCS hostname’s IP address was bound onto. But the IP address for the ASCS was on the ILB, not on the back-end VMs. This left LaMa not knowing which host was running the ASCS.

As part of the current Microsoft documentation, if you are using a Pacemaker cluster, the document here tells you how to setup your VMs and ILB for use with the cluster.
The documentation doesn’t tell you how the cluster software is allowing the local SAP utility commands like “ensmon” to continue working when the VM on which the ASCS is currently running, is behind an ILB.

How Can We Make Ensmon & Other Tools Work?

This is where it gets interesting.

We are going to explore how we can achieve the same level of usability, behind an ILB, without the cluster software. Ie. with just 2 VMs and the ASCS installed locally on each (real basic).

I’m not saying this is going to provide the same level of HA or anything like that, just showing you how you can get this working at a basic level.

Here is our setup:

  • Each VM has just 1 NIC and 1 primary IP address.
  • The ILB health probe is checking tcp/3600 (the message server).
  • We have “HA” enabled on the ILB.
  • The ASCS is currently active on VM1, but can also be started on VM2 when needed.
  • Although not shown, the ERS is installed directly on VM2 only.
  • This is a failover cluster (Active-Passive) with regards to the ASCS, and no auto-failback is anticipated. A nick name I like is the “chuck it over the fence” option.

Let’s imagine we try to start the “ensmon” tool on the VM1 server.
It will say that it is not able to connect to the Enqueue server on 10.x.x.x (via the ILB).
This is because the VM1 server is not allowed to talk to itself via the ILB of which the VM is an active back-end pool member.
Instead, we need to make the VM1 server believe that it is talking to the ILB, but make it talk instead, to itself.

If we say that VM1 has an IP address of 10.0.0.2 and a hostname of myvm1, then we can make VM1 talk to itself instead of the ILB IP address, by adding an entry into the Linux /etc/hosts file like so:

10.0.0.2  myvm1.corp.net  myvm1  myascs.corp.net  myascs

This will cause any hostname resolution that calls the standard Linux function “gethostbyname” for the ILB hostname myascs.corp.net, to find the IP address of VM1.
You can confirm this with a “ping” or a “host” command.

You should note that the above assumes that your /etc/resolv.conf is set such that /etc/hosts (known as “files” in /etc/resolv.conf) is used before DNS is checked. This is the correct setup in the majority of cases.

The reason for adding both the existing VM1 hostname and the ILB hostname, is because if we don’t, SUSE Linux will change it’s hostname to “myascs” during the boot phase. This is caused by the cloud-netconfig initialisation.

You can find out more about cloud-netconfig in: SUSE Cloud-Netconfig and Azure VMs – Dynamic Network Configuration

The above change to the /etc/hosts file is sufficient to allow our “ensmon” tool to connect to the Equeue server process, since the communication will now go directly internal over the internal VM network interface itself, and not via the ILB.

That Works But Here’s a Better Way

In some cases, the above solution is adequate.

You may not be running this ASCS VM with Azure Site Recovery (ASR) to a Disaster Recovery (DR) Region, where server IP addresses may change on invocation of a DR scenario.

If you are using ASR, and you know that the VM IP address could change in a DR scenario, then hard coding IP addresses into the /etc/hosts file is not really the best way forward.

In this case, we need to use another solution that does not involve editing the /etc/hosts file on the VM1 server.

We can employ another technique, we can bind the IP address of the ILB hostname, onto the loopback device of the VM1 server.

Loop-a-say-what?

Everyone knows the loopback device.
It’s usually represented by the IP address 127.0.0.1.
You want to test the network stack on a Linux server, you can ping 127.0.0.1 and it will usually return an ICMP response.

The loopback device is only accessible on the local server, it can’t be accessed from outside the server over the network in any way.

We can add different IP addresses to the loopback device if we want to. They will not be routable over the network and will not be addressable from any other host on the network.

If we imagine that the ILB hostname myascs.corp.net has an IP address of 10.0.0.5, we can add the IP address to the local loopback device using the “ip-address” utility command as follows:

NOTE: This must be done as the root Linux user.

> ip address add 10.0.0.5/32 dev lo scope host

You will then be able to show the device addresses:

> ip address show dev lo

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet 10.0.0.5/32 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever

Now we can ping the 10.0.0.5 address:

> ping 10.0.0.5

PING 10.0.0.5 (10.0.0.5) 56(84) bytes of data.
64 bytes from 10.0.0.5: icmp_seq=1 ttl=64 time=0.016 ms
64 bytes from 10.0.0.5: icmp_seq=2 ttl=64 time=0.030 ms
64 bytes from 10.0.0.5: icmp_seq=3 ttl=64 time=0.031 ms
64 bytes from 10.0.0.5: icmp_seq=4 ttl=64 time=0.031 ms
64 bytes from 10.0.0.5: icmp_seq=5 ttl=64 time=0.031 ms
64 bytes from 10.0.0.5: icmp_seq=6 ttl=64 time=0.043 ms

How do we tell that this is not routing out to the ILB itself?

  1. The ping response time is extremely quick at 0.03 milliseconds, so it must be just routing through the local TCP/IP stack on this VM.
  2. Remember, the VM is actively part of a back-end pool of the ILB, so it cannot talk back to itself!

If you remove the bound IP address from the loopback device:

> ip address del 10.0.0.5/32 dev lo

Now re-execute the ping:

> ping 10.0.0.5

PING 10.0.0.5 (10.0.0.5) 56(84) bytes of data.
--- 10.0.0.5 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 1016ms

All packets are lost.
So we have proven that adding the IP to the local loopback device, is what made the ping work.

Permanent Solution Reached?

Our more permanent solution is to add the IP address to the local loopback device, instead of the /etc/hosts file.

But wait! The IP address may change in a DR scenario.
So we will need an initial step to use DNS to lookup the IP address.
We can use the “host” command (or many others) to achieve this as follows:

> host myascs.corp.net

myascs.corp.net has address 10.0.0.5

The output of the “host” command could be used during a Linux boot script or some other means (another blog post for this 😉 ) to add the IP to the local loopback device.

Summary

We found that in a basic HA setup of the SAP ASCS instance, using two VMs behind an Azure ILB, we were not able to access certain tools directly on the VM running the ASCS.

We also found that landscape management tools like SAP LaMa failed to correctly identify which host the ASCS instance was running on.

By adding the ILB IP address to the local loopback device, we are able to use our “ensmon” and other utilities for SAP ASCS administration.

It also has the great effect of letting SAP LaMa detect the ILB IP address as being bound onto the VM1 server, which means that LaMa host discovery and validation works.

Finally, we discussed how the IP address could be detected, in case it has changed after a DR failover.

If you’re not already, you will eventually be wondering, “how can I automate this whole IP adding process, so that SAP LaMa knows where the ASCS is at any one time?”, well that is all part of knowing how to automate your SAP landscape operations and a deep understanding of how the various SAP agents interact during the start-up and shutdown of a standard Netweaver stack.
Some of this is discussed in a post here: How an Azure hosted SAP LaMa Controlled SAP System Starts Up