GCP Archives » Musings of an IT Implementor

Cluster Config Issue for SAP ERS Instance

Running SAP Netweaver A(SCS)/ERS in a Pacemaker cluster in Azure, AWS or GCP or potentially even on-premise?

Be aware, there is a configuration issue in the current version of the Microsoft, AWS, GCP and SUSE documentation for the Pacemaker cluster configuration (on SLES with non-native SystemD Startup Framework) for the SAP Enqueue Replication Server (ERS) instance primitive in an ENSA1 (classic Enqueue) architecture.

Having double checked with both the SAP and SUSE documentation (I don’t have access to check RHEL) I believe that the SAP certified SUSE cluster design is correct, but that the instructions to configure it are not inline with the SAP recommendations.

In this post I explain the issue, what the impact is and how to potentially correct it.

NOTE: This is for SLES systems that are not using the new “Native Startup Framework for SystemD” Services for SAP, see here.

Don’t put your SAP system at risk, get a big coffee and let me explain below.

SAP ASCS and High Availability

The Highly Available (HA) cluster configuration for the SAP ABAP Central Services (ASCS) instance is critical to successful operation of the SAP system, with the SAP Enqueue (EN) process being the guardian of the SAP application level logical locks (locks on business objects) and the SAP Enqueue Replication Server (ERS) instance being the guarantor for the EN when a cluster failover occurs.

In a two-node HA SAP ASCS/ERS setup, the EN process is on the first server node and the ERS instance is on the second node.
The EN proccess (within the ASCS instance) is replicating to the ERS instance (or rather, the ERS is pulling from the EN process on a loop).

If the EN process goes down, the surrounding cluster is usually configured to fail over the ASCS instance to the secondary cluster node where the ERS instance is running. The EN process will then take ownership of the replica in-memory lock table:

What is an ideal Architecture design?

In an ideal design, according to the SUSE documentation here.
(again, I’m sure RHEL is similar but if someone can verify that would be great), the ASCS and ERS instances are installed on cluster controlled disk storage on both cluster nodes in the cluster:

We mount the /sapmnt (and potentially /usr/sap/SID/SYS) file system from NFS storage, but these file systems are *not* cluster controlled file systems.
The above design ensures that the ASCS and ERS instances have access to their “local” file system areas before they are started by the cluster. In the event of a failover from node A to node B, the cluster ensures the relevant file system area is present before starting the SAP instance.

We can confirm this by looking at the SAP official filesystem layout design for an HA system here:

What is Microsoft’s Design in Azure?

Let’s look at the cluster design on Microsoft’s documentation here:

It clearly shows that /usr/sap/SID/ASCS and /usr/sap/SID/ERS is being stored on the HA NFS storage.
So this matches with the SAP design.

What is Amazon’s design in AWS?

If we look at the documentation provided by Amazon here:

We can see that they are using EFS (Elastic File Storage) for the ASCS and ERS instance locations, which is mounted using the NFS protocol.
So the design is the same as Microsoft’s and again this matches the SAP design.

What is Google’s design in GCP?

If we look at the documentation provided by Google, the diagram doesn’t show clearly how the filesystems are provided for the ASCS and ERS, so we have to look further into the configuration step here:

The above shows the preparation of the NFS storage.

Later in the process we see in the cluster configuration that the ASCS and ERS file systems are cluster controlled:

and

The above is going to mount /usr/sap/SID/ASCS## or /usr/sap/SID/ERS## and they will be cluster controlled.
Therefore the GCP design is also matching the SAP, Azure and AWS designs.

Where is the Configuration Issue?

So far we have:

Understood that /sapmnt is not a cluster controlled file system.
established that Azure, AWS, GCP and the SUSE documentation are in alignment regarding the cluster controlled file systems for the ASCS and the ERS.

Now we need to pay closer attention to the cluster configuration of the SAP instances themselves.

The Pacemaker cluster configuration of a SAP instance involves 3 (or more) different cluster resources: Networking, Storage and SAP instance. With the “SAP Instance” resource being the actual running SAP software process(es).

Within the cluster the “SAP Instance” Resource Adapter (RA) is actually called “SAPInstance” and in the cluster configuration is takes a number of parameters specific to the SAP instance that it is controlling.
One of these parameters is called “START_PROFILE” which should point to the SAP instance profile.

The SAP instance profile file is an executable script (on Linux) that contains all the required commands and settings to start (and stop) the SAP instance in the correct way and also contains required configuration for the instance once it has started. It is needed by the SAP Instance Agent (sapstartsrv) and the executable that is the actual SAP instance (ASCS binaries: msg_server and enserver, ERS binary: enrepserver).
Without the profile file, the SAP Instance Agent cannot operate a SAP instance and the process that is the instance itself is unable to read any required parameter settings.

Usually, the SAP Instance Agent knows where to find the instance profile because at server startup, the sapinit script (triggered through either systemd unit-file or the old Sys-V start scripts) will execute the entries in the file /usr/sap/sapservices.
These entries call the SAP Instance Agent for each SAP instance and they pass the location of the start profile.

Here’s a diagram from a prior blog post which shows what I mean:

In our two-node cluster setup example, after a node is started, we will see 2 SAP Instance Agents running, one for ASCS and one for ERS. This happens no matter what the cluster configuration is. The instance agents are always started and they always start with the profile file specific in the /usr/sap/sapservices file.
NOTE: This changes in the latest cluster setup in SLES 15, which is a pure SystemD controlled SAP Instance Agent start process.

The /usr/sap/sapservices file is created at installation time. So it contains the entries that the SAP Software Provisioning Manager has created.
The ASCS instance entry in the sapeservices file, looks like so:

LD_LIBRARY_PATH=/usr/sap/SID/ASCS01/exe:$LD_LIBRARY_PATH; export LD_LIBRARY_PATH; /usr/sap/SID/ASCS01/exe/sapstartsrv pf=/usr/sap/SID/SYS/profile/SID_ASCS01_myhost -D -u sidadm

But the ERS instance entry looks slightly different:

LD_LIBRARY_PATH=/usr/sap/SID/ERS11/exe:$LD_LIBRARY_PATH; export LD_LIBRARY_PATH; /usr/sap/SID/ERS11/exe/sapstartsrv pf=/usr/sap/SID/ERS11/profile/SID_ERS11_myhost -D -u sidadm

If we compare the “pf=” (profile) parameter entry between ASCS and ERS after installation, you will notice the specified ERS instance profile location is not using the location accessible under the link /usr/sap/SID/SYS.
Since we have already investigated the file system layout, we know that the “SYS” location only contains links, which point to other locations. In this case, the ASCS is looking for sub-directory “profile”, which is a link to directory /sapmnt/SID/profile.
The ERS on the other hand, is using a local copy of the profile.

This is something that usually would go unnoticed, but the ERS must be using a local copy of the profile for a reason?
Looking at SAP notes, we find SAP note 2954193, which explains that an ERS instance in an ENSA1 architecture should be started using a local instance profile file:

Important part: “this configuration must not be changed“.
Very interesting. It doesn’t go into any further detail, but from what we have understood about the criticality of the SAP ASCS and ERS instances, we have to assume that SAP have tested a specific failure scenario (maybe failure of sapmnt) and deemed it necessary to ensure that the ERS instance profile is always available.
I can’t think of any other reason (maybe you can?).

The next question, how does that ERS profile get created in that local “profile” location? It’s not something the other instances do.
After some testing it would appear that the “.lst” file in the sapmnt location is used by the SAP Instance Agent to determine which files to copy at instance startup:

It is important to notice that the DEFAULT.PFL is also copied by the above process.
Make sure you don’t go removing that “.lst” file from “/sapmnt/SID/profile”, otherwise those local profiles will be outdated!

To summarise in a nice diagram, this setup is BAD:

This is GOOD:

What about sapservices?

When we discussed the start process of the server, we just mentioned that the SAP Instance Agent is always started from the /usr/sap/sapervices file settings. We also noted how in the /usr/sap/sapservices file, the settings for the ERS profile file location are correct.
So why would the cluster affect the profile location of the ERS at all?
It’s a good question, and the answer is not a simple explanation because it requires a specific circumstance to happen in the lifecycle of the cluster.

Here’s the specific circumstance:

Cluster starts, the ERS Instance Agent was already running and so it has the correct profile.
We can run “ps -ef | grep ERS” and we would see the “er” process has the correct “pf=/path-to-profile” and correctly pointing to the local copy of the profile.
If the ERS instance somehow is now terminated (example: “rm /tmp/.sapstream50023”) then the cluster will restart the whole SAP Instance Agent of the ERS (without a cluster failover).
At this point, the cluster starts the ERS Instance Agent with the wrong profile location, and the “er” binary now inherits this when it starts. This will be inplace until the next complete shutdown of the ERS Instance Agent.

As you can see, it’s not an easy situation to detect, because from an outside perspective, the ERS died and was successfully restarted.
Except it was restarted with the incorrect profile location.
If a subsequent failure happens to the sapmnt file system, this would render the ERS at risk (we don’t know the exact risk because it is not mentioned in the referenced SAP note that we noted earlier).
What is more, the ERS instance is not monitorable using SAP Solution Manager (out-of-the-box), you would need to create your own monitoring element for it.

Which Documentation has this Issue?

Now we know there is a difference required for the ERS instance profile location, we need to go back and look at how the cluster configurations have been performed, because of “this configuration must not be changed”!

Let’s look first at the SUSE documentation here:

Alright, the above would seem to show that “/sapmnt” is used for the ERS.
That’s not good as it doesn’t comply with the SAP note.

How about the Microsoft documentation for Azure:

No that’s also using /sapmnt for the ERS profile location. That’s not right either.

Looking at the AWS document now:

Still /sapmnt for the ERS.

Finally, let’s look at the GCP document:

This one is a little harder, but essentially, the proposed variable “PATH_TO_PROFILE” looks like it is bound to the same one as the ASCS instance defined just above it, so I’m going to say, it’s going to be “/sapmnt” because when you try and change it on one, it forces the same on the other:

We can say that all documentation available for the main hyperscalers, provides an incorrect configuration of the cluster, which could cause the ERS to operate in a way that is strongly not recommended by SAP.

Should we correct the issue and How can we correct the issue?

I have reported my finding to both Microsoft and SUSE, so I would expect them to validate.
However, in the past when providing such feedback, the relevant SAP note has been updated to exclude or invalidate the information altogether, rather than instigating the effort of fixing or adjusting any incorrect configuration documentation.
That’s just the way it is and it’s not my product, so I have no say in the solution, I can only report on what I know is correct at the time.

If you would like to correct the issue using the information known at this point in time, then the steps to be taken to validate that the ERS is operating and configured in the correct way are provided in a high-level below:

Check the cluster configuration for the ERS instance to ensure it is using the local copy of the instance profile.
Check the current profile location used by the running ERS Instance Agent (on both nodes in the cluster).
Double check someone has not adjusted the /usr/sap/sapservices file incorrectly (on both nodes in the cluster).
Check that the instance profile “.lst” file exists in the /sapmnt/SID/profile directory, so that the ERS Instance Agent can copy the latest versions of the profile and the DEFAULT.PFL to the local directory location when it next starts.
Check for any differences between the current local profile files and the files in the /sapmnt/SID/profile directory and consider executing the “sapcpe” process manually.

Thanks for reading.

Is my GCP hosted SLES 12 Linux VM Affected by the BootHole Vulnerability

In an effort to really drag this topic out (it’s now a trilogy), I’ve taken my previous Azure specific post and also the AWS specific post and decided to do some further research into whether the same is true in Google Cloud Platform (a.k.a GCP).

Previously

(If I was writing this like a true screenwriter, it would get shorter and faster each recap).

In July 2020, a GRUB2 bootloader vulnerability was discovered which could allow attackers to replace the bootloader on a machine which has Secure Boot turned on.
The vulnerability is designated CVE-2020-10713 and is rated 8.2 HIGH on the CVSS (see here).

Let’s recap what this is (honestly, please see my Azure post for details, it’s quite technical), and how it impacts a GCP virtual machine running SUSE Enterprise Linux 12, which is commonly used to run SAP systems such as SAP HANA or other SAP products.

What is the Vulnerability?

Essentially, some evil input data can be entered into some part of the GRUB2 program binaries, which is not checked/validated.
By carefully crafting the data that is the overflow, it is possible to cause a specifically targeted memory area to be overwritten.

As described by Eclypsium here (the security company that detected this) “Attackers exploiting this vulnerability can install persistent and stealthy bootkits or malicious bootloaders that could give them near-total control over the victim device“.

Essentially, the vulnerability allows an attacker with root privileges to replace the bootloader with a malicious one.

What is GRUB2?

GRUB2 is v2 of the GRand Unified Bootloader (see here for the manual).
It can be used to load the main operating system of a computer.

What is Secure Boot?

There are commonly two boot methods: “Legacy Boot” and “Secure Boot” (a.k.a UEFI boot).
Until Secure Boot was invented, the bootloader would sit in a designated location on the hard disk and would be executed by the computer BIOS to start the chain of processes for the computer start up.

With Secure Boot, certificates are used to secure the boot process chain.
This BootHole vulnerability means a new CA certificate needs to be implemented in every machine that uses Secure Boot!

But the attackers Need Root?

Yes, the vulnerability is in a GRUB2 configuration text file owned by the root user. Additional text added to the file can cause the buffer overflow.
Anti-virus can’t remove the bootloader if the bootloader boots first and “adjusts” the anti-virus.

NOTE: The flaw also exists if you also use the network boot capability (PXE boot).

What is the Patch?

Due to the complexity of the problem (did you read the prior Eclypsium link?), it needs more than one piece of software to be patched and in different layers of the boot chain.

The vulnerable GRUB2 software needs patching.
To be able to stop the vulnerable version of GRUB2 being re-installed and used, three things need to happen:

The O/S vendor (SUSE) needs to adjust their code (known as the “shim”) so that it no longer trusts the vulnerable version of GRUB2. Again, this is a software patch from the O/S vendor (SUSE) which will need a reboot.
Since someone with root could simply re-install O/S vendor code (the “shim”) that trusts the vulnerable version of GRUB2, the adjusted O/S vendor code will need signing and trusting by the certificates further up the chain.
The revocation list of Secure Boot needs to be adjusted to prevent the vulnerable version of the O/S vendor code (“shim”) from being called during boot. (This is known as the “dbx” (exclusion database), which will need updating with a firmware update).

What is SUSE doing about it?

There needs to be a multi-pronged patching process because SUSE also found some additional bugs during their analysis.

You can see the SUSE page on CVE-2020-10713 here, which includes the mention of the additional bugs.

How does this impact GCP VMs?

In the previous paragraphs we found that a firmware update is needed to update the “dbx” exclusion database.
Since GCP virtual machines are hosted in a KVM based hypervisor, the “firmware” is actually software.

Whilst looking for details on “Secure Boot” in GCP virtual machines, we come across the Google Compute Engine’s “Shielded VM” option.
You can read about it in detail here.
In brief, in GCP a Shielded VM is deployed using a pre-defined set of Google specific guest operating systems:

As noted above, the documentation specifically mentions that the “firmware” underpinning the virtual machine contains Google’s Certificate Authority (CA) certificate, as the root of the trust chain.
This is important because the Eclypsium description of the vulnerability is specifically citing a problem with the Microsoft CA.
What this means is that Google actually decide on the trust chain themselves and can probably more rapidly adjust the firmware with a new CA certificate.
To reiterate, this is specific to Google specific VM images that you deploy as a Shielded VM.

Another point worth noting is that when creating a Shielded VM, you can enable the vTPM (virtual trusted platform module), which allows integrity monitoring of the boot process. Any change to the boot process and a validation alert is triggered. Whilst this would not prevent compromise, it would at least alert an administrator.

Reading the Google infrastructure security document, we find that just like AWS, Google have designed and are implementing their own security chip called Titan, on the physical hosts. This is used to ensure that physical hosts boot securely, but it is not clear if this chip is used in anyway for Shielded VMs booted on the physical host.

If we delve further into the GCP documentation we find that we also have the option to create a custom image for deployment into a Shielded VM.
See the documentation on how to create a custom Shielded VM image:

The above states that you can create your own Secure Boot capable VM image for deployment in GCP as a Shielded VM.
If we read further down that page under section “Default certificates“, we find a slight difference compared to the Google “curated” images:

The above is telling us, by default the standard Microsoft CA certificates are used for the Secure Boot setup of VMs created using a custom image (remember non-custom Secure Boot images use Google’s root CA) in GCP.
When it says “default values”, right now, they are the only values because of a small note further up the page:

OK, so you can only use the defaults for now. The same compromised defaults that will need fixing. 🤷‍♂️

What do we think needs to happen once Google create the ability to replace the certificates?
From reading those previously mentioned documents, I would guess that to rebuild the certificate database used during the creation of the custom Shielded VM image, you are going to need to re-create the VM image and then re-deploy a VM from that image!

The question remains, is SLES 12 supported as a Shielded VM guest-OS on GCP?
According to the Shielded VM page here, it is not by default. You will need to therefore create your own image:

Summary:

The BootHole vulnerability is far reaching and will impact many, many devices (servers, laptops, IoT devices, TVs, fridges, cars?).
However, only those devices that actually *use* Secure Boot will truly be impacted, since the devices not using Secure Boot do not need to be patched (it’s fruitless).

If you run SLES 12 on GCP virtual machines, using public images, then by default you will not being using the Shielded VM instances, so there is no point patching to fix a vulnerability for which you are not affected.
You are only introducing more risk by patching.

If however, you do decide to patch (even if you don’t need to) then follow the advice from SUSE and patch to fix GRUB2, the “shim” and the other vulnerabilities that were found.

On a final closing point, you could be running a custom SLES image deployed in GCP as a Shielded VM. An image that your company has built and which uses Secure Boot. You would be wise to contact your cloud administrators to ensure that they are preparing for a VM rebuild and subsequent patching required to ensure that Secure Boot remains secure.

Useful Links:

NIST CVE-2020-10713: nvd.nist.gov/vuln/detail/CVE-2020-10713
Eclypsium BootHole: eclypsium.com/2020/07/29/theres-a-hole-in-the-boot/
GCP Shielded VM: https://cloud.google.com/shielded-vm/
GCP Shielded VM: Your ticket to guarding against rootkits and exfiltration: https://cloud.google.com/blog/products/identity-security/shielded-vm-your-ticket-to-guarding-against-rootkits-and-exfiltration
GCP Shielded VM documentation: https://cloud.google.com/security/shielded-cloud/shielded-vm
GCP Blog Announcing Shielded VM: https://cloud.google.com/blog/products/identity-security/security-simplified-making-shielded-vm-default-compute-engine
GCP Infrastructure Security Design: https://cloud.google.com/security/infrastructure/design
GCP Titan Security Chip: https://cloud.google.com/blog/products/gcp/titan-in-depth-security-in-plaintext
GCP KVM Hypervisor Hardening: https://cloud.google.com/blog/products/gcp/7-ways-we-harden-our-kvm-hypervisor-at-google-cloud-security-in-plaintext

Category: GCP