This blog contains experience gained over the years of implementing (and de-implementing) large scale IT applications/software.

Add Second IP to Loopback for SAP ASCS on SUSE Linux 12 in Azure

Back in 2018, I helped design a Highly Available architecture for the SAP Netweaver ASCS (contains the message server and enqueue server).
This was to be implemented on SUSE Linux 12 in the Microsoft Azure cloud, and therefore it was sitting behind an Azure Internal Load Balancer (ILB).

At this point in time, the Microsoft Reference Architecture for SAP on Azure, was not really well established and the documentation was dispersed throughout the Microsoft site.

Browsing LinkedIn posts, I could see Microsoft appeared to be hiring SAP guys 10 to the dozen. Maybe they recognised that some of their existing SAP on Azure architecture design documentation was either too light on details, or just wrong. Who knows for sure, but I do know that since 2018, the documentation from Microsoft has been much, much better with regards to SAP content and specifically the reference architecture for SAP on Azure.
It now all exists in a nice single location and contains all the steps and details needed. Plus we have “Embrace”, the official tie-up between Microsoft and SAP, which puts a nice frame around the picture.

What Was Up With Our HA Design?

Going back to 2018, the ASCS HA design we created was suffering a number of small niggles.
The basic architecture pattern had 2 Azure VMs, of which either one could run the ASCS. Both VMs sit in the same ILB back-end pool fronted by the ILB with a dedicated DNS hostname and IP address.

One of the VMs runs the ERS (Enqueue Replication Server) for replication of the enqueue shared-memory table. Yes, this is pre-ENSA2 (see 2630416), which will change things going forward!

We experienced issues with the ILB timeout settings killing traffic to the enqueue server. These were resolved by scanning the Microsoft Documents in great detail and applying the required SAP level parameters and adjusting the ILB settings.

We also found that we were missing some of the required O/S level settings which are usually applied by saptune. This explains why the documentation was sparse around these parameters, they are usually already set.
Finding those key pieces of information was possible, it just wasn’t very clear back then.

We also had issues with local SAP utility commands like “ensmon” for diagnosing issues with the enqueue server process. These just would not work from the active ASCS VM, because the front-end for the ASCS was now the ILB.

Why Ensmon Didn’t Work

You see, when a VM is sitting behind an ILB, it is not able to send traffic to the ILB of which it is an active back-end member. This facet of Microsoft’s ILB design, is there to prevent a feedback loop.

This same issue also caused SAP LaMa landscape detection issues, because LaMa could not see which VM host the SAP ASCS was running on.
LaMa usually detects which VM the ASCS hostname’s IP address was bound onto. But the IP address for the ASCS was on the ILB, not on the back-end VMs. This left LaMa not knowing which host was running the ASCS.

As part of the current Microsoft documentation, if you are using a Pacemaker cluster, the document here tells you how to setup your VMs and ILB for use with the cluster.
The documentation doesn’t tell you how the cluster software is allowing the local SAP utility commands like “ensmon” to continue working when the VM on which the ASCS is currently running, is behind an ILB.

How Can We Make Ensmon & Other Tools Work?

This is where it gets interesting.

We are going to explore how we can achieve the same level of usability, behind an ILB, without the cluster software. Ie. with just 2 VMs and the ASCS installed locally on each (real basic).

I’m not saying this is going to provide the same level of HA or anything like that, just showing you how you can get this working at a basic level.

Here is our setup:

  • Each VM has just 1 NIC and 1 primary IP address.
  • The ILB health probe is checking tcp/3600 (the message server).
  • We have “HA” enabled on the ILB.
  • The ASCS is currently active on VM1, but can also be started on VM2 when needed.
  • Although not shown, the ERS is installed directly on VM2 only.
  • This is a failover cluster (Active-Passive) with regards to the ASCS, and no auto-failback is anticipated. A nick name I like is the “chuck it over the fence” option.

Let’s imagine we try to start the “ensmon” tool on the VM1 server.
It will say that it is not able to connect to the Enqueue server on 10.x.x.x (via the ILB).
This is because the VM1 server is not allowed to talk to itself via the ILB of which the VM is an active back-end pool member.
Instead, we need to make the VM1 server believe that it is talking to the ILB, but make it talk instead, to itself.

If we say that VM1 has an IP address of 10.0.0.2 and a hostname of myvm1, then we can make VM1 talk to itself instead of the ILB IP address, by adding an entry into the Linux /etc/hosts file like so:

10.0.0.2  myvm1.corp.net  myvm1  myascs.corp.net  myascs

This will cause any hostname resolution that calls the standard Linux function “gethostbyname” for the ILB hostname myascs.corp.net, to find the IP address of VM1.
You can confirm this with a “ping” or a “host” command.

You should note that the above assumes that your /etc/resolv.conf is set such that /etc/hosts (known as “files” in /etc/resolv.conf) is used before DNS is checked. This is the correct setup in the majority of cases.

The reason for adding both the existing VM1 hostname and the ILB hostname, is because if we don’t, SUSE Linux will change it’s hostname to “myascs” during the boot phase. This is caused by the cloud-netconfig initialisation.

You can find out more about cloud-netconfig in: SUSE Cloud-Netconfig and Azure VMs – Dynamic Network Configuration

The above change to the /etc/hosts file is sufficient to allow our “ensmon” tool to connect to the Equeue server process, since the communication will now go directly internal over the internal VM network interface itself, and not via the ILB.

That Works But Here’s a Better Way

In some cases, the above solution is adequate.

You may not be running this ASCS VM with Azure Site Recovery (ASR) to a Disaster Recovery (DR) Region, where server IP addresses may change on invocation of a DR scenario.

If you are using ASR, and you know that the VM IP address could change in a DR scenario, then hard coding IP addresses into the /etc/hosts file is not really the best way forward.

In this case, we need to use another solution that does not involve editing the /etc/hosts file on the VM1 server.

We can employ another technique, we can bind the IP address of the ILB hostname, onto the loopback device of the VM1 server.

Loop-a-say-what?

Everyone knows the loopback device.
It’s usually represented by the IP address 127.0.0.1.
You want to test the network stack on a Linux server, you can ping 127.0.0.1 and it will usually return an ICMP response.

The loopback device is only accessible on the local server, it can’t be accessed from outside the server over the network in any way.

We can add different IP addresses to the loopback device if we want to. They will not be routable over the network and will not be addressable from any other host on the network.

If we imagine that the ILB hostname myascs.corp.net has an IP address of 10.0.0.5, we can add the IP address to the local loopback device using the “ip-address” utility command as follows:

NOTE: This must be done as the root Linux user.

> ip address add 10.0.0.5/32 dev lo scope host

You will then be able to show the device addresses:

> ip address show dev lo

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet 10.0.0.5/32 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever

Now we can ping the 10.0.0.5 address:

> ping 10.0.0.5

PING 10.0.0.5 (10.0.0.5) 56(84) bytes of data.
64 bytes from 10.0.0.5: icmp_seq=1 ttl=64 time=0.016 ms
64 bytes from 10.0.0.5: icmp_seq=2 ttl=64 time=0.030 ms
64 bytes from 10.0.0.5: icmp_seq=3 ttl=64 time=0.031 ms
64 bytes from 10.0.0.5: icmp_seq=4 ttl=64 time=0.031 ms
64 bytes from 10.0.0.5: icmp_seq=5 ttl=64 time=0.031 ms
64 bytes from 10.0.0.5: icmp_seq=6 ttl=64 time=0.043 ms

How do we tell that this is not routing out to the ILB itself?

  1. The ping response time is extremely quick at 0.03 milliseconds, so it must be just routing through the local TCP/IP stack on this VM.
  2. Remember, the VM is actively part of a back-end pool of the ILB, so it cannot talk back to itself!

If you remove the bound IP address from the loopback device:

> ip address del 10.0.0.5/32 dev lo

Now re-execute the ping:

> ping 10.0.0.5

PING 10.0.0.5 (10.0.0.5) 56(84) bytes of data.
--- 10.0.0.5 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 1016ms

All packets are lost.
So we have proven that adding the IP to the local loopback device, is what made the ping work.

Permanent Solution Reached?

Our more permanent solution is to add the IP address to the local loopback device, instead of the /etc/hosts file.

But wait! The IP address may change in a DR scenario.
So we will need an initial step to use DNS to lookup the IP address.
We can use the “host” command (or many others) to achieve this as follows:

> host myascs.corp.net

myascs.corp.net has address 10.0.0.5

The output of the “host” command could be used during a Linux boot script or some other means (another blog post for this 😉 ) to add the IP to the local loopback device.

Summary

We found that in a basic HA setup of the SAP ASCS instance, using two VMs behind an Azure ILB, we were not able to access certain tools directly on the VM running the ASCS.

We also found that landscape management tools like SAP LaMa failed to correctly identify which host the ASCS instance was running on.

By adding the ILB IP address to the local loopback device, we are able to use our “ensmon” and other utilities for SAP ASCS administration.

It also has the great effect of letting SAP LaMa detect the ILB IP address as being bound onto the VM1 server, which means that LaMa host discovery and validation works.

Finally, we discussed how the IP address could be detected, in case it has changed after a DR failover.

If you’re not already, you will eventually be wondering, “how can I automate this whole IP adding process, so that SAP LaMa knows where the ASCS is at any one time?”, well that is all part of knowing how to automate your SAP landscape operations and a deep understanding of how the various SAP agents interact during the start-up and shutdown of a standard Netweaver stack.
Some of this is discussed in a post here: How an Azure hosted SAP LaMa Controlled SAP System Starts Up

How an Azure hosted SAP LaMa Controlled SAP System Starts Up

We all know how this works in the old “pre-cloud” world.
To start a SAP system on a physical server which is currently shutdown, you would:

  1. Power on the physical host (through whatever means, ILO or a button or other).
  2. You log into the host as “adm” and run: “sapcontrol -nr ## -function StartSystem”.

We then moved from physical, to virtual:

  1. Power on the hypervisor physical host (through whatever means, ILO or a button or other).
  2. Power on the VM.
  3. You log into the VM as “adm” and run: “sapcontrol -nr ## -function StartSystem”.

Now we have cloud:

  1. Power on the VM (through cloud control software e.g. Azure Portal).
  2. You log into the VM as “adm” and run: “sapcontrol -nr ## -function StartSystem”.

How Does SAP LaMa Work In this Context?

With SAP LaMa, it has the ability to perform both steps #1 and #2 in the last list.
Here’s a diagrammatic overview that I hope shows accurately the interaction between the Azure,  Linux and SAP layers:

In the above diagram we can see that SAP LaMa is “cloud aware” and uses the Azure APIs to start and stop Azure VMs.

Once the VM is started, the usual Linux-level startup process takes over to start the services.

There is one caveat with the above and that is regarding SUSE cloud-netconfig. Find out more here: SUSE Cloud-Netconfig and Azure VMs – Dynamic Network Configuration

Why the Hostagent is Critical for LaMa

The SAP Hostagent is used as the marker point in the VM start-up process, at which point SAP LaMa knows for sure that the VM is up and running.
Before the Hostagent responds, SAP LaMa can only query the status of the VM from the Azure APIs, which basically say “starting” or “running”.

There are a number of monitoring capabilities inside SAP LaMa, but the Hostagent is the critical one.
From the Hostagent, the SAP instance agents can be unregistered and re-registered.

When setting up your SAP LaMa landscape, the SAP Hostagent is critical.  If you get it right, with automated deployment, SSL setup, common configuration, custom descriptors/operations, then you can automate almost anything in your SAP landscape.  

You need to constantly patch the SAP Hostagent to ensure that it remains compatible with other elements of your SAP landscape.  For example, to be able to patch the SAP ASE database, you need the Hostagent.  It’s also used for the BALDR (metrics gathering) inside SAP ASE from ASE 16.3 onwards.

SAP LaMa & SAP Landscape Orchestration

The SAP LaMa tool is the nice front-end and scheduler onto which you can apply your automation requirements via the Hostagent and SAP instance agents.

Not only does it provide orchestration capability, but it can also validate your landscape according to predefined in-built checks (such as Kernel component version consistency) and even validate against custom validations built and defined by you.

Listing Azure VM DataDisks and Cache Settings Using Azure Portal JMESPATH & Bash

As part of a SAP HANA deployment, there are a set of recommendations around the Azure VM disk caching settings and the use of the Azure VM WriteAccelerator.
These features should be applied to the SAP HANA database data volume and log volume disks to ensure optimum performance of the database I/O operations.

This post is not about the cache settings, but about how it’s possible to gather the required information about the current settings across your landscape.

There are 3 main methods available to an infrastructure person, to see the current Azure VM disk cache settings.
I will discuss these method below.

1, Using the Azure Portal

You can use the Azure Portal to locate the VM you are interested in, then checking the disks, and looking on each disk.
You can only see the disk cache settings under the VM view inside the Azure Portal.

While slightly counter intuitive (you would expect to see the same under the “Disks” view), it’s because the disk cache feature is provided for by the VM onto which the disks are bound, therefore it’s tied to the VM view.

2, Using the Azure CLI

Using the Azure CLI (bash or powershell) to find the disks and get the settings.

This is by far the most common approach for anyone managing a large estate. It uses the existing Azure API layers and the Azure CLI to query your Azure subscription, return the data in JSON format and parse it.
The actual query is written in JMESPATH (https://jmespath.org/) and is similar to XPath (for XML).

A couple of sample queries in BASH (my favourite shell):

List all VM names:

az vm list --query [].name -o table

List VM names, powerstate, vmsize, O/S and RG:

az vm list --show-details --query '[].{name:name, state:powerState, OS:storageProfile.osDisk.osType, Type:hardwareProfile.vmSize, rg:resourceGroup, diskName:storageProfile.dataDisks.name, diskLUN:storageProfile.dataDisks.lun, diskCaching:storageProfile.dataDisks.caching, diskSizeG:storageProfile.dataDisks.diskSizeGb, WAEnabled:storageProfile.dataDisks.writeAcceleratorEnabled }' -o table

List all VMs with names ending d01 or d02 or d03, then pull out the data disk details and whether the WriteAccelerator is enabled:

az vm list --query "[?ends_with(name,'d01')||ends_with(name,'d02')||ends_with(name,'d03')]|[].storageProfile.dataDisks[].[lun,name,caching,diskSizeGb,writeAcceleratorEnabled]" -o tsv

To execute the above, simply launch the Cloud Shell and select “Bash” in the Azure Portal:

Then paste in the query and hit return:

3, A Most Obscure Method.

Since SAP require you to have the “Enhanced Monitoring for Linux” (OEM) agent extension installed, you can obtain the disk details directly on each VM.

For Linux VMs, the OEM creates a special text file for performance counters, which is used by the Saposcol (remember that) for use by SAP diagnostic agents, ABAP stacks and other tools.

Using a simple piece of awk scripting, we can pull out the disk cache settings from the file like so:

awk -F';' '/;disk;Caching;/ { sub(//dev//,"",$4); printf "/dev/%s %sn", tolower($4), tolower($6) }' /var/lib/AzureEnhancedMonitor/PerfCounters

There’s a lot more information in the text file (/var/lib/AzureEnhancedMonitor/PerfCounters) and my later post Checking Azure Disk Cache Settings on a Linux VM in Shell, I show how you can pull out the complete mapping between Linux disk devices, disk volume groups, Azure disk names and the disk caching settings, like so:

Useful Links

HowTo: Find the Datacentre Region and Physical Host of your Azure Windows VM

With VMs hosted in Azure you need a fine balance between protection from hardware failure on the underlying Azure platform, plus performance from having the tiers of your SAP application being physically close together.

For this very purpose, Microsoft introduced Proximity Placement Groups (PPGs) to allow an administrator to ensure that specific tiers (e.g. application and database) are located close. Potentially even in the same server rack.
The PPGs also affect the location of the storage assigned to the VMs, although the storage infrastructure is actually transparent to administrators.

The PPGs still allow Azure to honor the Availability Sets, Fault Domains and Update Domains.

In this post, I show a method of finding the physical hostname of your Windows VM which could be part of a check before/after implementing a PPG.
NOTE: PPGs should be created at the time a VM is created, and assigned to the “lead” system of the rarest size. Example, an M-series VM is rare, so this should be the lead system when creating the PPG. This will anchor the other VMs to this M-series VMs location.

The previous blog post shows how to do this for a Linux VM.
On a Windows VM in Azure, as any Windows user with access to the registry, you can use the following to see the name of the physical host on which your VM is running:

reg query "HKEY_LOCAL_MACHINE/Software/Microsoft/Virtual Machine/Guest/Parameters" /v PhysicalHostName

Example output:
PhysicalHostName    REG_SZ      DUB012345678910

Example with “AMS” prefix (Amsterdam)

In this case, we take the first 3 chars to be “Dublin”, which is in the EU North Azure region.
The remaining characters consist of the rack and physical hostname.

If you have 2 VMs in the same rack on the same physical host, then you will have minimal latency for networking between them.
Conversely, if you have 2 VMs on the same physical host, you are open to HA issues.

Therefore, for SAP, you need a good balance of distance.

You should expect to see SAP S/4HANA application servers and HANA DBs placed in the same Proximity Placement Groups, within the same rack, even potentially on the same host (providing you have availability sets across the tiers you will be safe).

HowTo: Find the Datacentre Region and Physical Host of your Azure VM

With VMs hosted in Azure you need a fine balance between protection from hardware failure on the underlying Azure platform, plus performance from having the tiers of your SAP application being physically close together.

For this very purpose, Microsoft introduced Proximity Placement Groups (PPGs) to allow an administrator to ensure that specific tiers (e.g. application and database) are located close. Potentially even in the same server rack.
The PPGs also affect the location of the storage assigned to the VMs, although the storage infrastructure is actually transparent to administrators.

The PPGs still allow Azure to honor the Availability Sets, Fault Domains and Update Domains.

In this post, I show a method of finding the physical hostname of your Linux VM which could be part of a check before/after implementing a PPG.
NOTE: PPGs should be created at the time a VM is created, and assigned to the “lead” system of the rarest size. Example, an M-series VM is rare, so this should be the lead system when creating the PPG. This will anchor the other VMs to this M-series VMs location.

A separate post shows how to do this for a Windows VM.
On a Linux VM in Azure, as any Linux user, you can use the following to see the name of the physical host on which your VM is running:

awk -F 'H' '{ sub(/ostName/,"",$2); print $2 }' /var/lib/hyperv/.kvp_pool_3

Example output: DUB012345678910

In this case, we take the first 3 chars to be “Dublin”, which is in the EU North Azure region.
The remaining characters consist of the rack and physical hostname.

If you have 2 VMs in the same rack on the same physical host, then you will have minimal latency for networking between them.

Conversely, if you have 2 VMs on the same physical host, you are open to HA issues.

Therefore, you need a good balance for SAP.
You should expect to see SAP S/4HANA application servers and HANA DBs in the same Proximity Placement Groups, within the same rack, even potentially on the same host (providing you have availability sets across the tiers you will be safe).

Update: 23-Apr-2020
To get the above script output into a bash variable, the output contains hidden characters, we can use the following:

awk '{ gsub(/[^[:print:]]/,""); split ($0,a,"H"); sub(/ostName/,"",a[2]); print a[2]}' /var/lib/hyperv/.kvp_pool_3

Update: 05-Oct-2020
I have since found that there is another location where the above information can also be found.
Depending on your Linux O/S, you may also find the physical server name in the network scripts as follows:

grep BOOTSERVERNAME /var/run/netconfig/eth0/netconfig0

The aboe will return something like:
BOOTSERVERNAME=’AMS072nnnnnnnnn’