This blog contains experience gained over the years of implementing (and de-implementing) large scale IT applications/software.

How an Azure hosted SAP LaMa Controlled SAP System Starts Up

We all know how this works in the old “pre-cloud” world.
To start a SAP system on a physical server which is currently shutdown, you would:

  1. Power on the physical host (through whatever means, ILO or a button or other).
  2. You log into the host as “adm” and run: “sapcontrol -nr ## -function StartSystem”.

We then moved from physical, to virtual:

  1. Power on the hypervisor physical host (through whatever means, ILO or a button or other).
  2. Power on the VM.
  3. You log into the VM as “adm” and run: “sapcontrol -nr ## -function StartSystem”.

Now we have cloud:

  1. Power on the VM (through cloud control software e.g. Azure Portal).
  2. You log into the VM as “adm” and run: “sapcontrol -nr ## -function StartSystem”.

How Does SAP LaMa Work In this Context?

With SAP LaMa, it has the ability to perform both steps #1 and #2 in the last list.
Here’s a diagrammatic overview that I hope shows accurately the interaction between the Azure,  Linux and SAP layers:

In the above diagram we can see that SAP LaMa is “cloud aware” and uses the Azure APIs to start and stop Azure VMs.

Once the VM is started, the usual Linux-level startup process takes over to start the services.

There is one caveat with the above and that is regarding SUSE cloud-netconfig. Find out more here: SUSE Cloud-Netconfig and Azure VMs – Dynamic Network Configuration

Why the Hostagent is Critical for LaMa

The SAP Hostagent is used as the marker point in the VM start-up process, at which point SAP LaMa knows for sure that the VM is up and running.
Before the Hostagent responds, SAP LaMa can only query the status of the VM from the Azure APIs, which basically say “starting” or “running”.

There are a number of monitoring capabilities inside SAP LaMa, but the Hostagent is the critical one.
From the Hostagent, the SAP instance agents can be unregistered and re-registered.

When setting up your SAP LaMa landscape, the SAP Hostagent is critical.  If you get it right, with automated deployment, SSL setup, common configuration, custom descriptors/operations, then you can automate almost anything in your SAP landscape.  

You need to constantly patch the SAP Hostagent to ensure that it remains compatible with other elements of your SAP landscape.  For example, to be able to patch the SAP ASE database, you need the Hostagent.  It’s also used for the BALDR (metrics gathering) inside SAP ASE from ASE 16.3 onwards.

SAP LaMa & SAP Landscape Orchestration

The SAP LaMa tool is the nice front-end and scheduler onto which you can apply your automation requirements via the Hostagent and SAP instance agents.

Not only does it provide orchestration capability, but it can also validate your landscape according to predefined in-built checks (such as Kernel component version consistency) and even validate against custom validations built and defined by you.

Making saptune Actually Work & Patching to v2

Having recently spent some time analysing the performance of a HANA database system, I got down to the depths of Linux device I/O performance on an Azure hosted VM.

There was no reason to suspect any issue, because during the implementation of the VM image build process, we had followed all the relevant SAP notes.
In our case, on SUSE Enterprise Linux for SAP 12, we were explicitly following SAP Note 1275776 “Linux: Preparing SLES for SAP environments”.
Inside that SAP note, you go through the process of understanding the difference between sapconf and saptune, plus actually configure saptune (since it comes automatically with the “for SAP” versions of SLES 12).

Once configured, saptune should apply all the best practices that are encompassed in a number of SAP notes including SAP Note 2205917 “SAP HANA DB: Recommended OS settings for SLES 12 / SLES for SAP Applications 12”, which is itself needed during the HANA DB installation preparation work.
If you follow the note, there are a number of required O/S adjustments that are needed for HANA, which can be either applied manually, or (as recommended) automatically via saptune, provided the correct saptune profile is selected.

As part of our configuration, we had applied saptune solution profile S4HANA-DBSERVER (also noted in the SUSE documentation for SAP HANA).
This is applied using the standard:

saptune solution apply S4HANA-DBSERVER

You don’t get a lot of feedback from the saptune execution, but the fact there are no errors, indicates (normally) that it has done what has been requested.
You can check it has applied the profile by executing:

saptune solution list

The item that is starred in the returned list, is the profile that has been applied.
That’s it.

As part of my troubleshooting I even took the trouble of running the publicly available script sapconf_saptune_check (see here: https://github.com/scmschmidt/sapconf_saptune_check/blob/master/sapconf_saptune_check ), which just confirmed that saptune was indeed active/enabled and had a valid profile configured:

Back to the task of checking out the performance issue, and you can probably see where this is going now.
On investigation of the actual saptune profile contents, it was possible to see that a large majority of O/S changes had not been applied.
Specifically, we were not seeing the NOOP scheduler selected for the HANA disks devices.

By executing either of the following, you can check the currently selected scheduler:

grep -l ‘.*’ /sys/block/s??/queue/scheduler

or

cat /sys/block/s??/queue/scheduler

The selected scheduler will be in square brackets.
In my case, I was seeing “[cfq]” for all devices. Not good and not the recommendation from SAP and SUSE.
This setting should be automatically adjusted by the tuned daemon.

Looking at my version of saptune, I could see it was version 1.1.7 (from the output of the execution of the sapconf_saptune_check script).

Reading some of the recent blog posts from Soeren Schmidt here: https://blogs.sap.com/2019/05/03/a-new-saptune-is-knocking-on-your-door/
I could see that version 2 of saptune was now released.

Downloading the newer version (not installing directly!), reverting the old solution profile, installing the new saptune version and finally re-applying the same profile, confirmed that saptune was the culprit.

The new saptune2 fixed the issue, immediately activating a number of critical O/S changes, including the NOOP scheduler setting on each device.

The moral of the story, is therefore that as well as following the SAP processes, you still need to actually validate what it says it should have done.
The new saptune2 has been incorporated into our build process, plus the configuration check scripts will be specifically checking for it.
However, since the upgrade from saptune1 to saptune2 could cause issues if it just blindly re-applied the “new” profile settings, SAP have made saptune follow a backwards compatible upgrade process, whereby the O/S settings are retained as they were before the upgrade was executed.

Therefore, as per the SAP Note 2816790 “Differences between sapconf and saptune” links, the upgrade process for an already applied profile, is to revert it prior to the saptune upgrade, then applied the upgrade, then re-apply.
This could therefore not just be rolled out via our standard SLES patching routine. We had to develop an automated script that would specifically pre-patch saptune to saptune2 using the correct procedure, before we embarked on the next SLES patching round.

As a post-note, you should make yourself familiar with the coming changes to the SLES scheduler settings, with the introduction of the NONE scheduler (see below links for link to the blog).

Useful notes/links:
https://www.suse.com/c/sles-1112-os-tuning-optimisation-guide-part-1/
https://blogs.sap.com/2019/06/25/sapconf-versus-saptune-in-more-detail/
https://blogs.sap.com/2019/05/03/a-new-saptune-is-knocking-on-your-door/
https://www.suse.com/c/noop-now-named-none/

HowTo: Find the Datacentre Region and Physical Host of your Azure VM

With VMs hosted in Azure you need a fine balance between protection from hardware failure on the underlying Azure platform, plus performance from having the tiers of your SAP application being physically close together.

For this very purpose, Microsoft introduced Proximity Placement Groups (PPGs) to allow an administrator to ensure that specific tiers (e.g. application and database) are located close. Potentially even in the same server rack.
The PPGs also affect the location of the storage assigned to the VMs, although the storage infrastructure is actually transparent to administrators.

The PPGs still allow Azure to honor the Availability Sets, Fault Domains and Update Domains.

In this post, I show a method of finding the physical hostname of your Linux VM which could be part of a check before/after implementing a PPG.
NOTE: PPGs should be created at the time a VM is created, and assigned to the “lead” system of the rarest size. Example, an M-series VM is rare, so this should be the lead system when creating the PPG. This will anchor the other VMs to this M-series VMs location.

A separate post shows how to do this for a Windows VM.
On a Linux VM in Azure, as any Linux user, you can use the following to see the name of the physical host on which your VM is running:

awk -F 'H' '{ sub(/ostName/,"",$2); print $2 }' /var/lib/hyperv/.kvp_pool_3

Example output: DUB012345678910

In this case, we take the first 3 chars to be “Dublin”, which is in the EU North Azure region.
The remaining characters consist of the rack and physical hostname.

If you have 2 VMs in the same rack on the same physical host, then you will have minimal latency for networking between them.

Conversely, if you have 2 VMs on the same physical host, you are open to HA issues.

Therefore, you need a good balance for SAP.
You should expect to see SAP S/4HANA application servers and HANA DBs in the same Proximity Placement Groups, within the same rack, even potentially on the same host (providing you have availability sets across the tiers you will be safe).

Update: 23-Apr-2020
To get the above script output into a bash variable, the output contains hidden characters, we can use the following:

awk '{ gsub(/[^[:print:]]/,""); split ($0,a,"H"); sub(/ostName/,"",a[2]); print a[2]}' /var/lib/hyperv/.kvp_pool_3

Update: 05-Oct-2020
I have since found that there is another location where the above information can also be found.
Depending on your Linux O/S, you may also find the physical server name in the network scripts as follows:

grep BOOTSERVERNAME /var/run/netconfig/eth0/netconfig0

The aboe will return something like:
BOOTSERVERNAME=’AMS072nnnnnnnnn’

SUSE Cloud-Netconfig and Azure VMs – Dynamic Network Configuration

What is SUSE Cloud-Netconfig:
Within the SUSE SLES 12 (and OpenSUSE) operating system, lies a piece of functionality called Cloud-Netconfig.
It is provided as part of the System/Management group of packages.

The Cloud-Netconfig software consists of a set of shell functions and init scripts that are responsible for control of the network interfaces on the SUSE VM when running inside of a cloud framework such as Microsoft Azure.
The core code is part of the SUSE-Enceladus project (code & documents for use with public cloud) and hosted on GitHub here: https://github.com/SUSE-Enceladus/cloud-netconfig.
Cloud-Netconfig requires the sysconfig-netconfig package, as it essentially provides a netconfig module.
Upon installation, the Cloud-Netconfig module is prepended to the front of the netconfig module list like this: NETCONFIG_MODULES_ORDER=”cloud-netconfig dns-resolver dns-bind dns-dnsmasq nis ntp-runtime”.

What Cloud-Netconfig does:
As with every public cloud platform, a deployed VM is allocated and booted with the configuration for the networking provided by the cloud platform, outside of the VM.
In order to provide the usual networking devices and modules inside the VM with the required configuration information, the VM must know about its environment and be able to make a call out to the cloud platform.
This is where Cloud-Netconfig does its work.
The Cloud-Netconfig code will be called at boot time from the standard SUSE Linux init process (systemd).
It has the ability to detect the cloud platform that it is running within and make the necessary calls to obtain the networking configuration.
Once it has the configuration, this is persisted into the usual network configuration files inside the /sysconfig/network/scripts and /netconfig.d/cloud-netconfig locations.
The configuration files are then used by the wicked service to adjust the networking configuration of the VM accordingly.

What information does Cloud-Netconfig obtain:
Cloud-Netconfig has the ability to influence the following aspects of networking inside the VM.
– DHCP.
– DNS.
– IPv4.
– IPv6.
– Hostname.
– MAC address.

All of the above information is obtained and can be persisted and updated accordingly.

What is the impact of changing the networking configuration of a VM in Azure Portal:
Changing the configuration of the SUSE VM within Azure (for example: changing the DNS server list), will trigger an update inside the VM via the Cloud-Netconfig module.
This happens because Cloud-Netconfig is able to poll the Azure VM Instance metadata service (see my previous blog post on the Azure VM Instance metadata service).
If the information has changed since the last poll, then the networking changes are instigated.

What happens if a network interface is to remain static:
If you wish for Cloud-Netconfig to not manage a networking interface, then there exists the capability to disable management by Cloud-Netconfig.
Simply adjusting the network configuration file in /etc/sysconfig/network and set the variable CLOUD_NETCONFIG_MANAGE=no.
This will prevent future adjustments to this network interface.

How does Cloud-Netconfig interact with Wicked:
SUSE SLES 12 uses the Wicked network manager.
The Cloud-Netconfig scripts adjust the network configuration files in the locations /sysconfig/network/scripts which are then detected by Wicked and the necessary adjustments made (e.g. interfaces brought online, IP addresses assigned or DNS server lists updated).
As soon as the network configuration files have been written by Cloud-Netconfig, this is where the interaction ends.
From this point the usual netconfig services take over (wicked and nanny – for detecting the carrier on the interface).

What happens in the event of a VM primary IP address change:
If the primary IP address of the VM is adjusted in Azure, then the same process as before takes place.
The interface is brought down and then brought back up again by wicked.
This means that in an Azure Site Recovery replicated VM, should you activate the replica, the VM will boot and Cloud-Netconfig will automatically adjust the network configuration to that provided by Azure, even though this VM only contained the config for the previous hosting location (region or zone).
This significantly speeds up your failover process during a DR situation.

Are there any issues with this dynamic network config capability:
Yes, I have seen a number of issues.
In SLES 12 sp3 I have seen issues whereby a delay in the provision of the Azure VM Instance metadata during the boot cycle has caused the VM to lose sight of any secondary IP addresses assigned to the VM in Azure.
On tracing, the problem seemed to originate from a slowness in the full startup of the Azure Linux agent – possibly due to boot diagnostics being enabled.  A SLES patch is still being waited on for this fix.

I have also seen a “problem” whereby an incorrect entry inside the /etc/hosts file can cause the reconfiguration of the VM’s hostname.
Quite surprising.  This caused other custom SAP deployment script related issues as the hostname was being relied on to be in a specific intelligent naming convention, when instead, it was being changed to a temporary hostname for resolution during an installation of SAP sing the Software Provisioning Manager.

How can I debug the Cloud-Netconfig scripts:
According to the manuals, debug logging can be enabled through the standard DEBUG=”yes” and WICKED_DEBUG=”all” variables in config file /etc/sysconfig/network/config.
However, casting an eye over the scripts and functions inside of the Cloud-Netconfig module, these settings don’t seem to be picked up and sufficient logging produced.  Especially around the polling of the Azure VM Instance metadata service.
I found that when debugging I had to actually resort to adjusting the function script functions.cloud-netconfig.

Additional information:
https://www.suse.com/c/multi-nic-cloud-netconfig-ec2-azure/
https://www.suse.com/documentation/sles-12/singlehtml/book_sle_admin/book_sle_admin.html
https://github.com/SUSE-Enceladus/cloud-netconfig
https://www.suse.com/media/presentation/wicked.pdf
https://github.com/openSUSE/wicked