Linux Archives » Page 3 of 6 » Musings of an IT Implementor

How an Azure hosted SAP LaMa Controlled SAP System Starts Up

We all know how this works in the old “pre-cloud” world.
To start a SAP system on a physical server which is currently shutdown, you would:

Power on the physical host (through whatever means, ILO or a button or other).
You log into the host as “adm” and run: “sapcontrol -nr ## -function StartSystem”.

We then moved from physical, to virtual:

Power on the hypervisor physical host (through whatever means, ILO or a button or other).
Power on the VM.
You log into the VM as “adm” and run: “sapcontrol -nr ## -function StartSystem”.

Now we have cloud:

Power on the VM (through cloud control software e.g. Azure Portal).
You log into the VM as “adm” and run: “sapcontrol -nr ## -function StartSystem”.

How Does SAP LaMa Work In this Context?

With SAP LaMa, it has the ability to perform both steps #1 and #2 in the last list.
Here’s a diagrammatic overview that I hope shows accurately the interaction between the Azure, Linux and SAP layers:

In the above diagram we can see that SAP LaMa is “cloud aware” and uses the Azure APIs to start and stop Azure VMs.

Once the VM is started, the usual Linux-level startup process takes over to start the services.

There is one caveat with the above and that is regarding SUSE cloud-netconfig. Find out more here: SUSE Cloud-Netconfig and Azure VMs – Dynamic Network Configuration

Why the Hostagent is Critical for LaMa

The SAP Hostagent is used as the marker point in the VM start-up process, at which point SAP LaMa knows for sure that the VM is up and running.
Before the Hostagent responds, SAP LaMa can only query the status of the VM from the Azure APIs, which basically say “starting” or “running”.

There are a number of monitoring capabilities inside SAP LaMa, but the Hostagent is the critical one.
From the Hostagent, the SAP instance agents can be unregistered and re-registered.

When setting up your SAP LaMa landscape, the SAP Hostagent is critical. If you get it right, with automated deployment, SSL setup, common configuration, custom descriptors/operations, then you can automate almost anything in your SAP landscape.

You need to constantly patch the SAP Hostagent to ensure that it remains compatible with other elements of your SAP landscape. For example, to be able to patch the SAP ASE database, you need the Hostagent. It’s also used for the BALDR (metrics gathering) inside SAP ASE from ASE 16.3 onwards.

SAP LaMa & SAP Landscape Orchestration

The SAP LaMa tool is the nice front-end and scheduler onto which you can apply your automation requirements via the Hostagent and SAP instance agents.

Not only does it provide orchestration capability, but it can also validate your landscape according to predefined in-built checks (such as Kernel component version consistency) and even validate against custom validations built and defined by you.

You may also be interested in:

Making saptune Actually Work & Patching to v2

Having recently spent some time analysing the performance of a HANA database system, I got down to the depths of Linux device I/O performance on an Azure hosted VM.

There was no reason to suspect any issue, because during the implementation of the VM image build process, we had followed all the relevant SAP notes.
In our case, on SUSE Enterprise Linux for SAP 12, we were explicitly following SAP Note 1275776 “Linux: Preparing SLES for SAP environments”.
Inside that SAP note, you go through the process of understanding the difference between sapconf and saptune, plus actually configure saptune (since it comes automatically with the “for SAP” versions of SLES 12).

Once configured, saptune should apply all the best practices that are encompassed in a number of SAP notes including SAP Note 2205917 “SAP HANA DB: Recommended OS settings for SLES 12 / SLES for SAP Applications 12”, which is itself needed during the HANA DB installation preparation work.
If you follow the note, there are a number of required O/S adjustments that are needed for HANA, which can be either applied manually, or (as recommended) automatically via saptune, provided the correct saptune profile is selected.

As part of our configuration, we had applied saptune solution profile S4HANA-DBSERVER (also noted in the SUSE documentation for SAP HANA).
This is applied using the standard:

saptune solution apply S4HANA-DBSERVER

You don’t get a lot of feedback from the saptune execution, but the fact there are no errors, indicates (normally) that it has done what has been requested.
You can check it has applied the profile by executing:

saptune solution list

The item that is starred in the returned list, is the profile that has been applied.
That’s it.

As part of my troubleshooting I even took the trouble of running the publicly available script sapconf_saptune_check (see here: https://github.com/scmschmidt/sapconf_saptune_check/blob/master/sapconf_saptune_check ), which just confirmed that saptune was indeed active/enabled and had a valid profile configured:

Back to the task of checking out the performance issue, and you can probably see where this is going now.
On investigation of the actual saptune profile contents, it was possible to see that a large majority of O/S changes had not been applied.
Specifically, we were not seeing the NOOP scheduler selected for the HANA disks devices.

By executing either of the following, you can check the currently selected scheduler:

grep -l ‘.*’ /sys/block/s??/queue/scheduler

or

cat /sys/block/s??/queue/scheduler

The selected scheduler will be in square brackets.
In my case, I was seeing “[cfq]” for all devices. Not good and not the recommendation from SAP and SUSE.
This setting should be automatically adjusted by the tuned daemon.

Looking at my version of saptune, I could see it was version 1.1.7 (from the output of the execution of the sapconf_saptune_check script).

Reading some of the recent blog posts from Soeren Schmidt here: https://blogs.sap.com/2019/05/03/a-new-saptune-is-knocking-on-your-door/
I could see that version 2 of saptune was now released.

Downloading the newer version (not installing directly!), reverting the old solution profile, installing the new saptune version and finally re-applying the same profile, confirmed that saptune was the culprit.

The new saptune2 fixed the issue, immediately activating a number of critical O/S changes, including the NOOP scheduler setting on each device.

The moral of the story, is therefore that as well as following the SAP processes, you still need to actually validate what it says it should have done.
The new saptune2 has been incorporated into our build process, plus the configuration check scripts will be specifically checking for it.
However, since the upgrade from saptune1 to saptune2 could cause issues if it just blindly re-applied the “new” profile settings, SAP have made saptune follow a backwards compatible upgrade process, whereby the O/S settings are retained as they were before the upgrade was executed.

Therefore, as per the SAP Note 2816790 “Differences between sapconf and saptune” links, the upgrade process for an already applied profile, is to revert it prior to the saptune upgrade, then applied the upgrade, then re-apply.
This could therefore not just be rolled out via our standard SLES patching routine. We had to develop an automated script that would specifically pre-patch saptune to saptune2 using the correct procedure, before we embarked on the next SLES patching round.

As a post-note, you should make yourself familiar with the coming changes to the SLES scheduler settings, with the introduction of the NONE scheduler (see below links for link to the blog).

Useful notes/links:
https://www.suse.com/c/sles-1112-os-tuning-optimisation-guide-part-1/
https://blogs.sap.com/2019/06/25/sapconf-versus-saptune-in-more-detail/
https://blogs.sap.com/2019/05/03/a-new-saptune-is-knocking-on-your-door/
https://www.suse.com/c/noop-now-named-none/

HowTo: Find the Datacentre Region and Physical Host of your Azure VM

With VMs hosted in Azure you need a fine balance between protection from hardware failure on the underlying Azure platform, plus performance from having the tiers of your SAP application being physically close together.

For this very purpose, Microsoft introduced Proximity Placement Groups (PPGs) to allow an administrator to ensure that specific tiers (e.g. application and database) are located close. Potentially even in the same server rack.
The PPGs also affect the location of the storage assigned to the VMs, although the storage infrastructure is actually transparent to administrators.

The PPGs still allow Azure to honor the Availability Sets, Fault Domains and Update Domains.

In this post, I show a method of finding the physical hostname of your Linux VM which could be part of a check before/after implementing a PPG.
NOTE: PPGs should be created at the time a VM is created, and assigned to the “lead” system of the rarest size. Example, an M-series VM is rare, so this should be the lead system when creating the PPG. This will anchor the other VMs to this M-series VMs location.

A separate post shows how to do this for a Windows VM.
On a Linux VM in Azure, as any Linux user, you can use the following to see the name of the physical host on which your VM is running:

awk -F 'H' '{ sub(/ostName/,"",$2); print $2 }' /var/lib/hyperv/.kvp_pool_3

Example output: DUB012345678910

In this case, we take the first 3 chars to be “Dublin”, which is in the EU North Azure region.
The remaining characters consist of the rack and physical hostname.

If you have 2 VMs in the same rack on the same physical host, then you will have minimal latency for networking between them.

Conversely, if you have 2 VMs on the same physical host, you are open to HA issues.

Therefore, you need a good balance for SAP.
You should expect to see SAP S/4HANA application servers and HANA DBs in the same Proximity Placement Groups, within the same rack, even potentially on the same host (providing you have availability sets across the tiers you will be safe).

Update: 23-Apr-2020
To get the above script output into a bash variable, the output contains hidden characters, we can use the following:

awk '{ gsub(/[^[:print:]]/,""); split ($0,a,"H"); sub(/ostName/,"",a[2]); print a[2]}' /var/lib/hyperv/.kvp_pool_3

Update: 05-Oct-2020
I have since found that there is another location where the above information can also be found.
Depending on your Linux O/S, you may also find the physical server name in the network scripts as follows:

grep BOOTSERVERNAME /var/run/netconfig/eth0/netconfig0

The aboe will return something like:
BOOTSERVERNAME=’AMS072nnnnnnnnn’

You may also be interested in:

SUSE Cloud-Netconfig and Azure VMs – Dynamic Network Configuration

What is SUSE Cloud-Netconfig:
Within the SUSE SLES 12 (and OpenSUSE) operating system, lies a piece of functionality called Cloud-Netconfig.
It is provided as part of the System/Management group of packages.

The Cloud-Netconfig software consists of a set of shell functions and init scripts that are responsible for control of the network interfaces on the SUSE VM when running inside of a cloud framework such as Microsoft Azure.
The core code is part of the SUSE-Enceladus project (code & documents for use with public cloud) and hosted on GitHub here: https://github.com/SUSE-Enceladus/cloud-netconfig.
Cloud-Netconfig requires the sysconfig-netconfig package, as it essentially provides a netconfig module.
Upon installation, the Cloud-Netconfig module is prepended to the front of the netconfig module list like this: NETCONFIG_MODULES_ORDER=”cloud-netconfig dns-resolver dns-bind dns-dnsmasq nis ntp-runtime”.

What Cloud-Netconfig does:
As with every public cloud platform, a deployed VM is allocated and booted with the configuration for the networking provided by the cloud platform, outside of the VM.
In order to provide the usual networking devices and modules inside the VM with the required configuration information, the VM must know about its environment and be able to make a call out to the cloud platform.
This is where Cloud-Netconfig does its work.
The Cloud-Netconfig code will be called at boot time from the standard SUSE Linux init process (systemd).
It has the ability to detect the cloud platform that it is running within and make the necessary calls to obtain the networking configuration.
Once it has the configuration, this is persisted into the usual network configuration files inside the /sysconfig/network/scripts and /netconfig.d/cloud-netconfig locations.
The configuration files are then used by the wicked service to adjust the networking configuration of the VM accordingly.

What information does Cloud-Netconfig obtain:
Cloud-Netconfig has the ability to influence the following aspects of networking inside the VM.
– DHCP.
– DNS.
– IPv4.
– IPv6.
– Hostname.
– MAC address.

All of the above information is obtained and can be persisted and updated accordingly.

What is the impact of changing the networking configuration of a VM in Azure Portal:
Changing the configuration of the SUSE VM within Azure (for example: changing the DNS server list), will trigger an update inside the VM via the Cloud-Netconfig module.
This happens because Cloud-Netconfig is able to poll the Azure VM Instance metadata service (see my previous blog post on the Azure VM Instance metadata service).
If the information has changed since the last poll, then the networking changes are instigated.

What happens if a network interface is to remain static:
If you wish for Cloud-Netconfig to not manage a networking interface, then there exists the capability to disable management by Cloud-Netconfig.
Simply adjusting the network configuration file in /etc/sysconfig/network and set the variable CLOUD_NETCONFIG_MANAGE=no.
This will prevent future adjustments to this network interface.

How does Cloud-Netconfig interact with Wicked:
SUSE SLES 12 uses the Wicked network manager.
The Cloud-Netconfig scripts adjust the network configuration files in the locations /sysconfig/network/scripts which are then detected by Wicked and the necessary adjustments made (e.g. interfaces brought online, IP addresses assigned or DNS server lists updated).
As soon as the network configuration files have been written by Cloud-Netconfig, this is where the interaction ends.
From this point the usual netconfig services take over (wicked and nanny – for detecting the carrier on the interface).

What happens in the event of a VM primary IP address change:
If the primary IP address of the VM is adjusted in Azure, then the same process as before takes place.
The interface is brought down and then brought back up again by wicked.
This means that in an Azure Site Recovery replicated VM, should you activate the replica, the VM will boot and Cloud-Netconfig will automatically adjust the network configuration to that provided by Azure, even though this VM only contained the config for the previous hosting location (region or zone).
This significantly speeds up your failover process during a DR situation.

Are there any issues with this dynamic network config capability:
Yes, I have seen a number of issues.
In SLES 12 sp3 I have seen issues whereby a delay in the provision of the Azure VM Instance metadata during the boot cycle has caused the VM to lose sight of any secondary IP addresses assigned to the VM in Azure.
On tracing, the problem seemed to originate from a slowness in the full startup of the Azure Linux agent – possibly due to boot diagnostics being enabled. A SLES patch is still being waited on for this fix.

I have also seen a “problem” whereby an incorrect entry inside the /etc/hosts file can cause the reconfiguration of the VM’s hostname.
Quite surprising. This caused other custom SAP deployment script related issues as the hostname was being relied on to be in a specific intelligent naming convention, when instead, it was being changed to a temporary hostname for resolution during an installation of SAP sing the Software Provisioning Manager.

How can I debug the Cloud-Netconfig scripts:
According to the manuals, debug logging can be enabled through the standard DEBUG=”yes” and WICKED_DEBUG=”all” variables in config file /etc/sysconfig/network/config.
However, casting an eye over the scripts and functions inside of the Cloud-Netconfig module, these settings don’t seem to be picked up and sufficient logging produced. Especially around the polling of the Azure VM Instance metadata service.
I found that when debugging I had to actually resort to adjusting the function script functions.cloud-netconfig.

Additional information:
https://www.suse.com/c/multi-nic-cloud-netconfig-ec2-azure/
https://www.suse.com/documentation/sles-12/singlehtml/book_sle_admin/book_sle_admin.html
https://github.com/SUSE-Enceladus/cloud-netconfig
https://www.suse.com/media/presentation/wicked.pdf
https://github.com/openSUSE/wicked

Understand and Use the Azure Instance Metadata Service with SAP

In the below post, we will explore the Azure Instance Metadata service and how we can make use of the service when deploying our SAP landscape.

What Is the Azure Instance Metadata Service?

The Azure Metadata Service is a locally accessed (on each VM deployed in Azure), REST enabled, API versioned HTTP service endpoint that provides a gateway to the Azure “fabric” hosting your VMs.

New features are added through new versions of the API, accessed through the URI and by appending the required version as a querystring parameter.

What Can You Do With the Azure Instance Metadata Service?

A simple example, would be to query the service to show the current VM size (Azure VM Size) from within the VM itself, without needing access to the Azure Portal or any Azure authorisation (e.g. Service Principals).

How Can You Query the Azure Intance Metadata Service?

Depending on whether you’re using Linux or Windows as your VM operating system, you can call the REST API for the Azure Instance Metadata Service using something similar to the following in Linux:

curl -H Metadata:true --noproxy "*" "http://169.254.169.254/metadata/instance?api-version=2019-06-01"

or in PowerShell 6.3+ on Windows (includes -noproxy):

Invoke-RestMethod -Headers @{"Metadata"="true"} -Method GET -NoProxy -Uri 169.254.169.254/metadata/instance?api-version=2019-06-01

or Powershell <6.0 compatible (excludes -noproxy):

Invoke-RestMethod -Headers @{"Metadata"="true"} -Method GET -Uri 169.254.169.254/metadata/instance?api-version=2019-06-01

This will return a JSON string which, among other things, will contain the current VM size.

You can use the querystring parameter “format=text” to get a raw text response:

169.254.169.254/metadata/instance?api-version=2017-08-01&format=text

For more information on the API options and returned data use the following links for Windows or Linux VMs:

What Is Providing the 169.254.x.x Address?

The Azure Instance Metadata service is provided by the WAAGENT. This (in Linux) is a daemon service and in Windows is a Windows Service installed during the VM build process when a VM is built using the Azure Resource Manager (not the Classic Azure VM build process).

The agent is a set of python routines. These python routines are visible on GitHub here: https://github.com/Azure/WALinuxAgent
The agent is not required to be installed inside VMs hosted in Azure but it is used by a multitude of Azure features.

If you analyse the agent log files (see /var/log/waagent.log in Linux), you will see that the agent is in constant communication with Azure APIs over HTTP (and HTTPS).

Can I Disable the Azure Instance Metadata Service?

Yes, you can disable it (see here: https://github.com/Azure/WALinuxAgent/wiki/VMs-without-WALinuxAgent), but without the agent running, you will not be able to run the Azure Enhanced Monitoring for Linux (AEM) plugin which is required in a production SAP system, because of the required use of Premium disks (see SAP note 2191498).
The Azure Instance Metadata service will auto-start with the VM.

There are noted downsides to having the agent running (documented here: https://raymii.org/s/blog/Linux_on_Microsoft_Azure_Disable_this_built_in_root_access_backdoor.html) but as mentioned, for SAP support, you need Azure Enhanced Monitoring (for Linux) which is a plugin for this agent.

Is the Azure Instance Metadata Service Used by SAP?

Yes, although indirectly.
The SAP Hostagent (7.21) is able to query the metadata service statistics of the guest VM.
The statistics are recorded into local file system files by the Azure Enhanced Monitoring for Linux agent plugin (also listed on GitHub under here: https://github.com/Azure/azure-linux-extensions/tree/master/AzureEnhancedMonitor).

The AEM plugin is a basic set of Python routines for the recording of the Azure disk and CPU statistics into designated flat text files (in Linux see /var/lib/AzureEnhancedMonitor/PerfCounters), and these files are then consumed by the SAP Hostagent.

As you may know, the Hostagent includes the SAPOSCOL (SAP O/S Collector) binary executable, which is the actual process within the SAP Hostagent delivered binaries, responsible for digesting the AEM statistics.
It makes the statistical information available in a shared memory segment, which can be accessed by a SAP Netweaver stack (in fact you can access it manually also by using the SAPOSCOL interactive command line).

In SAP Netweaver (AS ABAP) you can use transaction ST06 to access this SAPOSCOL information, where you will see a summary page for the O/S details (including the Azure provided details) plus a historical report of statistical data, all obtained from the SAPOSCOL memory segment.

Is the Azure Instance Metadata Service ReadOnly?

Yes, all of the data is readonly.
However there is one area that you can influence using a HTTP POST as outlined in the information provided here:
https://docs.microsoft.com/en-us/azure/virtual-machines/linux/scheduled-events

As you will see the ScheduledEvents API doesn’t really give you any control of the VM, as it’s more of a notification provider that gives you fair warning and allows you time to perform some provisional processing prior to a scheduled event execution.
It’s not used by the SAP Hostagent as far as I can determine.

How Can We Utilise the Azure Instance Metadata Service During SAP Deployment Projects?

During deployments of SAP into Microsoft Azure, I have found it very useful to script access to the Azure Instance Metadata service to form part of a basic configuration check of VMs.

As an example, a Custom Operation can be defined in SAP LaMa (SAP Landscape Manager) which can be executed across all known SAP Hostagents and can return the information back into SAP LaMa as part of a Custom Validation execution (see more about SAP LaMa Custom Validation here: https://blogs.sap.com/2018/05/14/how-to-use-sap-landscape-management-custom-validations).

This then provides you with an easy SAP level reporting capability to see what size of VMs you’re running in your landscape and the configuration of such items like Azure disk cache settings (an important topic for HANA databases!).

What is /usr/sbin/azuremetadata ?

In distributions of SUSE Linux (including OpenSUSE), a commandline binary executable exists which calls the Azure Instance Metadata service.

It has a fixed set of command line options and can be used to retrieve a minimised set of data as can be queried using “curl” or “wget”.

If you need only the barest, quickest method of calling the Azure Instance Metadata service, then this binary executable will probably suffice.

This executable is also used by other SUSE features, so it is unlikely that it will be deprecated, however, it may not use the latest version of the API.

What Is the Latest Version of the Azure Instance Metadata Service API?

If you look at the two URLs provided previously for Windows and Linux, you will notice they contain a section called “Versioning” on the pages which details the currently supported versions of the API.

Are There Any Issues With the Azure Instance Metadata Service?

Yes, I’ve seen a couple of issues.
The service is relied upon in various areas of SUSE Linux cloud-netconfig to provide the VM with IP address details at boot time.
If this integration fails or is slow, your Linux VM may not have all IP addresses after boot (only the primary IP).

Sometimes (quite a lot of times) you will notice timeout errors in the agent log file as it tries to talk to Azure APIs.
Apparently this is normal and noted in a few forum posts in places. However, it means that the agent is obviously “stalling” while it experiences this “timeout”. Therefore I would argue that it is not ideal.