This blog contains experience gained over the years of implementing (and de-implementing) large scale IT applications/software.

Best Disk Topology for SAP ASE Databases on Azure

Maybe you are considering migration of on-premise SAP ASE databases to Microsoft Azure, or you may be considering migrating from your existing database vendor to SAP ASE on Azure.
Either way, you will benefit from understanding a good, practical disk topology for SAP ASE on Azure.

In this post, I show how you can optimise use of the SAP ASE, Linux and Azure technical layers to provide a balanced approach to disk use, considering both performance and disk (ASE device) management.

The Different Layers

In an ASE on Linux on Azure (IaaS) setup, you have the following layers:

  • Azure Storage Services
  • Azure Data Disk Cache Settings
  • Linux Physical Disks
  • Linux Logical Volumes
  • Linux File Systems
  • ASE Database Data Devices
  • ASE Instance

Each layer has different options around tuning and setup, which I will highlight below.

Azure Storage Services

Starting at the bottom of the diagram we need to consider the Azure Disk Storage that we wish to use.
There are 2 design considerations here:

  • size of disk space required.
  • performance of disk device.

For performance, you are more than likely tied by the SAP requirements for running SAP on Azure.
Currently these require a minimum of Premium SSD storage, since it provides a guaranteed SLA. However, as of June 2020, Standard SSD was also given an SLA by Microsoft, potentially paving the way for cheaper disk (when certified by SAP) provided that it meets your SLA expectations.

Generally, the size of disk determines the performance (IOPS and MBps), but this can also be influenced by the quantity of data disk devices.
For example, by using 2 data disks striped together you can double the available IOPS. The IOPS are an important factor for databases, especially on high throughput database systems.

When considering multiple data disks, you also need to remember that each VM has limitations. There is a VM level IOPS limit, a VM level throughput limit (megabytes per second) plus a limit to the number of data disks that can be attached. These limit values are different for different Azure VM types.

Also remember that in Linux, each disk device has its own set of queues and buffers. Making use of multiple Linux disk devices (which translates directly to the number of Azure data disks) usually means better performance.

Essentials:

  • Choose minimum of Premium SSD (until Standard SSD is supported by SAP).
  • Spread database space requirements over multiple data disks.
  • Be aware of the VM level limits.

Azure Data Disk Cache Settings

Correct configuration of the Azure data disk cache settings on the Azure VM can help with performance and is an easy step to complete.
I have already documented the best practice Azure Disk Cache settings for ASE on Azure in a previous post.

Essentials:

  • Correctly set Azure VM disk cache settings on Azure data disks at the point of creation.

Use LVM For Managing Disks

Always use a logical volume manager, instead of formatting the Linux physical disk devices directly.
This allows the most flexibility for growing, shrinking and striping the disks for size and performance.

You should stripe the data logical volumes with a minimum of 2 physical disks and a maximum stripe size of 128KB (test it!). This fits within the window of testing that Microsoft have performed in order to achieve the designated IOPS for the underlying disk. It’s also the maximum size that ASE will read at. Depending on your DB read/write profile, you may choose a smaller stripe size such as 64KB, but it depends on the amount of Large I/O and pre-fetch. When reading the Microsoft documents, consider ASE to be the same as MS SQL Server (they are are from the same code lineage).

Stripe the transaction log logical volume(s) with a smaller stripe size, maybe start at 32KB and go lower but test it (remember HANA is 2KB stripe size for log volumes, but HANA uses Azure WriteAccelerator).

Essentials:

  • Use LVM to create volume groups and logical volumes.
  • Stipe the data logical volumes with (max) 128KB stripe size & test it.

Use XFS File System

You can essentially choose to use your preferred file system format; there are no restrictions – see note 405827.
However, if you already run or are planning to run HANA databases in your landscape, then choosing XFS for ASE will make your landscape architecture simpler, because HANA is recommended to run on an XFS file system (when on local disk) on Linux; again see SAP note 405827.

Where possible you will need to explicitly disable any Linux file system write barrier caching, because Azure will be handling the caching for you.
In SUSE Linux this is the “nobarrier” setting on the mount options of the XFS partition and for EXT4 partitions it is option “barrier=0”.

Essentials:

  • Choose disk file system wisely.
  • Disable write barriers.

Correctly Partition ASE

For SAP ASE, you should segregate the disk partitions of the database to avoid certain system specific databases or logging areas, from filling other disk locations and causing a general database system crash.

If you are using database replication (maybe SAP Replication Server a.k.a HADR for ASE), then you will have additional replication queue disk requirements, which should also be segregated.

A simple but flexible example layout is:

Volume
Group
Logical
Volume
Mount PointDescription
vg_aselv_ase<SID>/sybase/<SID>For ASE binaries
vg_sapdatalv_sapdata<SID>_1./sapdata_1One for each ASE device for SAP SID database.
vg_saploglv_saplog<SID>_1./saplog_1One for each log device for SAP SID database.
vg_asedatalv_asesec<SID>./sybsecurityASE security database.
lv_asesyst<SID>./sybsystemASE system databases (master, sybmgmtdb).
lv_saptemp<SID>./saptempThe SAP SID temp database.
lv_asetemp<SID>./sybtempThe ASE temp database.
lv_asediag<SID>./sapdiagThe ASE saptools database.
vg_asehadrlv_repdata<SID>./repdataThe HADR queue location.
vg_backupslv_backups<SID>./backupsDisk backup location.

The above will allow each disk partition usage type to be separately expanded, but more importantly, it allows specific Azure data disk cache settings to be applied to the right locations.
For instance, you can use read-write caching on the vg_ase volume group disks, because that location is only for storing binaries, text logs and config files for the ASE instance. The vg_asedata contains all the small ASE system databases, which will not use too much space, but could still benefit from read caching on the data disks.

TIP: Depending on the size of your database, you may decide to also separate the saptemp database into its own volume group. If you use HADR you may benefit from doing this.

You may not need the backups disk area if you are using a backup utility, but you may benefit from a scratch area of disk for system copies or emergency dumps.

You should choose a good naming standard for volume groups and logical volumes, because this will help you during the check phase, where you can script the checking of disk partitioning and cache settings.

Essentials:

  • Segregate disk partitions correctly.
  • Use a good naming standard for volume groups and LVs.
  • Remember the underlying cache settings on those affected disks.

Add Whole New ASE Devices

Follow the usual SAP ASE database practices of adding additional ASE data devices on additional file system partitions sapdata_2, sapdata_3 etc.
Do not be tempted to constantly (or automatically) expand the ASE device on sapdata_1 by adding new disks, you will find this difficult to maintain because striped logical volumes need at least 2 disks in the stripe set.
It will get complicated and is not easy to shrink back from this.

When you add new disks to an existing volume group and then expand an existing lv_sapdata<SID>_n logical volume, it is not as clean as adding a whole new logical volume (e.g. lv_sapdata<SID>_n+1) and then adding a whole new ASE data device.
The old problem of shrinking data devices is more easily solved by being able to drop a whole ASE device, instead of trying to shrink one.

NOTE: The Microsoft notes suggest enabling automatic DB expansion, but on Azure I don’t think it makes sense from a DB administration perspective.
Yes, by adding a new ASE device, as data ages you may end up with “hot” devices, but you can always move specific devices around and add more underlying disks and re-stripe etc. Keep the layout flexible.

Essentials:

  • Add new disks to new logical volumes (sapdata_n+1).
  • Add big whole new ASE devices to the new LVs.

Summary:

We’ve been through each of the layers in detail and now we can summarise as follows:

  • Choose a minimum of Premium SSD.
  • Spread database space requirements over multiple data disks.
  • Correctly set Azure VM disk cache settings on Azure data disks at the point of creation.
  • Use LVM to create volume groups and logical volumes.
  • Stipe the logical volumes with (max) 128KB stripe size & test it.
  • Choose disk file system wisely.
  • Disable write barriers.
  • Segregate disk partitions correctly.
  • Use a good naming standard for volume groups (and LVs).
  • Remember the underlying cache settings on those affected disks.
  • Add new disks to new logical volumes (sapdata_n).
  • Add big whole new ASE devices to the new LVs.

Useful Links:

Is my Azure hosted SLES 12 Linux VM Affected by the BootHole Vulnerability

In July 2020, a GRUB2 bootloader vulnerability was discovered which could allow attackers to replace the bootloader on a machine which has Secure Boot turned on.
The vulnerability is designated CVE-2020-10713 and is rated 8.2 HIGH on the CVSS (see here).

Let’s look at what this is and how it impacts a Microsoft Azure virtual machine running SUSE Enterprise Linux 12, which is commonly used to run SAP systems such as SAP HANA or other SAP products.

What is the Vulnerability?

It is a “Classic Buffer Overflow” vulnerability in the GRUB2 bootloader for versions prior to 2.06.
Essentially, some evil input data can be entered into some part of the GRUB2 program binaries, which is not checked/validated.
The input data causes an overflow of the holding memory area into adjacent memory areas.
By carefully crafting the data that is the overflow, it is possible to cause a specifically targeted memory area to be overwritten.

As described by Eclypsium here (the security company that detected this) “Attackers exploiting this vulnerability can install persistent and stealthy bootkits or malicious bootloaders that could give them near-total control over the victim device“.

Essentially, the vulnerability allows an attacker with root privileges to replace the bootloader with a malicious one, boot into it and then have further capability to effectively set up camp (a backdoor) on the server.
This backdoor would be hard to remove because the bootloader is one of the first things to be booted (anti-virus can’t remove the bootloader if the bootloader boots first and “adjusts” the anti-virus).

What is GRUB2?

GRUB2 is v2 of the GRand Unified Bootloader (see here for the manual).
It is used to load the main operating system of a computer.
Usually on Linux virtual machines, GRUB is used to load Linux. It is possible to install GRUB on machines that then boot into Windows.

What is Secure Boot?

There are commonly two boot methods: “Legacy Boot” and “Secure Boot” (a.k.a UEFI boot).
Until Secure Boot was invented, the bootloader would sit in a designated location on the hard disk and would be executed by the computer BIOS to start the chain of processes for the computer start up.
This is clearly quite insecure, since any program could put itself at the designated location and then be executed at boot up.

With Secure Boot, certificates are used to secure the boot process chain.
As with any certificate based process, at the top (root) level there needs to exist a certificate which is valid for many years and is ultimately trusted – the Certificate Authority (CA).
The next levels in the chain trust that CA certificate implicitly and if any point in the chain is compromised, then the trust is broken and will need re-establishing with new certificates.
Depending which level of the chain is compromised, will dictate the amount of effort needed to fix it.

This BootHole vulnerability means a new CA certificate needs to be implemented in every machine that uses Secure Boot!

But the attackers Need Root?

Yes, the vulnerability is in a GRUB2 configuration text file owned by the root user. Additional text added to the file can cause the buffer overflow.
Once the attacker has used malware to instigate the overflow, and installed a malicious bootloader, they then have a backdoor to the server, which would be executed every time the server is rebooted.
This backdoor would be hard to remove because the bootloader is one of the first things to be booted (anti-virus can’t remove the bootloader if the bootloader boots first and “adjusts” the anti-virus).

NOTE: The flaw also exists if you also use the network boot capability (PXE boot).

What is the Patch?

Due to the complexity of the problem (did you read the prior Eclypsium link?), it needs more than one piece of software to be patched and in different layers of the boot chain.

First off, the vulnerable GRUB2 software needs patching; this is quite easy and will require a reboot of the Linux O/S.
The problem with patching just GRUB2, is that it is still possible for an attacker with root to re-install a vulnerable version of GRUB2 and then use that vulnerable version to compromise the system further.
Remember, the chain of trust is still trusting that vulnerable version of GRUB2.
Therefore, to be able to stop the vulnerable version of GRUB2 being re-installed and used, three things need to happen:

  1. The O/S vendor (SUSE) needs to adjust their code (known as the “shim”) so that it no longer trusts the vulnerable version of GRUB2. Again, this is a software patch from the O/S vendor (SUSE) which will need a reboot.
  2. Since someone with root could simply re-install O/S vendor code (the “shim”) that trusts the vulnerable version of GRUB2, the adjusted O/S vendor code will need signing and trusting by the certificates further up the chain.
  3. The revocation list of Secure Boot needs to be adjusted to prevent the vulnerable version of the O/S vendor code (“shim”) from being called during boot. (This is known as the “dbx” (exclusion database), which will need updating with a firmware update).

What is SUSE doing about it?

There needs to be a multi-pronged patching process because SUSE also found some additional bugs during their analysis.

You can see the SUSE page on CVE-2020-10713 here, which includes the mention of the additional bugs.

They key point is that you *could* start patching, but if it were me, I would be tempted to wait until the SUSE “shim” has been updated with the new chain certificate, patch GRUB2 and then update the “dbx”.

How does this impact Azure VMs?

In the previous paragraphs we found that a firmware update is needed to update the “dbx” exclusion database.
Since Microsoft Azure is using the Hyper-V hypervisor, the “firmware” is actually software in Hyper-v.
See here, which says: “Secure Boot or UEFI firmware isn’t required on the physical Hyper-V host. Hyper-V provides virtual firmware to virtual machines that is independent of what’s on the Hyper-V host.

So the above would indicate that the Virtual Machine contains the necessary code from Hyper-V.
I would imagine that this is included at VM creation time.

If we dig into the VM details a little bit here on the Microsoft sites, we find:

So the above states that “…generation 2 VMs in Azure do not support Secure Boot…“.
The words “…in Azure…” are the key part of this.

OK, then how about Hyper-V in general (on-premise):

The above states “To Secure Boot generation 2 Linux virtual machines, you need to choose the UEFI CA Secure Boot template when you create the virtual machine.“.
BUT this is for Hyper-V in general, not for Azure virtual machines.

So we know that Secure Boot is not available in Azure on any of the generation 1 or generation 2 VMs (as of writing there are only 2).

Summary:

The BootHole vulnerability is far reaching and will impact many, many devices (servers, laptops, IoT devices, TVs, fridges, cars?).
However, only those devices that actually *use* Secure Boot will truly be impacted, since the devices not using Secure Boot do not need to be patched (it’s fruitless).

If you run SLES 12 on Azure virtual machines, you cannot possibly use Secure Boot, so there is no point patching to fix a vulnerability for which you are not affected.
You are only introducing more risk by patching.

If however, you do decide to patch (even if you don’t need to) then follow the advice from SUSE and patch to fix GRUB2, the “shim” and the other vulnerabilities that were found.

If you are running SLES on Azure, then there is no specific order of patching, because you do not use Secure Boot, so there is no possibility of breaking the trust chain that doesn’t exist.

On a final closing point, you could be running a HANA system in Azure on what is known as “HANA Large Instances” (HLI). These are physical machines. So whilst Virtual Machines can’t use Secure Boot, these physical machines may well do so. You would be wise to contact your Microsoft account representative to establish if they will be patching the firmware.

Useful Links:

Azure Front Door in a SAP Context

In April 2019, Microsoft announced the general availability of the Azure Front Door service.
The highlight of this service is layer 7 (HTTP/S) load balancing.
In this post I want to briefly explore how Azure Front Door could sit in an example SAP landscape.

But We Have Azure Application Gateway…

Yes, while the Azure Front Door service does provide similar capabilities with regards to load balancing an HTTP/s based back-end service, the similarities end when we start to consider multi-regional distribution of services. That is, multiple Azure regions actively servicing global clients.

Azure Application Gateway

The Azure Application Gateway service is the go-to service for HTTP/S load balancing for your Azure hosted HTTP/S IaaS or Container based services that are contained within a region.

Event for some, limited, SAP uses, the Azure Application Gateway may be sufficient, but you really need an experienced SAP Solution Architect to help you plan your SAP landscape architecture at this point. The consequences of doing it wrong, could cause you to completely re-implement a new architecture pattern in your landscape and, of course, additional cost.

… and SAP Web Dispatcher

I have discussed the features of the SAP Web Dispatcher before.
The need for a SAP Web Dispatcher in a SAP landscape is clear and even more appropriate in a cloud deployment of SAP.
Just like Azure Application Gateway, the SAP Web Dispatcher’s context should be limited to a single region. This is especially true because it is IaaS, which means the VMs on which the Web Dispatcher is deployed, are themselves bound to a specific region.

However, what is not clear is how disparate Web Dispatcher systems (i.e. different SAPGLOBALHOST values) can be used in different regions to correctly load balance. This is not the same as a single system with different instances in different regions!

How It All Hangs Together

If we go back to the purpose of this post, I wanted to show how Azure Front Door could be used within the context of a SAP system deployment in Azure.

To help convey the idea, I’ve put together a simple diagram:

In the above diagram, you can see that the Azure Front Door service is used to balance inbound requests from a customer booking system, across multiple Azure regions, directly from the internet. This means that Azure Front Door is most definitely suited as a global customer facing load balancer.
An example scenario is a 2 (or more) region architecture with primary region and disaster recovery region. If the primary region for our customer booking system is unavailable, a DR could be invoked and customers could be routed to the DR region, allowing customer bookings to be taken.

In the diagram, traffic routed from Azure Front Door, is then (for the sake of example) routed through Azure Application Gateway. This is just for example, but in reality it’s not really needed. It could be that you have a real mixture of SAP and non-SAP in some converged sub-domain, and it may be easier to load balance this mix of URLs at this level.
The main point at this point is, you are committed to returning data from a single region.

In our example diagram, the Azure Application Gateway then routes traffic to the SAP Web Dispatcher, which load balances the traffic over the back-end SAP ECC system available application servers using the ABAP stack message server (a feature that is not easily replicated in any other load balancer).

Where Does Azure Traffic Manager Sit?

The Azure Traffic Manager service is a DNS based routing and distribution service. If your company is a multinational conglomerate with a latency sensitive web based customer service, then the Azure Traffic Manager can be used to route customers to their nearest region, where you have your web service hosted and where they can potentially get the speedist and most appropriate content.
If you have only 2 or 3 regions, do not have latency issues and have no need to provide region specific content, then Azure Front Door is probably what you need.

Summary:

I’ve tried to show how the Azure Front Door service can provide your internet sourced, customer entry point into your multi-region web service.
The diagram I’ve provided hopefully shows how Azure Front Door can be distinguished from other similar technologies in a SAP landscape including how Azure Application Gateway could also be in the mix (although rare).
Finally I discuss how Azure Traffic Manager may not always be appropriate for load distribution.

Useful Links

Azure Disk Cache Settings for an SAP Database on Linux

One of your go-live tasks once you have built a VM in Azure, should be to ensure that the Azure disk cache settings on the Linux VM data disks, are set correctly in accordance with the Microsoft recommended settings.
In this post I explain the disk cache options and how they apply to SAP and especially to SAP databases such as SAP ASE and SAP HANA, to ensure you get optimum performance.

What Are the Azure Disk Cache Settings?

In Microsoft Azure you can configure different disk cache settings on data disks that are attached to a VM.
NOTE: You do not need to consider changing the O/S root disk cache settings, as by default they are applied as per the Azure recommendations.

Only specific VMs and specific disks (Standard or Premium Storage) have the ability to use caching.
If you use Azure Standard storage, the cache is provided by local disks on the physical server hosting your Linux VM.
If you use Azure Premium storage, the cache is provided by a combination of RAM and local SSD on the physical server hosting your Linux VM.

There are 3 different Azure disk cache settings:

  • None
  • ReadOnly (or “read-only”)
  • ReadWrite (or “read/write”)

The cache settings can influence the performance and also the consistency of the data written to the Azure storage service where your data disks are stored.

Cache Setting: None

By specifying “None” as the cache setting, no caching is used and a write operation at the VM O/S level is confirmed as completed once the data is written to the storage service.
All read operations for data not already in the VM O/S file system cache, will be read from the storage service.

Cache Setting: ReadOnly

By specifying “ReadOnly” as the cache setting, a write operation at the VM O/S level is confirmed as completed once the data is written to the storage service.
All read operations for data not already in the VM O/S file system cache, will be read from the read cache on the underlying physical machine, before being read from the storage service.

Cache Setting: ReadWrite

By specifying “ReadWrite” as the cache setting, a write operation at the VM O/S level is confirmed as completed once the data is written to the cache on the underlying physical machine.
All read operations for data not already in the VM O/S file system cache, will be read from the read cache on the underlying physical machine, before being read from the storage service.

Where Do We Configure the Disk Cache Settings?

The disk cache settings are configured in Azure against the VM (in the Disks settings), since the disk cache is both physical host and VM series dependent. It is *not* configured against the disk resource itself, as explained in my previous blog post: Listing Azure VM DataDisks and Cache Settings Using Azure Portal JMESPATH & Bash

Any Recommendations for Disk Cache Settings?

There are specific recommendations for Azure disk cache settings, especially when running SAP and especially when running databases like SAP ASE or SAP HANA.

In general, the rules are:

Disk UsageAzure Disk Cache Setting
Root O/S disk (/)ReadWrite – ALWAYS!
HANA SharedReadOnly
ASE Home
(/sybase/<SID>)
ReadOnly
Database DataHANA=None, ASE=ReadOnly
Database LogNone

The above settings for SAP ASE have been obtained from SAP note 2367194 (SQL Server is same as ASE) and from the general deployment guide here: https://docs.microsoft.com/en-us/azure/virtual-machines/workloads/sap/dbms_guide_general
The use of write caching on the ASE home is optional, you could choose ReadOnly, it would help protect the ASE config file in a very specific scenario. It is envisaged that using ASE 16.0 with SRS/HADR you would have a separate data disk for the Replication Server data (I’ll talk about this in another post).

The above settings for HANA have been taken from the updated guide here: https://docs.microsoft.com/en-us/azure/virtual-machines/workloads/sap/hana-vm-operations-storage which is designed to meet the KPIs mentioned in SAP note 2762990.

The reason for not using a write cache every time, is because an issue at the physical host level, affecting the cache, could cause the application (e.g database) to think it has committed data, when it actually isn’t written to disk. This is not good for databases, especially if the issue affects the transaction/redo log area. Data loss could occur.

It’s worth noting that this cache “issue” has always been true of every caching technology ever created, on which databases run. Storage tech vendors try to mitigate this by putting batteries into the storage appliances, but since the write cache in Azure is at the physical host level, there’s just no guarantee that when the VM O/S thinks the write operation has committed to disk, that it has actually been written to disk.

How About Write Accelerator?

There are specific Azure VM series (M-series at current) that support something known as “Write Accelerator”.
This is an extra VM level setting for Premium Storage disks attached to M-series VMs.

Enabling the Write Accelerator setting is a requirement by Microsoft for production SAP HANA transaction log disks on M-Series VMs. This setting ebales the Azure VM to meet the SAP HANA key performance indicators in note 2762990. Azure Write Accelerator is designed to provide lower latency write times on Premium Storage.

You should ensure that the Write Accelerator setting is enabled where appropriate, for your HANA database transaction log disks. You can check if it is enabled following my previous blog post: Listing Azure VM DataDisks and Cache Settings Using Azure Portal JMESPATH & Bash

I’ve tried my best to find more detailed information on how the Write Accelerator feature is actually provided, but unfortunately it seems very elusive. Robert Boban (of Microsoft) commented on a LinkedIn post here: “It is special caching impl. for M-Series VM to fulfill SAP HANA req. for <1ms latency between VM and storage layer.“.

Check the IOPS

Once you have configured your disks and the cache settings, you should ensure that you test the IOPS achieved using the Microsoft recommended process.
You can follow similar steps as my previous post: Recreating SAP ASE Database I/O Workload using Fio on Azure

As mentioned in other places in the Microsoft documentation and SAP notes such as 2367194, you need to ensure that you choose the correct size and series of VM to ensure that you align the required VM maximum IOPS with the intended amount of data disks and their potential IOPS maximum. Otherwise you could hit the VM max IOPS before touching the disk IOPS maximum.

Enable Accelerated Networking

Since the storage is itself connected to your VM via the network, you should ensure that Accelerator Networking is enabled in your VMs Network Settings:

Checking Cache Settings Directly on the VM

As per my previous post Checking Azure Disk Cache Settings on a Linux VM in Shell, you can actually check the Azure disk cache settings on the VM itself. You can do it manually, or write a script (better option for whole landscape validation).

Summary:

I discussed the two types of storage (standard or premium) that offer disk caching, plus where in Azure you need to change the setting.
The table provided a list of cache settings for both SAP ASE and SAP HANA databases and their data disk areas, based on available best-practices.

I mentioned Write Accelerator for HANA transaction log disks and ensuring that you enable Accelerated Networking.
Also provided was a link to my previous post about running a check of IOPS for your data disks, as recommended by Microsoft as part of your go-live checks.

A final mention was made another post of mine, with a great way of checking the disk cache settings across the VMs in the landscape.

Useful Links:

Windows File Cache

https://docs.microsoft.com/en-us/azure/virtual-machines/linux/premium-storage-performance

https://docs.microsoft.com/en-us/azure/virtual-machines/windows/how-to-enable-write-accelerator

https://docs.microsoft.com/en-us/azure/virtual-machines/workloads/sap/hana-vm-operations-storage#production-storage-solution-with-azure-write-accelerator-for-azure-m-series-virtual-machines

https://petri.com/digging-into-azure-vm-disk-performance-features

https://techcommunity.microsoft.com/t5/running-sap-applications-on-the/sap-on-azure-general-update-march-2019/ba-p/377456

https://docs.microsoft.com/en-us/azure/virtual-machines/workloads/sap/dbms_guide_general

https://docs.microsoft.com/en-us/azure/virtual-machines/workloads/sap/hana-vm-operations-storage

SAP Note 2762990 – How to interpret the report of HWCCT File System Test

SAP Note 2367194 – Use of Azure Premium SSD Storage for SAP DBMS Instance

Checking Azure Disk Cache Settings on a Linux VM in Shell

In a previous blog post, I ended the post by showing how you can use the Azure Enhanced Monitoring for Linux to obtain the disk cache settings.
Except, as we found, it doesn’t easily allow you to relate the Linux O/S disk device names and volume groups, to the Azure data disk names.

You can read the previous post here: Listing Azure VM DataDisks and Cache Settings Using Azure Portal JMESPATH & Bash

In this short post, I pick up where I left off and outline a method that will allow you to correlate the O/S volume group name, with the Linux O/S disk devices and correlate those Linux disk devices with the Azure data disk names, and finally, the Azure data disks with their disk cache settings.

Using the method I will show you, you will see how easily you can verify that the disk cache settings are consistent for all disks that make up a single volume group (very important), and also be able to easily associate those volume groups with the type of usage of the underlying Azure disks (e.g. is it for database data, logs or executable binaries).

1. Check If AEM Is Installed

Our first step is to check if the Azure Enhanced Monitoring for Linux (AEM) extension is installed on the Azure VM.
This extension is required, for your VM to be supported by SAP.

We use standard Linux command line to check for the extension on the VM:

ls -1 /var/lib/waagent/Microsoft.OSTCExtensions.AzureEnhancedMonitorForLinux-*/config/0.settings

The listing should return at least 1 file called “0.settings”.
If you don’t have this and you don’t have a directory starting with “Microsoft.OSTCExtensions.AzureEnhancedMonitorForLinux-“, then you don’t have AEM and you should get it installed following standard Microsoft documentation.

2. Get the Number of Disks Known to AEM

We need to know how many disks AEM knows about:

grep -c 'disk;Caching;' /var/lib/AzureEnhancedMonitor/PerfCounters

3. Get the Number of SCSI Disks Known to Linux

We need to know how many disks Linux knows about (we exclude the root disk /dev/sda):

lsscsi --size --size | grep -cv '/dev/sda'

4. Compare Disk Counts

Compare the disks quantity from AEM and from Linux.  They should be the same.  This is the number of data disks attached to the VM.

If you have a lower number from the AEM PerfCounters file, then you may be suffering the effects of an Azure bug in the AEM extension which is unable to handle more than 9 data disks.
Do you have more than 9 data disks?

At this point if you do not have matching numbers, then you will not be able to continue, as the AEM output is vital in the next steps.

Mapping Disks to the Cache Settings

Once we know our AEM PerfCounters file contains all our data disks, we are now ready to map the physical volumes (on our disk devices) to the cache settings. On the Linux VM:

pvs -o "pv_name,vg_name" --separator=' ' --noheadings

Your output should be a list of disks and their volume groups like so (based on our diagram earlier in the post):

/dev/sdc vg_data
/dev/sdd vg_data

Next we look for a line in the AEM PerfCounters file that contains that disk device name, to get the cache setting:

awk -F';' '/;disk;Caching;/ { sub(/\/dev\//,"",$4); printf "/dev/%s %s\n", tolower($4), tolower($6) }' /var/lib/AzureEnhancedMonitor/PerfCounters

The output will be the Linux disk device name and the Azure data disk cache setting:

/dev/sdc none
/dev/sdd none

For each line of disks from the cache setting, we can now see what volume group it belongs to.
Example: /dev/sdc is vg_data and the disk in Azure has a cache setting of “none”.

If there are multiple disks in the volume group, they all must have the same cache setting applied!

Finally, we look for the device name in the PerfCounters file again, to get the name of the Azure disk:

NOTE: Below is looking specifically for “sdc”.

awk -F';' '/;Phys. Disc to Storage Mapping;sdc;/ { print $6 }' /var/lib/AzureEnhancedMonitor/PerfCounters

The output will be like so:

None sapserver01-datadisk1
None sapserver01-datadisk2

We can ignore the first column output (“None”) in the above, it’s not needed.

Summary

If you package the AEM disk count check and the subsequent AEM PerfCounters AWK scripts into one neat script with the required loops, then you can get the output similar to this, in one call:

/dev/sdd none vg_data sapserver01-datadisk2
/dev/sdc none vg_data sapserver01-datadisk1
/dev/sda readwrite

Based on the above output, I can see that my vg_data volume group disks (sdc & sdd) all have the correct setting for Azure data disk caching in Azure for a HANA database data disk location.

Taking a step further, if you have intelligently named your volume group names, you then also check in your script, the cache setting based on the name of the volume group to determine if it is correct, or not.
You can then embed this validation script into a “custom validation” within SAP LaMa and it will alert you automatically if your VM disk cache settings are not correct.

You may be wondering, why not do all this from the Azure Portal?
Well, the answer to that is that you don’t know what Linux VM volume groups those Azure disks are used by, unless you have tagged them or named them intelligently in Azure.