This blog contains experience gained over the years of implementing (and de-implementing) large scale IT applications/software.

Best Disk Topology for SAP ASE Databases on Azure

Maybe you are considering migration of on-premise SAP ASE databases to Microsoft Azure, or you may be considering migrating from your existing database vendor to SAP ASE on Azure.
Either way, you will benefit from understanding a good, practical disk topology for SAP ASE on Azure.

In this post, I show how you can optimise use of the SAP ASE, Linux and Azure technical layers to provide a balanced approach to disk use, considering both performance and disk (ASE device) management.

The Different Layers

In an ASE on Linux on Azure (IaaS) setup, you have the following layers:

  • Azure Storage Services
  • Azure Data Disk Cache Settings
  • Linux Physical Disks
  • Linux Logical Volumes
  • Linux File Systems
  • ASE Database Data Devices
  • ASE Instance

Each layer has different options around tuning and setup, which I will highlight below.

Azure Storage Services

Starting at the bottom of the diagram we need to consider the Azure Disk Storage that we wish to use.
There are 2 design considerations here:

  • size of disk space required.
  • performance of disk device.

For performance, you are more than likely tied by the SAP requirements for running SAP on Azure.
Currently these require a minimum of Premium SSD storage, since it provides a guaranteed SLA. However, as of June 2020, Standard SSD was also given an SLA by Microsoft, potentially paving the way for cheaper disk (when certified by SAP) provided that it meets your SLA expectations.

Generally, the size of disk determines the performance (IOPS and MBps), but this can also be influenced by the quantity of data disk devices.
For example, by using 2 data disks striped together you can double the available IOPS. The IOPS are an important factor for databases, especially on high throughput database systems.

When considering multiple data disks, you also need to remember that each VM has limitations. There is a VM level IOPS limit, a VM level throughput limit (megabytes per second) plus a limit to the number of data disks that can be attached. These limit values are different for different Azure VM types.

Also remember that in Linux, each disk device has its own set of queues and buffers. Making use of multiple Linux disk devices (which translates directly to the number of Azure data disks) usually means better performance.

Essentials:

  • Choose minimum of Premium SSD (until Standard SSD is supported by SAP).
  • Spread database space requirements over multiple data disks.
  • Be aware of the VM level limits.

Azure Data Disk Cache Settings

Correct configuration of the Azure data disk cache settings on the Azure VM can help with performance and is an easy step to complete.
I have already documented the best practice Azure Disk Cache settings for ASE on Azure in a previous post.

Essentials:

  • Correctly set Azure VM disk cache settings on Azure data disks at the point of creation.

Use LVM For Managing Disks

Always use a logical volume manager, instead of formatting the Linux physical disk devices directly.
This allows the most flexibility for growing, shrinking and striping the disks for size and performance.

You should stripe the data logical volumes with a minimum of 2 physical disks and a maximum stripe size of 128KB (test it!). This fits within the window of testing that Microsoft have performed in order to achieve the designated IOPS for the underlying disk. It’s also the maximum size that ASE will read at. Depending on your DB read/write profile, you may choose a smaller stripe size such as 64KB, but it depends on the amount of Large I/O and pre-fetch. When reading the Microsoft documents, consider ASE to be the same as MS SQL Server (they are are from the same code lineage).

Stripe the transaction log logical volume(s) with a smaller stripe size, maybe start at 32KB and go lower but test it (remember HANA is 2KB stripe size for log volumes, but HANA uses Azure WriteAccelerator).

Essentials:

  • Use LVM to create volume groups and logical volumes.
  • Stipe the data logical volumes with (max) 128KB stripe size & test it.

Use XFS File System

You can essentially choose to use your preferred file system format; there are no restrictions – see note 405827.
However, if you already run or are planning to run HANA databases in your landscape, then choosing XFS for ASE will make your landscape architecture simpler, because HANA is recommended to run on an XFS file system (when on local disk) on Linux; again see SAP note 405827.

Where possible you will need to explicitly disable any Linux file system write barrier caching, because Azure will be handling the caching for you.
In SUSE Linux this is the “nobarrier” setting on the mount options of the XFS partition and for EXT4 partitions it is option “barrier=0”.

Essentials:

  • Choose disk file system wisely.
  • Disable write barriers.

Correctly Partition ASE

For SAP ASE, you should segregate the disk partitions of the database to avoid certain system specific databases or logging areas, from filling other disk locations and causing a general database system crash.

If you are using database replication (maybe SAP Replication Server a.k.a HADR for ASE), then you will have additional replication queue disk requirements, which should also be segregated.

A simple but flexible example layout is:

Volume
Group
Logical
Volume
Mount PointDescription
vg_aselv_ase<SID>/sybase/<SID>For ASE binaries
vg_sapdatalv_sapdata<SID>_1./sapdata_1One for each ASE device for SAP SID database.
vg_saploglv_saplog<SID>_1./saplog_1One for each log device for SAP SID database.
vg_asedatalv_asesec<SID>./sybsecurityASE security database.
lv_asesyst<SID>./sybsystemASE system databases (master, sybmgmtdb).
lv_saptemp<SID>./saptempThe SAP SID temp database.
lv_asetemp<SID>./sybtempThe ASE temp database.
lv_asediag<SID>./sapdiagThe ASE saptools database.
vg_asehadrlv_repdata<SID>./repdataThe HADR queue location.
vg_backupslv_backups<SID>./backupsDisk backup location.

The above will allow each disk partition usage type to be separately expanded, but more importantly, it allows specific Azure data disk cache settings to be applied to the right locations.
For instance, you can use read-write caching on the vg_ase volume group disks, because that location is only for storing binaries, text logs and config files for the ASE instance. The vg_asedata contains all the small ASE system databases, which will not use too much space, but could still benefit from read caching on the data disks.

TIP: Depending on the size of your database, you may decide to also separate the saptemp database into its own volume group. If you use HADR you may benefit from doing this.

You may not need the backups disk area if you are using a backup utility, but you may benefit from a scratch area of disk for system copies or emergency dumps.

You should choose a good naming standard for volume groups and logical volumes, because this will help you during the check phase, where you can script the checking of disk partitioning and cache settings.

Essentials:

  • Segregate disk partitions correctly.
  • Use a good naming standard for volume groups and LVs.
  • Remember the underlying cache settings on those affected disks.

Add Whole New ASE Devices

Follow the usual SAP ASE database practices of adding additional ASE data devices on additional file system partitions sapdata_2, sapdata_3 etc.
Do not be tempted to constantly (or automatically) expand the ASE device on sapdata_1 by adding new disks, you will find this difficult to maintain because striped logical volumes need at least 2 disks in the stripe set.
It will get complicated and is not easy to shrink back from this.

When you add new disks to an existing volume group and then expand an existing lv_sapdata<SID>_n logical volume, it is not as clean as adding a whole new logical volume (e.g. lv_sapdata<SID>_n+1) and then adding a whole new ASE data device.
The old problem of shrinking data devices is more easily solved by being able to drop a whole ASE device, instead of trying to shrink one.

NOTE: The Microsoft notes suggest enabling automatic DB expansion, but on Azure I don’t think it makes sense from a DB administration perspective.
Yes, by adding a new ASE device, as data ages you may end up with “hot” devices, but you can always move specific devices around and add more underlying disks and re-stripe etc. Keep the layout flexible.

Essentials:

  • Add new disks to new logical volumes (sapdata_n+1).
  • Add big whole new ASE devices to the new LVs.

Summary:

We’ve been through each of the layers in detail and now we can summarise as follows:

  • Choose a minimum of Premium SSD.
  • Spread database space requirements over multiple data disks.
  • Correctly set Azure VM disk cache settings on Azure data disks at the point of creation.
  • Use LVM to create volume groups and logical volumes.
  • Stipe the logical volumes with (max) 128KB stripe size & test it.
  • Choose disk file system wisely.
  • Disable write barriers.
  • Segregate disk partitions correctly.
  • Use a good naming standard for volume groups (and LVs).
  • Remember the underlying cache settings on those affected disks.
  • Add new disks to new logical volumes (sapdata_n).
  • Add big whole new ASE devices to the new LVs.

Useful Links:

Is my Azure hosted SLES 12 Linux VM Affected by the BootHole Vulnerability

In July 2020, a GRUB2 bootloader vulnerability was discovered which could allow attackers to replace the bootloader on a machine which has Secure Boot turned on.
The vulnerability is designated CVE-2020-10713 and is rated 8.2 HIGH on the CVSS (see here).

Let’s look at what this is and how it impacts a Microsoft Azure virtual machine running SUSE Enterprise Linux 12, which is commonly used to run SAP systems such as SAP HANA or other SAP products.

What is the Vulnerability?

It is a “Classic Buffer Overflow” vulnerability in the GRUB2 bootloader for versions prior to 2.06.
Essentially, some evil input data can be entered into some part of the GRUB2 program binaries, which is not checked/validated.
The input data causes an overflow of the holding memory area into adjacent memory areas.
By carefully crafting the data that is the overflow, it is possible to cause a specifically targeted memory area to be overwritten.

As described by Eclypsium here (the security company that detected this) “Attackers exploiting this vulnerability can install persistent and stealthy bootkits or malicious bootloaders that could give them near-total control over the victim device“.

Essentially, the vulnerability allows an attacker with root privileges to replace the bootloader with a malicious one, boot into it and then have further capability to effectively set up camp (a backdoor) on the server.
This backdoor would be hard to remove because the bootloader is one of the first things to be booted (anti-virus can’t remove the bootloader if the bootloader boots first and “adjusts” the anti-virus).

What is GRUB2?

GRUB2 is v2 of the GRand Unified Bootloader (see here for the manual).
It is used to load the main operating system of a computer.
Usually on Linux virtual machines, GRUB is used to load Linux. It is possible to install GRUB on machines that then boot into Windows.

What is Secure Boot?

There are commonly two boot methods: “Legacy Boot” and “Secure Boot” (a.k.a UEFI boot).
Until Secure Boot was invented, the bootloader would sit in a designated location on the hard disk and would be executed by the computer BIOS to start the chain of processes for the computer start up.
This is clearly quite insecure, since any program could put itself at the designated location and then be executed at boot up.

With Secure Boot, certificates are used to secure the boot process chain.
As with any certificate based process, at the top (root) level there needs to exist a certificate which is valid for many years and is ultimately trusted – the Certificate Authority (CA).
The next levels in the chain trust that CA certificate implicitly and if any point in the chain is compromised, then the trust is broken and will need re-establishing with new certificates.
Depending which level of the chain is compromised, will dictate the amount of effort needed to fix it.

This BootHole vulnerability means a new CA certificate needs to be implemented in every machine that uses Secure Boot!

But the attackers Need Root?

Yes, the vulnerability is in a GRUB2 configuration text file owned by the root user. Additional text added to the file can cause the buffer overflow.
Once the attacker has used malware to instigate the overflow, and installed a malicious bootloader, they then have a backdoor to the server, which would be executed every time the server is rebooted.
This backdoor would be hard to remove because the bootloader is one of the first things to be booted (anti-virus can’t remove the bootloader if the bootloader boots first and “adjusts” the anti-virus).

NOTE: The flaw also exists if you also use the network boot capability (PXE boot).

What is the Patch?

Due to the complexity of the problem (did you read the prior Eclypsium link?), it needs more than one piece of software to be patched and in different layers of the boot chain.

First off, the vulnerable GRUB2 software needs patching; this is quite easy and will require a reboot of the Linux O/S.
The problem with patching just GRUB2, is that it is still possible for an attacker with root to re-install a vulnerable version of GRUB2 and then use that vulnerable version to compromise the system further.
Remember, the chain of trust is still trusting that vulnerable version of GRUB2.
Therefore, to be able to stop the vulnerable version of GRUB2 being re-installed and used, three things need to happen:

  1. The O/S vendor (SUSE) needs to adjust their code (known as the “shim”) so that it no longer trusts the vulnerable version of GRUB2. Again, this is a software patch from the O/S vendor (SUSE) which will need a reboot.
  2. Since someone with root could simply re-install O/S vendor code (the “shim”) that trusts the vulnerable version of GRUB2, the adjusted O/S vendor code will need signing and trusting by the certificates further up the chain.
  3. The revocation list of Secure Boot needs to be adjusted to prevent the vulnerable version of the O/S vendor code (“shim”) from being called during boot. (This is known as the “dbx” (exclusion database), which will need updating with a firmware update).

What is SUSE doing about it?

There needs to be a multi-pronged patching process because SUSE also found some additional bugs during their analysis.

You can see the SUSE page on CVE-2020-10713 here, which includes the mention of the additional bugs.

They key point is that you *could* start patching, but if it were me, I would be tempted to wait until the SUSE “shim” has been updated with the new chain certificate, patch GRUB2 and then update the “dbx”.

How does this impact Azure VMs?

In the previous paragraphs we found that a firmware update is needed to update the “dbx” exclusion database.
Since Microsoft Azure is using the Hyper-V hypervisor, the “firmware” is actually software in Hyper-v.
See here, which says: “Secure Boot or UEFI firmware isn’t required on the physical Hyper-V host. Hyper-V provides virtual firmware to virtual machines that is independent of what’s on the Hyper-V host.

So the above would indicate that the Virtual Machine contains the necessary code from Hyper-V.
I would imagine that this is included at VM creation time.

If we dig into the VM details a little bit here on the Microsoft sites, we find:

So the above states that “…generation 2 VMs in Azure do not support Secure Boot…“.
The words “…in Azure…” are the key part of this.

OK, then how about Hyper-V in general (on-premise):

The above states “To Secure Boot generation 2 Linux virtual machines, you need to choose the UEFI CA Secure Boot template when you create the virtual machine.“.
BUT this is for Hyper-V in general, not for Azure virtual machines.

So we know that Secure Boot is not available in Azure on any of the generation 1 or generation 2 VMs (as of writing there are only 2).

Summary:

The BootHole vulnerability is far reaching and will impact many, many devices (servers, laptops, IoT devices, TVs, fridges, cars?).
However, only those devices that actually *use* Secure Boot will truly be impacted, since the devices not using Secure Boot do not need to be patched (it’s fruitless).

If you run SLES 12 on Azure virtual machines, you cannot possibly use Secure Boot, so there is no point patching to fix a vulnerability for which you are not affected.
You are only introducing more risk by patching.

If however, you do decide to patch (even if you don’t need to) then follow the advice from SUSE and patch to fix GRUB2, the “shim” and the other vulnerabilities that were found.

If you are running SLES on Azure, then there is no specific order of patching, because you do not use Secure Boot, so there is no possibility of breaking the trust chain that doesn’t exist.

On a final closing point, you could be running a HANA system in Azure on what is known as “HANA Large Instances” (HLI). These are physical machines. So whilst Virtual Machines can’t use Secure Boot, these physical machines may well do so. You would be wise to contact your Microsoft account representative to establish if they will be patching the firmware.

Useful Links:

Recreating SAP ASE Database I/O Workload using Fio on Azure

After you have deployed SAP databases onto Azure hosted virtual machines, you may find that sometimes you don’t get the performance you were expecting.

 

How can this be? It’s guaranteed isn’t it?
Well, the answer is, as with everything, sometimes it just doesn’t work that way and there are a large number of factors involved.
Even the Microsoft consultants I’ve spoken with have a check point for customers to confirm at the VM level, that they are seeing the IOPS that they are expecting to see.
Especially when deploying high performance applications such as SAP HANA in Azure.
I can’t comment on the reasons why performance may not be as expected, although I do have my own theories.

Let’s look at how we can simply simulate an SAP ASE 16.0 SP03 database I/O operation, so that we can run a reasonably representative and repetitive test, without the need for ASE to even be installed.
Remember, your specific workload could be different due to the design of your database, type and size of transactions and other factors.
What I’m really trying to show here, is how you can use an approximation to provide a simple test that is repetitive and doesn’t need ASE installing.

Microsoft have their own page dedicated to running I/O tests in Azure, and they document the use of the Fio tool for this process.
Read further detail about Fio here: https://docs.microsoft.com/en-gb/azure/virtual-machines/linux/disks-benchmarks

Since you may need to show your I/O results to your local Microsoft representative, I would recommend you use the tool that Microsoft are familiar with, and not some other tool. This should help speed up any fault resolution process.

NOTE: The IOPS will not hit the maximum achievable, because in our test, the page/block size is too high for this. Microsoft’s quoted Azure disk values are achievable only with random read, 8KB page sizes, multiple threads/jobs and a queue depth of 256 (see here: https://docs.microsoft.com/en-gb/azure/virtual-machines/linux/disks-benchmarks).

In SAP ASE 16.0 SP03 (this is the version I had to hand) on a SUSE Linux 12.3 server, imagine we run a SQL operation like “SELECT * FROM MYTABLE WHERE COL2=’X'” which in our example causes an execution path that performs a table scan of the table MYTABLE.
The table scan results in an asynchronous sequential read of the single database data file (data device) on the VM disk which is an LVM logical volume striped over 3 physical disks that make up the one volume group.

We are going to assume that you have saptune installed and configured correctly for SAP ASE, so we will not touch on the Linux configuration.
One thing to note, is that our assumption includes that the Linux file system hosting the database devices is configured to permit direct I/O (avoiding the Linux filesystem cache). This helps with the test configuration setup.

SAP ASE will try and optimise our SQL operation if ASE has been configured correctly, and use a read-ahead algorithm with large I/O pages up-to 128KB. But even with the right ASE configuration, the use of 128KB pages is is not always possible, for example if the table is in some ways fragmented.
As part of our testing we will assume that 128KB pages are not being used. We will instead use 16KB, which is the smallest page size in ASE (worst case scenario).
We will also assume that our SQL statement results in exactly 1GB of data to be read from the disk each time.
This is highly unlikely in a tuned ASE instance, due to the database datacache. However, we will assume this instance is not tuned and under slight load, causing the datacache to have re-used the memory pages between tests.

If we look at the help page for the Fio tool, it’s a fairly hefty read.
Let’s start by translating some of the notations used to something we can appreciate with regards to our test scenario:

Fio Config Item            Our Test Values/Setup
I/O type                    = sequential read
Blocksize                 = 16KB
I/O size                    = 1024m (amount of data)
I/O engine               = asynch I/O – direct (unbuffered)
I/O depth                 = 2048 (disk queue depth)
Target file/device    = /sybase/AS1/sapdata/AS1_data_001.dat
Threads/processes/jobs = 1

We can see that from the list above, the queue depth is the only thing that we are not sure on.
The actual values can be determined by querying the Linux disk devices but in essence what this is doing is asking for a value that represents how much I/O can be queued for a specific disk device.
In checking my setup, I can see that I have 2048 defined on SLES 12 SP3.
More information on queue depth in Azure can be found here: https://docs.microsoft.com/en-us/azure/virtual-machines/windows/premium-storage-performance#queue-depth

On SLES you can check the queue depth using the lsscsi command with the Long, Long, Long format (-lll):

lsscsi -lll

 

[5:0:0:4] disk Msft Virtual Disk 1.0 /dev/sdd
device_blocked=0
iocounterbits=32
iodone_cnt=0x2053eea
ioerr_cnt=0x0
iorequest_cnt=0x2053eea
queue_depth=2048
queue_type=simple
scsi_level=6
state=running
timeout=300
type=0

An alternative way to check is to output the content of the /proc/scsi/sg/devices file and look at the values in the 7th column:

cat /proc/scsi/sg/devices

 

2 0 0 0 0 1 2048 1 1
3 0 1 0 0 1 2048 0 1
5 0 0 0 0 1 2048 0 1
5 0 0 4 0 1 2048 0 1
5 0 0 2 0 1 2048 0 1
5 0 0 1 0 1 2048 0 1
5 0 0 3 0 1 2048 0 1

For the target file (source file in our read test case), we can either use an existing data device file (if ASE is installed and database exists), or we could create a new data file containing zeros, of 1GB in size.

Using “dd” you can quickly create a 1GB file full of zeros:

dd if=/dev/zero of=/sybase/AS1/sapdata/AS1_data_001.dat bs=1024 count=1048576

 

1048576+0 records in
1048576+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 6.4592 s, 166 MB/s

We will be using only 1 job/thread in Fio to perform the I/O test.
Generally in ASE 16.0 SP03, the number of “disk tasks” is configured using “sp_configure” and visible in the configuration file.
The configured value is usually 1 in a default installation and vary rarely needs adjusting.

See here: https://help.sap.com/viewer/379424e5820941d0b14683dd3a992d5c/16.0.3.5/en-US/a778c8d8bc2b10149f11a28571f24818.html

Once we’re happy with the above settings, we just need to apply them to the Fio command line as follows:

fio –name=global –readonly –rw=read –direct=1 –bs=16k –size=1024m –iodepth=2048 –filename=/sybase/AS1/sapdata/AS1_data_001.dat –numjobs=1 –name=job1

You will see the output of Fio on the screen as it performs the I/O work.
In testing, the amount of clock time that Fio takes to perform the work is reflective of the performance of the I/O subsystem.
In extremely fast cases, you will need to look at the statistics that have been output to the screen.

The Microsoft documentation and examples show running very lengthy operations on Fio, to ensure that the disk caches are populated properly.
In my experience, I’ve never had the liberty to explain to the customer that they just need to do the same operation for 30 minutes, over and over and it will be much better. I prefer to run this test cold and see what I get as a possible worst-case.

job1: (g=0): rw=read, bs=(R) 16.0KiB-16.0KiB, (W) 16.0KiB-16.0KiB, (T) 16.0KiB-16.0KiB, ioengine=psync, iodepth=2048
fio-3.10
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=109MiB/s][r=6950 IOPS][eta 00m:00s]
job1: (groupid=0, jobs=1): err= 0: pid=87654: Tue Jan 14 06:36:01 2020
read: IOPS=6524, BW=102MiB/s (107MB/s)(1024MiB/10044msec)
clat (usec): min=49, max=12223, avg=148.22, stdev=228.29
lat (usec): min=49, max=12223, avg=148.81, stdev=228.39
clat percentiles (usec):
| 1.00th=[ 61], 5.00th=[ 67], 10.00th=[ 70], 20.00th=[ 75],
| 30.00th=[ 81], 40.00th=[ 88], 50.00th=[ 96], 60.00th=[ 108],
| 70.00th=[ 125], 80.00th=[ 159], 90.00th=[ 322], 95.00th=[ 412],
| 99.00th=[ 644], 99.50th=[ 848], 99.90th=[ 3097], 99.95th=[ 5145],
| 99.99th=[ 7963]
bw ( KiB/s): min=64576, max=131712, per=99.98%, avg=104379.00, stdev=21363.19, samples=20
iops : min= 4036, max= 8232, avg=6523.65, stdev=1335.24, samples=20
lat (usec) : 50=0.01%, 100=54.55%, 250=32.72%, 500=10.48%, 750=1.59%
lat (usec) : 1000=0.31%
lat (msec) : 2=0.20%, 4=0.07%, 10=0.07%, 20=0.01%
cpu : usr=6.25%, sys=20.35%, ctx=65541, majf=0, minf=13
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=65536,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=2048

 

Run status group 0 (all jobs):
READ: bw=102MiB/s (107MB/s), 102MiB/s-102MiB/s (107MB/s-107MB/s), io=1024MiB (1074MB), run=10044-10044msec

Disk stats (read/write):
dm-8: ios=64233/2, merge=0/0, ticks=7416/8, in_queue=7436, util=74.54%, aggrios=21845/0, aggrmerge=0/0, aggrticks=2580/2, aggrin_queue=2581, aggrutil=25.78%
sdg: ios=21844/0, merge=0/0, ticks=2616/0, in_queue=2616, util=25.78%
sdh: ios=21844/1, merge=0/0, ticks=2600/4, in_queue=2600, util=25.63%
sdi: ios=21848/1, merge=0/0, ticks=2524/4, in_queue=2528, util=24.92%

The lines of significance to you, will be:

– Line: IOPS.

Shows the min, max and average IOPS that were obtained during the execution. This should roughly correspond to the IOPS expected for the type of Azure disk on which your source data file is located. Remember that if you have striped file system with RAID under a logical volume manager, then you should expect to see more IOPS because you have more disks.

NOTE: The IOPS will not hit the maximum achievable, because our page/block size is too high for this. The Azure disk values are achievable only with random read, 8KB page sizes, multiple threads/jobs and a queue depth of 256 (https://docs.microsoft.com/en-gb/azure/virtual-machines/linux/disks-benchmarks).

– Lines: “lat (usec)” and “lat (msec)”.

These are the proportions of latency in micro and milliseconds respectively.
If you have high percentages in the millisecond ranges, then you may have an issue. You would not expect this for the type of disks you would want to be running an SAP ASE database on.

In my example above, I am using 3x P40 Premium Storage SSD disks.
You can tell it is a striped logical volume setup, because the very last 3 lines of output shows my 3 Linux disk device names (sdg, sdh and sdi) which sit under my volume group.

You can use the useful links here to determine what you should be seeing on your setup:

NOTE: If you are running SAP on the ASE database, then you will more than likely be using Premium Storage (it’s the only option supported by SAP) and it will be Azure Managed (not un-managed).

Let’s look at the same Fio output using a 128KB page size (like ASE would if it was using large I/O).
We use the same command line but just change the “-bs” parameter to 128KB:

fio –name=global –readonly –rw=read –direct=1 –bs=128k –size=1024m –iodepth=2048 –filename=/sybase/AS1/sapdata/AS1_data_001.dat –numjobs=1 –name=job1

 

job1: (g=0): rw=read, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=psync, iodepth=2048
fio-3.10
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=128MiB/s][r=1021 IOPS][eta 00m:00s]
job1: (groupid=0, jobs=1): err= 0: pid=93539: Tue Jan 14 06:54:48 2020
read: IOPS=1025, BW=128MiB/s (134MB/s)(1024MiB/7987msec)
clat (usec): min=90, max=46843, avg=971.48, stdev=5784.85
lat (usec): min=91, max=46844, avg=972.04, stdev=5784.84
clat percentiles (usec):
| 1.00th=[ 101], 5.00th=[ 109], 10.00th=[ 113], 20.00th=[ 119],
| 30.00th=[ 124], 40.00th=[ 130], 50.00th=[ 137], 60.00th=[ 145],
| 70.00th=[ 157], 80.00th=[ 176], 90.00th=[ 210], 95.00th=[ 273],
| 99.00th=[42206], 99.50th=[42730], 99.90th=[43254], 99.95th=[43254],
| 99.99th=[46924]
bw ( KiB/s): min=130299, max=143616, per=100.00%, avg=131413.00, stdev=3376.53, samples=15
iops : min= 1017, max= 1122, avg=1026.60, stdev=26.40, samples=15
lat (usec) : 100=0.87%, 250=93.13%, 500=3.26%, 750=0.43%, 1000=0.13%
lat (msec) : 2=0.18%, 4=0.01%, 10=0.04%, 50=1.95%
cpu : usr=0.55%, sys=4.12%, ctx=8194, majf=0, minf=41
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=8192,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=2048

Run status group 0 (all jobs):
READ: bw=128MiB/s (134MB/s), 128MiB/s-128MiB/s (134MB/s-134MB/s), io=1024MiB (1074MB), run=7987-7987msec

Disk stats (read/write):
dm-8: ios=8059/0, merge=0/0, ticks=7604/0, in_queue=7640, util=95.82%, aggrios=5461/0, aggrmerge=0/0, aggrticks=5114/0, aggrin_queue=5114, aggrutil=91.44%
sdg: ios=5461/0, merge=0/0, ticks=564/0, in_queue=564, util=6.96%
sdh: ios=5461/0, merge=0/0, ticks=7376/0, in_queue=7376, util=91.08%
sdi: ios=5462/0, merge=0/0, ticks=7404/0, in_queue=7404, util=91.44%

You can see that we actually got a lower IOPS value, but we returned all the data quicker and got a higher throughput.
This is due to the laws of how IOPS and throughput interact. A higher page/block size means we can potentially read more data in each I/O request.

Some of the performance randomness now becomes apparent, with the inconsistency of the “util” for each disk device. However, there is a note on the Fio webpage about how this metric (util) is not necessarily reliable.

You should note that, although we are doing a simulated direct I/O (unbuffered) operation at the Linux level, outside of Linux at the Azure level, there could be caching (data disk caching, which is actually cached on the underlying Azure physical host).

You can check your current setup directly in Azure or at the Linux level, by reading through my previous post on how to do this easily.

https://www.it-implementor.co.uk/2019/12/17/listing-azure-vm-datadisks-and-cache-settings-using-azure-portal-jmespath-bash/

Now for the final test.
Can we get the IOPS that we should be getting for our current setup and disks?

Following the Microsoft documentation to create the fioread.ini and execute (note it needs 120GB of disk space – 4 reader jobs x 30GB):

cat <<EOF > /tmp/fioread.ini
[global]
size=30g
direct=1
iodepth=256
ioengine=libaio
bs=8k

 

[reader1]
rw=randread
directory=/sybase/AS1/sapdata/

[reader2]
rw=randread
directory=/sybase/AS1/sapdata/

[reader3]
rw=randread
directory=/sybase/AS1/sapdata/

[reader4]
rw=randread
directory=/sybase/AS1/sapdata/
EOF

fio –runtime 30 /tmp/fioread.ini
reader1: (g=0): rw=randread, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=libaio, iodepth=256
reader2: (g=0): rw=randread, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=libaio, iodepth=256
reader3: (g=0): rw=randread, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=libaio, iodepth=256
reader4: (g=0): rw=randread, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=libaio, iodepth=256
fio-3.10
Starting 4 processes
reader1: Laying out IO file (1 file / 30720MiB)
reader2: Laying out IO file (1 file / 30720MiB)
reader3: Laying out IO file (1 file / 30720MiB)
reader4: Laying out IO file (1 file / 30720MiB)
Jobs: 4 (f=4): [r(4)][100.0%][r=128MiB/s][r=16.3k IOPS][eta 00m:00s]
reader1: (groupid=0, jobs=1): err= 0: pid=120284: Tue Jan 14 08:16:38 2020
read: IOPS=4250, BW=33.2MiB/s (34.8MB/s)(998MiB/30067msec)
slat (usec): min=3, max=7518, avg=10.06, stdev=43.39
clat (usec): min=180, max=156683, avg=60208.81, stdev=32909.11
lat (usec): min=196, max=156689, avg=60219.59, stdev=32908.61
clat percentiles (usec):
| 1.00th=[ 1549], 5.00th=[ 3294], 10.00th=[ 4883], 20.00th=[ 45351],
| 30.00th=[ 47973], 40.00th=[ 49021], 50.00th=[ 51643], 60.00th=[ 54789],
| 70.00th=[ 94897], 80.00th=[ 98042], 90.00th=[100140], 95.00th=[101188],
| 99.00th=[143655], 99.50th=[145753], 99.90th=[149947], 99.95th=[149947],
| 99.99th=[149947]
bw ( KiB/s): min=25168, max=46800, per=26.07%, avg=34003.88, stdev=4398.09, samples=60
iops : min= 3146, max= 5850, avg=4250.45, stdev=549.78, samples=60
lat (usec) : 250=0.01%, 500=0.02%, 750=0.12%, 1000=0.28%
lat (msec) : 2=1.35%, 4=5.69%, 10=5.72%, 20=1.15%, 50=30.21%
lat (msec) : 100=45.60%, 250=9.86%
cpu : usr=1.29%, sys=5.58%, ctx=6247, majf=0, minf=523
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
issued rwts: total=127794,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=256
reader2: (groupid=0, jobs=1): err= 0: pid=120285: Tue Jan 14 08:16:38 2020
read: IOPS=4183, BW=32.7MiB/s (34.3MB/s)(983MiB/30067msec)
slat (usec): min=3, max=8447, avg= 9.92, stdev=54.73
clat (usec): min=194, max=154937, avg=61163.27, stdev=32365.78
lat (usec): min=217, max=154945, avg=61173.85, stdev=32365.26
clat percentiles (usec):
| 1.00th=[ 1778], 5.00th=[ 3294], 10.00th=[ 5145], 20.00th=[ 46400],
| 30.00th=[ 47973], 40.00th=[ 49546], 50.00th=[ 52167], 60.00th=[ 55313],
| 70.00th=[ 94897], 80.00th=[ 98042], 90.00th=[100140], 95.00th=[101188],
| 99.00th=[111674], 99.50th=[145753], 99.90th=[147850], 99.95th=[149947],
| 99.99th=[149947]
bw ( KiB/s): min=26816, max=43104, per=25.67%, avg=33474.27, stdev=3881.96, samples=60
iops : min= 3352, max= 5388, avg=4184.27, stdev=485.26, samples=60
lat (usec) : 250=0.01%, 500=0.03%, 750=0.08%, 1000=0.15%
lat (msec) : 2=1.02%, 4=6.31%, 10=5.05%, 20=1.12%, 50=27.79%
lat (msec) : 100=49.09%, 250=9.37%
cpu : usr=1.14%, sys=5.53%, ctx=6362, majf=0, minf=522
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.9%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
issued rwts: total=125800,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=256
reader3: (groupid=0, jobs=1): err= 0: pid=120286: Tue Jan 14 08:16:38 2020
read: IOPS=3919, BW=30.6MiB/s (32.1MB/s)(921MiB/30066msec)
slat (usec): min=3, max=12886, avg= 9.40, stdev=56.68
clat (usec): min=276, max=151726, avg=65256.88, stdev=31578.48
lat (usec): min=283, max=151733, avg=65266.86, stdev=31578.73
clat percentiles (usec):
| 1.00th=[ 1958], 5.00th=[ 3884], 10.00th=[ 10421], 20.00th=[ 47449],
| 30.00th=[ 49021], 40.00th=[ 51119], 50.00th=[ 53740], 60.00th=[ 65274],
| 70.00th=[ 96994], 80.00th=[ 99091], 90.00th=[100140], 95.00th=[101188],
| 99.00th=[139461], 99.50th=[145753], 99.90th=[149947], 99.95th=[149947],
| 99.99th=[149947]
bw ( KiB/s): min=21344, max=42960, per=24.04%, avg=31354.32, stdev=5530.77, samples=60
iops : min= 2668, max= 5370, avg=3919.27, stdev=691.34, samples=60
lat (usec) : 500=0.01%, 750=0.05%, 1000=0.12%
lat (msec) : 2=0.92%, 4=4.15%, 10=4.59%, 20=0.59%, 50=25.92%
lat (msec) : 100=53.48%, 250=10.18%
cpu : usr=0.96%, sys=5.22%, ctx=7986, majf=0, minf=521
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.9%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
issued rwts: total=117853,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=256
reader4: (groupid=0, jobs=1): err= 0: pid=120287: Tue Jan 14 08:16:38 2020
read: IOPS=3955, BW=30.9MiB/s (32.4MB/s)(928MiB/30020msec)
slat (usec): min=3, max=9635, avg= 9.57, stdev=52.03
clat (usec): min=163, max=151463, avg=64699.59, stdev=32233.21
lat (usec): min=176, max=151468, avg=64709.90, stdev=32232.66
clat percentiles (usec):
| 1.00th=[ 1729], 5.00th=[ 3720], 10.00th=[ 7832], 20.00th=[ 46924],
| 30.00th=[ 48497], 40.00th=[ 51119], 50.00th=[ 53740], 60.00th=[ 87557],
| 70.00th=[ 96994], 80.00th=[ 99091], 90.00th=[100140], 95.00th=[102237],
| 99.00th=[109577], 99.50th=[143655], 99.90th=[147850], 99.95th=[147850],
| 99.99th=[147850]
bw ( KiB/s): min=21488, max=46320, per=24.22%, avg=31592.63, stdev=4760.10, samples=60
iops : min= 2686, max= 5790, avg=3949.05, stdev=595.03, samples=60
lat (usec) : 250=0.02%, 500=0.07%, 750=0.07%, 1000=0.09%
lat (msec) : 2=1.31%, 4=4.04%, 10=5.13%, 20=1.28%, 50=24.76%
lat (msec) : 100=52.89%, 250=10.35%
cpu : usr=1.06%, sys=5.21%, ctx=8226, majf=0, minf=522
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.9%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
issued rwts: total=118743,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=256

Run status group 0 (all jobs):
READ: bw=127MiB/s (134MB/s), 30.6MiB/s-33.2MiB/s (32.1MB/s-34.8MB/s), io=3830MiB (4016MB), run=30020-30067msec

Disk stats (read/write):
dm-8: ios=490190/1, merge=0/0, ticks=30440168/64, in_queue=30570784, util=99.79%, aggrios=163396/0, aggrmerge=0/0, aggrticks=10170760/21, aggrin_queue=10172817, aggrutil=99.60%
sdg: ios=162989/1, merge=0/0, ticks=10134108/64, in_queue=10135484, util=99.59%
sdh: ios=163379/0, merge=0/0, ticks=10175316/0, in_queue=10177440, util=99.60%
sdi: ios=163822/0, merge=0/0, ticks=10202856/0, in_queue=10205528, util=99.59%

throughput = [IOPS] * [block size]
example: 3000 IOPS * 8 (8KB) = 24000KB/s (24MB/s)

From our output, we can see how the IOPS and blocksize affect the throughput calculation:
16,300 (IOPS total) * 8 (8KB) = 130400KB/s (127MB/s)

Simple answer, no, we don’t get what we expect for our P40 disks. Further investigation required. 🙁

Korn Shell vs Powershell and the New AZ Module

Do you know Korn and are thinking about learning Powershell?

Look at this:

function What-am-I {
   echo “Korn or powershell?”
}

what-am-i
echo $?

Looks like Korn, but it also looks like Powershell.
In actual fact, it executes in both Korn shell and Powershell.

There’s a slight difference in the output from “$?” because Powershell will output “True” and Korn will output “0”.
Not much in it really. That is just another reason Linux people are feeling the Microsoft love right now.

Plus, as recently highlighted by a Microsoft blog post, the Azure CLI known as “az” which allows you to interact with Azure APIs and functions, will now also be the name of the new Powershell module used to perform the same operations and replacing “AzureRM”.

It makes sense for Microsoft to harmonise the two names.
It could save them an awful lot of documentation because currently they have to write examples for both “az” CLI and Powershell cmdlets for each new Azure feature/function.

Recovery From: Operation start is not allowed on VM since the VM is generalized – Linux

Scenario: In Azure you had a Linux virtual machine.  In the Azure portal you clicked the “Capture” button in the Portal view of your Linux virtual machine, now you are unable to start the virtual machine as you get the Azure error: “Operation ‘start’ is not allowed on VM ‘abcd’ since the VM is generalized.“.

What this error/information prompt is telling you, is that the “Capture” button actually creates a generic image of your virtual machine, which means it is effectively a template that can be used to create a new VM.
Because the process that is applied to your original VM modifies it in such a way, it is now unable to boot up normally.  The process is called “sysprep”.

Can you recover your original VM?  no.  It’s not possible to recover it properly using the Azure Portal capabilities.  You could do it if you downloaded the disk image, but there’s no need.
Plus, there is no telling what changes have been made to the O/S that might affect your applications that have been installed.

It’s possible for you to create a new VM from your captured image, or even to use your old VM’s O/S disk to create a new VM.
However, both of the above mean you will have a new VM.  Like I said, who knows what changes could have been introduced from the sysprep process.  Maybe it’s better to rebuild…

Because the disk files are still present you can rescue your data and look at the original O/S disk files.
Here’s how I did it.

I’m starting from the point of holding my head in my hands after clicking “Capture“!
The next steps I took were:

– Delete your original VM (just the VM).  The disk files will remain, but at least you can create a new VM of the same name (I liked the original name).

– Create a new Linux VM, same as you did for the one you’ve just lost.
Use the same install image if possible.

– Within the properties of your new VM, go to the “Disks” area.

– Click to add a new data disk.
We will then be able to attach the existing O/S disk to the virtual machine (you will need to find itin the list).
You can add other data disks from the old VM if you need to.

Once your disks are attached to your new Linux VM, you just need to mount them up.
For my specific scenario, I could see that the root partition “/” on my new Linux VM, was of type “ext4” (check the output of ‘df -h’ command).
This means that my old VM’s root partition format would have also been ext4.
Therefore I just needed to find and mount the new disk in the O/S of my new VM.

As root on the new Linux VM find the last disk device added:

# ls -ltr /dev/sd*

The last line is your old VM disk.  Mine was device /dev/sdc and specifically, I needed partition 2 (the whole disk), so I would choose /dev/sdc2.

Mount the disk:

# mkdir /old_vm
# mount -t ext4 /dev/sdc2 /old_vm

I could then access the disk and copy any required files/settings:

# cd /old_vm

Once completed, I unmounted the old O/S disk in the new Linux VM:

# umount /old_vm

Then, back in the Azure Portal in the disks area of the new VM (in Edit mode), I detatched the old disk:

 

Once those disks are not owned by a VM anymore (you can see in the properties for the specific disk), then it’s safe to delete them.