This blog contains experience gained over the years of implementing (and de-implementing) large scale IT applications/software.

SAP Instance Agent as Mini-Web Server

In this post, I will show you how a little known feature of the SAP Instance Agent (available on every “modern” SAP system) can be used to serve files via HTTP and HTTPS.

You may be thinking, “This is basic stuff, I know this already“, in which case, you may only be interested in the very last paragraph of this post 😉

Why Not Just Use a Real Web Server?

During SAP projects there always comes a point where software needs to be distributed throughout the landscape, where end-users need access to a predefined set of software, or where scripts need to be centralised and downloadable onto multiple target servers.
Essentially, you need a common file distribution point.

Some projects have large budgets and some have small budgets. This post is for those with small budgets. Those projects where using everything twice, if possible, becomes an artform.

Under What Circumstances Would I Possibly Want to Use Such a Method?

Let’s imagine a scenario where you have a set of Korn shell scripts that you would like centralised across the SAP landscape.
There is/was no budget for a common fileshare and there is no budget to have someone setting one up.
Instead you develop an offline deployment approach, where the scripts are pulled down from a central repository on a schedule.
You decide that the central repository needs to be a web server and the scripts will be downloaded by HTTP.

What Is the SAP Instance Agent?

The SAP Instance Agent is the agent that does the work, when you run sapcontrol functions.
It is a small set of executables that come as part of the SAP Kernel and generally you get one Instance Agent installation per SAP instance.

For example, if you have an ASCS instance, there will be an instance agent installed under /usr/sap/<SID>/ASCS<##>/.

You can see the Instance Agent running by querying the list of running processes with ps:

>ps-ef | grep sapstartsrv

as1adm     1969      1  0 06:31 ?        00:00:01 /usr/sap/AS1/ASCS00/exe/sapstartsrv pf=/usr/sap/AS1/SYS/profile/AS1_ASCS00_sapas1ase1 -D -u as1adm

You can see that the binary executable responsible is “sapstartsrv”.

You will also notice that the SAP Host Agent also has a “sapstartsrv”. This is because the Host Agent and the Instance Agent are siblings, sharing a similar code-set, just with different functions.

How Do You Access the SAP Instance Agent?

In the history of the SAP product range, SAP created a Java based GUI tool called SAP MMC. The SAP MMC can be used to administer SAP instances on the local server via the Instance Agent. To be able to start the SAP MMC, it was distributed from the Instance Agent by HTTP. I’m not going to go into the SAP MMC because it will be going away completely eventually.

Generally, whenever you run sapcontrol and call a function, you are accessing the SAP Instance Agent:

sapcontrol -nr 00 -function GetSystemInstanceList

07.05.2020 07:01:32
GetSystemInstanceList
OK
hostname, instanceNr, httpPort, httpsPort, startPriority, features, dispstatus
sapas1ase1, 0, 50013, 50014, 1, MESSAGESERVER|ENQUE, GREEN

What Is the Document Root For the Instance Agent?

In Web Server lingo, the “document root” is the highest level directory that the web server can serve files from.

For the Instance Agent, this is /usr/sap/<SID>/<INST>/exe/servicehttp.
Example: /usr/sap/AS1/ASCS00/exe/servicehttp.

If we create a simple text file in the document root, we can access it via HTTP:

echo "Hello Darryl" > /usr/sap/AS1/ASCS00/exe/servicehttp/index.html

We can use wget to access the file like so:

wget -q -O - http://127.0.0.1:50013/index.html

Hello Darryl

Can We Use HTTPS?

You can use HTTPS, the TCP port is the secure port 5##14 (50014 in our example).
Because the certificate used by the SAP Instance Agent is self-signed, you will need to trust the certificate directly.

We can use wget again as follows, but with the secure port and telling wget to not check the SSL certificate:

wget --no-check-certificate -q -O - https://127.0.0.1:50014/index.html

Hello Darryl

Are the Files Persisted Forever?

The files that you create under the document root of the SAP Instance Agent, are not persisted indefinately.
They get removed when the SAP instance is started.
Do not confuse this with when the Instance Agent is started.

An example, the Instance Agent is started up automatically by sapinit on server boot. Only when the SAP instance that is served by the Instance Agent, is started (i.e. sapcontrol -nr 00 -function Start), do the files in the document root location get cleansed, just before the instance is started.

I have previously put together a nice diagram about the interactions of various components during SAP instance startup. See the post here:

How an Azure hosted SAP LaMa Controlled SAP System Starts Up

How Can We Make Our Files Persist?

You can make your files persist, by creating a filesystem soft link from the document root and re-creating this link as part of the SAP instance start process:

mkdir /home/as1adm/docroot
echo "Hello Darryl" > /home/as1adm/docroot/index.html
ln -s /home/as1adm/docroot /usr/sap/AS1/ASCS00/exe/servicehttp/

Then you need to add an entry to the SAP instance profile to re-create the link on startup.
FIrst we need to establish the current number of “Execute” tasks in the profile:

cdpro
grep Execute_ *

AS1_ASCS00_sapas1ase1:Execute_00 = immediate $(DIR_CT_RUN)/sapcpe$(FT_EXE) pf=$(_PF) $(_CPARG0)
AS1_ASCS00_sapas1ase1:Execute_01 = immediate $(DIR_CT_RUN)/sapcpe$(FT_EXE) pf=$(_PF) $(_CPARG1)
AS1_ASCS00_sapas1ase1:Execute_02 = local rm -f $(_MS)
AS1_ASCS00_sapas1ase1:Execute_03 = local ln -s -f $(DIR_EXECUTABLE)/msg_server$(FT_EXE) $(_MS)
AS1_ASCS00_sapas1ase1:Execute_04 = local rm -f $(_EN)
AS1_ASCS00_sapas1ase1:Execute_05 = local ln -s -f $(DIR_EXECUTABLE)/enserver$(FT_EXE) $(_EN)

We add a new entry as follows:

echo "Execute_05 = local ln -s /home/as1adm/docroot $(DIR_EXECUTABLE)/servicehttp/" >> AS1_ASCS00_sapas1ase1

Upon starting the ASCS instance, the profile is read and the link created.

You would then access the index.html as follows:

wget --no-check-certificate -q -O - https://127.0.0.1:50014/docroot/index.html

Voila!

Even if you don’t use the above method for serving basic file content over HTTP, there is another use if you are running your SAP system in Azure.
We can use the HTTP capability of the SAP Instance Agent as a method to dynamically control which back-end VMs are accessible through an Azure ILB. This is an interesting concept, especially when introduced with a couple of other posts I have written over the past year.
I will elaborate further on this over the next few months.

List Your Azure VMs in Excel – Part Deux

If you’ve been following, I recently showed you how to get a list of Azure hosted VMs into Microsoft Excel (O365) using Power Query.

You can find the original post here: List Your Azure VMs in Excel

The list was basic, listing only the VM names.
I promised a follow-up post on how to get the Power State of the VM, which would show if it was running, deactivated or powered off completely.
Here is the follow-up.

Can We Just Adjust our API Call?

If you look at the Azure REST API specification for the Compute collection, you will see that it looks like a simple enough task to get the VM Power State as part of the call to VirtualMachines ListAll (see here https://docs.microsoft.com/en-us/rest/api/compute/virtualmachines/listall).

However, all is not well with the call when applying the “statusOnly” URI parameter to the call. It generates an error in Excel when called via “JSON.Document” and Googling around would seem to throw up some interesting comments from others with the same issue.

In short, it doesn’t seem to work the way the API docs say it should, and there appears to be no other way apart from looping on the “VirtualMachineListResult” array that you get back from the “ListAll” call, then making a separate call to get the VM details from the VMInstance view.

So, this is exactly what I did but from within Power Query 🙂

What Changes Were Needed?

As part of the changes to our code, we had to include the VM ID, so that we could use it to call the VMInstance view.
To do this we included a new function FnGetVMdisplayStatus:

// FnGetVMdisplayStatus gets the instanceView object for the passed VM ID 
// then parses out the displayStatus from one of two possible locations. 
FnGetVMdisplayStatus = (idURI) as text => 
 let Source = Json.Document(Web.Contents(endPoint & idURI & "/instanceView?api-version=" & apiVersion)), 
 statuses = Source[statuses], 
 vmDisplayStatus1 = try statuses{1}[displayStatus] otherwise "", 
 vmDisplayStatus2 = try statuses{2}[displayStatus] otherwise "", 
 vmDisplayStatus = vmDisplayStatus1 & vmDisplayStatus2 
in
 vmDisplayStatus,

The new function uses the VM ID, passed in as the function parameter idURI, to formulate the API call URL.
You can see that we’ve also had to parameterise the URL elements throughout, to reduce duplication.

 endPoint = "https://management.azure.com" as text, 
 subscription = "[your subscription]" as text,
 apiVersion = "2019-07-01" as text,

The Power State (we use JSON field displayStatus) of the VM, is returned in one of two items inside a JSON statuses array element. Since there is no guarantee which item in the array will contain the displayStatus, we check in both:

 vmDisplayStatus1 = try statuses{1}[displayStatus] otherwise "", 
 vmDisplayStatus2 = try statuses{2}[displayStatus] otherwise "", 

Finally, we augment our final output table, with an additional column, which will include the call to FnGetVMdisplayStatus to get the related Power State for the VM, using the current table row’s ID column value:

#"VMdetail-list-with-displayStatus" = Table.AddColumn(#"VMdetail-list", "displayStatus", try each FnGetVMdisplayStatus([id]) otherwise "??")

The End Result

The end result of our modifications means that we now have an additional column in our output table, which is populated by our additional function call to the Azure API to get the VMInstance view, parse it and extract the displayStatus.

Once again, thanks to Gil Raviv for the original code with the pagination technique (available here: datachant.com/2016/06/27/cursor-based-pagination-power-query/).

NOTE: I am a Power Query novice, so if YOU (Jon 😉 ) have any tips on how to make this better, neater, faster, please leave a comment and I will test any recommendations.

let 
 iterations = 10 as number,
 // Max Number of Pages of VMs. 
 endPoint = "https://management.azure.com" as text, 
 subscription = "[your subscription]" as text, 
 apiVersion = "2019-07-01" as text, 
 vmListUrl = endPoint & "/subscriptions/" & subscription & "/providers/Microsoft.Compute/virtualMachines?api-version=" & apiVersion as text,
 
// FnGetOnePage is the function that performs an import of single page. 
// The page consists of a record with the data and the URL in the 
// fields data and next. Other Web APIs hold the data and cursor in different formats 
// but the principle is the same. 
FnGetOnePage = (url) as record => 
 let Source = Json.Document(Web.Contents(url)), 
data = try Source[value] otherwise null, 
 next = try Source[nextLink] otherwise null, 
 res = [Data=data, Next=next] 
in
 res,

// FnGetVMdisplayStatus gets the instanceView object for the passed VM ID 
// then parses out the displayStatus from one of two possible locations. 
FnGetVMdisplayStatus = (idURI) as text => 
 let Source = Json.Document(Web.Contents(endPoint & idURI & "/instanceView?api-version=" & apiVersion)), 
 statuses = Source[statuses], 
 vmDisplayStatus1 = try statuses{1}[displayStatus] otherwise "", 
 vmDisplayStatus2 = try statuses{2}[displayStatus] otherwise "", 
 vmDisplayStatus = vmDisplayStatus1 & vmDisplayStatus2 
in
 vmDisplayStatus,

// GeneratedVMList is the function to page through the subscriptions and get the VM lists. 
GeneratedVMList = 
 List.Generate( ()=>[i=0, res = FnGetOnePage(vmListUrl)], 
  each [i]null, 
  each [i=[i]+1, 
  res = FnGetOnePage([res][Next])], 
  each [res][Data]),

#"VMListTable" = Table.FromList(GeneratedVMList, Splitter.SplitByNothing(), null, null, ExtraValues.Error), 
#"Expanded-VMListTable-Column1" = Table.ExpandListColumn(#"VMListTable", "Column1"), 
#"VMdetail-list" = Table.ExpandRecordColumn(#"Expanded-VMListTable-Column1", "Column1", {"name","id"}), 
#"VMdetail-list-with-displayStatus" = Table.AddColumn(#"VMdetail-list", "displayStatus", try each FnGetVMdisplayStatus([id]) otherwise "??") 
in 
#"VMdetail-list-with-displayStatus"

You need to replace [your subscription] with your actual Azure subscription ID.

You can insert the code into an Excel O365 workbook by following the steps in my original post here:

List Your Azure VMs in Excel

Once you have entered the code, you will get back as a result, the VM name, id URI and the displayStatus which will reflect the power status of the VM.

That’s Amazing, What Next?

Can we enhance this code even further? Yes we can.
Right now, if you have multiple subscriptions, you will need to create multiple Power Query queries and then combine them into another query with the other queries as sources. This is quite cumbersome.

In another blog post, I will show how you can provide a list of multiple subscriptions and get all the results in just one query. Much neater.

List Your Azure VMs in Excel

In this post I would like to show how you can use an Excel Power Query inside the latest version of Excel, to be able to dynamically list your Azure VMs in Excel.

You are probably wondering why you would need to do such a thing.
Well if you want to validate that all VMs in Azure have been picked up by your SAP LaMa system, then this is a fairly easy way to perform that validation. By combining both the SAP LaMa list and the Azure list and validating both using an Excel vlookup, for example.

Prerequisites

  • You’re going to need the latest version of Excel (O365).
  • You will also need read access to your Azure subscription (if you can log into the Azure portal and see VMs, then that should be good enough).

Create a Workbook

Open Excel and create a new blank workbook.
Select the “Data” tab:

Click “Get Data” and select “From Other Sources”, then click “Blank Query”::

Click “Advanced Editor”:

Remove any existing text from the query box:

Secret Sauce – The Code

We modified some cool Power Query code provided by Gil Raviv (available here: datachant.com/2016/06/27/cursor-based-pagination-power-query/).

Instead of querying Facebook like in Gil’s example, we change it to the URL of our Azure subscription and the specific Compute API for Virtual Machines in Azure (API details are here: docs.microsoft.com/en-us/rest/api/compute/virtualmachines/listall).

Adjust the code below, changing:
[YOUR SUBSCRIPTION] = Your Azure subscription ID.

You may also need to increase the “iterations” parameter if you have more than say 900 VMs.

 let
 iterations = 10, // Max Number of Pages of VMs, you may need more depending on how many 100s of VMs you have. 
 url = "https://management.azure.com/subscriptions/[YOUR SUBSCRIPTION]/providers/Microsoft.Compute/virtualMachines?api-version=2019-07-01", 

 // FnGetPage is the function that performs an import of single page. 
 // The page consists of a record with the data and the URL in the 
 // fields data and next. Other Web APIs hold the data and cursor in different formats 
 // but the principle is the same. 
 FnGetOnePage = (url) as record => let Source = Json.Document(Web.Contents(url)), 
 data = try Source[value] otherwise null, 
 next = try Source[nextLink] otherwise null, 
 res = [Data=data, Next=next] in res, 
 GeneratedList = List.Generate( ()=>[i=0, res = FnGetOnePage(url)], 
 each [i]null, each [i=[i]+1, 
 res = FnGetOnePage([res][Next])], 
 each [res][Data]),

 #"Converted to Table" = Table.FromList(GeneratedList, Splitter.SplitByNothing(), null, null, ExtraValues.Error), 
 #"Expanded Column1" = Table.ExpandListColumn(#"Converted to Table", "Column1"), 
 #"Expanded Column2" = Table.ExpandRecordColumn(#"Expanded Column1", "Column1", {"name"}, {"VM-Name"}) 
 in 
 #"Expanded Column2"

Paste your modified into the query box:

Check there are no syntax errors and click “Done”:

Click “Edit Credentials”:

Select “Organizational account” and select the first entry that contains your subscription ID, then click “Sign in”:

Sign in with your username and password for the Azure Portal.
Click “Connect” once signed in.

You will then see your VMs listed in the query output:

Click “Close & Load”:

The query is now embedded into a new worksheet:

That’s it.
For now this is a basic VM listing.
You may be wanting to extract more information about the VMs in the list, maybe the powerState, or the resourceGroup.
I’ll be showing you how to do this in the second post here.

Recreating SAP ASE Database I/O Workload using Fio on Azure

After you have deployed SAP databases onto Azure hosted virtual machines, you may find that sometimes you don’t get the performance you were expecting.

 

How can this be? It’s guaranteed isn’t it?
Well, the answer is, as with everything, sometimes it just doesn’t work that way and there are a large number of factors involved.
Even the Microsoft consultants I’ve spoken with have a check point for customers to confirm at the VM level, that they are seeing the IOPS that they are expecting to see.
Especially when deploying high performance applications such as SAP HANA in Azure.
I can’t comment on the reasons why performance may not be as expected, although I do have my own theories.

Let’s look at how we can simply simulate an SAP ASE 16.0 SP03 database I/O operation, so that we can run a reasonably representative and repetitive test, without the need for ASE to even be installed.
Remember, your specific workload could be different due to the design of your database, type and size of transactions and other factors.
What I’m really trying to show here, is how you can use an approximation to provide a simple test that is repetitive and doesn’t need ASE installing.

Microsoft have their own page dedicated to running I/O tests in Azure, and they document the use of the Fio tool for this process.
Read further detail about Fio here: https://docs.microsoft.com/en-gb/azure/virtual-machines/linux/disks-benchmarks

Since you may need to show your I/O results to your local Microsoft representative, I would recommend you use the tool that Microsoft are familiar with, and not some other tool. This should help speed up any fault resolution process.

NOTE: The IOPS will not hit the maximum achievable, because in our test, the page/block size is too high for this. Microsoft’s quoted Azure disk values are achievable only with random read, 8KB page sizes, multiple threads/jobs and a queue depth of 256 (see here: https://docs.microsoft.com/en-gb/azure/virtual-machines/linux/disks-benchmarks).

In SAP ASE 16.0 SP03 (this is the version I had to hand) on a SUSE Linux 12.3 server, imagine we run a SQL operation like “SELECT * FROM MYTABLE WHERE COL2=’X'” which in our example causes an execution path that performs a table scan of the table MYTABLE.
The table scan results in an asynchronous sequential read of the single database data file (data device) on the VM disk which is an LVM logical volume striped over 3 physical disks that make up the one volume group.

We are going to assume that you have saptune installed and configured correctly for SAP ASE, so we will not touch on the Linux configuration.
One thing to note, is that our assumption includes that the Linux file system hosting the database devices is configured to permit direct I/O (avoiding the Linux filesystem cache). This helps with the test configuration setup.

SAP ASE will try and optimise our SQL operation if ASE has been configured correctly, and use a read-ahead algorithm with large I/O pages up-to 128KB. But even with the right ASE configuration, the use of 128KB pages is is not always possible, for example if the table is in some ways fragmented.
As part of our testing we will assume that 128KB pages are not being used. We will instead use 16KB, which is the smallest page size in ASE (worst case scenario).
We will also assume that our SQL statement results in exactly 1GB of data to be read from the disk each time.
This is highly unlikely in a tuned ASE instance, due to the database datacache. However, we will assume this instance is not tuned and under slight load, causing the datacache to have re-used the memory pages between tests.

If we look at the help page for the Fio tool, it’s a fairly hefty read.
Let’s start by translating some of the notations used to something we can appreciate with regards to our test scenario:

Fio Config Item            Our Test Values/Setup
I/O type                    = sequential read
Blocksize                 = 16KB
I/O size                    = 1024m (amount of data)
I/O engine               = asynch I/O – direct (unbuffered)
I/O depth                 = 2048 (disk queue depth)
Target file/device    = /sybase/AS1/sapdata/AS1_data_001.dat
Threads/processes/jobs = 1

We can see that from the list above, the queue depth is the only thing that we are not sure on.
The actual values can be determined by querying the Linux disk devices but in essence what this is doing is asking for a value that represents how much I/O can be queued for a specific disk device.
In checking my setup, I can see that I have 2048 defined on SLES 12 SP3.
More information on queue depth in Azure can be found here: https://docs.microsoft.com/en-us/azure/virtual-machines/windows/premium-storage-performance#queue-depth

On SLES you can check the queue depth using the lsscsi command with the Long, Long, Long format (-lll):

lsscsi -lll

 

[5:0:0:4] disk Msft Virtual Disk 1.0 /dev/sdd
device_blocked=0
iocounterbits=32
iodone_cnt=0x2053eea
ioerr_cnt=0x0
iorequest_cnt=0x2053eea
queue_depth=2048
queue_type=simple
scsi_level=6
state=running
timeout=300
type=0

An alternative way to check is to output the content of the /proc/scsi/sg/devices file and look at the values in the 7th column:

cat /proc/scsi/sg/devices

 

2 0 0 0 0 1 2048 1 1
3 0 1 0 0 1 2048 0 1
5 0 0 0 0 1 2048 0 1
5 0 0 4 0 1 2048 0 1
5 0 0 2 0 1 2048 0 1
5 0 0 1 0 1 2048 0 1
5 0 0 3 0 1 2048 0 1

For the target file (source file in our read test case), we can either use an existing data device file (if ASE is installed and database exists), or we could create a new data file containing zeros, of 1GB in size.

Using “dd” you can quickly create a 1GB file full of zeros:

dd if=/dev/zero of=/sybase/AS1/sapdata/AS1_data_001.dat bs=1024 count=1048576

 

1048576+0 records in
1048576+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 6.4592 s, 166 MB/s

We will be using only 1 job/thread in Fio to perform the I/O test.
Generally in ASE 16.0 SP03, the number of “disk tasks” is configured using “sp_configure” and visible in the configuration file.
The configured value is usually 1 in a default installation and vary rarely needs adjusting.

See here: https://help.sap.com/viewer/379424e5820941d0b14683dd3a992d5c/16.0.3.5/en-US/a778c8d8bc2b10149f11a28571f24818.html

Once we’re happy with the above settings, we just need to apply them to the Fio command line as follows:

fio –name=global –readonly –rw=read –direct=1 –bs=16k –size=1024m –iodepth=2048 –filename=/sybase/AS1/sapdata/AS1_data_001.dat –numjobs=1 –name=job1

You will see the output of Fio on the screen as it performs the I/O work.
In testing, the amount of clock time that Fio takes to perform the work is reflective of the performance of the I/O subsystem.
In extremely fast cases, you will need to look at the statistics that have been output to the screen.

The Microsoft documentation and examples show running very lengthy operations on Fio, to ensure that the disk caches are populated properly.
In my experience, I’ve never had the liberty to explain to the customer that they just need to do the same operation for 30 minutes, over and over and it will be much better. I prefer to run this test cold and see what I get as a possible worst-case.

job1: (g=0): rw=read, bs=(R) 16.0KiB-16.0KiB, (W) 16.0KiB-16.0KiB, (T) 16.0KiB-16.0KiB, ioengine=psync, iodepth=2048
fio-3.10
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=109MiB/s][r=6950 IOPS][eta 00m:00s]
job1: (groupid=0, jobs=1): err= 0: pid=87654: Tue Jan 14 06:36:01 2020
read: IOPS=6524, BW=102MiB/s (107MB/s)(1024MiB/10044msec)
clat (usec): min=49, max=12223, avg=148.22, stdev=228.29
lat (usec): min=49, max=12223, avg=148.81, stdev=228.39
clat percentiles (usec):
| 1.00th=[ 61], 5.00th=[ 67], 10.00th=[ 70], 20.00th=[ 75],
| 30.00th=[ 81], 40.00th=[ 88], 50.00th=[ 96], 60.00th=[ 108],
| 70.00th=[ 125], 80.00th=[ 159], 90.00th=[ 322], 95.00th=[ 412],
| 99.00th=[ 644], 99.50th=[ 848], 99.90th=[ 3097], 99.95th=[ 5145],
| 99.99th=[ 7963]
bw ( KiB/s): min=64576, max=131712, per=99.98%, avg=104379.00, stdev=21363.19, samples=20
iops : min= 4036, max= 8232, avg=6523.65, stdev=1335.24, samples=20
lat (usec) : 50=0.01%, 100=54.55%, 250=32.72%, 500=10.48%, 750=1.59%
lat (usec) : 1000=0.31%
lat (msec) : 2=0.20%, 4=0.07%, 10=0.07%, 20=0.01%
cpu : usr=6.25%, sys=20.35%, ctx=65541, majf=0, minf=13
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=65536,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=2048

 

Run status group 0 (all jobs):
READ: bw=102MiB/s (107MB/s), 102MiB/s-102MiB/s (107MB/s-107MB/s), io=1024MiB (1074MB), run=10044-10044msec

Disk stats (read/write):
dm-8: ios=64233/2, merge=0/0, ticks=7416/8, in_queue=7436, util=74.54%, aggrios=21845/0, aggrmerge=0/0, aggrticks=2580/2, aggrin_queue=2581, aggrutil=25.78%
sdg: ios=21844/0, merge=0/0, ticks=2616/0, in_queue=2616, util=25.78%
sdh: ios=21844/1, merge=0/0, ticks=2600/4, in_queue=2600, util=25.63%
sdi: ios=21848/1, merge=0/0, ticks=2524/4, in_queue=2528, util=24.92%

The lines of significance to you, will be:

– Line: IOPS.

Shows the min, max and average IOPS that were obtained during the execution. This should roughly correspond to the IOPS expected for the type of Azure disk on which your source data file is located. Remember that if you have striped file system with RAID under a logical volume manager, then you should expect to see more IOPS because you have more disks.

NOTE: The IOPS will not hit the maximum achievable, because our page/block size is too high for this. The Azure disk values are achievable only with random read, 8KB page sizes, multiple threads/jobs and a queue depth of 256 (https://docs.microsoft.com/en-gb/azure/virtual-machines/linux/disks-benchmarks).

– Lines: “lat (usec)” and “lat (msec)”.

These are the proportions of latency in micro and milliseconds respectively.
If you have high percentages in the millisecond ranges, then you may have an issue. You would not expect this for the type of disks you would want to be running an SAP ASE database on.

In my example above, I am using 3x P40 Premium Storage SSD disks.
You can tell it is a striped logical volume setup, because the very last 3 lines of output shows my 3 Linux disk device names (sdg, sdh and sdi) which sit under my volume group.

You can use the useful links here to determine what you should be seeing on your setup:

NOTE: If you are running SAP on the ASE database, then you will more than likely be using Premium Storage (it’s the only option supported by SAP) and it will be Azure Managed (not un-managed).

Let’s look at the same Fio output using a 128KB page size (like ASE would if it was using large I/O).
We use the same command line but just change the “-bs” parameter to 128KB:

fio –name=global –readonly –rw=read –direct=1 –bs=128k –size=1024m –iodepth=2048 –filename=/sybase/AS1/sapdata/AS1_data_001.dat –numjobs=1 –name=job1

 

job1: (g=0): rw=read, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=psync, iodepth=2048
fio-3.10
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=128MiB/s][r=1021 IOPS][eta 00m:00s]
job1: (groupid=0, jobs=1): err= 0: pid=93539: Tue Jan 14 06:54:48 2020
read: IOPS=1025, BW=128MiB/s (134MB/s)(1024MiB/7987msec)
clat (usec): min=90, max=46843, avg=971.48, stdev=5784.85
lat (usec): min=91, max=46844, avg=972.04, stdev=5784.84
clat percentiles (usec):
| 1.00th=[ 101], 5.00th=[ 109], 10.00th=[ 113], 20.00th=[ 119],
| 30.00th=[ 124], 40.00th=[ 130], 50.00th=[ 137], 60.00th=[ 145],
| 70.00th=[ 157], 80.00th=[ 176], 90.00th=[ 210], 95.00th=[ 273],
| 99.00th=[42206], 99.50th=[42730], 99.90th=[43254], 99.95th=[43254],
| 99.99th=[46924]
bw ( KiB/s): min=130299, max=143616, per=100.00%, avg=131413.00, stdev=3376.53, samples=15
iops : min= 1017, max= 1122, avg=1026.60, stdev=26.40, samples=15
lat (usec) : 100=0.87%, 250=93.13%, 500=3.26%, 750=0.43%, 1000=0.13%
lat (msec) : 2=0.18%, 4=0.01%, 10=0.04%, 50=1.95%
cpu : usr=0.55%, sys=4.12%, ctx=8194, majf=0, minf=41
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=8192,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=2048

Run status group 0 (all jobs):
READ: bw=128MiB/s (134MB/s), 128MiB/s-128MiB/s (134MB/s-134MB/s), io=1024MiB (1074MB), run=7987-7987msec

Disk stats (read/write):
dm-8: ios=8059/0, merge=0/0, ticks=7604/0, in_queue=7640, util=95.82%, aggrios=5461/0, aggrmerge=0/0, aggrticks=5114/0, aggrin_queue=5114, aggrutil=91.44%
sdg: ios=5461/0, merge=0/0, ticks=564/0, in_queue=564, util=6.96%
sdh: ios=5461/0, merge=0/0, ticks=7376/0, in_queue=7376, util=91.08%
sdi: ios=5462/0, merge=0/0, ticks=7404/0, in_queue=7404, util=91.44%

You can see that we actually got a lower IOPS value, but we returned all the data quicker and got a higher throughput.
This is due to the laws of how IOPS and throughput interact. A higher page/block size means we can potentially read more data in each I/O request.

Some of the performance randomness now becomes apparent, with the inconsistency of the “util” for each disk device. However, there is a note on the Fio webpage about how this metric (util) is not necessarily reliable.

You should note that, although we are doing a simulated direct I/O (unbuffered) operation at the Linux level, outside of Linux at the Azure level, there could be caching (data disk caching, which is actually cached on the underlying Azure physical host).

You can check your current setup directly in Azure or at the Linux level, by reading through my previous post on how to do this easily.

https://www.it-implementor.co.uk/2019/12/17/listing-azure-vm-datadisks-and-cache-settings-using-azure-portal-jmespath-bash/

Now for the final test.
Can we get the IOPS that we should be getting for our current setup and disks?

Following the Microsoft documentation to create the fioread.ini and execute (note it needs 120GB of disk space – 4 reader jobs x 30GB):

cat <<EOF > /tmp/fioread.ini
[global]
size=30g
direct=1
iodepth=256
ioengine=libaio
bs=8k

 

[reader1]
rw=randread
directory=/sybase/AS1/sapdata/

[reader2]
rw=randread
directory=/sybase/AS1/sapdata/

[reader3]
rw=randread
directory=/sybase/AS1/sapdata/

[reader4]
rw=randread
directory=/sybase/AS1/sapdata/
EOF

fio –runtime 30 /tmp/fioread.ini
reader1: (g=0): rw=randread, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=libaio, iodepth=256
reader2: (g=0): rw=randread, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=libaio, iodepth=256
reader3: (g=0): rw=randread, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=libaio, iodepth=256
reader4: (g=0): rw=randread, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=libaio, iodepth=256
fio-3.10
Starting 4 processes
reader1: Laying out IO file (1 file / 30720MiB)
reader2: Laying out IO file (1 file / 30720MiB)
reader3: Laying out IO file (1 file / 30720MiB)
reader4: Laying out IO file (1 file / 30720MiB)
Jobs: 4 (f=4): [r(4)][100.0%][r=128MiB/s][r=16.3k IOPS][eta 00m:00s]
reader1: (groupid=0, jobs=1): err= 0: pid=120284: Tue Jan 14 08:16:38 2020
read: IOPS=4250, BW=33.2MiB/s (34.8MB/s)(998MiB/30067msec)
slat (usec): min=3, max=7518, avg=10.06, stdev=43.39
clat (usec): min=180, max=156683, avg=60208.81, stdev=32909.11
lat (usec): min=196, max=156689, avg=60219.59, stdev=32908.61
clat percentiles (usec):
| 1.00th=[ 1549], 5.00th=[ 3294], 10.00th=[ 4883], 20.00th=[ 45351],
| 30.00th=[ 47973], 40.00th=[ 49021], 50.00th=[ 51643], 60.00th=[ 54789],
| 70.00th=[ 94897], 80.00th=[ 98042], 90.00th=[100140], 95.00th=[101188],
| 99.00th=[143655], 99.50th=[145753], 99.90th=[149947], 99.95th=[149947],
| 99.99th=[149947]
bw ( KiB/s): min=25168, max=46800, per=26.07%, avg=34003.88, stdev=4398.09, samples=60
iops : min= 3146, max= 5850, avg=4250.45, stdev=549.78, samples=60
lat (usec) : 250=0.01%, 500=0.02%, 750=0.12%, 1000=0.28%
lat (msec) : 2=1.35%, 4=5.69%, 10=5.72%, 20=1.15%, 50=30.21%
lat (msec) : 100=45.60%, 250=9.86%
cpu : usr=1.29%, sys=5.58%, ctx=6247, majf=0, minf=523
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
issued rwts: total=127794,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=256
reader2: (groupid=0, jobs=1): err= 0: pid=120285: Tue Jan 14 08:16:38 2020
read: IOPS=4183, BW=32.7MiB/s (34.3MB/s)(983MiB/30067msec)
slat (usec): min=3, max=8447, avg= 9.92, stdev=54.73
clat (usec): min=194, max=154937, avg=61163.27, stdev=32365.78
lat (usec): min=217, max=154945, avg=61173.85, stdev=32365.26
clat percentiles (usec):
| 1.00th=[ 1778], 5.00th=[ 3294], 10.00th=[ 5145], 20.00th=[ 46400],
| 30.00th=[ 47973], 40.00th=[ 49546], 50.00th=[ 52167], 60.00th=[ 55313],
| 70.00th=[ 94897], 80.00th=[ 98042], 90.00th=[100140], 95.00th=[101188],
| 99.00th=[111674], 99.50th=[145753], 99.90th=[147850], 99.95th=[149947],
| 99.99th=[149947]
bw ( KiB/s): min=26816, max=43104, per=25.67%, avg=33474.27, stdev=3881.96, samples=60
iops : min= 3352, max= 5388, avg=4184.27, stdev=485.26, samples=60
lat (usec) : 250=0.01%, 500=0.03%, 750=0.08%, 1000=0.15%
lat (msec) : 2=1.02%, 4=6.31%, 10=5.05%, 20=1.12%, 50=27.79%
lat (msec) : 100=49.09%, 250=9.37%
cpu : usr=1.14%, sys=5.53%, ctx=6362, majf=0, minf=522
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.9%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
issued rwts: total=125800,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=256
reader3: (groupid=0, jobs=1): err= 0: pid=120286: Tue Jan 14 08:16:38 2020
read: IOPS=3919, BW=30.6MiB/s (32.1MB/s)(921MiB/30066msec)
slat (usec): min=3, max=12886, avg= 9.40, stdev=56.68
clat (usec): min=276, max=151726, avg=65256.88, stdev=31578.48
lat (usec): min=283, max=151733, avg=65266.86, stdev=31578.73
clat percentiles (usec):
| 1.00th=[ 1958], 5.00th=[ 3884], 10.00th=[ 10421], 20.00th=[ 47449],
| 30.00th=[ 49021], 40.00th=[ 51119], 50.00th=[ 53740], 60.00th=[ 65274],
| 70.00th=[ 96994], 80.00th=[ 99091], 90.00th=[100140], 95.00th=[101188],
| 99.00th=[139461], 99.50th=[145753], 99.90th=[149947], 99.95th=[149947],
| 99.99th=[149947]
bw ( KiB/s): min=21344, max=42960, per=24.04%, avg=31354.32, stdev=5530.77, samples=60
iops : min= 2668, max= 5370, avg=3919.27, stdev=691.34, samples=60
lat (usec) : 500=0.01%, 750=0.05%, 1000=0.12%
lat (msec) : 2=0.92%, 4=4.15%, 10=4.59%, 20=0.59%, 50=25.92%
lat (msec) : 100=53.48%, 250=10.18%
cpu : usr=0.96%, sys=5.22%, ctx=7986, majf=0, minf=521
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.9%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
issued rwts: total=117853,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=256
reader4: (groupid=0, jobs=1): err= 0: pid=120287: Tue Jan 14 08:16:38 2020
read: IOPS=3955, BW=30.9MiB/s (32.4MB/s)(928MiB/30020msec)
slat (usec): min=3, max=9635, avg= 9.57, stdev=52.03
clat (usec): min=163, max=151463, avg=64699.59, stdev=32233.21
lat (usec): min=176, max=151468, avg=64709.90, stdev=32232.66
clat percentiles (usec):
| 1.00th=[ 1729], 5.00th=[ 3720], 10.00th=[ 7832], 20.00th=[ 46924],
| 30.00th=[ 48497], 40.00th=[ 51119], 50.00th=[ 53740], 60.00th=[ 87557],
| 70.00th=[ 96994], 80.00th=[ 99091], 90.00th=[100140], 95.00th=[102237],
| 99.00th=[109577], 99.50th=[143655], 99.90th=[147850], 99.95th=[147850],
| 99.99th=[147850]
bw ( KiB/s): min=21488, max=46320, per=24.22%, avg=31592.63, stdev=4760.10, samples=60
iops : min= 2686, max= 5790, avg=3949.05, stdev=595.03, samples=60
lat (usec) : 250=0.02%, 500=0.07%, 750=0.07%, 1000=0.09%
lat (msec) : 2=1.31%, 4=4.04%, 10=5.13%, 20=1.28%, 50=24.76%
lat (msec) : 100=52.89%, 250=10.35%
cpu : usr=1.06%, sys=5.21%, ctx=8226, majf=0, minf=522
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.9%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
issued rwts: total=118743,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=256

Run status group 0 (all jobs):
READ: bw=127MiB/s (134MB/s), 30.6MiB/s-33.2MiB/s (32.1MB/s-34.8MB/s), io=3830MiB (4016MB), run=30020-30067msec

Disk stats (read/write):
dm-8: ios=490190/1, merge=0/0, ticks=30440168/64, in_queue=30570784, util=99.79%, aggrios=163396/0, aggrmerge=0/0, aggrticks=10170760/21, aggrin_queue=10172817, aggrutil=99.60%
sdg: ios=162989/1, merge=0/0, ticks=10134108/64, in_queue=10135484, util=99.59%
sdh: ios=163379/0, merge=0/0, ticks=10175316/0, in_queue=10177440, util=99.60%
sdi: ios=163822/0, merge=0/0, ticks=10202856/0, in_queue=10205528, util=99.59%

throughput = [IOPS] * [block size]
example: 3000 IOPS * 8 (8KB) = 24000KB/s (24MB/s)

From our output, we can see how the IOPS and blocksize affect the throughput calculation:
16,300 (IOPS total) * 8 (8KB) = 130400KB/s (127MB/s)

Simple answer, no, we don’t get what we expect for our P40 disks. Further investigation required. 🙁

With SAP LaMa you can auto-save on your Azure Managed Disk Costs

By now, most people know that after you’ve moved your SAP landscape to the cloud, you could save hosting costs by shutting down SAP system VMs when they are expected to not be used.
(There are caveats around this as it depends on whether you’re paying for reserved instances).

But did you know there’s also an extra saving that can be had in the cloud?

For SAP to support your SAP systems in Microsoft Azure, you must use Premium tier storage.
The reason for this is primarily because Premium tier storage comes with an SLA from Microsoft, which means you are expected to receive a certain level of performance from those disks.
However, you pay more for this SLA and the proposed performance. Which is quite correct, when you’re using the disk but what about when you’re not using the disk?

Right now, in the Azure “West Europe” region, a Premium tier P10 disk (SSD, 128GiB in size with 500 IOPS and 100MB/s throughput), will cost you £16.16 per month, excluding any deals and discounts (such as Azure Managed Disk Reservation).
The P10 is probably the work-horse of the majority of mid-sized server estates. Microsoft recommend a P10 as the Linux root disk for SUSE Linux based HANA database M-Series Azure VMs.

At the other end, the cost of a Standard tier E10 disk (SSD, 128GiB with 500 IOPS and 60MB/s throughput) is £7.16 per month, with the only performance difference being the throughput and the SLA:

So for the same size disk, although with lower throughput, we pay £9 per month less (55% less). I am going to say this saving is roughly 30 pence per day.

(There is one caveat and that is for standard SSD disks like the E10, you pay a transaction fee of 0.1 pence (£0.001) on the disk for every 10,000 256KiB I/O operations.
However, we will see that this transaction fee will not impact us and our saving, in a moment.)

Here’s how we can save money on this Premium managed disk.

In Microsoft Azure, you can change the disk tier from Premium to Standard, when the VM on which the disk is attached is shutdown (deallocated).
It’s simple, you just use the Azure Portal to change the disk configuration once the VM is shutdown.

While this is nice for just a couple of disks, this is not something you’re going to want to do on a regular basis.
Don’t forget, before you start the VM you need to switch the disk tier back to Premium (to retain your SAP support).
So for mass-changes, you may want to use PowerShell to adjust the disks before starting the VMs.
This itself could become a bit of a burden, since you now lose the ability to mass-power-on VMs from the Azure Portal completely.  You would need to use PowerShell all of the time, or setup an Azure based operation schedule (a.k.a. Power Automate – previously Microsoft Flow).

This is where SAP Landscape Manager (LaMa) really comes into its own.
With SAP LaMa, your BASIS team can:

  • Perform the start-up & shutdown of the SAP relevant VMs.
  • Perform the start-up & shutdown of the SAP systems on the VMs once they have been started (or the reverse).
  • Use the inbuilt scheduling capability of SAP LaMa to schedule the VM and SAP system operations (full automation of start-up and shutdown operations of the whole stack).

The security capabilities of Azure, coupled with SAP LaMa mean that the BASIS team can only perform specific VM related operations on the SAP VMs. Which gives the cloud Ops team peace of mind.

Now for the best bit.
To be able to save money on managed disk costs in Azure, the SAP BASIS administrator has to merely tick a tickbox in the SAP LaMa cloud provider settings, to “Change Storage Type to save costs”:

The next time the VM is de-allocated, SAP LaMa automatically changes the disk configuration in Azure, to a lower cost disk tier.
As we mentioned earlier, since the start/stop is controlled by SAP LaMa, it knows to switch the disk back to Premium tier during the start-up operation.

How simple is that!

As mentioned, there are some complications around any reservation payments for managed disk, so you need to understand what you’re paying for, before just enabling the tick-box!

Here are my very basic calcs for our P10/E10 disk combination example:

  • Weekends per year: 52
  • Saving per weekend: 60 pence
  • Total possible saving per year for 1 disk if it was unused every weekend: £31.20

Now let’s imagine that saving opportunity was applied across your 100 server estate, whereby every server had at least 1x P10 disk.
You can’t shutdown production, because it’s 24/7, but you don’t do development & testing round-the-clock and you have no international locations, so we are going to imagine our SAP estate is maybe 70% applicable to this saving opportunity. That’s 70 servers x £31.20 equals a saving of £2,184 per year on managed disk, by ticking a tickbox.

These are obviously just best guesses, but it shows how costs can build up and can also be reduced.

Happy ticking.