This blog contains experience gained over the years of implementing (and de-implementing) large scale IT applications/software.

Recreating SAP ASE Database I/O Workload using Fio on Azure

After you have deployed SAP databases onto Azure hosted virtual machines, you may find that sometimes you don’t get the performance you were expecting.

 

How can this be? It’s guaranteed isn’t it?
Well, the answer is, as with everything, sometimes it just doesn’t work that way and there are a large number of factors involved.
Even the Microsoft consultants I’ve spoken with have a check point for customers to confirm at the VM level, that they are seeing the IOPS that they are expecting to see.
Especially when deploying high performance applications such as SAP HANA in Azure.
I can’t comment on the reasons why performance may not be as expected, although I do have my own theories.

Let’s look at how we can simply simulate an SAP ASE 16.0 SP03 database I/O operation, so that we can run a reasonably representative and repetitive test, without the need for ASE to even be installed.
Remember, your specific workload could be different due to the design of your database, type and size of transactions and other factors.
What I’m really trying to show here, is how you can use an approximation to provide a simple test that is repetitive and doesn’t need ASE installing.

Microsoft have their own page dedicated to running I/O tests in Azure, and they document the use of the Fio tool for this process.
Read further detail about Fio here: https://docs.microsoft.com/en-gb/azure/virtual-machines/linux/disks-benchmarks

Since you may need to show your I/O results to your local Microsoft representative, I would recommend you use the tool that Microsoft are familiar with, and not some other tool. This should help speed up any fault resolution process.

NOTE: The IOPS will not hit the maximum achievable, because in our test, the page/block size is too high for this. Microsoft’s quoted Azure disk values are achievable only with random read, 8KB page sizes, multiple threads/jobs and a queue depth of 256 (see here: https://docs.microsoft.com/en-gb/azure/virtual-machines/linux/disks-benchmarks).

In SAP ASE 16.0 SP03 (this is the version I had to hand) on a SUSE Linux 12.3 server, imagine we run a SQL operation like “SELECT * FROM MYTABLE WHERE COL2=’X'” which in our example causes an execution path that performs a table scan of the table MYTABLE.
The table scan results in an asynchronous sequential read of the single database data file (data device) on the VM disk which is an LVM logical volume striped over 3 physical disks that make up the one volume group.

We are going to assume that you have saptune installed and configured correctly for SAP ASE, so we will not touch on the Linux configuration.
One thing to note, is that our assumption includes that the Linux file system hosting the database devices is configured to permit direct I/O (avoiding the Linux filesystem cache). This helps with the test configuration setup.

SAP ASE will try and optimise our SQL operation if ASE has been configured correctly, and use a read-ahead algorithm with large I/O pages up-to 128KB. But even with the right ASE configuration, the use of 128KB pages is is not always possible, for example if the table is in some ways fragmented.
As part of our testing we will assume that 128KB pages are not being used. We will instead use 16KB, which is the smallest page size in ASE (worst case scenario).
We will also assume that our SQL statement results in exactly 1GB of data to be read from the disk each time.
This is highly unlikely in a tuned ASE instance, due to the database datacache. However, we will assume this instance is not tuned and under slight load, causing the datacache to have re-used the memory pages between tests.

If we look at the help page for the Fio tool, it’s a fairly hefty read.
Let’s start by translating some of the notations used to something we can appreciate with regards to our test scenario:

Fio Config Item            Our Test Values/Setup
I/O type                    = sequential read
Blocksize                 = 16KB
I/O size                    = 1024m (amount of data)
I/O engine               = asynch I/O – direct (unbuffered)
I/O depth                 = 2048 (disk queue depth)
Target file/device    = /sybase/AS1/sapdata/AS1_data_001.dat
Threads/processes/jobs = 1

We can see that from the list above, the queue depth is the only thing that we are not sure on.
The actual values can be determined by querying the Linux disk devices but in essence what this is doing is asking for a value that represents how much I/O can be queued for a specific disk device.
In checking my setup, I can see that I have 2048 defined on SLES 12 SP3.
More information on queue depth in Azure can be found here: https://docs.microsoft.com/en-us/azure/virtual-machines/windows/premium-storage-performance#queue-depth

On SLES you can check the queue depth using the lsscsi command with the Long, Long, Long format (-lll):

lsscsi -lll

 

[5:0:0:4] disk Msft Virtual Disk 1.0 /dev/sdd
device_blocked=0
iocounterbits=32
iodone_cnt=0x2053eea
ioerr_cnt=0x0
iorequest_cnt=0x2053eea
queue_depth=2048
queue_type=simple
scsi_level=6
state=running
timeout=300
type=0

An alternative way to check is to output the content of the /proc/scsi/sg/devices file and look at the values in the 7th column:

cat /proc/scsi/sg/devices

 

2 0 0 0 0 1 2048 1 1
3 0 1 0 0 1 2048 0 1
5 0 0 0 0 1 2048 0 1
5 0 0 4 0 1 2048 0 1
5 0 0 2 0 1 2048 0 1
5 0 0 1 0 1 2048 0 1
5 0 0 3 0 1 2048 0 1

For the target file (source file in our read test case), we can either use an existing data device file (if ASE is installed and database exists), or we could create a new data file containing zeros, of 1GB in size.

Using “dd” you can quickly create a 1GB file full of zeros:

dd if=/dev/zero of=/sybase/AS1/sapdata/AS1_data_001.dat bs=1024 count=1048576

 

1048576+0 records in
1048576+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 6.4592 s, 166 MB/s

We will be using only 1 job/thread in Fio to perform the I/O test.
Generally in ASE 16.0 SP03, the number of “disk tasks” is configured using “sp_configure” and visible in the configuration file.
The configured value is usually 1 in a default installation and vary rarely needs adjusting.

See here: https://help.sap.com/viewer/379424e5820941d0b14683dd3a992d5c/16.0.3.5/en-US/a778c8d8bc2b10149f11a28571f24818.html

Once we’re happy with the above settings, we just need to apply them to the Fio command line as follows:

fio –name=global –readonly –rw=read –direct=1 –bs=16k –size=1024m –iodepth=2048 –filename=/sybase/AS1/sapdata/AS1_data_001.dat –numjobs=1 –name=job1

You will see the output of Fio on the screen as it performs the I/O work.
In testing, the amount of clock time that Fio takes to perform the work is reflective of the performance of the I/O subsystem.
In extremely fast cases, you will need to look at the statistics that have been output to the screen.

The Microsoft documentation and examples show running very lengthy operations on Fio, to ensure that the disk caches are populated properly.
In my experience, I’ve never had the liberty to explain to the customer that they just need to do the same operation for 30 minutes, over and over and it will be much better. I prefer to run this test cold and see what I get as a possible worst-case.

job1: (g=0): rw=read, bs=(R) 16.0KiB-16.0KiB, (W) 16.0KiB-16.0KiB, (T) 16.0KiB-16.0KiB, ioengine=psync, iodepth=2048
fio-3.10
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=109MiB/s][r=6950 IOPS][eta 00m:00s]
job1: (groupid=0, jobs=1): err= 0: pid=87654: Tue Jan 14 06:36:01 2020
read: IOPS=6524, BW=102MiB/s (107MB/s)(1024MiB/10044msec)
clat (usec): min=49, max=12223, avg=148.22, stdev=228.29
lat (usec): min=49, max=12223, avg=148.81, stdev=228.39
clat percentiles (usec):
| 1.00th=[ 61], 5.00th=[ 67], 10.00th=[ 70], 20.00th=[ 75],
| 30.00th=[ 81], 40.00th=[ 88], 50.00th=[ 96], 60.00th=[ 108],
| 70.00th=[ 125], 80.00th=[ 159], 90.00th=[ 322], 95.00th=[ 412],
| 99.00th=[ 644], 99.50th=[ 848], 99.90th=[ 3097], 99.95th=[ 5145],
| 99.99th=[ 7963]
bw ( KiB/s): min=64576, max=131712, per=99.98%, avg=104379.00, stdev=21363.19, samples=20
iops : min= 4036, max= 8232, avg=6523.65, stdev=1335.24, samples=20
lat (usec) : 50=0.01%, 100=54.55%, 250=32.72%, 500=10.48%, 750=1.59%
lat (usec) : 1000=0.31%
lat (msec) : 2=0.20%, 4=0.07%, 10=0.07%, 20=0.01%
cpu : usr=6.25%, sys=20.35%, ctx=65541, majf=0, minf=13
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=65536,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=2048

 

Run status group 0 (all jobs):
READ: bw=102MiB/s (107MB/s), 102MiB/s-102MiB/s (107MB/s-107MB/s), io=1024MiB (1074MB), run=10044-10044msec

Disk stats (read/write):
dm-8: ios=64233/2, merge=0/0, ticks=7416/8, in_queue=7436, util=74.54%, aggrios=21845/0, aggrmerge=0/0, aggrticks=2580/2, aggrin_queue=2581, aggrutil=25.78%
sdg: ios=21844/0, merge=0/0, ticks=2616/0, in_queue=2616, util=25.78%
sdh: ios=21844/1, merge=0/0, ticks=2600/4, in_queue=2600, util=25.63%
sdi: ios=21848/1, merge=0/0, ticks=2524/4, in_queue=2528, util=24.92%

The lines of significance to you, will be:

– Line: IOPS.

Shows the min, max and average IOPS that were obtained during the execution. This should roughly correspond to the IOPS expected for the type of Azure disk on which your source data file is located. Remember that if you have striped file system with RAID under a logical volume manager, then you should expect to see more IOPS because you have more disks.

NOTE: The IOPS will not hit the maximum achievable, because our page/block size is too high for this. The Azure disk values are achievable only with random read, 8KB page sizes, multiple threads/jobs and a queue depth of 256 (https://docs.microsoft.com/en-gb/azure/virtual-machines/linux/disks-benchmarks).

– Lines: “lat (usec)” and “lat (msec)”.

These are the proportions of latency in micro and milliseconds respectively.
If you have high percentages in the millisecond ranges, then you may have an issue. You would not expect this for the type of disks you would want to be running an SAP ASE database on.

In my example above, I am using 3x P40 Premium Storage SSD disks.
You can tell it is a striped logical volume setup, because the very last 3 lines of output shows my 3 Linux disk device names (sdg, sdh and sdi) which sit under my volume group.

You can use the useful links here to determine what you should be seeing on your setup:

NOTE: If you are running SAP on the ASE database, then you will more than likely be using Premium Storage (it’s the only option supported by SAP) and it will be Azure Managed (not un-managed).

Let’s look at the same Fio output using a 128KB page size (like ASE would if it was using large I/O).
We use the same command line but just change the “-bs” parameter to 128KB:

fio –name=global –readonly –rw=read –direct=1 –bs=128k –size=1024m –iodepth=2048 –filename=/sybase/AS1/sapdata/AS1_data_001.dat –numjobs=1 –name=job1

 

job1: (g=0): rw=read, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=psync, iodepth=2048
fio-3.10
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=128MiB/s][r=1021 IOPS][eta 00m:00s]
job1: (groupid=0, jobs=1): err= 0: pid=93539: Tue Jan 14 06:54:48 2020
read: IOPS=1025, BW=128MiB/s (134MB/s)(1024MiB/7987msec)
clat (usec): min=90, max=46843, avg=971.48, stdev=5784.85
lat (usec): min=91, max=46844, avg=972.04, stdev=5784.84
clat percentiles (usec):
| 1.00th=[ 101], 5.00th=[ 109], 10.00th=[ 113], 20.00th=[ 119],
| 30.00th=[ 124], 40.00th=[ 130], 50.00th=[ 137], 60.00th=[ 145],
| 70.00th=[ 157], 80.00th=[ 176], 90.00th=[ 210], 95.00th=[ 273],
| 99.00th=[42206], 99.50th=[42730], 99.90th=[43254], 99.95th=[43254],
| 99.99th=[46924]
bw ( KiB/s): min=130299, max=143616, per=100.00%, avg=131413.00, stdev=3376.53, samples=15
iops : min= 1017, max= 1122, avg=1026.60, stdev=26.40, samples=15
lat (usec) : 100=0.87%, 250=93.13%, 500=3.26%, 750=0.43%, 1000=0.13%
lat (msec) : 2=0.18%, 4=0.01%, 10=0.04%, 50=1.95%
cpu : usr=0.55%, sys=4.12%, ctx=8194, majf=0, minf=41
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=8192,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=2048

Run status group 0 (all jobs):
READ: bw=128MiB/s (134MB/s), 128MiB/s-128MiB/s (134MB/s-134MB/s), io=1024MiB (1074MB), run=7987-7987msec

Disk stats (read/write):
dm-8: ios=8059/0, merge=0/0, ticks=7604/0, in_queue=7640, util=95.82%, aggrios=5461/0, aggrmerge=0/0, aggrticks=5114/0, aggrin_queue=5114, aggrutil=91.44%
sdg: ios=5461/0, merge=0/0, ticks=564/0, in_queue=564, util=6.96%
sdh: ios=5461/0, merge=0/0, ticks=7376/0, in_queue=7376, util=91.08%
sdi: ios=5462/0, merge=0/0, ticks=7404/0, in_queue=7404, util=91.44%

You can see that we actually got a lower IOPS value, but we returned all the data quicker and got a higher throughput.
This is due to the laws of how IOPS and throughput interact. A higher page/block size means we can potentially read more data in each I/O request.

Some of the performance randomness now becomes apparent, with the inconsistency of the “util” for each disk device. However, there is a note on the Fio webpage about how this metric (util) is not necessarily reliable.

You should note that, although we are doing a simulated direct I/O (unbuffered) operation at the Linux level, outside of Linux at the Azure level, there could be caching (data disk caching, which is actually cached on the underlying Azure physical host).

You can check your current setup directly in Azure or at the Linux level, by reading through my previous post on how to do this easily.

https://www.it-implementor.co.uk/2019/12/17/listing-azure-vm-datadisks-and-cache-settings-using-azure-portal-jmespath-bash/

Now for the final test.
Can we get the IOPS that we should be getting for our current setup and disks?

Following the Microsoft documentation to create the fioread.ini and execute (note it needs 120GB of disk space – 4 reader jobs x 30GB):

cat <<EOF > /tmp/fioread.ini
[global]
size=30g
direct=1
iodepth=256
ioengine=libaio
bs=8k

 

[reader1]
rw=randread
directory=/sybase/AS1/sapdata/

[reader2]
rw=randread
directory=/sybase/AS1/sapdata/

[reader3]
rw=randread
directory=/sybase/AS1/sapdata/

[reader4]
rw=randread
directory=/sybase/AS1/sapdata/
EOF

fio –runtime 30 /tmp/fioread.ini
reader1: (g=0): rw=randread, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=libaio, iodepth=256
reader2: (g=0): rw=randread, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=libaio, iodepth=256
reader3: (g=0): rw=randread, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=libaio, iodepth=256
reader4: (g=0): rw=randread, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=libaio, iodepth=256
fio-3.10
Starting 4 processes
reader1: Laying out IO file (1 file / 30720MiB)
reader2: Laying out IO file (1 file / 30720MiB)
reader3: Laying out IO file (1 file / 30720MiB)
reader4: Laying out IO file (1 file / 30720MiB)
Jobs: 4 (f=4): [r(4)][100.0%][r=128MiB/s][r=16.3k IOPS][eta 00m:00s]
reader1: (groupid=0, jobs=1): err= 0: pid=120284: Tue Jan 14 08:16:38 2020
read: IOPS=4250, BW=33.2MiB/s (34.8MB/s)(998MiB/30067msec)
slat (usec): min=3, max=7518, avg=10.06, stdev=43.39
clat (usec): min=180, max=156683, avg=60208.81, stdev=32909.11
lat (usec): min=196, max=156689, avg=60219.59, stdev=32908.61
clat percentiles (usec):
| 1.00th=[ 1549], 5.00th=[ 3294], 10.00th=[ 4883], 20.00th=[ 45351],
| 30.00th=[ 47973], 40.00th=[ 49021], 50.00th=[ 51643], 60.00th=[ 54789],
| 70.00th=[ 94897], 80.00th=[ 98042], 90.00th=[100140], 95.00th=[101188],
| 99.00th=[143655], 99.50th=[145753], 99.90th=[149947], 99.95th=[149947],
| 99.99th=[149947]
bw ( KiB/s): min=25168, max=46800, per=26.07%, avg=34003.88, stdev=4398.09, samples=60
iops : min= 3146, max= 5850, avg=4250.45, stdev=549.78, samples=60
lat (usec) : 250=0.01%, 500=0.02%, 750=0.12%, 1000=0.28%
lat (msec) : 2=1.35%, 4=5.69%, 10=5.72%, 20=1.15%, 50=30.21%
lat (msec) : 100=45.60%, 250=9.86%
cpu : usr=1.29%, sys=5.58%, ctx=6247, majf=0, minf=523
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
issued rwts: total=127794,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=256
reader2: (groupid=0, jobs=1): err= 0: pid=120285: Tue Jan 14 08:16:38 2020
read: IOPS=4183, BW=32.7MiB/s (34.3MB/s)(983MiB/30067msec)
slat (usec): min=3, max=8447, avg= 9.92, stdev=54.73
clat (usec): min=194, max=154937, avg=61163.27, stdev=32365.78
lat (usec): min=217, max=154945, avg=61173.85, stdev=32365.26
clat percentiles (usec):
| 1.00th=[ 1778], 5.00th=[ 3294], 10.00th=[ 5145], 20.00th=[ 46400],
| 30.00th=[ 47973], 40.00th=[ 49546], 50.00th=[ 52167], 60.00th=[ 55313],
| 70.00th=[ 94897], 80.00th=[ 98042], 90.00th=[100140], 95.00th=[101188],
| 99.00th=[111674], 99.50th=[145753], 99.90th=[147850], 99.95th=[149947],
| 99.99th=[149947]
bw ( KiB/s): min=26816, max=43104, per=25.67%, avg=33474.27, stdev=3881.96, samples=60
iops : min= 3352, max= 5388, avg=4184.27, stdev=485.26, samples=60
lat (usec) : 250=0.01%, 500=0.03%, 750=0.08%, 1000=0.15%
lat (msec) : 2=1.02%, 4=6.31%, 10=5.05%, 20=1.12%, 50=27.79%
lat (msec) : 100=49.09%, 250=9.37%
cpu : usr=1.14%, sys=5.53%, ctx=6362, majf=0, minf=522
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.9%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
issued rwts: total=125800,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=256
reader3: (groupid=0, jobs=1): err= 0: pid=120286: Tue Jan 14 08:16:38 2020
read: IOPS=3919, BW=30.6MiB/s (32.1MB/s)(921MiB/30066msec)
slat (usec): min=3, max=12886, avg= 9.40, stdev=56.68
clat (usec): min=276, max=151726, avg=65256.88, stdev=31578.48
lat (usec): min=283, max=151733, avg=65266.86, stdev=31578.73
clat percentiles (usec):
| 1.00th=[ 1958], 5.00th=[ 3884], 10.00th=[ 10421], 20.00th=[ 47449],
| 30.00th=[ 49021], 40.00th=[ 51119], 50.00th=[ 53740], 60.00th=[ 65274],
| 70.00th=[ 96994], 80.00th=[ 99091], 90.00th=[100140], 95.00th=[101188],
| 99.00th=[139461], 99.50th=[145753], 99.90th=[149947], 99.95th=[149947],
| 99.99th=[149947]
bw ( KiB/s): min=21344, max=42960, per=24.04%, avg=31354.32, stdev=5530.77, samples=60
iops : min= 2668, max= 5370, avg=3919.27, stdev=691.34, samples=60
lat (usec) : 500=0.01%, 750=0.05%, 1000=0.12%
lat (msec) : 2=0.92%, 4=4.15%, 10=4.59%, 20=0.59%, 50=25.92%
lat (msec) : 100=53.48%, 250=10.18%
cpu : usr=0.96%, sys=5.22%, ctx=7986, majf=0, minf=521
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.9%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
issued rwts: total=117853,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=256
reader4: (groupid=0, jobs=1): err= 0: pid=120287: Tue Jan 14 08:16:38 2020
read: IOPS=3955, BW=30.9MiB/s (32.4MB/s)(928MiB/30020msec)
slat (usec): min=3, max=9635, avg= 9.57, stdev=52.03
clat (usec): min=163, max=151463, avg=64699.59, stdev=32233.21
lat (usec): min=176, max=151468, avg=64709.90, stdev=32232.66
clat percentiles (usec):
| 1.00th=[ 1729], 5.00th=[ 3720], 10.00th=[ 7832], 20.00th=[ 46924],
| 30.00th=[ 48497], 40.00th=[ 51119], 50.00th=[ 53740], 60.00th=[ 87557],
| 70.00th=[ 96994], 80.00th=[ 99091], 90.00th=[100140], 95.00th=[102237],
| 99.00th=[109577], 99.50th=[143655], 99.90th=[147850], 99.95th=[147850],
| 99.99th=[147850]
bw ( KiB/s): min=21488, max=46320, per=24.22%, avg=31592.63, stdev=4760.10, samples=60
iops : min= 2686, max= 5790, avg=3949.05, stdev=595.03, samples=60
lat (usec) : 250=0.02%, 500=0.07%, 750=0.07%, 1000=0.09%
lat (msec) : 2=1.31%, 4=4.04%, 10=5.13%, 20=1.28%, 50=24.76%
lat (msec) : 100=52.89%, 250=10.35%
cpu : usr=1.06%, sys=5.21%, ctx=8226, majf=0, minf=522
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.9%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
issued rwts: total=118743,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=256

Run status group 0 (all jobs):
READ: bw=127MiB/s (134MB/s), 30.6MiB/s-33.2MiB/s (32.1MB/s-34.8MB/s), io=3830MiB (4016MB), run=30020-30067msec

Disk stats (read/write):
dm-8: ios=490190/1, merge=0/0, ticks=30440168/64, in_queue=30570784, util=99.79%, aggrios=163396/0, aggrmerge=0/0, aggrticks=10170760/21, aggrin_queue=10172817, aggrutil=99.60%
sdg: ios=162989/1, merge=0/0, ticks=10134108/64, in_queue=10135484, util=99.59%
sdh: ios=163379/0, merge=0/0, ticks=10175316/0, in_queue=10177440, util=99.60%
sdi: ios=163822/0, merge=0/0, ticks=10202856/0, in_queue=10205528, util=99.59%

throughput = [IOPS] * [block size]
example: 3000 IOPS * 8 (8KB) = 24000KB/s (24MB/s)

From our output, we can see how the IOPS and blocksize affect the throughput calculation:
16,300 (IOPS total) * 8 (8KB) = 130400KB/s (127MB/s)

Simple answer, no, we don’t get what we expect for our P40 disks. Further investigation required. 🙁

Complications of using SAP ASE 16.0 in a HADR pair plus DR node Setup

Firstly, we need to clarify that HADR in SAP ASE speak, is the SAP ASE feature-set name for a HA or DR setup consisting of 2 SAP ASE database instances with a defined replication mode.

The pair can be either for HA or DR, but rarely both, due to the problem of latency.
The problem of latency is inverse to the solution of DR. The further away your second datacentre, the better, from a DR perspective.
Conversely, the worse your latency will become, meaning it can only seriously be used for DR, and not for HA.

If you can find a sweet spot between distance (better for DR) and latency (better for HA), then you would have a HADR setup. But this is unlikely.

As of ASE 16 SP03, an additional DR node is supported to be incorporated into a HADR pair of ASE database instances.
This produces a 3 node setup, with 2 nodes forming a pair (designed to be for HA), then a remote 3rd node (designed for DR).
The reason you may consider such a setup is to provide HA between the two nodes, maybe within an existing datacentre, then DR is provided by a remote 3rd node.
Since the two nodes within the HA pair would likely have low latency, they would have one replication mode (e.g synchronous replication) keeping the data better protected, with the replication mode to the third database being asynchronous, for higher latency scenarios, but less protected data.

In the scenarios and descriptions below, we are highlighting the possibility of running a two node HADR pair plus DR node in public cloud using a paired region:

Whilst an SAP application layer is also supported on the 3 node setup, there are complications that should be understood prior to implementation.
These complications will drive up both cost of implementation and also administrative overhead, so you should ensure that you fully understand how the setup will work before embarking on this solution.

Setup Process:

We will briefly describe the process for setting up the 3 nodes.
In this setup we will use the remote, co-located replication server setup, whereby the SAP SRS (replication server) is installed onto the same servers as the ASE database instances.

1, Install primary ASE database instance.

2, Install Data Movement (DM) component into the binary software installation of the primary ASE database instance.

3, Install secondary ASE database instance.

4, Install Data Movement (DM) component into the binary software installation of the secondary ASE database instance.

5, Run the setuphadr utility to configure the replication between primary and secondary.

This step involves the materialisation of the master and <SID> databases. The master database materialisation is automatic, the <SID> database is manual and requires dump & load.

Therefore, if you have a large <SID> database, then materialisation can take a while.

6, Install tertiary ASE database instance.

7, Install Data Movement (DM) component into the binary software installation of the tertiary ASE database instance.

8, Run the setuphadr utility to configure the tertiary ASE instance as a DR node.

This step involves the materialisation of the master and <SID> databases. The master database materialisation is automatic, the <SID> database is manual and requires dump & load.
Therefore, if you have a large <SID> database, then materialisation can take a while.

In the above, you can adjust the replication mode between primary and secondary, depending on your latency.
In Public cloud (Microsoft Azure), we found that the latency between paired regions was perfectly fine for asynchronous replication mode.
This also permitted the RPO to be met, so we actually went asynchronous all the way through.

POINT 1:

Based on the above, we have our first point to make.

When doing the dump & load for the tertiary database, both master and <SID> databases are taken from the primary database, which in most cases will be in a different datacentre, so materialisation of a large <SID> database will take longer than the secondary database materialisation timings.

You will need to develop a process for quickly getting the dump across the network to the tertiary database node (before the transaction log fills up on the primary).

Developing this fast materialisation process is crucial to the operation of the 3 node setup, since you will be doing this step a lot.

Operational Process:

We now have a 3 node setup, with replication happily pushing primary database transactions from primary (they go from the Replication Agent within the primary ASE instance), to the SRS on the secondary ASE node.
The SRS on the secondary instance then pushes the transactions into the secondary ASE instance databases (master & <SID>) and also to the SRS on the tertiary ASE database instance.

While this is working, you can see the usual SRS output by connecting into the SRS DR Agent on the secondary node and issuing the “sap_status path” command.
The usual monitoring functions exist for monitoring the 3 ASE nodes. You can use the DBACockpit (DB02) in a Netweaver ABAP stack, the ASE Fault Manager or manually at the command line.

One of the critical processes with an ASE HADR setup, is the flow of transactions from primary. You will be constantly engaged trying to prevent the backlog of transactions, which could cause primary to halt database commits until transaction log space is freed.
By correctly sizing the whole chain (primary, secondary and tertiary transaction logs) plus sizing the inbound queues of the SRS, you should have little work to do on a daily basis.

POINT 2:

It’s not the daily monitoring that will impact, but the exceptional change scenarios.
As an example, all 3 ASE database instances should have the same database device sizes, transaction log sizes and configuration settings.
Remembering to increase the device, database, transaction log, queue on each of them can be arduous and mistakes can be made.
Putting a solid change process around the database and SRS is very important to avoid primary database outages.
Since all 3 databases are independent, you can’t rely on auto-growby to grow the devices and databases in sync. So you may need to consider manually increasing the device and database sizes.

Failover Process:

During a failover, the team need to be trained in the scenario of recovery of the data to whichever database server node is active/available/healthy.
The exact scenario training could be difficult as it may involve public cloud, in which case it may not be possible to accurately simulate.
For the 3 node SAP ASE HADR + DR node, the failure scenario that you experience could make a big difference to you overall recovery time.

When we mention recovery time, we are not just talking about RPO/RTO for getting production systems working, we are talking about the time to actually recover the service to a protected state.
For example, recovery of the production database to a point where it is once again adequately protected from failure through database replication.

Loss of the primary database in a 3 node setup, means that the secondary node is the choice to become primary.
In this scenario, the secondary SRS is no longer used. Instead the SRS on the DR node would be configured to be the recipient of transactions from the Replication Agent of the secondary ASE.
If done quickly enough, then re-materialisation of the tertiary database can be avoided as both secondary and tertiary should have the same point-in-time.
In practice however, you will find more often than not, that you are just re-materialising the DR node from the secondary.
In some cases, you may decide not to both until the original primary is back in action. The effort is just too much.

Loss of the secondary database in a 3 node setup, means that the primary becomes instantly unprotected!
Both the secondary node and the tertiary node will drift out of sync.
In this scenario, you will more than likely find that you will be pushed for time and need to teardown the replication on the primary database to prevent the primary transaction lo filling.

Loss of the tertiary database in a 3 node setup, means that you no longer have DR protection for your data!
The transaction log on the primary will start to fill because secondary SRS will be unable to commit transactions in the queue to the tertiary database.
In this scenario, you will more than likely find that you will be pushed for time and need to re-materialise the DR database from the primary.
Time will be of the essence, because you will need transaction log space available in the primary database and queue space in the SRS, for the time to perform the re-materialsation.

POINT 3:

Sizing of the production transaction log size is crucial.
The same size is needed on the secondary and tertiary databases (to allow materialisation (dump & load) to work.
The SRS queue size also needs to be a hefty size (bigger than the transaction log size) to accommodate the transactions from the transaction log.
The primary transaction log size is no longer now just about daily database transactional throughput, but is also intertwined with the requirement for the time it takes to dmp & load the DB across the network to the DR node (slowest link in the chain).
Plus, on top of the above sizings, you should accommodate some additional buffer space for added delays, troubleshooting, decision making.

You should understand your dump & load timings intricately to be able to understand your actual time to return production to a protected state. This will help you decide which is the best route to that state.

Maintenance Process:

Patching a two node ASE HADR setup, is fairly simple and doesn’t take too much effort in planning.
Patching a three node setup (HADR + DR node), involves a little more thought due to the complex way you are recommended to patch.
The basics of the process are that you should be patching the inactive portions of the HADR + DR setup.
Therefore, you end up partially patching the ASE binary stack, leaving the currently active primary SRS (on the secondary node) until last.
As well patching the ASE binaries, you will also have to patch the SAP Hostagent on each of the three nodes. Especially since the Hostagent is used to perform the ASE patching process.
Since there is also a SAP instance agent present on each database node, you will also need to patch the SAP Kernel (SAPEXE part only) on each database node.

POINT 4:

Database patching & maintenance effort increases with each node added. Since the secondary and DR nodes have a shared nothing architecture, you patch specific items more than once across the three nodes.

Summary:

The complexity of managing a two node SAP ASE HADR pair plus DR node should not be underestimated.
You can gain the ability to have HA and DR, especially in a public cloud scenario, but you will pay a heavy price in overhead from maintenance and potentially lose time during a real DR due to the complexity.
It really does depend on how rigid you can be at defining your failover processes and most importantly, testing them.

Carefully consider the cost of HA and DR, versus just DR (using a two node HADR setup with the same asynchronous replication mode).
Do you really need HA? Is your latency small enough to permit a small amount of time running across regions (in public cloud)?

Locking HANA Database Users During Maintenance

Running SAP S/4HANA means there are now more direct HANA DB accesses through a variety of Analytics tools, development tools and external reporting systems.
This can present a problem when it comes to patching and maintenance of the system, since you would not want to officially release the HANA database back to end-users until you had performed your preliminary checks to conclude th patching was successful at all levels of the application stack.

Most BASIS administrators are familiar with the usual “tp locksys” command to be able to lock everyone except SAP* and DDIC out of the SAP ABAP application layer.
But what can be done to stop connections direct into the HANA database?
SAP note 1986645 “Allow only administration users to work on HANA database”, provides an attached SQL file which delivers a few new stored procedures and some new database tables.

The stored procedures include:
– 1 for “locking” out non-system users.
– 1 for “unlocking” non-system users (the exact reverse operation against the exact same set of users that was initially locked).
– 1 for adding users to an exception list table.
– 1 for removing users from an exception list table.

The tables are used to store an exception list of users to be excluded from the locking operation.
You will need to add the “SAPABAP1” S/4HANA schema, XSA DB user and cockpit user to the exception list!
Also add any backup operator user accounts needed to perform backups or if you need to leave enabled a specific set of test user accounts.
There is also a table used for storing the list of users on which the last “locking” operation was performed.

As well as “locking” (the HANA DB accounts are actually disabled) the user accounts, any active sessions for those user accounts are kicked off the database instantly.
This feature is useful in other ways (for example, emergency access to a severely overloaded/failing HANA database system).
Of course if you are running something other than S/4HANA on HANA (maybe Solman), then direct database access may not be a requirement, therefore this set of SQL stored procedures are not so relevant.

How do you implement the SQL?
– Download the SQL from the SAP note and save to a file.
– Either execute the SQL using in the TenantDB as the SYSTEM user in HANA Studio, HANA Cockpit or use hdbsql in batch mode (hdbsql doesn’t like the code to be pasted at the prompt).

How do you add users to the exception list:
– As SYSTEM in the TenantDB, simply execute the store procedures:

CALL SESSION_ADMINS_ADD_TO_EXCEPTED_USER_LIST (‘SAPABAP1’);

How do you utilise the feature?
– As SYSTEM in the TenantDB, simply execute the store procedures:

CALL START_SESSION_ADMINS_ONLY;

When you’ve finished and wish to “unlock” the previously locked accounts:

CALL STOP_SESSION_ADMINS_ONLY;

HANA 2.0 – Calc View – SAP DBTech JDBC 2048 Column Store Error

Scenario: During DB access of a HANA 2.0 SPS3 Calculation View from S/4HANA ABAP stack (via ABAP) or even directly in HANA Studio (via “Raw Data”), an error is displayed in a short dump or on screen along the lines of “SAP DBTech JDBC (2048: column store error: search table error: (nnnn) Instantiation of calculation model failed: exception 30600. An Internal error occurred”.

After investigation you observe the following error inside the indexserver trace log: “Could not get template scenario <SID>::_SYS_BIC:_SYS_SS_CE_<nnnn>_vers2_lang6_type1_CS_2_2_TMP (t -1) of calculation index <SID>::_SYS_BIC:<PACKAGE>/<CALCVIEW> (t -1). reason: CalculationEngine read from metadata failed.; Condition ‘aScenarioHandle.is_valid()’ failed.”.

The error clearly references the name of your Calculation View (calculation index) but it also references another object with a name like “_SYS_SS_CE_*”.

SAP note 1646743 explains that objects with a naming convention of “_SYS_SS_CE_<guid>_TMPTBL” are temporary tables created during compilation of procedure objects. Whilst our objects naming convention is not an exact match, the assumption is that the object is temporary in nature and created during the compilation of our Calculation View.

To backup the above theory, SAP note 2717365 matches the initial error message in some respects and shows the method to recompile the temporary object.
The note explains that to reproduce its described situation you must “Create a script calculation view which is created based on a procedure.”.

With this in mind, after checking our erroring Calculation View, it is clearly possible to see that ours utilises a “Script” as part of its design.

Therefore, we can assume that the temporary object with naming convention “_SYS_SS_CE_<nnnn>_vers2_lang6_type1_CS_2_2_TMP” is the temporary representation of the script from within our Calculation View.

Following the SAP note, we can recompile the object via its source Calculation View as follows using HANA Studio SQL execution (or hdbsql command line):

(NOTE: in our case the object is owned by user SAPABAP1, so we login/connect as that user in order to execute)

ALTER PROCEDURE “_SYS_BIC”.”<PACKAGE>/<CALCVIEW>/proc” RECOMPILE;

The execution succeeds.
However on retrying to access the data within the view, we still get an error.
What happened, well looking again at our Calculation View, it appears that it references another Calculation View!
So we must recompile all referencing Calculation Views also.

To cut a long story short, it turned out that we had over 4 levels Calculation Views before I decided to just recompile all procedures (if existing) of all known Calculation Views. Some of the views were even in different schemas.

How do we obtain a list of all Calculation Views that use a script and would have temporary procedures?

We can use this SQL string to create the required list of “type 6” objects:

SELECT ‘ALTER PROCEDURE “‘||schema||'”.”‘||name||'” RECOMPILE;’ FROM sys.p_objects_ WHERE type=6 and name like ‘%/proc’

How did I find this? All (or most) HANA objects are represented in the SYS.P_OBJECTS table.

Even temporary SQL objects need to be accounted for in the general administrative operations of the database, they need to be listed somewhere and have a corresponding object ID.
By executing the SQL as the SAPABAP1 user, we get output in a similar fashion as to that shown below, with the first line being a column header:

‘ALTER PROCEDURE “‘||SCHEMA||'”.”‘||NAME||'” RECOMPILE;’

ALTER PROCEDURE “_SYS_BIC”.”sap.erp.sfin.co.pl/FCO_C_ACCOUNT_ASSIGNMENT/proc” RECOMPILE;

ALTER PROCEDURE “_SYS_BIC”.”sap.erp.sfin.rtc/RTC_C_FISCMAPA/proc” RECOMPILE;

We can then simply execute the output SQL lines for each object to be recompiled.
On attempting access to the Calculation View, it now correctly returns data (or no data), and does not show an error message.

The next question is, why did we get this problem?

Looking back at SAP note 2717365 it says “This error happens because the temporary created objects were not cleared up properly when this happened with an error.”.
Remember that this is not an exact match for our error, but I think the explanation is good enough.

An error occurred during the creation of the temporary procedures that underpin our scripted Calculation Views.

We don’t know what the error or issue was, but subsequently recompiling those Calculation View temporary procedures fixes the issue.