Databases Archives » Page 4 of 27 » Musings of an IT Implementor

Best Disk Topology for SAP ASE Databases on Azure

Maybe you are considering migration of on-premise SAP ASE databases to Microsoft Azure, or you may be considering migrating from your existing database vendor to SAP ASE on Azure.
Either way, you will benefit from understanding a good, practical disk topology for SAP ASE on Azure.

In this post, I show how you can optimise use of the SAP ASE, Linux and Azure technical layers to provide a balanced approach to disk use, considering both performance and disk (ASE device) management.

The Different Layers

In an ASE on Linux on Azure (IaaS) setup, you have the following layers:

Azure Storage Services
Azure Data Disk Cache Settings
Linux Physical Disks
Linux Logical Volumes
Linux File Systems
ASE Database Data Devices
ASE Instance

Each layer has different options around tuning and setup, which I will highlight below.

Azure Storage Services

Starting at the bottom of the diagram we need to consider the Azure Disk Storage that we wish to use.
There are 2 design considerations here:

size of disk space required.
performance of disk device.

For performance, you are more than likely tied by the SAP requirements for running SAP on Azure.
Currently these require a minimum of Premium SSD storage, since it provides a guaranteed SLA. However, as of June 2020, Standard SSD was also given an SLA by Microsoft, potentially paving the way for cheaper disk (when certified by SAP) provided that it meets your SLA expectations.

Generally, the size of disk determines the performance (IOPS and MBps), but this can also be influenced by the quantity of data disk devices.
For example, by using 2 data disks striped together you can double the available IOPS. The IOPS are an important factor for databases, especially on high throughput database systems.

When considering multiple data disks, you also need to remember that each VM has limitations. There is a VM level IOPS limit, a VM level throughput limit (megabytes per second) plus a limit to the number of data disks that can be attached. These limit values are different for different Azure VM types.

Also remember that in Linux, each disk device has its own set of queues and buffers. Making use of multiple Linux disk devices (which translates directly to the number of Azure data disks) usually means better performance.

Essentials:

Choose minimum of Premium SSD (until Standard SSD is supported by SAP).
Spread database space requirements over multiple data disks.
Be aware of the VM level limits.

Azure Data Disk Cache Settings

Correct configuration of the Azure data disk cache settings on the Azure VM can help with performance and is an easy step to complete.
I have already documented the best practice Azure Disk Cache settings for ASE on Azure in a previous post.

Essentials:

Correctly set Azure VM disk cache settings on Azure data disks at the point of creation.

Use LVM For Managing Disks

Always use a logical volume manager, instead of formatting the Linux physical disk devices directly.
This allows the most flexibility for growing, shrinking and striping the disks for size and performance.

You should stripe the data logical volumes with a minimum of 2 physical disks and a maximum stripe size of 128KB (test it!). This fits within the window of testing that Microsoft have performed in order to achieve the designated IOPS for the underlying disk. It’s also the maximum size that ASE will read at. Depending on your DB read/write profile, you may choose a smaller stripe size such as 64KB, but it depends on the amount of Large I/O and pre-fetch. When reading the Microsoft documents, consider ASE to be the same as MS SQL Server (they are are from the same code lineage).

Stripe the transaction log logical volume(s) with a smaller stripe size, maybe start at 32KB and go lower but test it (remember HANA is 2KB stripe size for log volumes, but HANA uses Azure WriteAccelerator).

Essentials:

Use LVM to create volume groups and logical volumes.
Stipe the data logical volumes with (max) 128KB stripe size & test it.

Use XFS File System

You can essentially choose to use your preferred file system format; there are no restrictions – see note 405827.
However, if you already run or are planning to run HANA databases in your landscape, then choosing XFS for ASE will make your landscape architecture simpler, because HANA is recommended to run on an XFS file system (when on local disk) on Linux; again see SAP note 405827.

Where possible you will need to explicitly disable any Linux file system write barrier caching, because Azure will be handling the caching for you.
In SUSE Linux this is the “nobarrier” setting on the mount options of the XFS partition and for EXT4 partitions it is option “barrier=0”.

Essentials:

Choose disk file system wisely.
Disable write barriers.

Correctly Partition ASE

For SAP ASE, you should segregate the disk partitions of the database to avoid certain system specific databases or logging areas, from filling other disk locations and causing a general database system crash.

If you are using database replication (maybe SAP Replication Server a.k.a HADR for ASE), then you will have additional replication queue disk requirements, which should also be segregated.

A simple but flexible example layout is:

Volume Group	Logical Volume	Mount Point	Description
vg_ase	lv_ase<SID>	/sybase/<SID>	For ASE binaries
vg_sapdata	lv_sapdata<SID>_1	./sapdata_1	One for each ASE device for SAP SID database.
vg_saplog	lv_saplog<SID>_1	./saplog_1	One for each log device for SAP SID database.
vg_asedata	lv_asesec<SID>	./sybsecurity	ASE security database.
	lv_asesyst<SID>	./sybsystem	ASE system databases (master, sybmgmtdb).
	lv_saptemp<SID>	./saptemp	The SAP SID temp database.
	lv_asetemp<SID>	./sybtemp	The ASE temp database.
	lv_asediag<SID>	./sapdiag	The ASE saptools database.
vg_asehadr	lv_repdata<SID>	./repdata	The HADR queue location.
vg_backups	lv_backups<SID>	./backups	Disk backup location.

The above will allow each disk partition usage type to be separately expanded, but more importantly, it allows specific Azure data disk cache settings to be applied to the right locations.
For instance, you can use read-write caching on the vg_ase volume group disks, because that location is only for storing binaries, text logs and config files for the ASE instance. The vg_asedata contains all the small ASE system databases, which will not use too much space, but could still benefit from read caching on the data disks.

TIP: Depending on the size of your database, you may decide to also separate the saptemp database into its own volume group. If you use HADR you may benefit from doing this.

You may not need the backups disk area if you are using a backup utility, but you may benefit from a scratch area of disk for system copies or emergency dumps.

You should choose a good naming standard for volume groups and logical volumes, because this will help you during the check phase, where you can script the checking of disk partitioning and cache settings.

Essentials:

Segregate disk partitions correctly.
Use a good naming standard for volume groups and LVs.
Remember the underlying cache settings on those affected disks.

Add Whole New ASE Devices

Follow the usual SAP ASE database practices of adding additional ASE data devices on additional file system partitions sapdata_2, sapdata_3 etc.
Do not be tempted to constantly (or automatically) expand the ASE device on sapdata_1 by adding new disks, you will find this difficult to maintain because striped logical volumes need at least 2 disks in the stripe set.
It will get complicated and is not easy to shrink back from this.

When you add new disks to an existing volume group and then expand an existing lv_sapdata<SID>_n logical volume, it is not as clean as adding a whole new logical volume (e.g. lv_sapdata<SID>_n+1) and then adding a whole new ASE data device.
The old problem of shrinking data devices is more easily solved by being able to drop a whole ASE device, instead of trying to shrink one.

NOTE: The Microsoft notes suggest enabling automatic DB expansion, but on Azure I don’t think it makes sense from a DB administration perspective.
Yes, by adding a new ASE device, as data ages you may end up with “hot” devices, but you can always move specific devices around and add more underlying disks and re-stripe etc. Keep the layout flexible.

Essentials:

Add new disks to new logical volumes (sapdata_n+1).
Add big whole new ASE devices to the new LVs.

Summary:

We’ve been through each of the layers in detail and now we can summarise as follows:

Choose a minimum of Premium SSD.
Spread database space requirements over multiple data disks.
Correctly set Azure VM disk cache settings on Azure data disks at the point of creation.
Use LVM to create volume groups and logical volumes.
Stipe the logical volumes with (max) 128KB stripe size & test it.
Choose disk file system wisely.
Disable write barriers.
Segregate disk partitions correctly.
Use a good naming standard for volume groups (and LVs).
Remember the underlying cache settings on those affected disks.
Add new disks to new logical volumes (sapdata_n).
Add big whole new ASE devices to the new LVs.

Useful Links:

SAP ASE on Azure:
https://docs.microsoft.com/en-us/azure/virtual-machines/workloads/sap/dbms_guide_sapase
Configuring LVM:
https://docs.microsoft.com/en-us/azure/virtual-machines/linux/configure-lvm
Azure Premium Storage Performance:
https://docs.microsoft.com/en-us/azure/virtual-machines/premium-storage-performance
SAP ASE Performance Tuning Series Basics:
https://help.sap.com/doc/a6147023bc2b10149066f7b7a35888a6/16.0.3.6/en-US/SAP_ASE_Performance_and_Tuning_Series_Basics_en.pdf

Cookies, SAP Analytics Cloud and CORS in Netweaver & HANA

Back in 2019 (now designated as 2019AC – Anno-Covid19), I wrote a post explaining in simple terms what CORS is and how it can affect a SAP landscape.
In that post I showed a simple “on-premise” setup using Fiori, a back-end system and how a Web Dispatcher can help alleviate CORS issues without needing too much complexity.
This post is about a recent CORS related issue that impacts access to back-end SAP data repositories.

Back To The Future

If we hit the “Fast-Forward” button to 2020MC (Mid-Covid19), CORS is now an extremely important technical setup to enable Web Browser based user interfaces to be served from Internet based SAP SaaS services (like SAP Analytics Cloud) and communicate with back-end on-premise/private data sources such as SAP BW systems or SAP HANA databases.

We see that CORS is going to become ever more important going forward, since Web Browser based user interfaces will become more abundant (due to the increase of SaaS products) for the types of back-end data access. The old world of installing a software application on-premise takes too much time and effort to keep up with changing technology.
Using SaaS applications as user interfaces to on-premise data allows a far more agile delivery of user functionality.

The next generation of Web Interfaces will be capable of processing ever larger data sets, with richer capabilities and more in-built intelligence. We’re talking about the Web Browser being a central hub of cross-connected Web Based services.
Imagine, one “web application” that needs a connection to a SaaS product that provides the analytical interface and version management, a connection to one or more back-end data repositories, a connection to a separate SaaS product for AI data analysis and pattern matching (deep insights), a connection to a separate SaaS product for content management (publishing), a connection to a separate SaaS product for marketing and customer engagement.

All of that, from one central web “origin” will mean CORS will become critical to prevent unwanted connections and data leaks. The Web Browser is already the target of many cyber security exploits, therefore staying secure is extremely important, but security is always at the expense of functionality.

IETF Is On It

The Internet Engineering Task Force already have this in hand. That’s how we have CORS in the first place (tools.ietf.org/html/rfc6454).
The Web Origin Concept is constantly evolving to provide features for useability and also security. Way back in 2016 an update to RFC 6265 was proposed, to enhance the HTTP state management mechanism, which is commonly known to you and I as “cookies”.

This amendment (the RFC details are here: tools.ietf.org/html/draft-ietf-httpbis-cookie-same-site-00) was the SameSite attribute that can be set for cookies.
Even in this RFC, you can see that it actually attributes the idea of “samedomain-cookies” back to Mozilla, in 2011. So this is not really a “new” security feature, it’s a long time coming!

The Deal With SAC

The “problem” that has brought me back around to CORS, is recent experience with a CORS issue and SAP Analytics Cloud (SAC).
The issue led me to a blog post by Dong Pan of SAP Canada in Feb 2020 and a recent blog post by Ian Henry, also of SAP in Aug 2020.

Dong Pan wrote quite a long technical blog post on how to fix or work-around the full introduction of the SameSite cookie attribute in Google Chrome version 80 when using SAP Analytics Cloud (SAC).

Ian Henry’s post is also based on the same set of solutions that Dong Pan wrote about, but his issue was accessing a backend HANA XS Engine via Web Dispatcher.

The problem in both cases is that SAP Analytics Cloud (SAC) uses the Web Browser as a middleman to create a “Live Connection” back to an “on-premise” data repository (such as SAP BW or SAP S/4HANA), but the back-end SAP Netweaver/SAP ABAP Platform stack/HANA XS engine, that hosts the “on-premise” data repository does not apply the “SameSite” attribute to cookies that it creates.

You can read Dong Pan’s blog post here: www.sapanalytics.cloud/direct-live-connections-in-sap-analytics-cloud-and-samesite-cookies/
You can read Ian Henry’s blog post here: https://blogs.sap.com/2020/08/26/how-to-fix-google-chrome-samesite-cookie-issue-with-sac-and-hana/

By not applying the “SameSite” attribute to the cookie, Google Chrome browsers of version 80+ will not allow SAC to establish a full session to the back-end system.
You will see an HTTP 400 “session expired” error when viewing the HTTP browser traffic, because SAC tries to establish the connection to the back-end, but no back-end system cookies are allowed to be visible to SAC. Therefore SAC thinks you have no session to the back-end.

How to See the Problem

You will need to be proficient at tracing HTTP requests to be able to capture the problem, but it looks like the following in the HTTP response from the back-end system:

You will see (in Google Chrome) two yellow warning triangles on the “set-cookie” headers in the response from the back-end during the call to “GetServerInfo” to establish the actual connection.
The call is the GET for URL “/sap/bw/ina/GetServerInfo?sap-client=xxx&sap-language=EN&sap-sessionviaurl=X“, with the sap-sessionviaurl in the query-string being the key part.
The text when you hover over the yellow triangle is: “This Set-Cookie didn’t specify a “SameSite” attribute and was defaulted to “SameSite=Lax,” and was blocked because it came from a cross-site response which was not the response to a top-level navigation. The Set-Cookie had to have been set with “SameSite=None” to enable cross-site usage.“.

The Fix(es)

SAP Netweaver (or SAP ABAP Platform) needs some code fixes to add the required cookie attribute “SameSite”.

A workaround (it is a workaround) is possible by using the rewrite module capability of the Internet Communication Management (ICM) or using a rewrite rule in a Web Dispatcher, to re-write the responses and include a generic “SameSite” attribute on each cookie.
This is a workaround for a reason, because using the rewrite method causes unnecessary extra work in the ICM (or Web Dispatcher) for every request (matched or not matched) by the rewrite engine.

It’s always better (more secure, more efficient) to apply the code fix to Netweaver (or ABAP Platform) so the “SameSite” attribute is added at the point of the cookie creation.
For HANA XS, it will need a patch to be applied (if it ever gets fixed in the XS since it is soon deprecated).
With the workaround, we are forcing a setting onto cookies outside of the creation process of those cookies.

Don’t get me wrong, I’m not saying that the workaround should not be used. In some cases it will be the only way to fix this problem in some older SAP systems. I’m just pointing out that there are consequences and it’s not ideal.

Dong Pan and Ian Henry have done a good job of providing options for fixing this in a way that should work for 99% of cases.

Is There a Pretty Picture?

This is something I always find useful when I try and work something through in my mind.
I’ve adjusted my original CORS diagram to include an overview of how I think this “SameSite” attribute issue can be imagined.
Hopefully it will help.

We see the following architecture setup with SAC and it’s domain “sapanalytics.cloud”, issuing CORS requests to back-end system BE2, which sits in domain “corp.net”:

Using the above picture for reference, we can now show where the “SameSite” issue occurs in the processing of the “Resource Response” when it comes back to the browser from the BE2 back-end system:

The blocking, by the Chrome Web browser, of the cookies set by the back-end system in domain “corp.net”, means that from the point of view of SAC, no session was established.
There are a couple more “Request”, “Response” exchanges, before the usual HTTP Authorization header is sent from SAC, but at that point it’s really too late as the returned SAP SSO cookie will also be blocked.

At this point you could see a number of different error messages in SAC, but in the Chrome debugging you will see no HTTP errors because the actual HTTP request/response mechanism is working and HTTP content is being returned. It’s just that SAC will know it does not have a session established, because it will not be finding the usual cookies that it would expect from a successfully established session.

Hopefully I’ve helped explain what was already a highly technical topic, in a more visual way and helped convey the problem and the solution.

Useful Links:

The Web Origin Concept – RFC6454 tools.ietf.org/html/rfc6454
RFC 6265 amendment draft: tools.ietf.org/html/draft-ietf-httpbis-rfc6265bis-03
CORS In a SAP Context – www.it-implementor.co.uk/2019/06/cors-in-sap-netweaver-landscape.html
SAP Analytics Cloud SameSite Cookie Issue – www.sapanalytics.cloud/direct-live-connections-in-sap-analytics-cloud-and-samesite-cookies/
SameSite Cookies Issue with SAC and HANA – blogs.sap.com/2020/08/26/how-to-fix-google-chrome-samesite-cookie-issue-with-sac-and-hana

SAP ASE Error – Process No Longer Found After Startup

This post is about a strange issue I was hitting during the configuration of SAP LaMa 3.0 to start/stop a SAP ABAP 7.52 system (with Kernel 7.53) that was running with a SAP ASE 16.0 database.

During the LaMa start task, the task would fail with an error message: “ASE process no longer found after startup. (fault code: 127)“.

When I logged directly onto the SAP server Linux host, I could see that the database had indeed started up, eventually.
So what was causing the failure?

The Investigation

At first I thought this was related to the Kernel, but having checked the versions of the Kernel components, I found that they were the same as another SAP system that was starting up perfectly fine using the exact same LaMa system.

The next check I did was to turn on tracing on the hostagent itself. This is a simple task of putting the trace value to “3” in the host_profile of the hostagent and restarting it:

service/trace = 3

The trace output is shown in a number of different trace files in the work directory of the hostagent but the trace file we were interested in is called dev_sapdbctrl.

The developer trace file for the sapdbctrl binary executable is important, because the sapdbctrl binary is executed by SAP hostagent (saphostexec) to perform the database start. If you observe the contents of the sapdbctrl trace output, you will see that it loads the Sybase specific shared library which contains the required code to start/stop the ASE database.

The same sapdbctrl also contains the ability to load the required libraries for other database systems.

As a side note, it is still not known to me, how the Sybase shared library comes to exist in the hostagent executable directory. When SAP ASE is patched, this library must also be patched, otherwise how does the hostagent stay in-step with the ASE database that it needs to talk with?

Once tracing was turned on, I shut the SAP ASE instance down again and then used SAP LaMa to initiate the SAP system start once again.
Sure enough, the LaMa start task failed again.

Looking in the trace file dev_sapdbctrl I could see the same error message that I was seeing in SAP LaMa:

Error: Command execution failed. : ASE process no longer found after startup. 
(fault code: 127) Operation ID: 000D3A3862631EEAAEDDA232BE624001
----- Log messages ---- 
Info: saphostcontrol: Executing StartDatabase 
Error: sapdbctrl: ASE process no longer found after startup. 
Error: saphostcontrol: StartDatabase failed (sapdbctrl exit code = 1)

This was great. It confirmed that SAP LaMa was just seeing the symptom of some other issue, since LaMa just calls the hostagent to do the start.

Now I knew the hostagent was seeing the error, I tried using the hostagent directly to perform the start, using the following:

/usr/sap/hostctrl/exe/saphostctrl -debug -function StartDatabase -dbname <SID> -dbtype syb -dbhost <the-ASE-host>

NOTE: The hostagent “-debug” command line option puts out the same information without the need for the hostagent tracing to be turned on in the host_profile.

Once again, the start process failed and the same error message was present in the dev_sapdbctrl trace file.

This was now really strange.
I decided that the next course of action was to start the process of raising the issue with SAP via an incident.
If you suspect that something could take a while to fix, then it’s always best to raise it with SAP early and continue to look at the issue in parallel.

Continuing the Diagnosis

While the SAP incident was in progress, I continued the process of trying to self-diagnose the issue further.
I tried a couple more things such as:

Starting and stopping SAP ASE manually using stopdb/startdb commands.
Restarting the whole server (this step has a place in every troubleshooting process, eventually).
Checking the server patch level.
Checking the environment of the Linux user, the shell, the profile files, the O/S limits applied.
Checking what happens if McAfee anti-virus was disabled (I’ve seen the ePO blocking processes before).

Eventually exhaustion set in and I decided to give the SAP support processor time to get back to me with some hints.

Some Sleep

I spend a lot of time solving SAP problems. A lot of time.
Something doesn’t work according to the docs, something did work but has stopped working, something has never worked well…
It builds up in your mind and you carry this stuff around in your head.
Subconsciously you think about these problems.

Then, at about 3am when you can’t get back to sleep, you have a revelation.
The hostagent is forking the process to start the database as the syb<sid> Linux user (it uses “su”), from the root user (hostagent runs as the root user).

Linux Domain Users

The revelation I had regarding the forking of the user, was just the trigger I needed to make me consider the way the Linux authentication was setup on this specific server with the problem ASE startup.

I remembered at the beginning of the project that I had hit an issue with the SSSD Linux daemon, which is responsible for interfacing between Linux and Microsoft Active Directory. At that time, the issue was causing the hostagent to hang when operations were executed which required a switch to another Linux user.
This previous issue was actually a hostagent issue that was fixed in a later hostagent patch. During that particular investigation, I requested that the Linux team re-configure the SSSD daemon to be more efficient with its Active Directory traversals, when it was looking to see if the desired user account was local to Linux or if it was a domain account.

With this previous issue in mind, I checked the SSSD configuration on the problem server. This is a simple conf file in /etc/sssd.

The Solution

After all the troubleshooting, the raising of the incident, the sleeping, I had finally got to the solution.

After checking the SSSD daemon configuration file /etc/sssd/sssd.conf, I could clearly see that there was one entry missing compared to the other servers that didn’t experience the SAP ASE start error.

The parameter: “subdomain_enumerate = none” was missing.
Looking at the manual page for SSSD it would seem that without this parameter there is additional overhead during any Active Directory traversal.

I set the parameter accordingly in the /etc/sssd/sssd.conf file and restarted the SSSD daemon using:

service sssd restart

Then I retried the start of the database using the hostagent command shown previously.
It worked!
I then retried with SAP LaMa and that also now started ASE without error messages.

Root Cause

What it seems was happening, was some sort of internal pre-set timeout in the sapdbctrl binary, when hit, the sapdbctrl just abandons and throws the error that I was seeing. This leaves the ASE database to continue and start (the process was initiated), but in the hostagent it looked like it had failed to start.
By adding the “subdomain_enumerate = none” parameter, any “delay”, caused by inappropriate call to Active Directory was massively reduced and subsequent start activities were successful.

Analysing & Reducing HANA Backup Catalog Records

In honour of DBA Appreciation Day today 3rd July, I’ve written a small piece on a menial but crucial task that HANA database administrators may wish to check. It’s very easy to overlook but the impact can be quite amazing.

HANA Transaction Logging

In “normal” log mode (for recoverability), the HANA database, like Oracle, has an automatic transaction log backup process, which is responsible for backing up transaction log segments so that the HANA log volume disk space can be re-used by new transactions.
No free disk space in the HANA log volume, means the database will hang, until free space becomes available.

It is strongly recommended by SAP, to have your HANA database in log mode “normal”, since this offers the point-in-time recovery capability through the use of the transaction log backups.

By default a transaction log backup will be triggered automatically by HANA every time a log segment becomes full or if the timeout for an individual service is hit, whichever of those is sooner.
This is known as “immediate” interval mode.

I’m not going to go into the differences of the various interval options and the pros and cons of each since this is highly scenario specific. A lot of companies have small HANA databases and are quite happy with the default options. Some companies have high throughput, super low latency requirements, and would be tuning the log backup process for maximum throughput, while other companies want minimal data-loss and adjust the parameters to ensure that transactions are backed up off the machine as soon as possible.

The SITREP

In this specific situation that I encountered, I have a small HANA database of around ~200GB in memory, serving a SAP Solution Manager 7.2 system (so it has 2x tenant databases plus the SystemDB).

The settings are such that all databases run in log_mode “normal” with consolidated log backups enabled in “immediate” mode and a max_log_backup_size of 16GB (the default, but specified).

All backups are written to a specific disk area, before being pushed off the VM to an Azure Storage Account.

The Issue

I noticed that the local disk area was becoming quite full where the HANA database backups are written. Out of context you might have said it’s normal for an increase of activity in the system, but I know that this system is not doing anything at all (it’s a test system for testing Solution Manager patches and nobody was using it).

What Was Causing the Disk Usage?

Looking at the disk backup file system, I could easily see at the O/S level, that the HANA database log backups were the reason for the extra space usage.
Narrowing that down even further, I could be specific enough to see that the SYSTEMDB was to blame.

The SYSTEMDB in a very lightly used HANA database should not be transacting enough to have a day-to-day noticeable increase in log backup disk usage.
This was no ordinary increase!
I was looking at a total HANA database size on disk of ~120GB (SYSTEMDB plus 2x TenantDBs), and yet I was seeing ~200GB of transaction log backups per day from just the SYSTEMDB.

Drilling down further into the log backup directory for the SYSTEMDB, I could see the name of the log backup files and their sizes.
I was looking at log backup files of 2.8GB in size every ~10 to ~15 minutes.
The files that were biggest were….

… log_backup_0_0_0_0.<unix epoch time>
That’s right, the backup catalog backups!

Whenever HANA writes a backup, whether it is a complete data backup, or a transaction log backup, it also writes a backup of the backup catalog.
This is extremely useful if you have to restore a system and need to know about the backups that have taken place.
By default, the backup catalog backups are accumulated, which means that HANA doesn’t need to write out multiple backups of the backup catalog for each log backup (remember, we have 2x tenantDBs).

Why Were Catalog Backup Files So Big?

The catalog backups include the entire backup catalog.
This means every prior backup is in the backup file, so by default the backup catalog backup file will increase in size at each backup, unless you do some housekeeping of the backup catalog records.

My task was to write some SQL to check the backup catalog to see how many backup catalog records existed, for what type of backups, in which database and how old they were.

I came up with the following SQL:

--- Breakdown of age of backup records in months, by type of record.
SELECT smbc.DATABASE_NAME,
smbc.ENTRY_TYPE_NAME,
MONTHS_BETWEEN(smbc.SYS_START_TIME, CURRENT_DATE) as AGE_MONTHS,
COUNT(MONTHS_BETWEEN(smbc.SYS_START_TIME, CURRENT_DATE)) RECORDS,
t_smbc.YOUNGEST_BACKUP_ID
FROM	"SYS_DATABASES"."M_BACKUP_CATALOG" AS smbc,
		(SELECT xmbc.DATABASE_NAME, 
				xmbc.ENTRY_TYPE_NAME, 
				MONTHS_BETWEEN(xmbc.SYS_START_TIME, CURRENT_DATE) as AGE_MONTHS, 
				max (xmbc.BACKUP_ID) as YOUNGEST_BACKUP_ID 
				FROM "SYS_DATABASES"."M_BACKUP_CATALOG" xmbc 
				GROUP BY xmbc.DATABASE_NAME, 
						xmbc.ENTRY_TYPE_NAME, 
						MONTHS_BETWEEN(xmbc.SYS_START_TIME, CURRENT_DATE) 
		) as t_smbc 
WHERE t_smbc.DATABASE_NAME = smbc.DATABASE_NAME 
AND t_smbc.ENTRY_TYPE_NAME = smbc.ENTRY_TYPE_NAME 
AND t_smbc.AGE_MONTHS = MONTHS_BETWEEN(smbc.SYS_START_TIME, CURRENT_DATE) 
GROUP BY 	smbc.DATABASE_NAME, 
			smbc.ENTRY_TYPE_NAME, 
			MONTHS_BETWEEN(smbc.SYS_START_TIME, CURRENT_DATE), 
			t_smbc.YOUNGEST_BACKUP_ID 
ORDER BY DATABASE_NAME, 
		AGE_MONTHS DESC,
		RECORDS

The key points to note are:

I use the SYS_DATABASES.M_BACKUP_CATALOG view in the SYSTEMDB to see across all databases in the HANA system instead of checking in each one.
For each database, the SQL outputs:
– type of backup (complete or log).
– age in months of the backup.
– number of backup records in that age group.
– youngest backup id for that age group (so I can do some cleanup).

An example execution is:

(NOTE: I made a mistake with the last column name, it’s correct in the SQL now – YOUNGEST_BACKUP_ID)

You can see that the SQL execution took only 3.8 seconds.
Based on my output, I could immediately see one problem, I had backup records from 6 months ago in the SYSTEMDB!

All of these records would be backed up on every transaction log backup.
For whatever reason, the backup process was not able to honour the “BACKUP CATALOG DELETE” which was meant to keep the catalog to less than 1 month of records.
I still cannot adequately explain why this had failed. The same process is used on other HANA databases and none had exhibited the same issue.

I can only presume something was preventing the deletion somehow, since in the next few steps you will see that I was able to use the exact same process with no reported issues.
For reference this is HANA 2.0 SPS04 rev47, patched all the way from SPS02 rev23.

Resolving the Issue

How did I resolve the issue? I simply re-ran the catalog deletion that was already running after each backup.
I was able to use the backup ID from the YOUNGEST_BACKUP_ID column to reduce the backup records.

In the SYSTEMDB:

BACKUP CATALOG DELETE ALL BEFORE BACKUP_ID xxxxxxxx

Then for each TenantDB (still in the SYSTEMDB):

BACKUP CATALOG DELETE FOR <TENANTBD> ALL BEFORE BACKUP_ID xxxxxxxx

At the end of the first DELETE execution *in the first Tenant*, I re-ran the initial SQL query to check and this was the output:

We now only have 1 backup record, which was the youngest record in that age group for that first tenant database (compare to screenshot of first execution of the SQL query with backup id 1,590,747,286,179).
Crucially we have way less log backups for that tenant. Weve gone down from 2247 to 495.
Nice!
I then progressed to do the delete in the SYSTEMDB and other TenantDB of this HANA system.

Checking the Results

As a final check, I was able to compare the log backup file sizes:

The catalog backup in file “log_backup_0_0_0_0.nnnnnnn” at 09:16 is before the cleanup and is 2.7GB in size.
Whereas the catalog backup in “log_backup_0_0_0_0.nnnnnnn” at 09:29 is after the cleanup and is only 76KB in size.
An absolutely massive reduction!

How do we know that file “log_backup_0_0_0_0.nnnnnnn” is a catalog backup?
Because we can check using the Linux “strings” command to see the file string contents.
Way further down the listing it says it is a catalog backup, but I thought it was more interesting to see the “MAGIC” of Berlin:

UPDATE: August 2020 – SAP note 2962726 has been released which contains some standard SQL to help remove failed backup entries from the catalog.

Summary

Check your HANA backup catalog backup sizes.
Ensure you have alerting on file systems (if doing backups to disk).
Double check the backup catalog record age.
Give tons of freebies and thanks to your DBAs on DBA Appreciation Day!

Useful Links

Enable and Disable Automatic Log Backup
https://help.sap.com/viewer/6b94445c94ae495c83a19646e7c3fd56/2.0.05/en-US/241c0f0020b2492fb93a69a40b1b1b9a.html

Accumulated Backups of the Backup Catalog
https://help.sap.com/viewer/6b94445c94ae495c83a19646e7c3fd56/2.0.05/en-US/3def15378b954aac85f2b93bb3f85a49.html

Log Modes
https://help.sap.com/viewer/6b94445c94ae495c83a19646e7c3fd56/2.0.05/en-US/c486a0a3bb571014ab46c0633224f02f.html

Consolidated Log Backups
https://help.sap.com/viewer/6b94445c94ae495c83a19646e7c3fd56/2.0.05/en-US/653b5c6d5f9d41808011a5bd0fac6709.html

Azure Disk Cache Settings for an SAP Database on Linux

One of your go-live tasks once you have built a VM in Azure, should be to ensure that the Azure disk cache settings on the Linux VM data disks, are set correctly in accordance with the Microsoft recommended settings.
In this post I explain the disk cache options and how they apply to SAP and especially to SAP databases such as SAP ASE and SAP HANA, to ensure you get optimum performance.

What Are the Azure Disk Cache Settings?

In Microsoft Azure you can configure different disk cache settings on data disks that are attached to a VM.
NOTE: You do not need to consider changing the O/S root disk cache settings, as by default they are applied as per the Azure recommendations.

Only specific VMs and specific disks (Standard or Premium Storage) have the ability to use caching.
If you use Azure Standard storage, the cache is provided by local disks on the physical server hosting your Linux VM.
If you use Azure Premium storage, the cache is provided by a combination of RAM and local SSD on the physical server hosting your Linux VM.

There are 3 different Azure disk cache settings:

None
ReadOnly (or “read-only”)
ReadWrite (or “read/write”)

The cache settings can influence the performance and also the consistency of the data written to the Azure storage service where your data disks are stored.

Cache Setting: None

By specifying “None” as the cache setting, no caching is used and a write operation at the VM O/S level is confirmed as completed once the data is written to the storage service.
All read operations for data not already in the VM O/S file system cache, will be read from the storage service.

Cache Setting: ReadOnly

By specifying “ReadOnly” as the cache setting, a write operation at the VM O/S level is confirmed as completed once the data is written to the storage service.
All read operations for data not already in the VM O/S file system cache, will be read from the read cache on the underlying physical machine, before being read from the storage service.

Cache Setting: ReadWrite

By specifying “ReadWrite” as the cache setting, a write operation at the VM O/S level is confirmed as completed once the data is written to the cache on the underlying physical machine.
All read operations for data not already in the VM O/S file system cache, will be read from the read cache on the underlying physical machine, before being read from the storage service.

Where Do We Configure the Disk Cache Settings?

The disk cache settings are configured in Azure against the VM (in the Disks settings), since the disk cache is both physical host and VM series dependent. It is *not* configured against the disk resource itself, as explained in my previous blog post: Listing Azure VM DataDisks and Cache Settings Using Azure Portal JMESPATH & Bash

Any Recommendations for Disk Cache Settings?

There are specific recommendations for Azure disk cache settings, especially when running SAP and especially when running databases like SAP ASE or SAP HANA.

In general, the rules are:

Disk Usage	Azure Disk Cache Setting
Root O/S disk (/)	ReadWrite – ALWAYS!
HANA Shared	ReadOnly
ASE Home (/sybase/<SID>)	ReadOnly
Database Data	HANA=None, ASE=ReadOnly
Database Log	None

The above settings for SAP ASE have been obtained from SAP note 2367194 (SQL Server is same as ASE) and from the general deployment guide here: https://docs.microsoft.com/en-us/azure/virtual-machines/workloads/sap/dbms_guide_general
The use of write caching on the ASE home is optional, you could choose ReadOnly, it would help protect the ASE config file in a very specific scenario. It is envisaged that using ASE 16.0 with SRS/HADR you would have a separate data disk for the Replication Server data (I’ll talk about this in another post).

The above settings for HANA have been taken from the updated guide here: https://docs.microsoft.com/en-us/azure/virtual-machines/workloads/sap/hana-vm-operations-storage which is designed to meet the KPIs mentioned in SAP note 2762990.

The reason for not using a write cache every time, is because an issue at the physical host level, affecting the cache, could cause the application (e.g database) to think it has committed data, when it actually isn’t written to disk. This is not good for databases, especially if the issue affects the transaction/redo log area. Data loss could occur.

It’s worth noting that this cache “issue” has always been true of every caching technology ever created, on which databases run. Storage tech vendors try to mitigate this by putting batteries into the storage appliances, but since the write cache in Azure is at the physical host level, there’s just no guarantee that when the VM O/S thinks the write operation has committed to disk, that it has actually been written to disk.

How About Write Accelerator?

There are specific Azure VM series (M-series at current) that support something known as “Write Accelerator”.
This is an extra VM level setting for Premium Storage disks attached to M-series VMs.

Enabling the Write Accelerator setting is a requirement by Microsoft for production SAP HANA transaction log disks on M-Series VMs. This setting ebales the Azure VM to meet the SAP HANA key performance indicators in note 2762990. Azure Write Accelerator is designed to provide lower latency write times on Premium Storage.

You should ensure that the Write Accelerator setting is enabled where appropriate, for your HANA database transaction log disks. You can check if it is enabled following my previous blog post: Listing Azure VM DataDisks and Cache Settings Using Azure Portal JMESPATH & Bash

I’ve tried my best to find more detailed information on how the Write Accelerator feature is actually provided, but unfortunately it seems very elusive. Robert Boban (of Microsoft) commented on a LinkedIn post here: “It is special caching impl. for M-Series VM to fulfill SAP HANA req. for <1ms latency between VM and storage layer.“.

Check the IOPS

Once you have configured your disks and the cache settings, you should ensure that you test the IOPS achieved using the Microsoft recommended process.
You can follow similar steps as my previous post: Recreating SAP ASE Database I/O Workload using Fio on Azure

As mentioned in other places in the Microsoft documentation and SAP notes such as 2367194, you need to ensure that you choose the correct size and series of VM to ensure that you align the required VM maximum IOPS with the intended amount of data disks and their potential IOPS maximum. Otherwise you could hit the VM max IOPS before touching the disk IOPS maximum.

Enable Accelerated Networking

Since the storage is itself connected to your VM via the network, you should ensure that Accelerator Networking is enabled in your VMs Network Settings:

Checking Cache Settings Directly on the VM

As per my previous post Checking Azure Disk Cache Settings on a Linux VM in Shell, you can actually check the Azure disk cache settings on the VM itself. You can do it manually, or write a script (better option for whole landscape validation).

Summary:

I discussed the two types of storage (standard or premium) that offer disk caching, plus where in Azure you need to change the setting.
The table provided a list of cache settings for both SAP ASE and SAP HANA databases and their data disk areas, based on available best-practices.

I mentioned Write Accelerator for HANA transaction log disks and ensuring that you enable Accelerated Networking.
Also provided was a link to my previous post about running a check of IOPS for your data disks, as recommended by Microsoft as part of your go-live checks.

A final mention was made another post of mine, with a great way of checking the disk cache settings across the VMs in the landscape.

Useful Links:

Windows File Cache

https://docs.microsoft.com/en-us/azure/virtual-machines/linux/premium-storage-performance

https://docs.microsoft.com/en-us/azure/virtual-machines/windows/how-to-enable-write-accelerator

https://docs.microsoft.com/en-us/azure/virtual-machines/workloads/sap/hana-vm-operations-storage#production-storage-solution-with-azure-write-accelerator-for-azure-m-series-virtual-machines

https://petri.com/digging-into-azure-vm-disk-performance-features

https://techcommunity.microsoft.com/t5/running-sap-applications-on-the/sap-on-azure-general-update-march-2019/ba-p/377456

https://docs.microsoft.com/en-us/azure/virtual-machines/workloads/sap/dbms_guide_general

https://docs.microsoft.com/en-us/azure/virtual-machines/workloads/sap/hana-vm-operations-storage

SAP Note 2762990 – How to interpret the report of HWCCT File System Test

SAP Note 2367194 – Use of Azure Premium SSD Storage for SAP DBMS Instance

The Different Layers

Azure Storage Services

Azure Data Disk Cache Settings

Use LVM For Managing Disks

Use XFS File System

Correctly Partition ASE

Add Whole New ASE Devices

Summary:

Useful Links:

You may also be interested in:

Back To The Future

IETF Is On It

The Deal With SAC

How to See the Problem

The Fix(es)

Is There a Pretty Picture?

Useful Links:

You may also be interested in:

The Investigation

Continuing the Diagnosis

Some Sleep

Linux Domain Users

The Solution

Root Cause

HANA Transaction Logging

The SITREP

The Issue

What Was Causing the Disk Usage?

Why Were Catalog Backup Files So Big?

Resolving the Issue

Checking the Results

Summary

Useful Links

You may also be interested in:

What Are the Azure Disk Cache Settings?

Cache Setting: None

Cache Setting: ReadOnly

Cache Setting: ReadWrite

Where Do We Configure the Disk Cache Settings?

Any Recommendations for Disk Cache Settings?

How About Write Accelerator?

Check the IOPS

Enable Accelerated Networking

Checking Cache Settings Directly on the VM

Summary:

Useful Links:

You may also be interested in: