This blog contains experience gained over the years of implementing (and de-implementing) large scale IT applications/software.

Cookies, SAP Analytics Cloud and CORS in Netweaver & HANA

Back in 2019 (now designated as 2019AC – Anno-Covid19), I wrote a post explaining in simple terms what CORS is and how it can affect a SAP landscape.
In that post I showed a simple “on-premise” setup using Fiori, a back-end system and how a Web Dispatcher can help alleviate CORS issues without needing too much complexity.
This post is about a recent CORS related issue that impacts access to back-end SAP data repositories.

Back To The Future

If we hit the “Fast-Forward” button to 2020MC (Mid-Covid19), CORS is now an extremely important technical setup to enable Web Browser based user interfaces to be served from Internet based SAP SaaS services (like SAP Analytics Cloud) and communicate with back-end on-premise/private data sources such as SAP BW systems or SAP HANA databases.

We see that CORS is going to become ever more important going forward, since Web Browser based user interfaces will become more abundant (due to the increase of SaaS products) for the types of back-end data access. The old world of installing a software application on-premise takes too much time and effort to keep up with changing technology.
Using SaaS applications as user interfaces to on-premise data allows a far more agile delivery of user functionality.

The next generation of Web Interfaces will be capable of processing ever larger data sets, with richer capabilities and more in-built intelligence. We’re talking about the Web Browser being a central hub of cross-connected Web Based services.
Imagine, one “web application” that needs a connection to a SaaS product that provides the analytical interface and version management, a connection to one or more back-end data repositories, a connection to a separate SaaS product for AI data analysis and pattern matching (deep insights), a connection to a separate SaaS product for content management (publishing), a connection to a separate SaaS product for marketing and customer engagement.

All of that, from one central web “origin” will mean CORS will become critical to prevent unwanted connections and data leaks. The Web Browser is already the target of many cyber security exploits, therefore staying secure is extremely important, but security is always at the expense of functionality.

IETF Is On It

The Internet Engineering Task Force already have this in hand. That’s how we have CORS in the first place (tools.ietf.org/html/rfc6454).
The Web Origin Concept is constantly evolving to provide features for useability and also security. Way back in 2016 an update to RFC 6265 was proposed, to enhance the HTTP state management mechanism, which is commonly known to you and I as “cookies”.

This amendment (the RFC details are here: tools.ietf.org/html/draft-ietf-httpbis-cookie-same-site-00) was the SameSite attribute that can be set for cookies.
Even in this RFC, you can see that it actually attributes the idea of “samedomain-cookies” back to Mozilla, in 2011. So this is not really a “new” security feature, it’s a long time coming!

The Deal With SAC

The “problem” that has brought me back around to CORS, is recent experience with a CORS issue and SAP Analytics Cloud (SAC).
The issue led me to a blog post by Dong Pan of SAP Canada in Feb 2020 and a recent blog post by Ian Henry, also of SAP in Aug 2020.

Dong Pan wrote quite a long technical blog post on how to fix or work-around the full introduction of the SameSite cookie attribute in Google Chrome version 80 when using SAP Analytics Cloud (SAC).

Ian Henry’s post is also based on the same set of solutions that Dong Pan wrote about, but his issue was accessing a backend HANA XS Engine via Web Dispatcher.

The problem in both cases is that SAP Analytics Cloud (SAC) uses the Web Browser as a middleman to create a “Live Connection” back to an “on-premise” data repository (such as SAP BW or SAP S/4HANA), but the back-end SAP Netweaver/SAP ABAP Platform stack/HANA XS engine, that hosts the “on-premise” data repository does not apply the “SameSite” attribute to cookies that it creates.

You can read Dong Pan’s blog post here: www.sapanalytics.cloud/direct-live-connections-in-sap-analytics-cloud-and-samesite-cookies/
You can read Ian Henry’s blog post here: https://blogs.sap.com/2020/08/26/how-to-fix-google-chrome-samesite-cookie-issue-with-sac-and-hana/

By not applying the “SameSite” attribute to the cookie, Google Chrome browsers of version 80+ will not allow SAC to establish a full session to the back-end system.
You will see an HTTP 400 “session expired” error when viewing the HTTP browser traffic, because SAC tries to establish the connection to the back-end, but no back-end system cookies are allowed to be visible to SAC. Therefore SAC thinks you have no session to the back-end.

How to See the Problem

You will need to be proficient at tracing HTTP requests to be able to capture the problem, but it looks like the following in the HTTP response from the back-end system:

You will see (in Google Chrome) two yellow warning triangles on the “set-cookie” headers in the response from the back-end during the call to “GetServerInfo” to establish the actual connection.
The call is the GET for URL “/sap/bw/ina/GetServerInfo?sap-client=xxx&sap-language=EN&sap-sessionviaurl=X“, with the sap-sessionviaurl in the query-string being the key part.
The text when you hover over the yellow triangle is: “This Set-Cookie didn’t specify a “SameSite” attribute and was defaulted to “SameSite=Lax,” and was blocked because it came from a cross-site response which was not the response to a top-level navigation. The Set-Cookie had to have been set with “SameSite=None” to enable cross-site usage.“.

The Fix(es)

SAP Netweaver (or SAP ABAP Platform) needs some code fixes to add the required cookie attribute “SameSite”.

A workaround (it is a workaround) is possible by using the rewrite module capability of the Internet Communication Management (ICM) or using a rewrite rule in a Web Dispatcher, to re-write the responses and include a generic “SameSite” attribute on each cookie.
This is a workaround for a reason, because using the rewrite method causes unnecessary extra work in the ICM (or Web Dispatcher) for every request (matched or not matched) by the rewrite engine.

It’s always better (more secure, more efficient) to apply the code fix to Netweaver (or ABAP Platform) so the “SameSite” attribute is added at the point of the cookie creation.
For HANA XS, it will need a patch to be applied (if it ever gets fixed in the XS since it is soon deprecated).
With the workaround, we are forcing a setting onto cookies outside of the creation process of those cookies.

Don’t get me wrong, I’m not saying that the workaround should not be used. In some cases it will be the only way to fix this problem in some older SAP systems. I’m just pointing out that there are consequences and it’s not ideal.

Dong Pan and Ian Henry have done a good job of providing options for fixing this in a way that should work for 99% of cases.

Is There a Pretty Picture?

This is something I always find useful when I try and work something through in my mind.
I’ve adjusted my original CORS diagram to include an overview of how I think this “SameSite” attribute issue can be imagined.
Hopefully it will help.

We see the following architecture setup with SAC and it’s domain “sapanalytics.cloud”, issuing CORS requests to back-end system BE2, which sits in domain “corp.net”:

Using the above picture for reference, we can now show where the “SameSite” issue occurs in the processing of the “Resource Response” when it comes back to the browser from the BE2 back-end system:

The blocking, by the Chrome Web browser, of the cookies set by the back-end system in domain “corp.net”, means that from the point of view of SAC, no session was established.
There are a couple more “Request”, “Response” exchanges, before the usual HTTP Authorization header is sent from SAC, but at that point it’s really too late as the returned SAP SSO cookie will also be blocked.

At this point you could see a number of different error messages in SAC, but in the Chrome debugging you will see no HTTP errors because the actual HTTP request/response mechanism is working and HTTP content is being returned. It’s just that SAC will know it does not have a session established, because it will not be finding the usual cookies that it would expect from a successfully established session.

Hopefully I’ve helped explain what was already a highly technical topic, in a more visual way and helped convey the problem and the solution.


Useful Links:

Azure Front Door in a SAP Context

In April 2019, Microsoft announced the general availability of the Azure Front Door service.
The highlight of this service is layer 7 (HTTP/S) load balancing.
In this post I want to briefly explore how Azure Front Door could sit in an example SAP landscape.

But We Have Azure Application Gateway…

Yes, while the Azure Front Door service does provide similar capabilities with regards to load balancing an HTTP/s based back-end service, the similarities end when we start to consider multi-regional distribution of services. That is, multiple Azure regions actively servicing global clients.

Azure Application Gateway

The Azure Application Gateway service is the go-to service for HTTP/S load balancing for your Azure hosted HTTP/S IaaS or Container based services that are contained within a region.

Event for some, limited, SAP uses, the Azure Application Gateway may be sufficient, but you really need an experienced SAP Solution Architect to help you plan your SAP landscape architecture at this point. The consequences of doing it wrong, could cause you to completely re-implement a new architecture pattern in your landscape and, of course, additional cost.

… and SAP Web Dispatcher

I have discussed the features of the SAP Web Dispatcher before.
The need for a SAP Web Dispatcher in a SAP landscape is clear and even more appropriate in a cloud deployment of SAP.
Just like Azure Application Gateway, the SAP Web Dispatcher’s context should be limited to a single region. This is especially true because it is IaaS, which means the VMs on which the Web Dispatcher is deployed, are themselves bound to a specific region.

However, what is not clear is how disparate Web Dispatcher systems (i.e. different SAPGLOBALHOST values) can be used in different regions to correctly load balance. This is not the same as a single system with different instances in different regions!

How It All Hangs Together

If we go back to the purpose of this post, I wanted to show how Azure Front Door could be used within the context of a SAP system deployment in Azure.

To help convey the idea, I’ve put together a simple diagram:

In the above diagram, you can see that the Azure Front Door service is used to balance inbound requests from a customer booking system, across multiple Azure regions, directly from the internet. This means that Azure Front Door is most definitely suited as a global customer facing load balancer.
An example scenario is a 2 (or more) region architecture with primary region and disaster recovery region. If the primary region for our customer booking system is unavailable, a DR could be invoked and customers could be routed to the DR region, allowing customer bookings to be taken.

In the diagram, traffic routed from Azure Front Door, is then (for the sake of example) routed through Azure Application Gateway. This is just for example, but in reality it’s not really needed. It could be that you have a real mixture of SAP and non-SAP in some converged sub-domain, and it may be easier to load balance this mix of URLs at this level.
The main point at this point is, you are committed to returning data from a single region.

In our example diagram, the Azure Application Gateway then routes traffic to the SAP Web Dispatcher, which load balances the traffic over the back-end SAP ECC system available application servers using the ABAP stack message server (a feature that is not easily replicated in any other load balancer).

Where Does Azure Traffic Manager Sit?

The Azure Traffic Manager service is a DNS based routing and distribution service. If your company is a multinational conglomerate with a latency sensitive web based customer service, then the Azure Traffic Manager can be used to route customers to their nearest region, where you have your web service hosted and where they can potentially get the speedist and most appropriate content.
If you have only 2 or 3 regions, do not have latency issues and have no need to provide region specific content, then Azure Front Door is probably what you need.

Summary:

I’ve tried to show how the Azure Front Door service can provide your internet sourced, customer entry point into your multi-region web service.
The diagram I’ve provided hopefully shows how Azure Front Door can be distinguished from other similar technologies in a SAP landscape including how Azure Application Gateway could also be in the mix (although rare).
Finally I discuss how Azure Traffic Manager may not always be appropriate for load distribution.

Useful Links

SAP Web Dispatcher Reverse Proxy Features

If you read through any SAP documentation, you may be forgiven for thinking that the SAP Web Dispatcher is just a reverse HTTP proxy.
It can be located in front of a SAP WebAS and balance the load.
Therefore, it is a simple reverse proxy, right?

In this post, I am going to highlight some of the core features of the SAP Web Dispatcher, so that you may understand its strengths in comparison with other solutions such as Azure Application Gateway and even Azure Front-Door.

Heavily Engineered

There’s a common misconception that SAP is just another piece of software using an array of different components lumped together with some bits of Open Source. In some small cases this may be true of acquired software.
However, the core SAP software offerings are actually far more coherent and intricately linked than you may first imagine.

Ask any Oracle EBS administrator about their software stack and you will be impressed at how well the SAP software stack has been engineered.
This is especially true for the lower SAP Kernel level software components. The older parts of the software stack, are reused so often because of their robustness.

3 Routing Principals

The main thing to remember is that the SAP Web Dispatcher can route requests according to 3 main principles:

  1. Capability
    Is the desired target URL path served by the configured target back-end system(s).

  2. Availability
    Is the desired target URL path served by a configured back-end system that is available (i.e. not in maintenance mode).

  3. Capacity
    Are there more than one target back-end servers capable of handling the request and which one has more capacity.

Load Balancing Act

The SAP Web Dispatcher takes the HTTP/S request from the end-user and as part of the routing determination it analyses the target back-end system load.
It’s actually continually aware of the back-end systems.

There’s a great picture here, which highlights the load balancing methods used for the different types of SAP back-end: https://help.sap.com/viewer/683d6a1797a34730a6e005d1e8de6f22/7.40.18/en-US/4899c3d999273987e10000000a421937.html

What is not mentioned on the help.sap.com page linked above, is target back-end systems configured as “EXTSRV” (non-SAP routing) and also the “flat-file” routing method (very rarely used – at least, I’ve not used it).

The “EXTSRV” back-end systems will use basic round-robin to distribute the request between a comma separated list of target servers. Sticky-ness is achieved through HTTP headers, allowing the Web Dispatcher to determine which back-end system it routed your previous request to.

Even though “EXTSRV” is really designed for non-SAP back-ends, I have used “EXTSRV” for SAP systems, especially when using the SAP Web Dispatcher to avoid issues for system-to-system communications and wanting to avoid CORS issues (see CORS in a SAP Netweaver Landscape).

The “flat-file” method simply uses a static text file as a kind of false load response from a Message Server. The flat-file can be generated by anything and the Web Dispatcher configuration is then defined to route to whatever is in the flat-file.

Back-End

Apart from “EXTSRV” and “flat-file”, all other routing mechanisms use SAP proprietary methods to determine the back-end system load.
As you can see in the SAP Help page link referenced above, the SAP Web Dispatcher knows about the back-end because in the SAP Web Dispatcher configuration, we tell it what it is going to be routing to.

As an example, ABAP back-end systems are added to the Web Dispatcher profile file with the ABAP Message Server described in the configuration.
The Web Dispatcher connects to the target system’s Message Server and says “hello”.
Once connected, the SAP Web Dispatcher retrieves the list of URLs that are provided by the ABAP back-end system, the servers that are served by the Message Server and the relative load of those servers.
All of this information is used during the routing determination.

Protocols

The Web Dispatcher can handle HTTP 1.0, 1.1 and 2.0 (HTTP/2) protocols delivered over TLS (SSL).

Since Kernel 7.49, HTTP/2 has been supported in the Web Dispatcher and also in the ABAP Netweaver stack. This is significant for the latest HTTP based SAP UX known as SAP Fiori. The use of HTTP/2 allows request multiplexing over a single continuous TCP connection, reducing latency and increasing throughput.

NOTE: There are some great SAP blogs out on there on how and why to enable HTTP/2 for Fiori!

For many years now, the SAP Web Dispatcher has supported the Web Socket protocol.
The Web Socket protocol allows developers to utilise push-notifications and provide a more real-time interactive experience for HTML 5.0 content.
Bringing a closer level of integration with the consuming Web Browser.

Security

Some of the more complex uses of the SAP Web Dispatcher involve specific security scenarios.

One such scenario that comes to mind, is Principal Propagation, which can use the Web Dispatcher to front a set of common back-end systems.
The whole premise of Principal Propagation, is that the iDP (identity provider) is “impersonating” the authenticated user, by issuing a generated certificate of authenticity to the target system, on behalf of the user.
With a reverse proxy between the Web Browser and the target HTTP service, things can become complex because the generated X.509 client certificate can become consumed by the proxy server, instead of being forwarded to the target HTTP server.
To prevent the certificate from being interpreted in the wrong way, the SAP Web Dispatcher can be configured to shift the client certificate out to a predefined HTTP header., allowing a kind of X.509 client certificate forwarding.
(More information can be found here: Principal Propagation with SAP Cloud Platform).

Update Aug-2020: As pointed out by a reader, the SAP Web Dispatcher is also capable of reverse invocation. This is an added security feature which allows the target SAP system to open the connection to the SAP Web Dispatcher (instead of the other way around). The SAP Web Dispatcher then uses this open connection channel to send load balanced requests back to the target SAP system. The Reverse Invoke feature is obviously meant for scenarios where the Web Dispatcher exists in a separate network segment (DMZ) to the target SAP system, meaning you only need to open the firewall in the outbound (from the target SAP system) direction.
(Details here: https://help.sap.com/doc/7b196aab728810148a4b1a83b0e91070/1511%20000/en-US/frameset.htm)

Manageability

There’s nothing I like about trying to trace a HTTP call through a proxy server.
The SAP Web Dispatcher comes with it’s own secure administration page from where an administrator can enable advanced tracing capabilities.

The SAP Web Dispatcher makes it much easier to trace requests and responses, with the ability to show the complete unencrypted trace of SSL encrypted sessions (not using pass-through encryption).

The trace is able to show the exact ABAP work process number that processed the request in the target back-end system.

An administrator is able to move individual back-end systems into “maintenance mode” and provide a custom HTTP 503 (service unavailable) message, without affecting the other back-end systems serviced by the same Web Dispatcher.

The SAP Web Dispatcher comes with a vast array of configuration parameters to hone the characteristics of the service you are trying to deliver.
As an example, parameter “wdisp/handle_webdisp_ap_header” can be set to allow the Web Dispatcher to add additional HTTP headers to the request, thereby informing the target back-end system of the Web Dispatcher forward-facing TCP ports. This feature allows the target back-end systems to correctly rewrite HTML links and referral URLs, with the ports on which the SAP Web Dispatcher is listening for requests.
This is just one example of where the back-end SAP system is actually aware that it is being called via a SAP Web Dispatcher.

The Future

With the seemingly constant evolution of cloud based services, what do I imagine the future is for the SAP Web Dispatcher?
In my opinion it is here for another few years yet. The feature list is too specific to SAP landscapes for any real profit to be made by a competitive product.
However, what we may see in this hyper-competitive race for cloud adoption, is the use of a SaaS based version of SAP Web Dispatcher, provided for by the major cloud providers.
Right now, a SAP Web Dispatcher consumes far too much cost/resources/effort than it needs to. Therefore, a simple button click and subsequent configuration in something like the Azure Portal, would be a great saving and more importantly, a great incentive to potential cloud customers with SAP landscapes.

Summary

In this short article, we have discussed how the robust engineering of the SAP Web Dispatcher makes it the ideal front-end reverse proxy for the back-end systems of a SAP landscape.

In fact, in some situations it is the only possibility due to the way the Web Dispatcher is acutely SAP back-end aware, with many features built for native SAP compatibility.

Conversely we’ve seen how, in some situations, the back-end system is actually aware of the presence of the SAP Web Dispatcher and can rewrite HTML links and referral URLs accordingly.

We know the latest HTTP/2 protocol is supported and that this is in line with SAP’s goal of having Fiori as the future SAP presentation layer.

We discussed the extensive tracing capabilities, helping SAP administrators to diagnose complex HTTP connectivity, and authentication issues.

We can conclude that, SAP Web Dispatcher is not just a simple reverse proxy and its use within your SAP landscape is more than likely going to be beneficial in some way or another.
The SAP Web Dispatcher will be with us for a while longer.

References:

SAP ASE Error – Process No Longer Found After Startup

This post is about a strange issue I was hitting during the configuration of SAP LaMa 3.0 to start/stop a SAP ABAP 7.52 system (with Kernel 7.53) that was running with a SAP ASE 16.0 database.

During the LaMa start task, the task would fail with an error message: “ASE process no longer found after startup. (fault code: 127)“.

When I logged directly onto the SAP server Linux host, I could see that the database had indeed started up, eventually.
So what was causing the failure?

The Investigation

At first I thought this was related to the Kernel, but having checked the versions of the Kernel components, I found that they were the same as another SAP system that was starting up perfectly fine using the exact same LaMa system.

The next check I did was to turn on tracing on the hostagent itself. This is a simple task of putting the trace value to “3” in the host_profile of the hostagent and restarting it:

service/trace = 3

The trace output is shown in a number of different trace files in the work directory of the hostagent but the trace file we were interested in is called dev_sapdbctrl.

The developer trace file for the sapdbctrl binary executable is important, because the sapdbctrl binary is executed by SAP hostagent (saphostexec) to perform the database start. If you observe the contents of the sapdbctrl trace output, you will see that it loads the Sybase specific shared library which contains the required code to start/stop the ASE database.

The same sapdbctrl also contains the ability to load the required libraries for other database systems.

As a side note, it is still not known to me, how the Sybase shared library comes to exist in the hostagent executable directory. When SAP ASE is patched, this library must also be patched, otherwise how does the hostagent stay in-step with the ASE database that it needs to talk with?

Once tracing was turned on, I shut the SAP ASE instance down again and then used SAP LaMa to initiate the SAP system start once again.
Sure enough, the LaMa start task failed again.

Looking in the trace file dev_sapdbctrl I could see the same error message that I was seeing in SAP LaMa:

Error: Command execution failed. : ASE process no longer found after startup. 
(fault code: 127) Operation ID: 000D3A3862631EEAAEDDA232BE624001
----- Log messages ---- 
Info: saphostcontrol: Executing StartDatabase 
Error: sapdbctrl: ASE process no longer found after startup. 
Error: saphostcontrol: StartDatabase failed (sapdbctrl exit code = 1)

This was great. It confirmed that SAP LaMa was just seeing the symptom of some other issue, since LaMa just calls the hostagent to do the start.

Now I knew the hostagent was seeing the error, I tried using the hostagent directly to perform the start, using the following:

/usr/sap/hostctrl/exe/saphostctrl -debug -function StartDatabase -dbname <SID> -dbtype syb -dbhost <the-ASE-host>

NOTE: The hostagent “-debug” command line option puts out the same information without the need for the hostagent tracing to be turned on in the host_profile.

Once again, the start process failed and the same error message was present in the dev_sapdbctrl trace file.

This was now really strange.
I decided that the next course of action was to start the process of raising the issue with SAP via an incident.
If you suspect that something could take a while to fix, then it’s always best to raise it with SAP early and continue to look at the issue in parallel.

Continuing the Diagnosis

While the SAP incident was in progress, I continued the process of trying to self-diagnose the issue further.
I tried a couple more things such as:

  • Starting and stopping SAP ASE manually using stopdb/startdb commands.
  • Restarting the whole server (this step has a place in every troubleshooting process, eventually).
  • Checking the server patch level.
  • Checking the environment of the Linux user, the shell, the profile files, the O/S limits applied.
  • Checking what happens if McAfee anti-virus was disabled (I’ve seen the ePO blocking processes before).

Eventually exhaustion set in and I decided to give the SAP support processor time to get back to me with some hints.

Some Sleep

I spend a lot of time solving SAP problems. A lot of time.
Something doesn’t work according to the docs, something did work but has stopped working, something has never worked well…
It builds up in your mind and you carry this stuff around in your head.
Subconsciously you think about these problems.

Then, at about 3am when you can’t get back to sleep, you have a revelation.
The hostagent is forking the process to start the database as the syb<sid> Linux user (it uses “su”), from the root user (hostagent runs as the root user).

Linux Domain Users

The revelation I had regarding the forking of the user, was just the trigger I needed to make me consider the way the Linux authentication was setup on this specific server with the problem ASE startup.

I remembered at the beginning of the project that I had hit an issue with the SSSD Linux daemon, which is responsible for interfacing between Linux and Microsoft Active Directory. At that time, the issue was causing the hostagent to hang when operations were executed which required a switch to another Linux user.
This previous issue was actually a hostagent issue that was fixed in a later hostagent patch. During that particular investigation, I requested that the Linux team re-configure the SSSD daemon to be more efficient with its Active Directory traversals, when it was looking to see if the desired user account was local to Linux or if it was a domain account.

With this previous issue in mind, I checked the SSSD configuration on the problem server. This is a simple conf file in /etc/sssd.

The Solution

After all the troubleshooting, the raising of the incident, the sleeping, I had finally got to the solution.

After checking the SSSD daemon configuration file /etc/sssd/sssd.conf, I could clearly see that there was one entry missing compared to the other servers that didn’t experience the SAP ASE start error.

The parameter: “subdomain_enumerate = none” was missing.
Looking at the manual page for SSSD it would seem that without this parameter there is additional overhead during any Active Directory traversal.

I set the parameter accordingly in the /etc/sssd/sssd.conf file and restarted the SSSD daemon using:

service sssd restart

Then I retried the start of the database using the hostagent command shown previously.
It worked!
I then retried with SAP LaMa and that also now started ASE without error messages.

Root Cause

What it seems was happening, was some sort of internal pre-set timeout in the sapdbctrl binary, when hit, the sapdbctrl just abandons and throws the error that I was seeing. This leaves the ASE database to continue and start (the process was initiated), but in the hostagent it looked like it had failed to start.
By adding the “subdomain_enumerate = none” parameter, any “delay”, caused by inappropriate call to Active Directory was massively reduced and subsequent start activities were successful.

Analysing & Reducing HANA Backup Catalog Records

In honour of DBA Appreciation Day today 3rd July, I’ve written a small piece on a menial but crucial task that HANA database administrators may wish to check. It’s very easy to overlook but the impact can be quite amazing.

HANA Transaction Logging

In “normal” log mode (for recoverability), the HANA database, like Oracle, has an automatic transaction log backup process, which is responsible for backing up transaction log segments so that the HANA log volume disk space can be re-used by new transactions.
No free disk space in the HANA log volume, means the database will hang, until free space becomes available.

It is strongly recommended by SAP, to have your HANA database in log mode “normal”, since this offers the point-in-time recovery capability through the use of the transaction log backups.

By default a transaction log backup will be triggered automatically by HANA every time a log segment becomes full or if the timeout for an individual service is hit, whichever of those is sooner.
This is known as “immediate” interval mode.

I’m not going to go into the differences of the various interval options and the pros and cons of each since this is highly scenario specific. A lot of companies have small HANA databases and are quite happy with the default options. Some companies have high throughput, super low latency requirements, and would be tuning the log backup process for maximum throughput, while other companies want minimal data-loss and adjust the parameters to ensure that transactions are backed up off the machine as soon as possible.

The SITREP

In this specific situation that I encountered, I have a small HANA database of around ~200GB in memory, serving a SAP Solution Manager 7.2 system (so it has 2x tenant databases plus the SystemDB).

The settings are such that all databases run in log_mode “normal” with consolidated log backups enabled in “immediate” mode and a max_log_backup_size of 16GB (the default, but specified).

All backups are written to a specific disk area, before being pushed off the VM to an Azure Storage Account.

The Issue

I noticed that the local disk area was becoming quite full where the HANA database backups are written. Out of context you might have said it’s normal for an increase of activity in the system, but I know that this system is not doing anything at all (it’s a test system for testing Solution Manager patches and nobody was using it).

What Was Causing the Disk Usage?

Looking at the disk backup file system, I could easily see at the O/S level, that the HANA database log backups were the reason for the extra space usage.
Narrowing that down even further, I could be specific enough to see that the SYSTEMDB was to blame.

The SYSTEMDB in a very lightly used HANA database should not be transacting enough to have a day-to-day noticeable increase in log backup disk usage.
This was no ordinary increase!
I was looking at a total HANA database size on disk of ~120GB (SYSTEMDB plus 2x TenantDBs), and yet I was seeing ~200GB of transaction log backups per day from just the SYSTEMDB.

Drilling down further into the log backup directory for the SYSTEMDB, I could see the name of the log backup files and their sizes.
I was looking at log backup files of 2.8GB in size every ~10 to ~15 minutes.
The files that were biggest were….

… log_backup_0_0_0_0.<unix epoch time>
That’s right, the backup catalog backups!

Whenever HANA writes a backup, whether it is a complete data backup, or a transaction log backup, it also writes a backup of the backup catalog.
This is extremely useful if you have to restore a system and need to know about the backups that have taken place.
By default, the backup catalog backups are accumulated, which means that HANA doesn’t need to write out multiple backups of the backup catalog for each log backup (remember, we have 2x tenantDBs).

Why Were Catalog Backup Files So Big?

The catalog backups include the entire backup catalog.
This means every prior backup is in the backup file, so by default the backup catalog backup file will increase in size at each backup, unless you do some housekeeping of the backup catalog records.

My task was to write some SQL to check the backup catalog to see how many backup catalog records existed, for what type of backups, in which database and how old they were.

I came up with the following SQL:

--- Breakdown of age of backup records in months, by type of record.
SELECT smbc.DATABASE_NAME,
smbc.ENTRY_TYPE_NAME,
MONTHS_BETWEEN(smbc.SYS_START_TIME, CURRENT_DATE) as AGE_MONTHS,
COUNT(MONTHS_BETWEEN(smbc.SYS_START_TIME, CURRENT_DATE)) RECORDS,
t_smbc.YOUNGEST_BACKUP_ID
FROM	"SYS_DATABASES"."M_BACKUP_CATALOG" AS smbc,
		(SELECT xmbc.DATABASE_NAME, 
				xmbc.ENTRY_TYPE_NAME, 
				MONTHS_BETWEEN(xmbc.SYS_START_TIME, CURRENT_DATE) as AGE_MONTHS, 
				max (xmbc.BACKUP_ID) as YOUNGEST_BACKUP_ID 
				FROM "SYS_DATABASES"."M_BACKUP_CATALOG" xmbc 
				GROUP BY xmbc.DATABASE_NAME, 
						xmbc.ENTRY_TYPE_NAME, 
						MONTHS_BETWEEN(xmbc.SYS_START_TIME, CURRENT_DATE) 
		) as t_smbc 
WHERE t_smbc.DATABASE_NAME = smbc.DATABASE_NAME 
AND t_smbc.ENTRY_TYPE_NAME = smbc.ENTRY_TYPE_NAME 
AND t_smbc.AGE_MONTHS = MONTHS_BETWEEN(smbc.SYS_START_TIME, CURRENT_DATE) 
GROUP BY 	smbc.DATABASE_NAME, 
			smbc.ENTRY_TYPE_NAME, 
			MONTHS_BETWEEN(smbc.SYS_START_TIME, CURRENT_DATE), 
			t_smbc.YOUNGEST_BACKUP_ID 
ORDER BY DATABASE_NAME, 
		AGE_MONTHS DESC,
		RECORDS

The key points to note are:

  • I use the SYS_DATABASES.M_BACKUP_CATALOG view in the SYSTEMDB to see across all databases in the HANA system instead of checking in each one.
  • For each database, the SQL outputs:
    – type of backup (complete or log).
    – age in months of the backup.
    – number of backup records in that age group.
    – youngest backup id for that age group (so I can do some cleanup).

An example execution is:

(NOTE: I made a mistake with the last column name, it’s correct in the SQL now – YOUNGEST_BACKUP_ID)

You can see that the SQL execution took only 3.8 seconds.
Based on my output, I could immediately see one problem, I had backup records from 6 months ago in the SYSTEMDB!

All of these records would be backed up on every transaction log backup.
For whatever reason, the backup process was not able to honour the “BACKUP CATALOG DELETE” which was meant to keep the catalog to less than 1 month of records.
I still cannot adequately explain why this had failed. The same process is used on other HANA databases and none had exhibited the same issue.

I can only presume something was preventing the deletion somehow, since in the next few steps you will see that I was able to use the exact same process with no reported issues.
For reference this is HANA 2.0 SPS04 rev47, patched all the way from SPS02 rev23.

Resolving the Issue

How did I resolve the issue? I simply re-ran the catalog deletion that was already running after each backup.
I was able to use the backup ID from the YOUNGEST_BACKUP_ID column to reduce the backup records.

In the SYSTEMDB:

BACKUP CATALOG DELETE ALL BEFORE BACKUP_ID xxxxxxxx

Then for each TenantDB (still in the SYSTEMDB):

BACKUP CATALOG DELETE FOR <TENANTBD> ALL BEFORE BACKUP_ID xxxxxxxx

At the end of the first DELETE execution *in the first Tenant*, I re-ran the initial SQL query to check and this was the output:

We now only have 1 backup record, which was the youngest record in that age group for that first tenant database (compare to screenshot of first execution of the SQL query with backup id 1,590,747,286,179).
Crucially we have way less log backups for that tenant. Weve gone down from 2247 to 495.
Nice!
I then progressed to do the delete in the SYSTEMDB and other TenantDB of this HANA system.

Checking the Results

As a final check, I was able to compare the log backup file sizes:

The catalog backup in file “log_backup_0_0_0_0.nnnnnnn” at 09:16 is before the cleanup and is 2.7GB in size.
Whereas the catalog backup in “log_backup_0_0_0_0.nnnnnnn” at 09:29 is after the cleanup and is only 76KB in size.
An absolutely massive reduction!

How do we know that file “log_backup_0_0_0_0.nnnnnnn” is a catalog backup?
Because we can check using the Linux “strings” command to see the file string contents.
Way further down the listing it says it is a catalog backup, but I thought it was more interesting to see the “MAGIC” of Berlin:

UPDATE: August 2020 – SAP note 2962726 has been released which contains some standard SQL to help remove failed backup entries from the catalog.

Summary

  • Check your HANA backup catalog backup sizes.
  • Ensure you have alerting on file systems (if doing backups to disk).
  • Double check the backup catalog record age.
  • Give tons of freebies and thanks to your DBAs on DBA Appreciation Day!
Useful Links

Enable and Disable Automatic Log Backup
https://help.sap.com/viewer/6b94445c94ae495c83a19646e7c3fd56/2.0.05/en-US/241c0f0020b2492fb93a69a40b1b1b9a.html

Accumulated Backups of the Backup Catalog
https://help.sap.com/viewer/6b94445c94ae495c83a19646e7c3fd56/2.0.05/en-US/3def15378b954aac85f2b93bb3f85a49.html

Log Modes
https://help.sap.com/viewer/6b94445c94ae495c83a19646e7c3fd56/2.0.05/en-US/c486a0a3bb571014ab46c0633224f02f.html

Consolidated Log Backups
https://help.sap.com/viewer/6b94445c94ae495c83a19646e7c3fd56/2.0.05/en-US/653b5c6d5f9d41808011a5bd0fac6709.html