Disaster Recovery Archives » Musings of an IT Implementor

Simple in-Cloud SAP LaMa DR Setup

When running the SAP Landscape Management tool (LaMa) in the cloud, you need to be aware of the tool’s importance in your SAP landscape in the context of disaster recovery (DR).

In this post I will highlight the DR strategies for hosting SAP LaMa with your favourite cloud provider.

What is SAP LaMa?

For those not yet accustomed to SAP LaMa, it is SAP’s complete SAP/non-SAP landscape management and orchestration tool for both on-premise and cloud.

SAP LaMa comes in two guises:

Standard Edition
Enterprise Edition

The Enterprise edition comes with many additional features, but crucially, it includes the “Cloud Connectors” for all the mainstream cloud vendors.
A “Cloud Connector” allows seamless start/stop/provisioning of cloud hosted VMs.

Using SAP LaMa to execute a pre-configured, ordered startup of VMs and the applications on those VMs can be a huge time saving during a disaster.

What Installation Patterns Can We Use with SAP LaMa?

SAP LaMa is a software component installed inside a standard SAP Netweaver Java stack. Therefore, you may use the standard Netweaver Java installation patterns such as single-system or distributed system.
SAP LaMa will work in either pattern.

What is a Normal Installation Pattern in the Cloud?

In the cloud (e.g. Azure, GCP, AWS etc), when installing SAP Netweaver, you would usually want to use the distributed system architecture pattern, to prevent a single VM outage from disrupting the SAP Netweaver application too much. The distributed system pattern is preferred because you have slightly less control over the patching of the physical host systems, so it afford you that little bit extra up-time.

This usually means having: a Web Dispatcher tier, at least 2 application servers in the application tier, the Central Services (SCS) instance having failover and using Enqueue Replication Server (ERS), plus database replication technology on the database tier.

How is DR catered for in SAP LaMa?

For large organisations with business critical SAP systems like SAP S/4HANA, SAP ECC etc, you would usually have a “hot” DR database server (i.e. running and actively replicating from the primary database) in your designated DR cloud region.
This means there is minimal data-loss and as the DR database is mere minutes behind the primary database in transactional consistency.
The application tier and Web Dispatcher tier would use the cloud provider’s VM replication technology (e.g in Azure this is called Azure Site Recovery), ensuring that the application patching and config is also replicated.

I would designate the above pattern as a “hot” DR architecture pattern.

For SAP LaMa the situation is slightly more flexible because:

It is not business critical, only operations critical.
The database is only a repository for configuration and monitoring data. Therefore, transactional data loss is not critical.
In fact, the configuration data in SAP LaMa can be exported into a single XML file and re-imported into another LaMa system.

Due to the above, we have some different options that we can explore for Disaster Recovery.
Excluding the “hot” DR architecture pattern, we could classify the DR architecture pattern options for SAP LaMa as “restore”, “cold”, “cool” and finally “warm”. (These are my own designators, you can call them what you like really).

What is a “restore” DR pattern for SAP LaMa?

A “restore” DR setup for SAP LaMa, is when you have no pre-existing VM in your cloud DR region. Instead you are replicating your VM level backups into a geo-replicated file storage service (in Azure this is called Azure Vault).

In this setup, during a DR scenario, the VM backups from your primary region would need to be accessible to restore to a newly built VM in the DR region.

This is the most cost friendly option, but there is a significant disadvantage here. Your system administrators will not have the benefit of LaMa to see the current state of the landscape and they will not be able to make use of the start/stop technology.

Instead they will need a detailed DR runbook with start/stop commands and system/VM startup priority, to be able to start your critical systems in a DR scenario. You are also placing your trust in the VM backup and restore capability to get LaMa back online.

The VM backup timing could actually be an issue depending on the state of the running database at the time of backup. Therefore, you may need to also replicate and restore the database backup itself.

During a DR scenario, the pressure will be immense and time will be short.

Cost: $
Effort: !!!! (mainly all during DR)
Bonus: 0

What is a “cold” DR pattern for SAP LaMa?

A “cold” DR setup for SAP LaMa, is when you have a duplicate SAP LaMa system that is installed in the DR cloud region, but the duplicate system is completely shutdown, including the VM(s).

In this setup, during a DR scenario, the VM would need to be started using the cloud provider tools (or other method) and then the SAP LaMa system would be started.

Once running, the latest backup of the LaMa configuration would need restoring (it’s an XML file) and the cloud connectors connecting to the cloud provider. After connecting to the cloud provider, LaMa can then be used to start/provision the other software components of the SAP landscape, into the DR cloud region.

Compared to the “restore” pattern, we can have our DR LaMa system up and running and start using it to start the VMs and applications in a pre-defined DR operation template (like a runbook).
However, we need a process in place to export and backup the export of the configuration from the primary LaMa system, so that the configuration file is available during a DR scenario.

In Azure, for example, we would store the configuration file export on a geo-replicated file storage service that was accessible from multiple-regions. We also have the associated hosting costs and the required patching/maintenance of the DR VM and LaMa system. As an added bonus, this pattern allows us to apply patches first to the DR LaMa system, which could remove the need for a Development LaMa system.

Cost: $$
Effort: !!! (some during DR, patching)
Bonus: +

What is a “cool” DR pattern for SAP LaMa?

A “cool” DR setup for SAP LaMa, is when you have a duplicate SAP LaMa system that is installed in the DR cloud region, and the duplicate system is frequently started (maybe daily) and the configuration synchronised with the primary SAP LaMa system.

The synchronisation could be using the in-built configuration synchronisation of the LaMa software layer, or it could be a simple automated configuration file import from a shared file location where the configuration file has previously been exported from the primary LaMa system.

In this setup, during a DR scenario, the VM *may* need to be started (depends on when the failure happens), using the cloud provider tools (or other method) and then the SAP LaMa system *may* need to be started. Once running, the latest backup of the LaMa configuration would probably not need restoring (it’s an XML file), because for your business critical systems, they would already exist and be configured as a result of the frequent synchronisation. The cloud connectors would need connecting to the cloud provider.
After connecting to the cloud provider, LaMa can then be used to start/provision the other software components of the SAP landscape, into the DR cloud region.

Compared to the “cold” pattern, we save a little time by having the frequent configuration file synchronisation already done. We can choose to also have a process in place to export and backup the export of the configuration from the primary LaMa system, should we choose to also use that configuration file.
There is an obvious cost to the frequent starting of the VM. Since you pay for the VM to be running.

As an added bonus, this pattern allows us to apply patches first to the DR LaMa system, which could remove the need for a Development LaMa system.

Cost: $$$
Effort: !! (a little during DR, patching)
Bonus: +

What is a “warm” DR pattern for SAP LaMa?

A “warm” DR setup for SAP LaMa, is when you have a duplicate SAP LaMa system that is installed in the DR cloud region, and the duplicate system is constantly running with frequent (could be hourly) synchronisation with the primary SAP LaMa system.
The synchronisation could be using the in-built configuration synchronisation in the LaMa software component, or it could be a simple automated file import from a shared file location where the configuration file has been exported from the primary LaMa system.

In this setup, during a DR scenario, the cloud connectors would need connecting to the cloud provider. After connecting to the cloud provider, LaMa can then be used to start/provision the other software components of the SAP landscape, into the DR cloud region.

Like the “cool” pattern, we get an added bonus that this pattern allows us to apply patches first to the DR LaMa system, which could remove the need for a Development LaMa system.

Compared to the other patterns, we gain the immediate advantage of being able to start/stop VMs and SAP systems in the DR region. However, there is a constant cost for the VM to be running (if using a PAYG VM pricing model).

Cost: $$$$
Effort: ! (hardly any during DR, patching)
Bonus: +

Summary

Depending on your strategy, you may choose to stick to your existing architecture patterns.

You could choose to use a “hot” DR pattern, and ensure that your DR LaMa system is in-synch to the primary.
However, for the most risk averse, I would be inclined to calculate the costs/benefits for the “warm” pattern.
A “warm” pattern also means you could forgo the distributed system installation pattern for the DR system. Choosing the more cost-effective single-system pattern and removing the extra complexity of database level replication.

For SMEs, I would favour more the “cool” pattern. This could remove the need for a Development system, allowing testing of patching on the DR system instead. I feel it represents the middle ground between using the technology vs cost.

Complications of using SAP ASE 16.0 in a HADR pair plus DR node Setup

Firstly, we need to clarify that HADR in SAP ASE speak, is the SAP ASE feature-set name for a HA or DR setup consisting of 2 SAP ASE database instances with a defined replication mode.

The pair can be either for HA or DR, but rarely both, due to the problem of latency.
The problem of latency is inverse to the solution of DR. The further away your second datacentre, the better, from a DR perspective.
Conversely, the worse your latency will become, meaning it can only seriously be used for DR, and not for HA.

If you can find a sweet spot between distance (better for DR) and latency (better for HA), then you would have a HADR setup. But this is unlikely.

As of ASE 16 SP03, an additional DR node is supported to be incorporated into a HADR pair of ASE database instances.
This produces a 3 node setup, with 2 nodes forming a pair (designed to be for HA), then a remote 3rd node (designed for DR).
The reason you may consider such a setup is to provide HA between the two nodes, maybe within an existing datacentre, then DR is provided by a remote 3rd node.
Since the two nodes within the HA pair would likely have low latency, they would have one replication mode (e.g synchronous replication) keeping the data better protected, with the replication mode to the third database being asynchronous, for higher latency scenarios, but less protected data.

In the scenarios and descriptions below, we are highlighting the possibility of running a two node HADR pair plus DR node in public cloud using a paired region:

Whilst an SAP application layer is also supported on the 3 node setup, there are complications that should be understood prior to implementation.
These complications will drive up both cost of implementation and also administrative overhead, so you should ensure that you fully understand how the setup will work before embarking on this solution.

Setup Process:

We will briefly describe the process for setting up the 3 nodes.
In this setup we will use the remote, co-located replication server setup, whereby the SAP SRS (replication server) is installed onto the same servers as the ASE database instances.

1, Install primary ASE database instance.

2, Install Data Movement (DM) component into the binary software installation of the primary ASE database instance.

3, Install secondary ASE database instance.

4, Install Data Movement (DM) component into the binary software installation of the secondary ASE database instance.

5, Run the setuphadr utility to configure the replication between primary and secondary.

This step involves the materialisation of the master and <SID> databases. The master database materialisation is automatic, the <SID> database is manual and requires dump & load.

Therefore, if you have a large <SID> database, then materialisation can take a while.

6, Install tertiary ASE database instance.

7, Install Data Movement (DM) component into the binary software installation of the tertiary ASE database instance.

8, Run the setuphadr utility to configure the tertiary ASE instance as a DR node.

This step involves the materialisation of the master and <SID> databases. The master database materialisation is automatic, the <SID> database is manual and requires dump & load.
Therefore, if you have a large <SID> database, then materialisation can take a while.

In the above, you can adjust the replication mode between primary and secondary, depending on your latency.
In Public cloud (Microsoft Azure), we found that the latency between paired regions was perfectly fine for asynchronous replication mode.
This also permitted the RPO to be met, so we actually went asynchronous all the way through.

POINT 1:

Based on the above, we have our first point to make.

When doing the dump & load for the tertiary database, both master and <SID> databases are taken from the primary database, which in most cases will be in a different datacentre, so materialisation of a large <SID> database will take longer than the secondary database materialisation timings.

You will need to develop a process for quickly getting the dump across the network to the tertiary database node (before the transaction log fills up on the primary).

Developing this fast materialisation process is crucial to the operation of the 3 node setup, since you will be doing this step a lot.

Operational Process:

We now have a 3 node setup, with replication happily pushing primary database transactions from primary (they go from the Replication Agent within the primary ASE instance), to the SRS on the secondary ASE node.
The SRS on the secondary instance then pushes the transactions into the secondary ASE instance databases (master & <SID>) and also to the SRS on the tertiary ASE database instance.

While this is working, you can see the usual SRS output by connecting into the SRS DR Agent on the secondary node and issuing the “sap_status path” command.
The usual monitoring functions exist for monitoring the 3 ASE nodes. You can use the DBACockpit (DB02) in a Netweaver ABAP stack, the ASE Fault Manager or manually at the command line.

One of the critical processes with an ASE HADR setup, is the flow of transactions from primary. You will be constantly engaged trying to prevent the backlog of transactions, which could cause primary to halt database commits until transaction log space is freed.
By correctly sizing the whole chain (primary, secondary and tertiary transaction logs) plus sizing the inbound queues of the SRS, you should have little work to do on a daily basis.

POINT 2:

It’s not the daily monitoring that will impact, but the exceptional change scenarios.
As an example, all 3 ASE database instances should have the same database device sizes, transaction log sizes and configuration settings.
Remembering to increase the device, database, transaction log, queue on each of them can be arduous and mistakes can be made.
Putting a solid change process around the database and SRS is very important to avoid primary database outages.
Since all 3 databases are independent, you can’t rely on auto-growby to grow the devices and databases in sync. So you may need to consider manually increasing the device and database sizes.

Failover Process:

During a failover, the team need to be trained in the scenario of recovery of the data to whichever database server node is active/available/healthy.
The exact scenario training could be difficult as it may involve public cloud, in which case it may not be possible to accurately simulate.
For the 3 node SAP ASE HADR + DR node, the failure scenario that you experience could make a big difference to you overall recovery time.

When we mention recovery time, we are not just talking about RPO/RTO for getting production systems working, we are talking about the time to actually recover the service to a protected state.
For example, recovery of the production database to a point where it is once again adequately protected from failure through database replication.

Loss of the primary database in a 3 node setup, means that the secondary node is the choice to become primary.
In this scenario, the secondary SRS is no longer used. Instead the SRS on the DR node would be configured to be the recipient of transactions from the Replication Agent of the secondary ASE.
If done quickly enough, then re-materialisation of the tertiary database can be avoided as both secondary and tertiary should have the same point-in-time.
In practice however, you will find more often than not, that you are just re-materialising the DR node from the secondary.
In some cases, you may decide not to both until the original primary is back in action. The effort is just too much.

Loss of the secondary database in a 3 node setup, means that the primary becomes instantly unprotected!
Both the secondary node and the tertiary node will drift out of sync.
In this scenario, you will more than likely find that you will be pushed for time and need to teardown the replication on the primary database to prevent the primary transaction lo filling.

Loss of the tertiary database in a 3 node setup, means that you no longer have DR protection for your data!
The transaction log on the primary will start to fill because secondary SRS will be unable to commit transactions in the queue to the tertiary database.
In this scenario, you will more than likely find that you will be pushed for time and need to re-materialise the DR database from the primary.
Time will be of the essence, because you will need transaction log space available in the primary database and queue space in the SRS, for the time to perform the re-materialsation.

POINT 3:

Sizing of the production transaction log size is crucial.
The same size is needed on the secondary and tertiary databases (to allow materialisation (dump & load) to work.
The SRS queue size also needs to be a hefty size (bigger than the transaction log size) to accommodate the transactions from the transaction log.
The primary transaction log size is no longer now just about daily database transactional throughput, but is also intertwined with the requirement for the time it takes to dmp & load the DB across the network to the DR node (slowest link in the chain).
Plus, on top of the above sizings, you should accommodate some additional buffer space for added delays, troubleshooting, decision making.

You should understand your dump & load timings intricately to be able to understand your actual time to return production to a protected state. This will help you decide which is the best route to that state.

Maintenance Process:

Patching a two node ASE HADR setup, is fairly simple and doesn’t take too much effort in planning.
Patching a three node setup (HADR + DR node), involves a little more thought due to the complex way you are recommended to patch.
The basics of the process are that you should be patching the inactive portions of the HADR + DR setup.
Therefore, you end up partially patching the ASE binary stack, leaving the currently active primary SRS (on the secondary node) until last.
As well patching the ASE binaries, you will also have to patch the SAP Hostagent on each of the three nodes. Especially since the Hostagent is used to perform the ASE patching process.
Since there is also a SAP instance agent present on each database node, you will also need to patch the SAP Kernel (SAPEXE part only) on each database node.

POINT 4:

Database patching & maintenance effort increases with each node added. Since the secondary and DR nodes have a shared nothing architecture, you patch specific items more than once across the three nodes.

Summary:

The complexity of managing a two node SAP ASE HADR pair plus DR node should not be underestimated.
You can gain the ability to have HA and DR, especially in a public cloud scenario, but you will pay a heavy price in overhead from maintenance and potentially lose time during a real DR due to the complexity.
It really does depend on how rigid you can be at defining your failover processes and most importantly, testing them.

Carefully consider the cost of HA and DR, versus just DR (using a two node HADR setup with the same asynchronous replication mode).
Do you really need HA? Is your latency small enough to permit a small amount of time running across regions (in public cloud)?

Recovery From: Operation start is not allowed on VM since the VM is generalized – Linux

Scenario: In Azure you had a Linux virtual machine. In the Azure portal you clicked the “Capture” button in the Portal view of your Linux virtual machine, now you are unable to start the virtual machine as you get the Azure error: “Operation ‘start’ is not allowed on VM ‘abcd’ since the VM is generalized.“.

What this error/information prompt is telling you, is that the “Capture” button actually creates a generic image of your virtual machine, which means it is effectively a template that can be used to create a new VM.
Because the process that is applied to your original VM modifies it in such a way, it is now unable to boot up normally. The process is called “sysprep”.

Can you recover your original VM? no. It’s not possible to recover it properly using the Azure Portal capabilities. You could do it if you downloaded the disk image, but there’s no need.
Plus, there is no telling what changes have been made to the O/S that might affect your applications that have been installed.

It’s possible for you to create a new VM from your captured image, or even to use your old VM’s O/S disk to create a new VM.
However, both of the above mean you will have a new VM. Like I said, who knows what changes could have been introduced from the sysprep process. Maybe it’s better to rebuild…

Because the disk files are still present you can rescue your data and look at the original O/S disk files.
Here’s how I did it.

I’m starting from the point of holding my head in my hands after clicking “Capture“!
The next steps I took were:

– Delete your original VM (just the VM). The disk files will remain, but at least you can create a new VM of the same name (I liked the original name).

– Create a new Linux VM, same as you did for the one you’ve just lost.
Use the same install image if possible.

– Within the properties of your new VM, go to the “Disks” area.

– Click to add a new data disk.
We will then be able to attach the existing O/S disk to the virtual machine (you will need to find itin the list).
You can add other data disks from the old VM if you need to.

Once your disks are attached to your new Linux VM, you just need to mount them up.
For my specific scenario, I could see that the root partition “/” on my new Linux VM, was of type “ext4” (check the output of ‘df -h’ command).
This means that my old VM’s root partition format would have also been ext4.
Therefore I just needed to find and mount the new disk in the O/S of my new VM.

As root on the new Linux VM find the last disk device added:

# ls -ltr /dev/sd*

The last line is your old VM disk. Mine was device /dev/sdc and specifically, I needed partition 2 (the whole disk), so I would choose /dev/sdc2.

Mount the disk:

# mkdir /old_vm

# mount -t ext4 /dev/sdc2 /old_vm

I could then access the disk and copy any required files/settings:

# cd /old_vm

Once completed, I unmounted the old O/S disk in the new Linux VM:

# umount /old_vm

Then, back in the Azure Portal in the disks area of the new VM (in Edit mode), I detatched the old disk:

Once those disks are not owned by a VM anymore (you can see in the properties for the specific disk), then it’s safe to delete them.

Disaster Recovery – Where To Start

Perhaps IT implementors think that Disaster Recovery is a dark art, or maybe it’s just another item to de-scope from the implementation project due to time restrictions or budget cuts.
But I find that it’s coming up more and more as part of the yearly financial system audits (the ones we love to hate!).
Not only is it an action on the main financial systems, but the auditors are training there sights on the ancillary systems too.

So where do I normally start?
Usually it’s not a case of where I would like to start, it’s where I *have* to start due to missing system documentation and configuration information.
Being a DBA, I’m responsible for ensuring the continuity of business with respect to the database system and the preservation of the data.
So no matter what previous work has been completed to provide a DR capability, or a HA capability (so often confused) it’s useless without any documentation or analysis to show what the tolerances are (how much data can you afford to lose – RPO, and how much time it will take to get a system back, RTO).

First things first, ensure that the business recognises that DR is not just an IT task.
You should point them at the host of web pages devoted to creating the perfect Business Continuity Plan (BCP).
It states that a BCP should include what *the business* intends to do in a disaster. This more often than not involves a slight re-work to the standard business processes. Whether this involves going to a paper based temporary system or just using a cut-down process, the business needs a plan.
It’s not all about Recovery Point Objective and Recovery Time Objective!
A nice “swim lane” diagram will help a lot when picturing the business process.
You should start at the beginning of your business process (normally someone wants to buy something), then work through to the end (your company gets paid). Add all the steps in between and overlay the systems used. Imagine if those systems were not present, how could you work around them.

Secondly, you need to evaluate the risks among the systems you support. Identify points of failure.
To be able to do this effectively, you will need to know your systems inside out.
You should be aware of *all* dependencies (e.g. interfaces) especially with the more complex business systems like ERP, BPM, BI, middleware etc.
Start by creating a baseline of your existing configuration:
– What applications do you have.
– What versions are they.
– How are they linked (interfaces).
– Who uses them (business areas).
– What type of business data do they store (do they store data at all).

Third, you should make yourself aware of the hardware capabilities of important things like your storage sub-system (SAN, NAS etc). Can you leverage these devices to make yourself better protected/prepared.
An overall architecture diagram speaks a thousand words.

Once you understand what you have got, only then can you start to formulate a plan for the worst.
Decide what “the worst” actually is! I’ve worked for companies that are convinced an aeroplane would decimate the data centre at any moment. All the prep-work in the world couldn’t prevent it from happening, but maybe you can save the data (if not yourself 😉 ).