This blog contains experience gained over the years of implementing (and de-implementing) large scale IT applications/software.

SAP ASE HADR Overview – Part5

In this multi-part post, I’m explaining the basics behind how SAP Replication Server works when replicating from a SAP ASE database to a SAP ASE database as part of an HADR (ASE always-on) setup for a SAP Business Suite system.
The post will be based on SAP ASE (Adaptive Server Enterprise) 16.0 HADR with SAP Replication Server (SRS) 16.0.

In Part 1 we started with:

  • What is SRS.
  • The basic premise of HADR with SRS.
  • What a transaction is.

In Part 2 we went on to discuss:

  • What is the ASE transaction log.
  • Which databases are replicated.
  • How do transactions move to the SRS.

In Part 3 we covered:

  • What is the Active SRS.
  • What are the Key Parts of SRS.
  • Replication Synchronisation Modes
  • Open Transaction Handling
  • What are the Implications of Replication Delay

In Part 4 we stepped through the replication of Bob’s data change and saw how the transactional data was replicated first to the SRS and eventually to the companion (secondary) database.

This is the last part of my ASE 16.0 HADR mini-series, and in this final part I will discuss possible issues that can impact an ASE 16.0 HADR system, which might be useful when planning operational acceptance testing.

Companion ASE Unavailable

When the companion (secondary) database is unavailable for whatever reason (undergoing maintenance, it’s broken or for other reasons), then the replicated transactions will still continue to move through the SRS until they get to the outbound queue (OBQ).
The assumption is that the active SRS (on same server as the companion DB) is still up and working.

In the OBQ the transactions will wait until the companion is available again.
As soon as the companion is available, the transactions will move from the OBQ into the companion.
The primary database will be unaffected during this time and transactions will continue through the SRS until the OBQ becomes full.

If the OBQ fills up, then transactions will start to accumulate in the inbound queue (IBQ).

If the companion database is down for a long period of time, you may need to make a decision:

  • Increase the stable queue partition space to hold more transactions.
    With the hope that the companion can be brought back online.
  • Disable replication, removing the Secondary Truncation Point (STP) from the primary database and acknowledging that the companion will need re-materialisation to bring it back in-sync with primary.

Inbound Queue Full

When the inbound queue becomes full, the transactions will start to build up into the simple persistent queue (SPQ).
You should note that the IBQ, by default, is only allowed to use up to 70% of the stable queue partition size. The rest is for the OBQ. So “full” is not actually 100% of the queue space.

There can be two common reasons for the IBQ to fill:

  1. The OBQ is also full due to an issue with the connection to the companion ASE database, or the companion is unavailable.
    or
  2. There is an open transaction in the IBQ and the SRS is waiting for the “commit” or “rollback” command to come through from the SPQ for the open transaction.

To resolve the full IBQ, you are going to need to establish which of the two issues is occurring.
An easy way to do this is to check the OBQ fill level.
If transactions are moving from the OBQ to the companion, then the issue is an open transaction.

If an open transaction has caused the IBQ to fill, then the “commit” or “rollback” transaction could now be stuck in the SPQ. Since there is no space in the IBQ, the SRS is also unable to process the SPQ records, which leaves the IBQ open transaction in a stale-mate situation.
You will need to make a decision:

  • Add more space to the stable queues to increase the IBQ size.
    or
  • Increase the proportion of stable queue size that the IBQ can use (if OBQ is empty).
    or
  • Zap (remove) the open transaction from the IBQ (will mean data-loss on companion so a rematerialise may be needed).

Normally, you can just add more space by adding another partition to the stable queues, hopefully resolve the issue, then remove the extra space again. How much is needed? Nobody will know.
However, if you have to zap the open transaction, then make sure you dump the queue contents out first, so you can see what DML was open, you can then make a decision on how the missing transaction will affect the companion database integrity (could negate the need for rematerialisation).

During this problematic period, the SPQ has remained functional, which has meant that the primary database has been able to continue to send transactions to the active SRS and therefore allowed it to continue to commit data in a timely manner. The primary database will have no issues.

Simple Persistent Queue Full

This is probably the most serious of the scenarios.
Once the SPQ becomes full, it immediately impacts the Replication Agent running in the primary ASE database.

Transactions are unable to move from the primary database transaction log to the active SRS.
You will start to get urgent warnings in the primary ASE database error log, informing you that the SPQ is full and that the buffers in the primary Replication Agent are full.

You will also see that the Replication Agent will be producing error messages and informing you that it has switched from synchronous to asynchronous replication mode.

The integrity of your production ASE database is now at risk!

It is possible you can add more space to the SPQ if you think you can resolve the problem in the IBQ and have the time to wait for the IBQ to empty!

You should note that this scenario has the symptoms if the active SRS is not working at all. If the SPQ is not available, then you need to troubleshoot the SRS error log. It’s also possible you may have issues with the Replication Agent in the primary ASE.

Primary Transaction Log Full

If your active SRS has now filled up to the SPQ, your primary ASE database is now at serious risk.
With the Replication Agent unable to move the Secondary Truncation Point (STP) then the transaction log of the primary ASE will start to fill up.
You will not be able to release the segments, even with a transaction log dump, because they are still needed by the Replication Agent.

You have the following options available:

  • Add more space to the transaction log by adding a new device (recommended, instead of expanding the existing device).
    This will give you more time to maybe fix the SRS.
    or
  • Disable replication, which removes the STP and allows the transaction log to be dumped. A rematerialise of the companion will be needed at a later date.
    or
  • If the transaction log is already full, then even trying to disable replication will not work (no transactions will be permitted).
    You will have to use DBCC to remove (ignore) the “LTM” (Last Truncation Marker), which will have a similar outcome to disabling replication.

At this point, if all else fails and your transaction log is full DO NOT RESTART THE PRIMARY ASE!
If you restart the primary ASE with a full transaction log, then you will be in a world of pain trying to get it to start up again.
You need to stop all processing (it won’t be working anyway), then try and resolve the issue.

Summary

This concludes the 5 part mini-series on SAP ASE 16.0 HADR.
I hope it has given you enough of an overview to be able to explain a typical setup, the way replication occurs and the most common issues that put your production ASE database at risk.

Feel free to contact me with any questions and feedback.


5 thoughts on SAP ASE HADR Overview – Part5

  1. Hello Darryl,

    It’s an excellent blog. Thank you very much for the detailed explanation of how ASE replication works. I am new to ASE and this helps me a lot.

    1. Thanks Vihari.
      There is a lot of material to read as SRS is a very complex product.

      All the best

      Darryl

    1. Hi Raji,

      No FM is optional.
      The FM is useful for automated fault detection and failover of HADR.

      Darryl

  2. Hi, what is the difference between outbound queue backlog and outbound queue truncation backlog on the ? Which is more critical between those?

Add Your Comment

* Indicates Required Field

Your email address will not be published.

*