In this multi-part post, I’m explaining the basics behind how SAP Replication Server works when replicating from a SAP ASE database to a SAP ASE database as part of an HADR (ASE always-on) setup for a SAP Business Suite system.
The post will be based on SAP ASE (Adaptive Server Enterprise) 16.0 HADR with SAP Replication Server (SRS) 16.0.
In Part 1 we started with:
- What is SRS.
- The basic premise of HADR with SRS.
- What a transaction is.
In Part 2 we went on to discuss:
- What is the ASE transaction log.
- Which databases are replicated.
- How do transactions move to the SRS.
In this part we discuss the role of the active SRS and how the internal processing moves the transactions into the secondary ASE database.
What is the Active SRS?
In a standard HADR setup with primary and secondary databases, there are two instances of SRS.
- The inactive SRS is running on the primary database node.
- The active SRS is running on the secondary database node.
The active SRS receives the transactions (commands) from the Replication Agents in the primary databases.
To those with DB2 or Oracle experience, this replication hop seems strange at first. On closer inspection it achieves the same desired result, the transactional data is successfully persisted on a server separate to the primary database..
The inactive SRS is unused until a failover occurs.
During a failover the inactive SRS, on the old primary server, can switch replication paths to become the active SRS. Therefore the inactive SRS is the reverse path of replication.
What are the Key Parts of SRS?
In my view, there are 5 key parts to the SRS 16.0 architecture.
At each of these stages the data is persisted.
- The primary database.
Sends transactions from the transaction logs, via the Replication Agent threads to the active SRS.
- The SRS Simple Persistent Queue (SPQ).
This is a simple set of disk files for persisting the unordered transactions received on the SRS.
In synchronous replication mode, once a replicated transaction is persisted in the SPQ, it is classified as “received” on the secondary which allows the transaction to commit on the primary database.
A backlog on the SPQ could mean the underlying disk is not fast enough for the replication workload, or that the server hosting the actives SRS is suffering CPU/IO saturation.
If you have anti-virus installed, you should treat the SPQ disk just like you treat other database data files (i.e. exclude them from the A/V scanning).
- The SRS Stable Inbound Queue (IBQ).
An ordered table of the replicated transaction commands that represent the open or final transactions.
In the case of transactions that are rolled back, they are removed at the IBQ once the rollback command is seen (needs to come via the SPQ).
There is one IBQ for each primary database and it only ever holds transactions from a database and never from another Rep Server.
The SRS internals process the transactions on the IBQ that have a corresponding commit record (needs to come via the SPQ).
These committed transactions are grouped/compacted, ordered and translated into the correct target language for the target database platform and moved to the respective outbound queue for the target databases.
A backlog on the IBQ could mean the SRS internals may not be keeping up with the replication workload.
It is important to re-state that transactions on the IBQ could be open and waiting for a rollback or commit, which means the SPQ needs space for the transactions that contain those commands to make it through to the IBQ.
- The SRS Stable Outbound Queue (OBQ).
Committed transactions are moved from the IBQ onto the the OBQ.
The OBQ actually shares a portion of the same partition space as the IBQ, so moving between the IBQ and OBQ is very quick.
There is one OBQ for each target database and one if the target is another SRS (in scenarios with a third DR node).
- The Target (a.k.a Standby or Secondary or Companion) Databases.
The SRS has a set of distribution threads (DIST) that apply the transactions from the OBQs to the respective target databases via the DSI (Data Server Interface).
NOTE: In my diagrams I’ve positioned the DSI as slightly separate, but it is actually a module/component of the DSI.
For scenarios with a DR node also, the target is the DR Rep Server.
In the target databases you will the <SID>_maint user is used to apply the transactions.
A backlog on the OBQ could indicate a problem with the performance of the target database.
Replication Synchronisation Modes
With ASE HADR there are 3 replication modes available. The different modes affect how transactions are treated.
In asynchronous replication mode (a.k.a “Warm Standby” or “DR”), the Replication Agent threads operate in a lazy way. Scanning for and sending transactions to the active SRS in batches when they can.
There is a replication delay in asynchronous mode, but it does mean that the responsiveness of the primary database may seem better to the client applications (SAP Business Suite) because the commit to the primary database does not wait for the Replication Agent to send the transaction to the SRS.
Because of the inherent delay, there is a high possibility of data-loss in a failover scenario.
Even in asynchronous mode, the transaction log of the primary database cannot be freed until transactions are actually committed to secondary (the STP cannot be moved until transactions are committed on secondary).
When HADR is configured in synchronous replication mode (a.k.a “Hot Standby” or “HA”), each transaction is immediately sent by the replication agent to the SRS for persisting on disk at the SPQ.
Once safely in the SPQ, the primary database is allowed to commit the transaction.
This means synchronous replication mode has a direct impact on the responsiveness of the primary database transactions, because the commit on primary will be delayed by the Replication Agent + network transfer + memory + I/O latency for persisting to the SPQ of the SRS.
Lastly, near-synchronous mode works the same way as synchronous, but it is designed for slower disks hosting the SPQ on the SRS. This means that the SRS acknowledgement is sent back to the Replication Agent as soon as transaction is received in memory, but before it is persisted to the SPQ files on disk.
Compared to synchronous mode, near-sync has a slightly higher possibility of data-loss, in exchange for a slight reduction in latency.
Open Transaction Handling
As mentioned previously, transactions are replicated from the primary almost as soon as they are started (as soon as the transaction log records the start of the transaction). This means “open” transactions are replicated as soon as they are started and flow all the way to the IBQ.
(See Part 1 for a description of “Open Transactions”)
If transactions were only replicated once they were committed on primary, there would be a large delay before that transaction was safely applied to the secondary database.
When enabled, the “early dispatch” feature (“parallel_dist”) even allows large open transactions (using specific SRS threads) to start applying to the secondary database early, once the commit record is seen at the SPQ.
What are the Implications of Replication Delay?
There are two main implications of delay (latency) in replication:
- Potential Data Loss.
Any delay in the replication from primary to SRS introduces potential data-loss.
This is because the aim of replication is to get the transactional data off the primary databases as soon as possible and safely onto the active SRS node (server) in another location. Any delay between the active SRS and the secondary database could mean data-loss but only if the SRS software itself is lost or corrupted.
The ultimate goal is to get from the primary database to the active SRS as quickly as possible, with the second goal being to get from the primary database to the secondary database as quickly as possible.
This opens up possibilities regarding architecture design and sizing of the secondary databases, which I will cover in another post.
- Response Time.
Any delay in synchronous replication mode can also delay the commit of the transactions on the primary databases. In a Netweaver ABAP stack, this delay would be seen in the SAP Business Suite as a longer database response time in the Netweaver UPDATE work processes, therefore the delay is unlikely to be seen by the users themselves. In a Java stack, there is no such UPDATE work process, so the Java stack is likely to be more sensitive to longer response times.
In the next part (part 4 is here), we will step through the replication of a single transaction and discuss the impact of different scenarios along the way.