This post is about a strange issue I was hitting during the configuration of SAP LaMa 3.0 to start/stop a SAP ABAP 7.52 system (with Kernel 7.53) that was running with a SAP ASE 16.0 database.
During the LaMa start task, the task would fail with an error message: “ASE process no longer found after startup. (fault code: 127)“.
When I logged directly onto the SAP server Linux host, I could see that the database had indeed started up, eventually.
So what was causing the failure?
At first I thought this was related to the Kernel, but having checked the versions of the Kernel components, I found that they were the same as another SAP system that was starting up perfectly fine using the exact same LaMa system.
The next check I did was to turn on tracing on the hostagent itself. This is a simple task of putting the trace value to “3” in the host_profile of the hostagent and restarting it:
The trace output is shown in a number of different trace files in the work directory of the hostagent but the trace file we were interested in is called dev_sapdbctrl.
The developer trace file for the sapdbctrl binary executable is important, because the sapdbctrl binary is executed by SAP hostagent (saphostexec) to perform the database start. If you observe the contents of the sapdbctrl trace output, you will see that it loads the Sybase specific shared library which contains the required code to start/stop the ASE database.
The same sapdbctrl also contains the ability to load the required libraries for other database systems.
As a side note, it is still not known to me, how the Sybase shared library comes to exist in the hostagent executable directory. When SAP ASE is patched, this library must also be patched, otherwise how does the hostagent stay in-step with the ASE database that it needs to talk with?
Once tracing was turned on, I shut the SAP ASE instance down again and then used SAP LaMa to initiate the SAP system start once again.
Sure enough, the LaMa start task failed again.
Looking in the trace file dev_sapdbctrl I could see the same error message that I was seeing in SAP LaMa:
Error: Command execution failed. : ASE process no longer found after startup.
(fault code: 127) Operation ID: 000D3A3862631EEAAEDDA232BE624001
----- Log messages ----
Info: saphostcontrol: Executing StartDatabase
Error: sapdbctrl: ASE process no longer found after startup.
Error: saphostcontrol: StartDatabase failed (sapdbctrl exit code = 1)
This was great. It confirmed that SAP LaMa was just seeing the symptom of some other issue, since LaMa just calls the hostagent to do the start.
Now I knew the hostagent was seeing the error, I tried using the hostagent directly to perform the start, using the following:
/usr/sap/hostctrl/exe/saphostctrl -debug -function StartDatabase -dbname <SID> -dbtype syb -dbhost <the-ASE-host>
NOTE: The hostagent “-debug” command line option puts out the same information without the need for the hostagent tracing to be turned on in the host_profile.
Once again, the start process failed and the same error message was present in the dev_sapdbctrl trace file.
This was now really strange.
I decided that the next course of action was to start the process of raising the issue with SAP via an incident.
If you suspect that something could take a while to fix, then it’s always best to raise it with SAP early and continue to look at the issue in parallel.
Continuing the Diagnosis
While the SAP incident was in progress, I continued the process of trying to self-diagnose the issue further.
I tried a couple more things such as:
- Starting and stopping SAP ASE manually using stopdb/startdb commands.
- Restarting the whole server (this step has a place in every troubleshooting process, eventually).
- Checking the server patch level.
- Checking the environment of the Linux user, the shell, the profile files, the O/S limits applied.
- Checking what happens if McAfee anti-virus was disabled (I’ve seen the ePO blocking processes before).
Eventually exhaustion set in and I decided to give the SAP support processor time to get back to me with some hints.
I spend a lot of time solving SAP problems. A lot of time.
Something doesn’t work according to the docs, something did work but has stopped working, something has never worked well…
It builds up in your mind and you carry this stuff around in your head.
Subconsciously you think about these problems.
Then, at about 3am when you can’t get back to sleep, you have a revelation.
The hostagent is forking the process to start the database as the syb<sid> Linux user (it uses “su”), from the root user (hostagent runs as the root user).
Linux Domain Users
The revelation I had regarding the forking of the user, was just the trigger I needed to make me consider the way the Linux authentication was setup on this specific server with the problem ASE startup.
I remembered at the beginning of the project that I had hit an issue with the SSSD Linux daemon, which is responsible for interfacing between Linux and Microsoft Active Directory. At that time, the issue was causing the hostagent to hang when operations were executed which required a switch to another Linux user.
This previous issue was actually a hostagent issue that was fixed in a later hostagent patch. During that particular investigation, I requested that the Linux team re-configure the SSSD daemon to be more efficient with its Active Directory traversals, when it was looking to see if the desired user account was local to Linux or if it was a domain account.
With this previous issue in mind, I checked the SSSD configuration on the problem server. This is a simple conf file in /etc/sssd.
After all the troubleshooting, the raising of the incident, the sleeping, I had finally got to the solution.
After checking the SSSD daemon configuration file /etc/sssd/sssd.conf, I could clearly see that there was one entry missing compared to the other servers that didn’t experience the SAP ASE start error.
The parameter: “subdomain_enumerate = none” was missing.
Looking at the manual page for SSSD it would seem that without this parameter there is additional overhead during any Active Directory traversal.
I set the parameter accordingly in the /etc/sssd/sssd.conf file and restarted the SSSD daemon using:
Then I retried the start of the database using the hostagent command shown previously.
I then retried with SAP LaMa and that also now started ASE without error messages.
What it seems was happening, was some sort of internal pre-set timeout in the sapdbctrl binary, when hit, the sapdbctrl just abandons and throws the error that I was seeing. This leaves the ASE database to continue and start (the process was initiated), but in the hostagent it looked like it had failed to start.
By adding the “subdomain_enumerate = none” parameter, any “delay”, caused by inappropriate call to Active Directory was massively reduced and subsequent start activities were successful.