This blog contains experience gained over the years of implementing (and de-implementing) large scale IT applications/software.

Oracle 11gR2, KEWBMBTA: Maintain BSLN Thresholds failed

If you tighten up the security in an Oracle 11gR2 database, you may be tempted to remove access to the DBMS_OBFUSCATION_TOOLKIT package from PUBLIC (“revoke execute on DBMS_OBFUSCATION_TOOLKIT from PUBLIC;“).

This causes an issue, seen in 11gR2 11.2.0.3.
The error was visible in the user traces:

*** 2012-09-11 20:00:43.692
KEWBMBTA: Maintain BSLN Thresholds failed, check for details.

This was tracked to an invalid package body DBSNMP.BSLN.

SQL> alter package DBSNMP.BSLN compile body;

Warning: Package Body altered with compilation errors.

SQL> show errors
Errors for PACKAGE BODY DBSNMP.BSLN:

LINE/COL ERROR
-------- -----------------------------------------------------------------
88/5 PL/SQL: Statement ignored
88/19 PLS-00201: identifier 'DBMS_OBFUSCATION_TOOLKIT' must be declared
200/7 PL/SQL: Statement ignored
200/21 PLS-00201: identifier 'DBMS_OBFUSCATION_TOOLKIT' must be declared
241/7 PL/SQL: Statement ignored
242/8 PLS-00201: identifier 'DBMS_OBFUSCATION_TOOLKIT' must be declared
1332/7 PL/SQL: Statement ignored
1332/21 PLS-00201: identifier 'DBMS_OBFUSCATION_TOOLKIT' must be declared

Re-granting access to the DBMS_OBFUSCATION_TOOLKIT package to the PUBLIC user, fixes the issue again:

SQL> connect / as sysdba

Connected.

SQL> grant execute on DBMS_OBFUSCATION_TOOLKIT to PUBLIC;
Grant succeeded.

SQL> alter package DBSNMP.BSLN compile body;

Package body altered.

SAP Short Dump SYSTEM_NO_TASK_STORAGE

A background job has produced a short dump with SYSTEM_NO_TASK_STORAGE.
The “Heap Memory” section in ST02 showed a maximum use of 1.5GB, but we had allocated nearly 1.9GB according to the “abap/heap_area_nondia” and “abap/heap_area_dia” instance profile parameters.

The ST02 short dump analysis shows the source code line where the problem occurred. The line of code doesn’t look specifically interesting, it is, however, requiring a slight increase in memory allocation.
If the program is running in DIALOG then it will be allocated Extended Memory first, followed by HEAP (local process) memory.
If the program is running in BACKGROUND then it will be allocated HEAP memory first, followed by Extended Memory.

Extended Memory is pre-allocated at system startup according to the EM initial setting in the SAP instance profile.  It is then increased up to the maximum specified in the instance profile per user and per application server.
HEAP memory is only allocated as it is needed within each of the SAP dw.* OS processes, up to the maximum specified (in instance profile) per user or per application server.  It is then released back to the OS when the process finishes processing (SAP restarts the work process) and it has used over a specific amount of memory as set in parameter abap/heaplimit.

Since this was a background job, we can assume that we exhausted HEAP memory and should have automatically switched to Extended Memory.
However, the Kernel section of the short dump showed that we may have experienced an issue obtaining more virtual memory from the OS:

*** Error in libunwind: Out of memory. Try with a higher value >
 > for UNWIND_RESERVE_MEM (current value = 16).
 (0) 0x40000000017ae480 [dw.sapXXX_DVEBMGS01]
 (1) 0x40000000017ae2b0 [dw.sapXXX_DVEBMGS01]
 (2) 0x40000000021de880 [dw.sapXXX_DVEBMGS01]
 (3) 0x40000000021e45d0 [dw.sapXXX_DVEBMGS01]
 (4) 0x400000000115e860 [dw.sapXXX_DVEBMGS01]
 (5) 0x400000000120b3a0 [dw.sapXXX_DVEBMGS01]
 (6) 0x40000000010542f0 [dw.sapXXX_DVEBMGS01]
 (7) 0x4000000001111940 [dw.sapXXX_DVEBMGS01]
 (8) 0x40000000011ae790 [dw.sapXXX_DVEBMGS01]
 (9) 0x40000000012fb090 [dw.sapXXX_DVEBMGS01]
 (10) 0x40000000012fd7d0 [dw.sapXXX_DVEBMGS01]
 (11) 0x400000000188cae0 [dw.sapXXX_DVEBMGS01]
 (12) 0x4000000001896e70 [dw.sapXXX_DVEBMGS01]
 (13) 0x4000000001891670 [dw.sapXXX_DVEBMGS01]
 (14) 0x40000000018949a0 [dw.sapXXX_DVEBMGS01]
 (15) 0x400000000187f1e0 [dw.sapXXX_DVEBMGS01]
 (16) 0x40000000014d1ce0 [dw.sapXXX_DVEBMGS01]
 (17) 0x400000000149e8d0 [dw.sapXXX_DVEBMGS01]
 (18) 0x4000000001496fc0 [dw.sapXXX_DVEBMGS01]
 (19) 0x4000000001364c30 [dw.sapXXX_DVEBMGS01]
 (20) 0x4000000000ed4af0 [dw.sapXXX_DVEBMGS01]
 (21) 0x4000000000ed4a90 [dw.sapXXX_DVEBMGS01]
 (22) 0xc000000000045880 main_opd_entry + 0x50 [/usr/lib/hpux64/dld.so]

Is this case, you could either add more OS memory (essentially you have over allocated memory somewhere), or you could resize the abap/heap_area_nondia parameter to a smaller amount, so that it will switch to EM sooner.

“Fix” Control and ORA-00940: invalid ALTER command

Upon startup of an Oracle 10.2.0.4.0 database I received Oracle error:
ORA-00940: invalid ALTER command

I eventually managed to narrow this down to a “_fix_control” setting.

The database is for an SAP system and the init.ora was updated after following SAP note 830576 (Parameter recommendations for Oracle 10g).
I noticed that I had misread the note and applied a “_fix_control” parameter setting for a slightly (only by a patchset update) higher version of Oracle.

Take care when reading this note and make sure you test the startup and shutdown of the database before going home!

UPDATE: This is a bug.  See here.

SAP BI – ORA-04031 – Oracle 10.2.0.5 – Bug 9737897 – End Of.


If you use SAP BI 7.0 on an Oracle 10.2.0.5 64bit database you may experience this issue.
Our BI system performs tha delta loads for FI data overnight.
These are large loads of records into the cubes.

The problem is Oracle error 04031:
----------------------------------------------------------------------------------
Mon Jul 11 00:46:26 GMT Daylight Time 2011
Errors in file j:oraclebwpsaptraceusertracebwp_ora_3716.trc:
ORA-00603: ORACLE server session terminated by fatal error
ORA-00604: error occurred at recursive SQL level 1
ORA-04031: unable to allocate 4120 bytes of shared memory ("shared pool","select name,online$,contents...","Typecheck","kgghteInit")

ORA-00604: error occurred at recursive SQL level 1
ORA-04031: unable to allocate 4120 bytes of shared memory ("shared pool","select name,online$,contents...","Typecheck","kgghteInit")

ORA-00604: error occurred at recursive SQL level 1
ORA-04031: unable to allocate 4120 bytes of shared memory ("shared pool","select name,online$,contents...","Typecheck","kgghteInit")

Mon Jul 11 00:46:26 GMT Daylight Time 2011
Errors in file j:oraclebwpsaptracebackgroundbwp_smon_8756.trc:
ORA-00604: error occurred at recursive SQL level 1
ORA-04031: unable to allocate 4120 bytes of shared memory ("shared pool","select file#, block#, ts# fr...","Typecheck","seg:kggfaAllocSeg")

Mon Jul 11 00:46:27 GMT Daylight Time 2011
Errors in file j:oraclebwpsaptracebackgroundbwp_smon_8756.trc:
ORA-00604: error occurred at recursive SQL level 1
ORA-04031: unable to allocate 4120 bytes of shared memory ("shared pool","select file#, block#, ts# fr...","Typecheck","seg:kggfaAllocSeg")

----------------------------------------------------------------------------------


Once it hits this problem, it causes the rest of the chain to fail on the index rebuilds.
Sometimes the user queries also fail in the Bex Web front-end. The user sees a similar error as those shown above.

Using V$SGASTAT I was able to take snapshots of the Shared Pool usage via a Windows scheduled task.
I traced the problem to the memory allocation of “obj stat memo” in the Shared Pool of the database.
It’s constantly growing, and although the “free memory” reduces slightly, it doesn’t seem to correlate to the increase in “obj stat memo” usage.

So next step was looking on Oracle Support.
I found bug 9737897, which is valid *upto* 10.2.0.4. But it does not state that it has actually been fixed in 10.2.0.5.
The bug description says: “An SGA memory leak is possible for “obj stat memo” and “obj htab chun”
memory if a segment is dropped and recreated many times.”

The bug note recommends querying x$ksolsfts (see here about this table https://yong321.freeshell.org/computer/x$table.html).
The following SQL is recommended by Oracle:

select distinct fts_objd, fts_objn from x$ksolsfts x
where not exists (select 1 from obj$ o
where o.dataobj#=x.fts_objd and o.obj#=x.fts_objn);

The note says that a high number of records returned could indicate that you are seeing the symptoms of the bug.
I was seeing over 160,000 records in a database that had been restarted 24 hours prior.

I produced a lovely Excel chart showing the snapshot times against the memory use of the “obj stat memo” component.
The chart showed significant increases (step ups) of approximately 10MB at specific points in time.
I was then able to search SM37 using the “extended” option, to show background jobs active during the memory “step ups” identified by the chart.
Generally I saw 10-20 jobs active at the time, but the correlation showed that more often than not, the “step ups” were at near enough the same time as a BI_PROCESS_INDEX job.

Looking in the background job log, I was able to identify the process chain and the type of index/load that was being performed.
The indexes in question were FI cube related. They were quite large indexes and they were bitmap indexes (“CREATE BITMAP INDEX…”).
Each night these indexes got re-built after a load.
The re-build command was visible in the background job log and included “COMPUTE STATISTIC”.

So I had successfully identified a segment(s) that was being dropped and re-created regularly.
This met with the Oracle bug description.

At the bottom of the bug note, it states:

“Note:
If you suffer “obj stat memo” memory leak problem in 10.2.0.5, you can try
_object_statistics = false as a workaround, or following init parameter to see
if it solves problem.
_disable_objstat_del_broadcast=false”

So what does SAP say about this.
Well the following SAP notes were/are available:
1120481 – ORA-4031 – memory allocation errors for object-level stat
1070102 – High memory allocation by ‘obj stat memo’
869006 – Composite SAP note: ORA-04031

The only note of relevance is 1120481 which has a short reference to Oracle 10.2.0.4.
The other notes tend to point the read at just blindly increasing the Share Pool. I tried this, it didn’t work when 900MB was used by “obj stat memo”!

So, stuck with only one possible solution (other than upgrading to 11g), I can confirm that setting “_object_statistics = false” (alter system set “_object_statistics”=false scope=spfile;) and bouncing the database solved the problem as it completely removes the use of the “obj stat memo” memory area (it’s no longer visible in V$SGASTAT).

SAP RFC Gateway Error 10060

Yesterday I spent the majority of the morning tracking a problem in the SAP ERP system.

It has been occurring for a while but I have had far more important issues.  Finally I got round to looking at it.

In hosta SM21 system log, the gateway process (RD) had registered a network connect error (Q0I) “Operating system call connect failed (error no. 10060)”:

SM21 Q01 operating system call connect failed 10060

The architecture of the SAP systems involved in the landscape looks like the following diagram, which shows an SAP ERP single ABAP stack (hosta), which is connected to an SAP Enterprise Portal system which is clustered (hostb & hostc):

Looking in the gateway error log (TX: SMGW), I saw the network interface (Ni) timeout error when looking up the hostname for IP address 192.168.1.1:

NIHSLGETHOSTNAME

The business process involved means that an RFC connection is established from the EP server processes (hostb) to the ERP gateway process (hosta).
As you can see in the architecture diagram, the IP 192.168.1.1 is the private IP address of the hostb server in the EP cluster.
The private IP address is only used as a heartbeat address as part of the Microsoft Cluster service.
So the question was, why was the SAP ERP gateway process receiving inbound connections from hostb via the private IP address?

The method of analysis involves understanding how the IP to hostname lookups are performed on the Windows servers:

First, if you open a command prompt and use NSLOOKUP to check the hostname (“C:> nslookup hostb”) it will return the hostname from the Name Service configured for the Windows Server. Usually this is in the order of hosts file first, then DNS. It performs the lookup using the appropriate network interface according to the network routings (“C:> route print”).
In our example, the nslookup correctly returned the public IP address.

Second, you can issue a PING of hostb from hostb (itself) using “C:> ping hostb”.
When using PING, it sends a network packet using the appropriate network interface according to the network routings (“C:> route print”). Since we are pinging ourselves, it will use the first network interface in bind order of the Windows network connections.
In our case, it was pinging the *private* IP address.
According to Microsoft, the PING command in Windows uses gethostbyname() to find the IP address for the destination to be ping’d(https://support.microsoft.com/kb/822713).

So we have established that the private IP address is seen when we ping our own servername on hostb, why does this mean that the private IP is seen as the client in a SAP RFC gateway connection?
Well, in the same way that PING uses gethostbyname(), the RFC connection that is established through jlaunch.exe (host of the JAVA stack server processes) also uses gethostbyname(). Of course, we know what IP address a gethostbyname() call returns, as this is used in the PING command .

(findstr /M gethostbyname jlaunch.exe)

So, the RFC connection performed by the JAVA stack is created from the jlanuch.exe binary, which uses gethostbyname() during the bind() call, which returns the private IP.
The RFC connection is established and part of the RFC connection process the jlanuch.exe process passes it’s calling IP address. Which is the private IP address.
The ERP system gateway at the other end of the connection receives the incoming connection and notes that it is coming from 192.168.1.1. Should the ERP system on hosta wish to make an RFC callback (https://help.sap.com/saphelp_nw04/helpdata/en/22/042a77488911d189490000e829fbbd/content.htm)
(https://help.sap.com/saphelp_nw04/helpdata/en/22/042a91488911d189490000e829fbbd/content.htm), it has no chance of returning the call to the private IP address on hostb. This is where we get the error “NiHsLGetHostByName failed”.

So how do we resolve the problem?
We could code the lookup into the hosts file (172.x.x.2 hostb) on hostb, but this might cause other issues.
A quick look into the SAP documentation for installing Netweaver 7.0 SR3 in a Microsoft Clustered system reveals that a simple step was missed during installation on hostb.
In the document it states “The card of the public network must be displayed before that of the private network. If necessary, change the order in which the cards are listed by using the Move Up and Move Down arrows”.

So I check the binding orders on hostb in Control Panel -> Network Connections -> Advanced Settings:

Cluster heartbeat adapters and bindings

They are the wrong way around. The “Cluster Heartbeat” (Private IP) should be beneath the Public IP (“Local Area Connection” in our example).
The Microsoft article here (https://support.microsoft.com/kb/894564) lists the process to change it and provides alternative solutions (I would recommend to follow the SAP document).

Changing the binding order on Windows Server 2003 does not require a reboot, but the change will not be effective until a reboot is performed.