DBA Sensation

June 15, 2009

Oracle connection idle timeout with firewall

Since our oracle 10g RAC has been moved behind firewall, we always get disconnected/timeout by firewall if the connection was idle.
By searching the metalink i found this article is really useful:

Resolving Problems with Connection Idle Timeout With Firewall

An Overview

Firewall(FW) has become common in today’s networking to protect the network environment. The firewall recognizes the TCP protocol and it records the client server socket end-points. Also, FW recognize the TCP connection closure, and then will release the resources allocated for recording the opening connection. For every end-point pairs , the firewall must also allocate some resources(may be small).

When the client or server closes the communication it sends TCP FIN type packet, this is a normal socket closure. However, it is not uncommon that the client server communication abruptly ending without closing the end points properly by sending FIN packet, for example, when the client or server crashed, power down or a network error which prevents sending the closure packet to the other end. In that cases, the firewall will not know that the end-points will no longer use the opened channel. As a passive intermediary, it had no way to determine if the endpoints are still active. As is it not possible to maintain resources forever, and also, it is a security threat keeping a port open for undefined time. So, firewall imposes a BLACKOUT on those connections that stay idle for a predefined amount of time.

Initially FW were designed to protect the application servers, network and then to protect client/server connection. With these in mind, a time-out in terms of hours (1 hour is the default for most FW) is reasonable. With the advent of more complex security schemes, FW are not only between client and server, but also between different application servers ( intranet, demilitarized zone (DMZ) , and such) and database servers. So, the horizon of 1 hour idle time for communication between servers maybe not be appropriate.

Idle connections can be expected from an application server. There is the case of J2EE using pooled JDBC connections. The pool usually returns the first available connection to the requester, so the first connections of the pool list are the most likely to be active. The last one, which are at the end of the list, are only used at peek loads, and most of the time it will be inactive.

Other cases are the connections established from a HTTP Server, either SQL connections from mod_plsql, or AJP connections from mod_oc4j.


One of the inconvenience of theses blackout, is that they are passive. None of the endpoints will be notified that the communication was banned . Only when the client or server tries to contact its peer, it comes to know that the peer end is no more active and the communication has already been broken.

The worst of all scenarios are the so called “passive listeners” . They will never know. Because, passive listeners are those processes at an endpoint that are simply waiting for commands to arrive from the other end. A typical example of this are the backend database server processes, which are reading from the socket looking new SQL statements to execute , and after the request is answered, they return to their passive state. When a blackout occurs, they will stay forever in this reading state, unless some of the following techniques are applied.

Resolving problems with connection idle time-out

TCP KeepAlive

You can enable TCP KeepAlive option at the Operating System(OS) level. Once TCP keepalive option is enabled and configured, a small probe packet will be sent to the other end at every predefined in-activity interval and it expects an ACK from the other end. And ACK will be returned only when the other end is alive and is reachable. If ACK is not returned, then after some retry, the OS will close the end points and will release the resources allocated for that. The application which is listening on that particular socket will recieve the error, so that application can take necessary action upon receiving the error signal from the OS.

When a communication is blacked out by the firewall, the probe will not reach its other end, and then the OS will close the socket end points and the application will be notified of the exception.

Steps to configure TCP KeepAlive depends on a specific Operating Systems. You will have to refer the appropriate OS documentation for it.

It is common to enable TCP KeepAlive option at the server end. Because server is the one which holds many resources for a communication, it any communication is broken, then those resources at the server will be released than holding it for indefinite time. By default TCP KeepAlive is not enabled at the OS.

TCP KeepAlive is applicable for all network applications running on that particular Operating System.

DCD for DataBase Servers

For database connections, one of the endpoints is a passive listener, either is a dedicated process or a dispatcher process. If the connection becomes blacked out , this backend will never know that client cannot send any more requests, and then will lock important resources as database sessions, locks , and at least , a file descriptor used for maintaining the socket.

A solution is to make this backend “not so” passive, using the DCD (dead connection detection) to figure out if the communication is still possible.

Simply, set in the $ORACLE_HOME/network/admin/sqlnet.ora, in the server side SQLNET.EXPIRE_TIME=10 (10 minutes, for example). With this parameter in place, after 10 minutes of inactivity, the server send a small 10 bytes probe packet to the client. If this packet is not acknowledge, the connection will be closed and the associated resources will be released.

There are two benefits with this DCD
1. If the SQLNET.EXPIRE_TIME is less than the FW connection idle time-out, then the firewall will consider this packet as activity, and the idle time-out (firewall blackout) will never happen until both the client and the server processes are alive.

2. If the SQLNET.EXPIRE_TIME (let’s say a little bit higher) than the FW idle limit, then , as soon as the blackout happens , the RDBMS will know and will close the connection.

The first case is recommended when the connection comes from another application server , and the second makes sense for client applications.

DCD works at the application level and also works on top of TCP/IP protocol. If you have set the SQLNET.EXPIRE_TIME=10 then do not expect that the connections will be closed exactly after 10 minutes of the blackout or network outage. The TCP timeout and TCP retransmission values also adds to this time.
Please note that some latest firewalls may not see DCD packets as a valid traffic, and thus the DCD may not be useful. In this case, firewall timeout should be increased or users should not leave the application idle for longer than the idle time out configured on the firewall.

AJP Connections

It is not a default behavior in 9.0.2, but if Patch 2862660 is installed, the connection between and OHS server process and the J2EE can be maintained for more than a single request. If the parameter Oc4jConnTimeout is set, the OHS will maintain the connection for at least that time. The problem is that the child process may became inactive before that time-out occurs, and then the connection will remain open. While the child process is inactive, the connection will be idle, and there is chance to be blackout by the FW.

If this happens, the first thing that the child will do is to close it when it becomes active. But at this time, the TCP socket closing cannot be completed, due the blackout. Although the http child process can simply ignore the closing failure and continue the creation of a new connection, the passive listener at the
j2ee side (the worker thread) will be hook without a chance for the resources to be released.

To solve this , the Patch 3151686 must be installed and the java-option
must be enabled.

After this, the blackout detection will rely on the TCP KeepAlive provided by the operating system.

As DCD , this process consist in send probes -empty packages- when a socket had been inactive for a period of time. If there is no response, the socket will be closed, and then, even the passive listener, will receive and exception or signal to let him know that the no further communication is possible.


As the firewalls extend their functionality , and are now are placed in between application servers, some tuning and parameter adjusting must be made to overcome the default rules established for client/server communications.

Note id: 257650.1
Link: https://metalink2.oracle.com/metalink/plsql/f?p=130:14:9497610239050450074::::p14_database_id,p14_docid,p14_show_header,p14_show_help,p14_black_frame,p14_font:NOT,257650.1,1,1,1,helvetica

November 6, 2008

Error when Execute Sqlplus as non-Oracle User on UNIX

Filed under: [client related] — zhefeng @ 1:10 pm

When execute ‘sqlplus’ as a user outside of the dba or Oracle group, you get the following errors.

Error 6 initializing SQL*Plus
Message file sp1<lang>.msb not found
SP2-0750: You may need to set ORACLE_HOME to your Oracle software directory

According to Metalink, this is a bug for oracle 10g by setting wrong file/folder permissions.


I.  Logged in as the Oracle user (or the user that installed the 10gR2 software), manually change the permissions on the client.  For example:

chmod -R 755 <client_home>

In our case:

chmod -R 755 $ORACLE_HOME/sqlplus

II.  If doing a recursive permissions command is not acceptable, then you will need to pinpoint exactly what files the client is reading at the time of execution, and manually change permissions only on those files.  In our case, we need to pinpoint what files are being accessed by SQL*Plus.   To implement this workaround, please execute the following steps:

1. As the non-Oracle user, run the truss utility to find out which files are being accessed.  Sample command:
truss -aefo /tmp/truss_sqlplus.out sqlplus username/password

2. Use this truss_sqlplus.out trace file to see what files have error “EACCES” when attempting to access.   In our case, the truss_sqlplus.out showed a problem accessing the following file:

Another possible error to search for in the truss output is ENOENT.  For example:
9775: open(“./sqlplus/mesg/sp1us.msb”, O_RDONLY) Err#2 ENOENT

3. Logged in as the Oracle user, change permissions on folders leading up to, and including sp1us.msb:

chmod 755 $ORACLE_HOME/sqlplus
chmod 755 $ORACLE_HOME/sqlplus/mesg
chmod 755 $ORACLE_HOME/sqlplus/mesg/sp1us.msb

4. After making above permission changes, a different error may appear when executing sqlplus as non-Oracle user, such as:

$ sqlplus username/password
SP2-1503: Unable to initialize Oracle call interface
SP2-0152: ORACLE may not be functioning properly

5. At this point, you need to re-run the truss (as non-Oracle) to see what other files are trying to be accessed.  In our case, the following files were trying to get accessed, but showed “EACCES” failure:


6. Logged in as the Oracle user, change permissions on these files and the directories leading up to these files.

chmod 755 $ORACLE_HOME/nls
chmod 755 $ORACLE_HOME/nls/data
chmod 755 $ORACLE_HOME/nls/data/lx1boot.nlb
chmod 755 $ORACLE_HOME/oracore
chmod 755 $ORACLE_HOME/oracore/zoneinfo
chmod 755 $ORACLE_HOME/oracore/zoneinfo/timezlrg.dat

7. Now, invoking sqlplus as a non-Oracle user was successful in our case.

September 24, 2008

ORA-12516 Error TNS:listener could not find available handler with matching protocol stack

Filed under: [client related] — zhefeng @ 3:01 pm

today when a user was doing heavy data injection against our 10g RAC database, got error “ORA-12516, TNS:listener could not find available handler with matching protocol stack”.

It looks like the databases was rejecting the connections. According to metalink, by increase the “processes” parameters should resolve this problem.

#Step1: take a look at the processes limition

select * from gv$resource_limit;

check the “MAX_UTILIZATION” (the peak value) and “INITIAL_ALLOCATION” (your current setting) for processes and sessions.

In my case, sometime the maximum of processes has been reached.

#Step2: increase the parameter from 150 (default) to 300

sql>alter system set processes=300 scope=spfile;

#Step3: reboot the database to let parameter taking effect.

SQL> show parameters processes

NAME                                 TYPE        VALUE
———————————— ———– ——————————

processes                            integer     300

After adjusted processes, sessions was automatically adjusted to 335 by system.

SQL> show parameters sessions

NAME                                 TYPE        VALUE
———————————— ———– ——————————

sessions                             integer     335

Problem resolved with no more complains.

P.S. some people is sugguesting to use this formula to decide the value of processes.

User connections = processes*10%*shared_server

July 7, 2008

oracle client got aio error on aix platform

Filed under: [client related] — zhefeng @ 11:26 pm
Today i got lots complains about our NFS shared oracle client on AIX plaform. The sqlplus totally doesn't work and gives the aio64 error. Search the metalink, found the solution as below:
    Oracle executables fail to run on AIX: Symbol resolution failed for /usr/lib/libc.a[aio_64.o]
Note:235537.1 symptom: 0509-130 Symbol resolution failed for /usr/lib/libc.a[aio_64.o] because: symptom: 0509-136 Symbol listio64 (number 1) is not exported from dependent module /unix. symptom: 0509-136 Symbol iosuspend64 (number 3) is not exported from dependent module /unix. symptom: 0509-136 Symbol aio_nwait64 (number 5) is not exported from dependent module /unix. symptom: 0509-136 Symbol aio_nwait_timeout64 (number 7) is not exported from dependent module /unix. symptom: 0509-192 Examine .loader section symbols with the 'dump -Tv' command. change: Filesystem mode changed from JFS2 to Solution: Make Asynchronous I/O available on system, following the next steps to set it up: 1. Log in as root 2. Type: mkdev -l aio0 This command attempts to verify that Asynchronous I/O services are running (available on system), if not, they are started. 3. To ensure it is configured at each system restart : a. Type: smit aio b. Select the menu item "Change/Show Characteristics of Asynchronous I/O" c. Change the field "STATE to be configured at system restart" from DEFINED to AVAILABLE. or just execute the next command and it will do the same as the previous a, b and c steps: chdev -l aio0 -a autoconfig='available' which should return the message "aio0 changed". Step 3 (either of the two ways) will toggle the state to be configured at system restart to be available, making asynchronous I/O available with each system reboot.

Create a free website or blog at WordPress.com.