1.04.2010

Oracle bugs - Part 1 (an Instance hung on 'WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK')

Since couple of months, I have been dealing with various Oracle bugs on our 8 node RAC running on HPUX environment with Oracle V10.2.0.4. That really keeps my life busy and given enough nightmares. I just thought of sharing my experience dealing these bugs in a three part series. In the first part of the series, I am going to discuss a situation where an RAC database with two instances hung scenario and how we handled the situation.

An RAC database with two instances was hung with the 'WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK!'  error message in the alert.log file. Initially, neither I was able to connect to the database via sqlplus nor able to shutdown the database using SRVCTL STOP DATABASE command. I had to connect one of the instance using the 'sqlplus -prelim' option and before I perform instance abort, I have collected the hanganalyze (level 3) and systemstate (level 266) dumps using the oradebug utility. Upon investigating the hanganlyze tracefile, it shows the following contention:

63CA629C:BE596F7A 300 215 10005 2 KSL WAIT END [enq: SQ - contention] 1397817350/0x53510006 144/0x90 0/0x0 time=2929720
63CA62C9:BE596F7B 300 215 10005 1 KSL WAIT BEG [enq: SQ - contention] 1397817350/0x53510006 144/0x90 0/0x0

 
Further analysis shows a dead lock between two process which were either directly or indirectly blocking all other sessions in the database. Digging further more with the trace file reveals the real problem as you can see in the subsequent lines:

Deadlock between process 33 (J000) running DBMS_STATS_SCH and process 45.
Process 33 is waiting for 'cursor: pin S wait on X' mutex idn=57bc44f9 in shared mode while holding library cache lock handle=c00000010e9123c0 in exclusive mode

Mutex c0000001074df688(490, 0) idn 57bc44f9 oper GET_SHRD --> waiting
LIBRARY OBJECT LOCK: lock=c000000101141390 handle=c00000010e9123c0 mode=X

Process 45 is waiting for waiting for 'library cache lock' on SYS.C_COBJ#  handle address=c00000010e9123c0 in shared mode while holding mutex idn 57bc44f9 in mode EXCL.

Mutex c0000001074df688(490, 0) idn 57bc44f9 oper EXCL
LIBRARY OBJECT LOCK: lock=c000000108b0b208 handle=c00000010e9123c0 request=S




As usual, an Oracle support informed me that its a bug 6011045 and the behavior is due to the following reasons:





Abstract:            DBMS_STATS causes deadlock between 'cursor: pin S wait on X' and 'library cache lock'
Fixed-Releases:     B1
Tags:               DBMSPKG DEADLOCK OPS
Details:          
 Gathering statistics (With DBMS_STATS) on bootstrap objects (like ind$) can cause  a deadlock on sessions waiting on between 'cursor: pin S wait on X'  and 'library cache lock' in RAC.  In some instances sessions waiting on 'DFS lock handle' and 'row cache lock'   are also part of the deadlock path.



Yes, we had a job scheduled daily at might night to collect the database statistics (dbms_stats.gather_database_stats). Though it was running smoothly more than a year now, on any unlucky day, you gonna face this sort of troubles. My initial impression was that this bug is fixed in v.10.2.0.4. However, the Oracle support informed me that the bug is not fixed in 10.2.0.4, either I  need to upgrade  to 11g or request for a one-off patch for my platform.

In part 2, I am going to discuss a situation where one of the RAC database instance crashed due to 'ORA-07445 error'.

Happy reading

Jaffar

1 comment:

Yasser said...

Bugs are crazy...

Waiting for your future post..

-Yasser