5.30.2011

Yet another Oracle 11g upgrade aftermath - ORA-07445: exception encountered: core dump [evaopn3()]

Apparently, the Oracle 11gR2 aftermath stories is like never ending story from my part. Yet another aftermath fairy tale for you.

It was brought to our notice a couple of days ago that one of the application query failed to execute of late (perhaps after 11gR2 upgrade) and the following error has been logged in the application log file:

Dynamic Query PreparedStatement SQLException: java.sql.SQLException: No more data to read from socket SQLState: null ErrorCode: 17410

During my initial investigation the following errors have been found in the database alert.log:

Exception [type: SIGSEGV, Address not mapped to object] [ADDR:0x1200000000] [PC:0x400000000631B880, evaopn3()+384] [flags: 0x0, coun
t: 1]
Errors in file /u00/app/oracle/diag/rdbms/pmsit/XXXXX/trace/XXXXX1_ora_23652.trc  (incident=9016):
ORA-07445: exception encountered: core dump [evaopn3()+384] [SIGSEGV] [ADDR:0x1200000000] [PC:0x400000000631B880] [Address not mappe
d to object] []
Incident details in: /u00/app/oracle/diag/rdbms/pmsit/XXXXX1/incident/incdir_9016/PMSIT1_ora_23652_i9016.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Mon May 30 11:13:54 2011
Dumping diagnostic data in directory=[cdmp_20110530111354], requested by (instance=1, osid=23652), summary=[incident=9016].
Mon May 30 11:14:10 2011
Sweep [inc][9016]: completed
Sweep [inc][8952]: completed
Sweep [inc][8680]: completed
Sweep [inc2][9016]: completed
Sweep [inc2][8952]: completed
Sweep [inc2][8680]: completed


Additionally, the following has been found in the trace file in the context:

DDE:Problem Key 'ORA 7445[evaopn3()+384]'was flood controlled(0x2)(incident: 8657)
ORA-07445: exception encountered: core dump [evaopn3()+384] [SIGSEGV] 

ORA-07445 error caught my attention and as a immediate fix or thumb of rule, perhaps, I have flushed the SHARED POOL and asked the person to execute the query again. Unfortunately, the issue persists. Since this is a development database, I took the permission from the concerning application team to re-start the database which also never worked in our favor.

After a little bit of research over the MOS (metalink), I found  a potential workaround  to set the initialization parameter _simple_view_merging value to false to get rid of the issue. This indeed did the trick and the statement (query) was able to run without any issues this time. Additionally, found a recommendation to turn of _complex_view_merging initialization parameter too.

Moving on further with my investigation, I used the ORA-07445 look up tool by providing the first argument of the error '[evaopn3()+384]'  that suggests me to have a look at MOS ORA-7445 (evaopn3) [ID 860969.1] and the following is the excerpt from the document:

ERROR:
  ORA-7445 failing function "evaopn3"

VERSIONS:
  versions 10.1 to 11.2

DESCRIPTION:

  ORA-7445 [evaopn3] generally occurs during the execution of a query.

  The error typically only occurs with a specific stream of data being
  fetched from the objects in the execution plan.

  The problem is caused often by an optimization but it is not until a
  row is fetched and evaluated that the error happens.
 


The error is largely because of  a few Oracle bugs (depends on the version of database). Nevertheless, the document indeed suggests a few strategies to cope up the error. In my case, the above solution worked perfectly, though not really aware of the consequence of disabling the feature. I need to monitor the database behavior for sometime to make the change permanent.

References:
http://www.freelists.org/post/oracle-l/ora3113-related-to-histograms-11gr2,1
MOS ORA-7445 (evaopn3) [ID 860969.1]
MOS Customer Introduction to ORA-7445 Errors [ID 211909.1]
 
Happy reading,

Jaffar

5.28.2011

Terminating sessions on a specific RAC database instance

No matter how many years of experience or knowledge we possess, in the modern days, every passing day is a learning day, we surely come across something new every day and learn a new thing.

Perhaps people out there who already discovered and discussed about this feature earlier,  but, I have learned about this feature today and felt quite useful, at least for me, as we manage around 4 cluster environments with nearly 200 databases and 300 instance running across.

As a DBA, we typically use the 'ALTER SYSTEM KILL SESSION' statement to terminate (kill) any specific session on the database, and if the session is active and involved with any operations, we tend to get the 'ORA-00031: session marked for kill' and the session won't be terminated until the outstanding activity complete.

I then learned about the IMMEDIATE (ALTER SYSTEM KILL SESSION ...... IMMEDIATE') clause that most probably terminates the session without actually waiting for the outstanding activity to complete.  In addition, it also records the information (SID,PID) with the database alert.log file.
Today, I come across about terminating sessions on a specific RAC database instance in Oracle 11gR2 (I am not sure about Oracle 11gR1). This will give the flexibility to terminate a session on a specific RAC database instance across cluster environment. The following is the syntax and example of the enhancement:

Syntax 
KILL SESSION 'sid, serial#[, @integer3]'
 
Example
ALTER SYSTEM KILL SESSION '1287, 9823, @2' IMMEDIATE; 
 
Addition

My friend Aman Sharma just tweeted that the feature is available with 11gR1. Thank you buddy.



5.26.2011

Download a free chapter from 'Oracle 11g R1/R2 Real Application Cluster Essentials' book

Friends, you can now download a free chapter from our updated (latest) RAC book 'Oracle 11g R1/R2 Real Application Cluster Essentials' here.

What next?

Does the free chapter excites you to read the entire book? You wanna go ahead and have your own copy? Then, what are you waiting for, place your order right way and get your copy.

Your comments, suggestion and reviews are most welcomed.

5.25.2011

Oracle 11g upgrade aftermath - ORA-24247: network access denied by access control list (ACL)

Following a successful Oracle 10g Clusterware and databases upgrade to Oracle 11g R2, some of the database user defined procedures execution were failed with the 'ORA-24247: network access denied by access control list (ACL)' error and it was brought to our notice instantly.

With a little bit of research over the internet about the 'ORA-24247' error,  it was clear that it is an expected behavior post 11g upgrade if  any of Oracle supplied package, like, UTL_TCP, UTL_HTTP, UTL_SMTP or UTL_MAIL is being referred or used in any user defined database procedures/packages/functions. This behavior is largely due to an enhanced security level with the Access Control Lists(ACL) with Oracle 11g to access External Network Services.

The problem can be resolved by explicitly granting the user or role in the question to access the External network Services by making use of the new Oracle supplied package, DBMS_NETWORK_ACL_ADMIN.

The following example demonstrates the steps that are required to address the error:

BEGIN   dbms_network_acl_admin.create_acl(acl => 'http_service_acl.xml',                                     description => 'HTTP ACL Access',                                     principal   => 'USER_NAME',                                     is_grant    => TRUE,                                     privilege   => 'connect',                                     start_date  => null,                                     end_date    => null);   dbms_network_acl_admin.add_privilege(acl=> 'http_service_acl.xml',                                     principal  => 'USER_NAME',                                    is_grant   => TRUE,                                     privilege  => 'resolve',                                     start_date => null,                                     end_date   => null);   dbms_network_acl_admin.assign_acl(acl => 'http_service_acl.xml',                                     host       => '*',                                     lower_port => 8801,                                     upper_port => 8810);   commit; END;


In a nutshell, the above script does the following (ensure the user got the DBA privileges to execute the above script successfully on the database):
  • Creates a new ACL with and grants the 'connect' privilege to the specified user.
  • Grants the 'resolve' privilege to the user.
  • Finally assign the ACL to specific (or open) host with defined ports.
 You can subsequently query the dictionary table dba_network_acls to extract the ACL configured information.

In order to modify the host and port information, you must unassign the ACL using the UNASSIGN_ACL procedure.

Punchline:
Ensure you go through the above exercise after you upgrade your 10g database to Oracle 11g, if the above mentioned Oracle Supplied packages are referred.

References
http://download.oracle.com/docs/cd/B28359_01/appdev.111/b28419/d_networkacl_adm.htm#BABBJCJD
MOS: [ID 1209644.1]


5.22.2011

Unable to mount a diskgroup with ORA-15032,ORA-15202,ORA-15041 errors

Post 8 node Oracle 10g cluster upgrade to 11g R2 with nearly 60 databases on HPUX OS, we have observed a significant increase in the server resources consumption, in particular for CPU and Memory. Due to the heavy  resource consumption (or lack of sufficient resources time to time), a couple of nodes with nearly 10 databases running across each node started evicting more frequently. However, a couple of days back, one of the node went down for the same reason, and the subsequent node start up, the databases on the node refuse to start up automatically and we have noticed the memory and virtually memory consumption was just shooting up in no time. We identified the symptoms of the behavior and repaired the issue after spending nearly a half day time (it was a very silly mistake, I will blog about it later on). However, one particular RAC database was not able to start as its diskgroup was not mounted. When we tried mount the diskgroup in question manauly, we come across the following ORA error:

ERROR at line 1:
ORA-15032: not all alterations performed
ORA-15202: cannot create additional ASM internal change segment
ORA-15041: diskgroup "DG_XXXX" space exhausted

It was a bit of surprise to receive the ORA-15041 error while mounting the ASM diskgroup as I was in a impression that the error would come when the diskgroup is mounted and when it runs out of adequate space to cope up with the space reqruiements from the database. After all R&D and trying all the options in our capacity, we have logged a SR with the Oracle support and the issue was opened for 4 days... yes, the database was down for 4 days, luckily it was a development database plus a week-end too.

The following has been recorded in the ASM alert.log :

Thu May 19 16:04:47 2011
GMON dismounting group 57 at 143 for pid 32, osid 23046
NOTE: Disk in mode 0x8 marked for de-assignment
ERROR: diskgroup DG_XXXX was not mounted
ORA-15032: not all alterations performed
ORA-15202: cannot create additional ASM internal change segment
ORA-15041: diskgroup "DG_XXXX" space exhausted
ERROR: alter diskgroup DG_XXXX mount
Thu May 19 16:05:10 2011
SQL> alter diskgroup DG_XXXX mount
NOTE: cache registered group DG_XXXX number=57 incarn=0xdc516e6a
NOTE: cache began mount (not first) of group DG_XXXX number=57 incarn=0xdc516e6a
NOTE: Assigning number (57,0) to disk (/dev/rdsk/oracle/data/ln1/xxxxxxx)
Thu May 19 16:05:14 2011
GMON querying group 57 at 145 for pid 31, osid 23739
NOTE: cache opening disk 0 of grp 57: DG_XXXX_0000 path:/dev/rdsk/oracle/data/ln1/xxxxxx
NOTE: F1X0 found on disk 0 au 2 fcn 0.0
NOTE: cache mounting (not first) external redundancy group 57/0xDC516E6A (DG_XXXX)
Thu May 19 16:05:14 2011
kjbdomatt send to inst 1
kjbdomatt send to inst 3
kjbdomatt send to inst 4
kjbdomatt send to inst 5
kjbdomatt send to inst 6
Thu May 19 16:05:14 2011
NOTE: attached to recovery domain 57
NOTE: redo buffer size is 256 blocks (1053184 bytes)
Thu May 19 16:05:14 2011
NOTE: LGWR attempting to mount thread 4 for diskgroup 57 (DG_XXXX)
NOTE: ACD expansion required for disk group 57
Thu May 19 16:05:33 2011
WARNING: unable to grow ACD, probably out of space
ERROR: ORA-15041 signalled during mount of diskgroup DG_XXXX
NOTE: cache dismounting (clean) group 57/0xDC516E6A (DG_XXXX)
NOTE: lgwr not being msg'd to dismount

The above messages indicates that ASM is trying to update or recover some metadata information and no space has been left in the diskgroup.

Nearly four different Oracle Engineers handled the issue and neither a workaround nor a solution was offered by any of them. Upon escalating the issue further, the support engineer asked me to use the AMDU utility and asked us to execute the following script:

./amdu -diskstring 'ASM disks location' -dump 'diskgroup_name'

When the above script was executed, the following was appeared on the screen:

AMDU-00204: Disk N0109 is in currently mounted diskgroup DG_XXXXX
AMDU-00201: Disk N0109: 'ASM disks location'

The first message raised my eyebrows, and started to think where the hell the diskgroup is mounted as I had tried on multiple instances to mount the diskgroup and ended up with the error. I then decided to go through the rest of the 7 ASM instances to check where the diskgroup is mounted. I found that the diskgroup was mounted on the first node and I was bit disappointed and sort of cursing myself why I didn't check all the ASM instances before. Upon adding a new asm disk to the diskgroup from the node1 ASM instance, I was able to mount the diskgroup successfully on the other nodes in the context.

The good thing about this investigation was coming across of the AMDU utility which is ported with Oracle 11g Grid Control software. I have also learned from MOS ID 553639.1 that the utility also can be configured on Oracle 10g (read the note for further instructions).

A directory with a couple of files will be created in the current location after executing the script. You can read the report.txt file, however, you can't read the other informative files, i.e., .map, as Map files are ASCII files that describe the data in the image files for a particular disk group.
 
Here is the extract of the note:

On Oracle 10g the content of a diskgroup can be reviewed if the diskgroup is mounted and there are internal views that display the specific allocation of the files (ASM files and Database files) inside of the disks.

When the diskgroup is not mounted, this information is not available, which makes difficult to diagnose the errors avoiding the diskgroup to be mounted.   This problem has been resolved with AMDU.

AMDU is a tool introduced in 11g where it is posible to extract all the available metadata from one or more ASM disks, generate formatted block printouts from the dump output, extract one or more files from a diskgroup (mounted/unmounted) and write them to the OS file system.

This tool is very important when dealing with internal errors related to the ASM metadata. Although this tool was released with 11g, it can be used with ASM 10g.

You can get the help about the utility with amdu -help

References:

https://twiki.cern.ch/twiki/bin/view/PDBService/ASM_utilities
Placeholder for AMDU binaries and using with ASM 10g [ID 553639.1]


Happy reading,

Jaffar

5.16.2011

Advt. Finally the update version of our RAC book is live

Here is a quick update about my book for those who were anticipating very curiously ....

The update version of  'Oracle 11g R1/R2 Real Application Clusters Essentials' book is live on packtpub.com:

For more details, visit http://www.packtpub.com/oracle-11g-r1-r2-real-application-clusters-essentials/book

Hope you guys like it...

Regards,

Jaffar

5.13.2011

Exclusive first look of our updated upcoming Oracle RAC book

Here is the first look of our upcoming Oracle 11g R1/R2 Real Application Clusters Essentials book, exclusively for you.


In contrast to the earlier version  (Oracle 11g R1/R2 Real Application Cluster Handbook' cover page, this is pretty awesome, at least I think so... Do you guys agree...
However, the book should hit the stores sometime the first of week of next month, also, you can place an order at Packt Publishing in advance.
For more updates about the book, stay tuned.... as I will be blogging more in the coming days...

Yours truly,

Jaffar

5.06.2011

Yet another RMAN quiz: What do you do in this situation?

I will surely write a detailed blog entry later about the issue talking here. But for now, excuse me for the quiz format blog entry.
A few days ago, I was trapped into a very tricky RMAN restore situation for one of our Oracle 10g (10.2.0.4) RAC databases and would like to get your perspective over the issue. Al right, here is the QUIZ for you.


Took an RMAN offline backup to tape with 5 channels. (DB running in noarchive log mode)
The database has around 22 datafiles.
dropped the database, including the controlfiles.
Restore the controlfile from the backupset followed by full database restore.

Do you think the restore is going to finish without any issues?
What sort of issues do you anticipate with the above situation?

By the way, no controlfile auto backup was set on the RMAN.

Updates on 8/May/2011
A lot of curiosity about the quiz has been generated in the Oracle community and many DBAs indeed tried their hands giving various solutions here and at linkedIn discussions.
Just to add more spice to the curiosity, let me tell you guys that I restored the controlfile from the backuppiece and followed by complete database restore. However, when I restored the entire database, to my surprise, only 4 data files were restored from the last backup and 18 datafiles were restored last but one backup sets. When I query the rman backup, it only showed 4 datafiles backup information for the last backup, of course there were older backups info too. What has astonished me is, despite the complete successful backup operations, the controlfile got only 4 datafiles backup info for that date.
Any wild guess why ?????



Regards,

Jaffar