6.16.2013

ORA-01115: IO error reading block from file (block # ) - a list of common causes

There are situations when a single issue could arise for different reasons. Will list out some of the common causes of an 'ORA-01115: IO error reading block from file  (block # )' error over here.

Typically, when Oracle failed to read a data block from an open data file, it throws an ORA-01115 error.  Before you suspect any database issue, it is advised to have a close look at the error message stack presented before and after the ORA-01115 error in the database alert.log file. As there could be a different reason for this error when comparing the same error over the net or previous occurrence, it is pretty important that you isolate the issue looking at the other error messages represented along with this error.

One of the following would be the most common reasons for the error:
  • the datafile in the context is OFFLINE
  • database might have lost communication with the underlying ASM instance
  • caused by any hardware problems
  • physical data block corruption at the storage level
  • a Oracle BUG
We have encountered similar issue in one of our RAC databases due to communication loss with the underlying ASM instance. Here is the alert.log entries:

WARNING: ASM communication error: op 0 state 0x0 (15055)
ERROR: direct connection failure with ASM
ERROR: paging ASM fault extent map failed gn=28 fn=256 extet=715
Errors in file /u00/app/oracle/diag/rdbms/xxxDB/xxxDB2/trace/xxxDB2D2_ora_21886.trc:
ORA-00704: bootstrap process failure
ORA-01115: IO error reading block from file  (block # )
ORA-01110: data file 1: '+DG_DATA/xxxdb/datafile/system.256.680619545'
ORA-15055: unable to connect to ASM instance
ORA-15055: unable to connect to ASM instance
ORA-00020: maximum number of processes (100) exceeded
ORA-00704: bootstrap process failure
ORA-01115: IO error reading block from file  (block # )
ORA-15055: unable to connect to ASM instance
ORA-15055: unable to connect to ASM instance
ORA-00020: maximum number of processes (100) exceeded
Error 704 happened during db open, shutting down database
USER (ospid: 21886): terminating the instance due to error 704
Instance terminated by USER, pid = 21886



In a nutshell, you must examine the error stack message when you encounter similar issue to diagnose the real cause of the problem.

NOTE: I welcome your inputs, additions and you want to share your experience on this issue.

6.13.2013

Rejoining a node to a 2 node cluster on Windows Platform - Yet another learning experience

There was a situation encountered a couple of days ago where a node (second node) had to rejoin to a 2 node cluster environment on Windows 2008 platform. The Windows OS admin team rebuilt the 2 node after OS crash without DBAs being informed and requested the DBA team to rejoin the node to the existing cluster. Although the second node was clean, after the rebuilt, it wasn't properly removed from the cluster.

When addNode.bat procedure was initiated on the node 1, it failed with the following errors:

Error ocurred while retrieving node numbers of the existing nodes. Please check if clusterware home is properly configured.
SEVERE:Error ocurred while retrieving node numbers of the existing nodes. Please check if clusterware home is properly configured.


The error string doesn't yield any obvious reason of failure. And the following has been reported in the addnode_action*.logs:

Node is okay
--------------------------------------------------------------------------
INFO: Setting variable 'REMOTE_CLEAN_MACHINES' to 'xxxxxdb2'. Received the value from a code block.
INFO:  cannot initialize cluster interface skgxn error number 1311719766   operation skgxncin   location forced vacuo Could not initialize cluster
INFO: Vendor clusterware is not detected.
INFO: Error ocurred while retrieving node numbers of the existing nodes. Please check if clusterware home is properly configured.
SEVERE: Error ocurred while retrieving node numbers of the existing nodes. Please check if clusterware home is properly configured.
INFO: User Selected: Yes/OK

Both the reasons misleading in my cause though. I have verified and ensured that the cluster home is properly set before executing the addNode.bat procedure. However, the other error  'INFO:  cannot initialize cluster interface skgxn error number 1311719766   operation skgxncin   location forced vacuo Could not initialize cluster' caught my attention. Upon a little research over the net and My Oracle Support (MOS), it takes and guides me to another direction. One of the metalink  notes mentioned that this issue causes while adding the RDBMS (RAC) home not the GI home. This is due to missing olsnodes execute. If you find lsnodes, you must replace it with olsnodes and re-try the addNode.sh procedure. Fortunately, in our case, the olsnodes executable exists in GI and RAC homes.


Since the node wasn't removed properly, I wanted to have a look at the inventory to verify about the current nodes list. I found that the inventory still has the entry about two nodes. What all I have do was update the inventory (which is usually a post delete note step) and ensure only one node is listed. I did update the inventory with one node for GI and RDBMS homes. After updating the inventory, the addNode.bat procedure went smoothly and managed to add (rejoin) the node succesfully.


The bottom line is, in most circumstances, the errors reported doesn't give the right direction. Hence, it is pretty important that you review different logs, understand the history and sometimes need to think out-of-the box.



5.31.2013

Introducing Java EE 7 - Live Webcast


Wednesday, June 12, 2013 / Thursday, June 13, 2013 

Two opportunities to come together with the Java community, chat with experts, and explore Java EE 7: 
9 a.m. PT / 12 p.m. ET / 5 p.m. London or 
9 p.m. PT / 12 a.m. ET (Thursday) / 2 p.m. Sydney (Thursday) 

The introduction of Java EE 7 is a free online event where you can connect with Java users from all over the world as you learn about the power and capabilities of Java EE 7. Join us for presentations from Oracle technical leaders and Java users from both large and small enterprises, deep dives into the new JSRs, and scheduled chats with Java experts.

Register for the event here.

Jave EE 7 updates (session recording and PDF)
https://java.net/projects/jugs/downloads/download/JavaEE_Update_ArunGupta_May30.mp3



5.28.2013

Expert Oralce RAC 12c - upcoming book

Here is the TOC for upcoming Expert Oracle RAC 12c book, which is slated to release sometime in August 2013 (of course subject to Oracle 12c announcement), published by Apress.


Table of contents

  1. Overview of Oracle RAC
  2. Clusterware Management and Troubleshooting
  3. RAC Operational Practices
  4. RAC New Features
  5. Storage and ASM Practices
  6. Application Design Issues
  7. Managing and Optimizing a Complex RAC Environment
  8. Backup and Recovery in RAC
  9. Network Practices in RAC
  10. RAC Database Optimization
  11. Locks and Deadlocks
  12. Parallel Query in RAC
  13. Clusterware and Database Upgrades
  14. Oracle RAC One Node
  15. Virtualized RAC - Setup DB Clouds - Part 1
  16. Virtualized RAC - Setup DB Clouds - Part 2
You might get about 29% off on pre-order copy at Amazon.

5.18.2013

A tricky standby database situation

A very tricky and interesting situation came-up this morning while configuring one of the standby databases of over 1.5TB sized . Whilst the database is being cloned to the DR site as part of the DUPLICATE..ACTIVE DATABASE command, which actually took more than 1.5 day, a couple of new datafiles were added to the primary database.  After cloning process was over, the newly build DR database was almost 2 days behind withe the primary database. I knew I can make it in SYNC the PRIMARY and STANDBY applying the standby roll-froward method, but, I already have a daily cumulative incremental backups on TAPE.  If I perform incremental backup to do the roll-forward upgrade, it gonna take much time. Hence, I determined to make use of the existing backups. When the the roll-forward method was followed, the following confronted:

RMAN> SWITCH DATABASE TO COPY;

RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of switch to copy command at 05/18/2013 10:25:29
RMAN-06571: datafile 58 does not have recoverable copy


Obviously, it was expected, because the datafile in the question was added after standby database creation initiations.

Workaround:
Had to try out-of-the-box solution (roll-forward method).
  1. Re-create and restore the standby controlfile
  2. Restore missing datafiles on the standby
  3. Catalog standby database datafiles (diskgroup was different from primary)
  4. Recover the database
  5. Complete the rest of the standby configure to make it in sync 
Will be writing a detailed article on this. Stay tuned for more.

Happy reading

Jaffar