1.27.2014

CRS-1615:No I/O has completed after 50% of the maximum interval

A few days back on one of the production cluster nodes the clusterware became unhealthy and caused instances termination on the node. If it was a node eviction, everything would have come back automatically after the node reboot, however, in this scenario, just cluster components were in unhealthy state, hence, no instance started on the node.

Upon referring the ocssd.log file, the following error found:

cssd(21532)]CRS-1615:No I/O has completed after 50% of the maximum interval. Voting file /dev/rdisk/oracle/ocr/ora_ocr_01 will be considered not functional in 13169 milliseconds

The node was unable to communicate with rest of the cluster nodes. If so, why didn't my node crashed/evicted? Isn't that question comes to your mind?.

Well, from the error it is clear that the node suffered I/O issues, and was unable to access the voting disk and started complaining (in the ocssd.log) that it can't see other nodes in the cluster. When contacted the storage and OS teams, the OS team was quick to identify the issue on PCI card. For some reasons, the I/O channel to the storage from the node was suspended for 10 seconds and re-established the connection back after 10 seconds. In the mean time, the cluster on the node become unhealthy and couldn't proceed further with any other action.

The only workaround we had to perform was restarting the CSS component using the following command:

crsctl start res ora.cssd -init

Once the CSS was started, everything back to normal state. Luckily, didn't have to do a lot of research why cluster become unhealthy or instances crashed.

1.26.2014

AUTO Vs Manual PSU patch - when and why?

The purpose of this blog entry is to share my thoughts on AUTO Vs Manual PSU patch deployment with when and why scenario, also,  a success story of reducing (almost 50%) over all patching time we achieved in the recent times. Thought this would help you as well.

In my own perspective, I think we are one of the organizations in the Middle East who has a large and complex Oracle Cluster setups in-place. Having six (06) cluster environments, which includes production, non-production and DR, it is always certainly going to be an uphill task and challenging aspect maintaining them. One of the tasks that requires most attention and efforts more often is non-other than the PSU patch deployment for all the six environments at our premises.

We are currently in the process of applying 11.2.0.2.11 PSU patch in all our environments and the challenge in front of us is to bring down the patching time on each server. Our past patching experience says, an AUTO patching on each node needs minimum of 2 hours , and if you are going to patch a 10 node cluster, you need at least 22 hrs time frame to complete the deployment.

AUTO Patching
No doubt AUTO patching is the coolest enhancement from Oracle to automate the entire patching procedure smoothly and more importantly without much human intervene. The downside of the AUTO patch is the following, where AUTO patching for GI and RDBMS homes together fail or can't go with AUTO patch for GI and RDBMS home together:

  1. When you have multiple Oracle homes with different software owners, for example, we have an Oracle EBusiness Suite and typical RDBMS homes(10g and 11g) under different ownership.
  2. If you have multiple versions of Oracle databases running, for example, we have Oracle v10g and Oracle v11g databases.
  3. During the course of patching, if one of the files get failed to copy/rollback the backup copy due to file busy with any other existing OS process, the patch will be rolled back and will automatically restarts the cluster and the other services on the node subsequently.
Perhaps in the above circumstances, one may choose going with the AUTO patch separately for GI and then to the RDBMS homes. However, when you hit the 3 point above, the same thing gonna happen, which is time consuming, of course. While patching on particular node, the AUTO patching on GI home failed 3 times due to unsuccessfully cluster shutdown and we end-up rebooting the node 3 times.

Manual Patching
In contrast, manual patching requires heavy human intervene during the course of patch deployment. A set of steps needs to be followed carefully.

Since we got a challenge to look at all the possibilities to reduce the overall patching time, we started off analyzing various options between AUTO and Manual patch deployment and where the time is being consumed/wasted. We figured out that, after each successful/unsuccessful AUTO patching attempt, the cluster and the services on the nodes will have to restart and this was the time consuming factor. In a complex/large cluster environment with many instances, asm disks, diskgroups, it is certainly going to take a good amount of time to start off everything. This caught our attention and thought of giving a manual patching try.

When we tried manual patching method, we managed to patch the GI and RDBMS homes in about 1 hour time, this was almost 50% less than with the AUTO patching time-frame. Imagine, we finish 6 nodes in about 7 hours time in contrast to 12-13 hours time frame.

In a nutshell, if you have a small cluster environment, say 2 nodes cluster, you may feel you are not gain much with respect to saving the time, however, if you are going to patch a large/complex cluster environment, think of manual method which could save pretty huge patching downtime. At the same time, keep in mind that this method requires DBA intervene more than the AUTO patching method.