5.03.2013

Things to be considerd before/after the OS patch deployment

The objective of this write-up is to emphasize the importance of considering things like verifying the patch compatibility and  relinking the Oracle home after patching the underlying Operating System (OS)  in any Oracle environment. I would like to share an incident (a little story) that we encountered a few days ago in one of our non-production RAC environments where the Clusterware stack didn't start after the OS patch deployment.

As part of the patching policy set in the organization, our HPUX admin scheduled the latest quarterly HPUX v11.3x OS patch deployment activity on all servers, and a non-RAC and Oracle RAC environments have patched in the context. Though the patching activity went smoothly on both the environments,  we faced issues starting the Cluster stack in the Cluster environment. When the cluster stack status was verified, we have noticed that the Cluster Synchronization Daemon process (cssd) was in 'STARTING' state, as shown below:


$ ./crsctl stat res -init -t
--------------------------------------------------------------------------------
Name           Target  State        Server                   State details       
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.asm
      1        ONLINE  OFFLINE      rac1                     
ora.cluster_interconnect.haip
      1        ONLINE  OFFLINE      rac1                     
ora.crsd
      1        ONLINE  OFFLINE      rac1                     
ora.cssd
      1        ONLINE  OFFLINE      rac1                     STARTING                
ora.cssdmonitor
      1        ONLINE  ONLINE       rac1                      


Oracle High Availability Daemon process (ohsd) started without any issues, however, the crsd couldn't be started on any of the nodes after the patch deployment . Upon examining the ocssd.log, it was found that some how the voting disks were not able to discover by the process, hence, the crsd process couldn't start and the following messages appeared in the ocssd.log:

CRS-1714:Unable to discover any voting files
2013-04-23 18:47:16.553: [ SKGFD][6]Discovery with str:/dev/rdsk/c0t5d5,/dev/rdsk/c0t5d4:

2013-04-23 18:47:16.553: [ SKGFD][6]UFS discovery with :/dev/rdsk/c0t5d5:
2013-04-23 18:47:16.559: [ SKGFD][6]Fetching UFS disk :/dev/rdsk/c0t5d5:
2013-04-23 18:47:16.559: [ SKGFD][6]OSS discovery with :/dev/rdsk/c0t5d5:
2013-04-23 18:47:16.559: [ SKGFD][6]Discovery advancing to nxt string :/dev/rdsk/c0t5d4:
2013-04-23 18:47:16.559: [ SKGFD][6]UFS discovery with :/dev/rdsk/c0t5d4:
2013-04-23 18:47:16.564: [ SKGFD][6]Fetching UFS disk :/dev/rdsk/c0t5d4:
2013-04-23 18:47:16.564: [ SKGFD][6]OSS discovery with :/dev/rdsk/c0t5d4:
2013-04-23 18:47:16.564: [ CSSD][6]clssnmvDiskVerify: Successful discovery of 0 disks
2013-04-23 18:47:16.564: [ CSSD][6]clssnmCompleteInitVFDiscovery: Completing initial voting file discovery
2013-04-23 18:47:16.564: [ CSSD][6]clssnmvFindInitialConfigs: No voting files found

From the messages it was pretty clear that for some reasons the voting disks (placed on the shared storage) are inaccessible to the node/s. When searched over the internet and in the My Oracle Support (MOS) with the combination of error codes, all the links were pointing to verify the ownership and permission on the voting disks. We found there were no issues with regards to the ownership and permissions on the voting disks, we even dumped the the disks with the DD command found no corruption and no ownership/permission issues.  After 1 hour of hard struggles, there was a little hope about the issue when we come across of a MOS note (id 1508899.1) that explained an incident close to ours.
According the note, this issue was due to a bug : 14810756 and the workaround is to apply patch: 14810756 or rollback the OS patch PHCO_43004. There was no chance of applying the patch for us as we were not able to start-up the cluster, hence, we verified with the OS admin whether PHCO_43004 is part of the bundle patch that deployed a while ago on HPUX 11.3x plat form. The OS admin then confirmed us that the particular patch is indeed part of the patch bundle deployed a while ago. We then requested the OS admin to roll-back the patch in the context to try our luck. After rolling back the patch from a node, Clusterstack successfully started on the node. We did the same on the rest of the nodes and everything came back successfully.
The MOS note states that the issue likely to happen during the execution of the rootupgrade.sh script as part of the the cluster upgrade from 11.2.0.2 to 11.2.0.3 on the HPUX 11.3x platform, and when the voting disks is placed on disk/raw devices.
We fail to understand why the HP didn't mentioned about this behavior despite there were similar issues recorded and addressed on the HP forums.

Conclusion:
The motive of his blog entry is emphasize the importance of verifying the compatibility of the PATCH before deploying in any environment.
Also, it is highly advised to relink the binaries manually right after the OS patch deployment. The following demonstrates how to relink the binaries in 11gR2 GI RAC env.:

as the root user:
Unlock the CRS (ensure cluster stack is not running on the server)
$GRID_HOME/crs/install/rootcrs.pl -unlock

cd $ORACLE_HOME/rdbms/lib
make -f ins_rdbms.mk rac_on ioracle


References:
  • How to Check Whether Oracle Binary/Instance is RAC Enabled and Relink Oracle Binary in RAC [ID 284785.1]
  • hp-ux: 11gR2 GI Fails to Start or rootupgrade.sh Fails with "clsfmt: Received unexpected error 4 from skgfifi for file" if PHCO_43004 is Applied [ID 1508899.1]



2 comments:

Anonymous said...

Dear Jaffar,

Good Day !

gr8 FYI ...... After you sort out this critical crs issue on 03-05-2013 , The HP provided the solution to address this issue on 13-05-2013.

- These behaviors are corrected in PHCO_43503, which is
available. HP recommends installing PHCO_43503 to
correct these behaviors.

For more details click below link:

http://h20566.www2.hp.com/portal/site/hpsc/template.PAGE/public/psi/mostViewedDisplay/?sp4ts.oid=3955589&spf_p.tpst=psiContentDisplay&spf_p.prp_psiContentDisplay=wsrp-navigationalState%3DdocId%253Dpdb_na-PHCO_43004-6%257CdocLocale%253Den_US&javax.portlet.begCacheTok=com.vignette.cachetoken&javax.portlet.endCacheTok=com.vignette.cachetoken

KamranAghayev A. said...

Interesting post Jaffar, thanks for sharing