6.13.2013

Rejoining a node to a 2 node cluster on Windows Platform - Yet another learning experience

There was a situation encountered a couple of days ago where a node (second node) had to rejoin to a 2 node cluster environment on Windows 2008 platform. The Windows OS admin team rebuilt the 2 node after OS crash without DBAs being informed and requested the DBA team to rejoin the node to the existing cluster. Although the second node was clean, after the rebuilt, it wasn't properly removed from the cluster.

When addNode.bat procedure was initiated on the node 1, it failed with the following errors:

Error ocurred while retrieving node numbers of the existing nodes. Please check if clusterware home is properly configured.
SEVERE:Error ocurred while retrieving node numbers of the existing nodes. Please check if clusterware home is properly configured.


The error string doesn't yield any obvious reason of failure. And the following has been reported in the addnode_action*.logs:

Node is okay
--------------------------------------------------------------------------
INFO: Setting variable 'REMOTE_CLEAN_MACHINES' to 'xxxxxdb2'. Received the value from a code block.
INFO:  cannot initialize cluster interface skgxn error number 1311719766   operation skgxncin   location forced vacuo Could not initialize cluster
INFO: Vendor clusterware is not detected.
INFO: Error ocurred while retrieving node numbers of the existing nodes. Please check if clusterware home is properly configured.
SEVERE: Error ocurred while retrieving node numbers of the existing nodes. Please check if clusterware home is properly configured.
INFO: User Selected: Yes/OK

Both the reasons misleading in my cause though. I have verified and ensured that the cluster home is properly set before executing the addNode.bat procedure. However, the other error  'INFO:  cannot initialize cluster interface skgxn error number 1311719766   operation skgxncin   location forced vacuo Could not initialize cluster' caught my attention. Upon a little research over the net and My Oracle Support (MOS), it takes and guides me to another direction. One of the metalink  notes mentioned that this issue causes while adding the RDBMS (RAC) home not the GI home. This is due to missing olsnodes execute. If you find lsnodes, you must replace it with olsnodes and re-try the addNode.sh procedure. Fortunately, in our case, the olsnodes executable exists in GI and RAC homes.


Since the node wasn't removed properly, I wanted to have a look at the inventory to verify about the current nodes list. I found that the inventory still has the entry about two nodes. What all I have do was update the inventory (which is usually a post delete note step) and ensure only one node is listed. I did update the inventory with one node for GI and RDBMS homes. After updating the inventory, the addNode.bat procedure went smoothly and managed to add (rejoin) the node succesfully.


The bottom line is, in most circumstances, the errors reported doesn't give the right direction. Hence, it is pretty important that you review different logs, understand the history and sometimes need to think out-of-the box.



No comments: