A few days back on one of the production cluster nodes the clusterware became unhealthy and caused instances termination on the node. If it was a node eviction, everything would have come back automatically after the node reboot, however, in this scenario, just cluster components were in unhealthy state, hence, no instance started on the node.
Upon referring the ocssd.log file, the following error found:
cssd(21532)]CRS-1615:No I/O has completed after 50% of the maximum interval. Voting file /dev/rdisk/oracle/ocr/ora_ocr_01 will be considered not functional in 13169 milliseconds
The node was unable to communicate with rest of the cluster nodes. If so, why didn't my node crashed/evicted? Isn't that question comes to your mind?.
Well, from the error it is clear that the node suffered I/O issues, and was unable to access the voting disk and started complaining (in the ocssd.log) that it can't see other nodes in the cluster. When contacted the storage and OS teams, the OS team was quick to identify the issue on PCI card. For some reasons, the I/O channel to the storage from the node was suspended for 10 seconds and re-established the connection back after 10 seconds. In the mean time, the cluster on the node become unhealthy and couldn't proceed further with any other action.
The only workaround we had to perform was restarting the CSS component using the following command:
crsctl start res ora.cssd -init
Once the CSS was started, everything back to normal state. Luckily, didn't have to do a lot of research why cluster become unhealthy or instances crashed.
Upon referring the ocssd.log file, the following error found:
cssd(21532)]CRS-1615:No I/O has completed after 50% of the maximum interval. Voting file /dev/rdisk/oracle/ocr/ora_ocr_01 will be considered not functional in 13169 milliseconds
The node was unable to communicate with rest of the cluster nodes. If so, why didn't my node crashed/evicted? Isn't that question comes to your mind?.
Well, from the error it is clear that the node suffered I/O issues, and was unable to access the voting disk and started complaining (in the ocssd.log) that it can't see other nodes in the cluster. When contacted the storage and OS teams, the OS team was quick to identify the issue on PCI card. For some reasons, the I/O channel to the storage from the node was suspended for 10 seconds and re-established the connection back after 10 seconds. In the mean time, the cluster on the node become unhealthy and couldn't proceed further with any other action.
The only workaround we had to perform was restarting the CSS component using the following command:
crsctl start res ora.cssd -init
Once the CSS was started, everything back to normal state. Luckily, didn't have to do a lot of research why cluster become unhealthy or instances crashed.