Jaffar's (Mr RAC) Oracle blog: 2014

10.28.2014

AIOUG annual Oracle conference - SANGAM14

All India Oracle User Group (AIOUG) annual Oracle conference Sangam14 is less than 10 days away. This is the largest Oracle conference that take place every year in different cities of India with thousand's of attendees plus over 100 different topics by many Oracle experts across the globe.

This year's SANGAM is scheduled on Nov 7,8,9 in Bangalore city. Don't let the opportunity go vain, avail/grab the opportunity if you are in India. I am super excited about the conference and look forward attending Tom Kyte's 'Optimizer master class', a full day class and also Maria's 'Oracle database in-memory option' session.

My sessions are as follow:

For more details on agenda, speakers, enrollment, visit http://sangam14.info/.

Look forward to seeing you in-person at the conference.

8.19.2014

Oracle 11204 Clusterware upgrade - ASM glitch

Yet another tough challenge thrown at my team right after the disaster recovery (DR) simulation drill which performed barely couple of weeks ago. The new task (challenge) in hands is to upgrade the existing four cluster environments from 11.2.0.2 to 11.2.0.4 as Oracle already stopped supporting v11.2.0.2.

Although last week we had a 3 node successfully upgrade track record, we encountered ASM upgrade troubles whilst running rootupgrade.sh in a new cluster environment (7 nodes). The following error was reported during the course of rootupgrade.sh script execution:

CRS-2672: Attempting to start 'ora.asm' on 'node01'
CRS-5017: The resource action "ora.asm start" encountered the following error:
ORA-48108: invalid value given for the diagnostic_dest init.ora parameter
ORA-48140: the specified ADR Base directory does not exist [/u00/app/11.2.0/grid/dbs/{ORACLE_BASE}]
ORA-48187: specified directory does not exist
HPUX-ia64 Error: 2: No such file or directory
Additional information: 1CRS-2674: Start of 'ora.asm' on 'node01' failed
CRS-2679: Attempting to clean 'ora.asm' on 'node01'
CRS-2681: Clean of 'ora.asm' on 'node01' succeeded
CRS-4000: Command Start failed, or completed with errors.

When tried to start-up the ASM instance manually through sqlplus prompt, the following error was thrown:

SQL>
ORA-32004: obsolete or deprecated parameter(s) specified for ASM instance
ORA-48108: invalid value given for the diagnostic_dest init.ora parameter
ORA-48140: the specified ADR Base directory does not exist [/u00/app/11.2.0/grid/dbs/{ORACLE_BASE}]
ORA-48187: specified directory does not exist
HPUX-ia64 Error: 2: No such file or directory

Sadly, there wasn't much info available about the nature of this problem. As usual, after giving it 1 hr try with different options, we opened a SR with Oracle support and agreed to rollback the upgrade from the node where the rootupgrade script failed. Luckily, this was the first node we tried and other 6 nodes were just running fine. After rolling back to the previous cluster version, ASM instance error was still persist.

To resolve the ASM instance startup issues, the following action was taken:

export diagnostic_dest=/u00/app/oracle
From active ASM instance on another node, executed the following statement:

SQL> ALTER SYSTEM STOP ROLLING MIGRATION;

Cause:

The problem caused an ASM instance startup issue was reported/logged as a known bug (17449823).

Workaround:

According to the MOS Doc ID (1598959.1), the bug is still being worked by the development team, they suggest the following work around on each node just before running the rootupgrade.sh script:

mkdir <New-GI-HOME>/dbs/{ORACLE_BASE}

Third successful attempt

The upgrade failed in first 2 attempts, and the 3 attempt was successful and we managed to upgrade all 7 nodes from 11.2.0.2 to 11.2.0.4. It was also learnt that CRS_HOME, ORACLE_HOME, ORACLE_BASE was not unset before the runinstaller was initiated. In 3rd attempt with unsetting those parameters, upgrade went successfully.

Addendum (24-Aug-2014)
Couple of new challenges encountered in the last upgrade task on 10 nodes.

OUI window from which runInstaller was initiated got closed due to PC rebooted.
Although the directory {ORACLE_BASE} created under the new GRID home, the issue were reoccurring.

Here is the solution:

How to Complete 11gR2 Grid Infrastructure Configuration Assistant(Plug-in) if OUI is not Available (Doc ID 1360798.1)
Ensure the diagnostic_dest is updated on ASM Spfile to the new location before running the rootupgrade.sh

References:

Things to Consider Before Upgrading to 11.2.0.3/11.2.0.4 Grid Infrastructure/ASM ( Doc ID 1363369.1)
Things to Consider Before Upgrading to 11.2.0.4 to Avoid Poor Performance or Wrong Results ( Doc ID 1645862.1)
GI rootupgrade.sh on last node: ASM rolling upgrade action failed ( Doc ID 1598959.1)
bug 17449823

8.06.2014

Disaster Recovery Simulation test - performed 30 databases failover

Successfully switched (fail-over) the role of over 30 physical standby databases this morning as part of the Disaster Recovery (DR) simulation test. Fortunately, there were no technical glitches and hassles during the course of testing as anticipated. It was indeed a great test and very successful one too.

The next big challenge to the team would be reconstructing and making in sync those 30 physical standby databases whose range from 100GB to 5TB size.

Anyways, my team loving the challenges and true enjoying every moment.

7.23.2014

EID Holidays and things to do

Looking forward to a much anticipated 9 day EID holiday break to complete the to-do-list which I have been carrying for a while now. Determined to complete some of the writing assignments that I have kept pending for a long period of time now. At the same time, will have to seek the possibilities to exploring the new features of v12.1.0.2 and Exadata as we might we going for the combination in the coming weeks for a Data Warehouse project.

Will surely blog about my test scenarios and will share the inputs on Oracle 12c new features.

I wish everyone a very happy and prosperous EID in advance.

6.04.2014

Monthly article publication on Toad World portal

Let me quickly give you an update if you are wondering why I kept clam, not making much buzz on my blog. Well, over the past one year or so, I have been contributing monthly articles to Toad World news letter, and all my articles are published at their portal. I must encourage you all to visit Toad World website where many Oracle Experts/Gurus contributing/sharing knowledge.

5.19.2014

Middle East & North Africa (MENA) 2014 OTN Tour - Tunis/Riyadh/Jeddah/Dubai

First ever MENA OTN Tour is scheduled From 26-May to 1-June 2014 in Tunis/Saudi Arabia/Dubai major cities.

2 Continents, 3 Countries, 5 Cities:50+ Action-Packed Oracle Sessions.

Physical Attendance - Click here to attend this event in-person

Virtual Attendance - Click here to virtually attend this event online

Courtesy of the Oracle Technology Network (OTN) and the ARABOUG, the inaugural 2014 OTN MENA Tour brings a star-studded cast, consisting of some of the world's best Oracle ACEs, ACE Directors and Rock Star Speakers to the region. The tour aims at sharing cutting edge knowledge and independent research in the MENA region, by accomplished Oracle experts from all over the world. This is a 100% FREE event for the benefit of the local Oracle communities.

Speakers

5.11.2014

Middle East North Africa (MENA) OTN Tour - May 26 - June 1

Middle East North Africa (MENA) OTN Tour, scheduled through May 26 until June 1 in Tunisia, Saudi Arabia and Dubai countries.

More updates about agenda, registration will be published very shortly. Stay tuned.

2.16.2014

Oracle 11g upgrade and SYS.AUD$ obstacles

During one of the Oracle 10gR2 databases manual upgrade to Oracle 11gR2 couple of days ago, to our own surprise, the upgrade procedure took nearly 5.34 hrs to finish. Although we had done earlier 100's of database manual upgrades comfortably within 2 hrs time window for individual database, the delay indeed surprised us this time. Unfortunately, there is no log that could help us understanding what is really going on. We started investigation the delay after reaching the anticipated upgrade duration. The only new addition in this upgrade was a standby database in place this time. But, this shouldn't be the genuine reason of the delay, right?
When we queried the v$session , we found the following statement was executing:

update SYS.AUD$ set dbid = 223249361 where dbid is null and rownum <= 10000

It was understood that during the course of Oracle Server component upgrade in 11gR2, the upgrade procedure performs the following actions on the SYS.AUD$ table:

Add 11 new columns
Updates ntimestamp column
Update DBID column with the appropriate value for every <10000 records in a loop

Fortunately the SYS.AUD$ table in the database has over 6.4 million records. And the update statement on the table was for every 10000 records at a time which took 4.30 hrs time to complete. Subsequently the remaining components upgrade procedure took 1 hr time to complete.

When we did our upgrades earlier, auditing option was not enabled on the databases, hence, we never encountered such delayed scenario before. In order to speed up the upgrade length, when SYS.AUD$ contains huge amount records, it is best recommend the following practice:

This would certainly help us when we plan Oracle 12c upgrade in the future.

2.10.2014

Bug 18223021 : DISK FILE OPERATIONS I/O ON HP-UX.

We have been dealing with a strange ASM behavior over the past few months across all ASM instances in our multiple Oracle 11gR2 (11.2.0.2) Cluster envs on HPUX 11.31 OS. Even a simple and typical ASM task like, adding new disk to ASM disk group, disk group mount/unmount and querying for CANDIDATE asm disks was taking min of 20 minutes, and sometimes infinity time. This behavior caused a lot of performance degradation across all database instance in the cluster, and most of the databases suffered with 'Disk file operations i/o' wait with other consequences.

I must say, Oracle support indeed made a few unsuccessful attempts by suggesting increasing the ASM instance SGA, enabling async i/o on OS, reducing the number of disks etc to address the issue. Finally, they have logged a BUG '18223021' for our issue and we are yet to receive the fix.

Below are a few consequences of this behavior we are confronting:

Oracle 10g database instances running on the node where a new ASM disk being added to the existing disk group had suffered with control file locking and database hung issues
We ensure the disk group to which a new disk is being added is not mounted on non relative ASM instances (in other nodes)
When adding disk takes infinite time on one of the ASM instances, the only way was to shutdown the instance with dismount the ASM disk group subsequently to speed-up and complete the procedure
Most databases are suffering with 'Disk File operations I/O'

If you have MOS access, you may refer the BUG for more details.

Stay tuned for the fix and solution...

1.27.2014

CRS-1615:No I/O has completed after 50% of the maximum interval

A few days back on one of the production cluster nodes the clusterware became unhealthy and caused instances termination on the node. If it was a node eviction, everything would have come back automatically after the node reboot, however, in this scenario, just cluster components were in unhealthy state, hence, no instance started on the node.

Upon referring the ocssd.log file, the following error found:

cssd(21532)]CRS-1615:No I/O has completed after 50% of the maximum interval. Voting file /dev/rdisk/oracle/ocr/ora_ocr_01 will be considered not functional in 13169 milliseconds

The node was unable to communicate with rest of the cluster nodes. If so, why didn't my node crashed/evicted? Isn't that question comes to your mind?.

Well, from the error it is clear that the node suffered I/O issues, and was unable to access the voting disk and started complaining (in the ocssd.log) that it can't see other nodes in the cluster. When contacted the storage and OS teams, the OS team was quick to identify the issue on PCI card. For some reasons, the I/O channel to the storage from the node was suspended for 10 seconds and re-established the connection back after 10 seconds. In the mean time, the cluster on the node become unhealthy and couldn't proceed further with any other action.

The only workaround we had to perform was restarting the CSS component using the following command:

crsctl start res ora.cssd -init

Once the CSS was started, everything back to normal state. Luckily, didn't have to do a lot of research why cluster become unhealthy or instances crashed.

1.26.2014

AUTO Vs Manual PSU patch - when and why?

The purpose of this blog entry is to share my thoughts on AUTO Vs Manual PSU patch deployment with when and why scenario, also, a success story of reducing (almost 50%) over all patching time we achieved in the recent times. Thought this would help you as well.

In my own perspective, I think we are one of the organizations in the Middle East who has a large and complex Oracle Cluster setups in-place. Having six (06) cluster environments, which includes production, non-production and DR, it is always certainly going to be an uphill task and challenging aspect maintaining them. One of the tasks that requires most attention and efforts more often is non-other than the PSU patch deployment for all the six environments at our premises.

We are currently in the process of applying 11.2.0.2.11 PSU patch in all our environments and the challenge in front of us is to bring down the patching time on each server. Our past patching experience says, an AUTO patching on each node needs minimum of 2 hours , and if you are going to patch a 10 node cluster, you need at least 22 hrs time frame to complete the deployment.

AUTO Patching
No doubt AUTO patching is the coolest enhancement from Oracle to automate the entire patching procedure smoothly and more importantly without much human intervene. The downside of the AUTO patch is the following, where AUTO patching for GI and RDBMS homes together fail or can't go with AUTO patch for GI and RDBMS home together:

When you have multiple Oracle homes with different software owners, for example, we have an Oracle EBusiness Suite and typical RDBMS homes(10g and 11g) under different ownership.
If you have multiple versions of Oracle databases running, for example, we have Oracle v10g and Oracle v11g databases.
During the course of patching, if one of the files get failed to copy/rollback the backup copy due to file busy with any other existing OS process, the patch will be rolled back and will automatically restarts the cluster and the other services on the node subsequently.

Perhaps in the above circumstances, one may choose going with the AUTO patch separately for GI and then to the RDBMS homes. However, when you hit the 3 point above, the same thing gonna happen, which is time consuming, of course. While patching on particular node, the AUTO patching on GI home failed 3 times due to unsuccessfully cluster shutdown and we end-up rebooting the node 3 times.

Manual Patching

In contrast, manual patching requires heavy human intervene during the course of patch deployment. A set of steps needs to be followed carefully.

Since we got a challenge to look at all the possibilities to reduce the overall patching time, we started off analyzing various options between AUTO and Manual patch deployment and where the time is being consumed/wasted. We figured out that, after each successful/unsuccessful AUTO patching attempt, the cluster and the services on the nodes will have to restart and this was the time consuming factor. In a complex/large cluster environment with many instances, asm disks, diskgroups, it is certainly going to take a good amount of time to start off everything. This caught our attention and thought of giving a manual patching try.

When we tried manual patching method, we managed to patch the GI and RDBMS homes in about 1 hour time, this was almost 50% less than with the AUTO patching time-frame. Imagine, we finish 6 nodes in about 7 hours time in contrast to 12-13 hours time frame.

In a nutshell, if you have a small cluster environment, say 2 nodes cluster, you may feel you are not gain much with respect to saving the time, however, if you are going to patch a large/complex cluster environment, think of manual method which could save pretty huge patching downtime. At the same time, keep in mind that this method requires DBA intervene more than the AUTO patching method.

Jaffar's (Mr RAC) Oracle blog

Expert Oracle RAC