11.25.2006

String of Disasters - a human error.

I was reading Alex Gorbachev's article on his blog, 'Which Risks Are You Protected From?' , http://www.pythian.com/blogs/author/alex/, in which he had discussed how human errors are underestimated which can leadto potential cuase.
While reading the article, I recalled my hardest time in one of my previous organization where I had faced the music of disaster and spent 4 days to bring the database back.
In my one of previous company, where the start-of-art technology being used, in terms of systems hardware and software, I came across a string of disasters due to a human error.
First experience of disaster recovery was for a datawarehouse database of 2TB. Due a human error, one of the disk on UNIX OS, including its mirror disk got corrupt and unfortunately the filesystem that contains all the redo groups and their members were placed on the corrupted disk.
Since, we lost all the redo groups and its members, we only left the an option of full database restoreand recovery and database open with resetlogs.
we use to backup the complete database on every week-end and backup archived logs at every 4 hrs. The database generates around 100 archived logs a day.
Disaster occured just before the next backup schedule.
When restore was started we have faced other problems with the backup utility which we were using.
However, complete db was restored in 3.5 days of time, and recovery took good amount of time as oneweek archived logs were applying. Finally, at the end of 4th day, database was back online available to the users.
Different kind of human errors happend which cause two more times to restore other databases as well.No matter, how good resource (hardware) you have, there is always feaure of human errors.
For me it was a good learning experience to handle the restore and recovery very first time in the production and this disaster cleared many of my backup/recovery doubts.
I felt, sometime, people learn by making mistakes.

Jaffar

2 comments:

Alex Gorbachev said...

"...all the redo groups and their members were placed on the corrupted disk"
Well, this is a design issue. Since you already mirror on hardware - it doesn't make sense to quadruple on the same disks. If you would place members on different FS's than you'd be just fine.

"When restore was started we have faced other problems with the backup utility which we were using."
Never underestimate testing restore procedures. ;)

The Human Fly said...

I do agree with you Alex that its purely design issue were all redo memrbs and groups kept on the single FS.

Exactly, backup and recovery procedures should be tested.

Jaffar