Tales from the past – Disaster Recovery testing
July 14, 2015 1 Comment
A long time ago in a datacenter far, far away….
Turmoil has engulfed the IT landscape. Within the newly formed digital universe,
corporate empires are becoming more and more
dependent on their digital data and computer systems.
To avoid downtime when getting hit by an evil strike, the corporations are
starting to build disaster recovery capabilities in their operational architectures.
While the congress of the Republic endlessly debates whether
the high cost of decent recovery methods is justified,
the Supreme CIO Chancellor has secretly dispatched a Jedi Apprentice,
one of the guardians of reliability and availability,
to validate existing recovery plans…
Another story from my days as UNIX engineer in the late nineties. I obfuscated all company or people names to protect their reputation or disclose sensitive information, but former colleagues might recognize parts of the stories or maybe everything. Also, some of it is a long time ago and I cannot be sure all I say is factually correct. The human memory is notoriously unreliable.
In those days, our company was still relying on tape backup as the only Disaster Recovery (DR) strategy. The main datacenter had a bunch of large tape silos, where, on a daily basis, trays of tapes were unloaded, packed and labeled in a small but strong suitcase, and sent to an off-site location (Pickup Truck Access Method) so the invaluable data could be salvaged in case our entire datacenter would go up in flames.
There were no standby systems as in those days, the Recovery Time Objectives were measured in days versus minutes. We only had DR plans for the most mission-critical apps so that our business could hopefully survive in case of disaster, albeit severely handicapped because all the usual support systems were not part of the DR plan. How one should operate a business with the key apps up and running, but without email, communications gateways, document libraries, test and development systems and so on was frequently discussed but never solved.
As a junior UNIX engineer I was assinged the task to run a D/R test for our most mission critical UNIX system (an Oracle database application). Such a test was never announced and you could suddenly get asked to drop your current tasks and start the test without any prior notice. Which makes sense given that real disasters will most likely not announce themselves in advance either.
Another precondition was that the D/R test was not to be executed in the real datacenter (to reflect the fact that it would no longer exist if the shit hits the fan). So for a D/R test, a small maintenace room in another office building was selected.
As we did not have any standby equipment, we rented servers and other stuff to be able to restore the database. Upon entering the maintenance room, I saw an old model of the UNIX server we were using in production, side by side with a large box that contained a spaghetti of cables and other components. A single tape drive unit was also provided so we could restore the data, and a bunch of hard disks of varying capacity.
A box with a number of 8mm tapes that was retrieved from the off-site storage location was waiting in the corner as well.
Good luck with that.
So I started connecting disk drives to the server, working my way through the box of cables to find the right SCSI connectors and terminators, hooked up the tape drive, connected a serial console and flipped the power button.
The first thing you had to do is to get UNIX bootstrapped so you could continue with the application and data restore. I will not go into details about trouble with SCSI cabling, dip switches, terminators that physically connect but have the wrong electrical termination and the like. By the time I had the machine booting without errors, a few hours had passed and I had not restored a single bit of data.
The next step was to install the standalone backup client that allowed you to do command-line restores. After getting that working, I looked at the paper schema that listed the first of each set of tapes that you had to load, so the backup client knew what to restore, including the layout of data across many tapes. I restored the application and database binaries (Oracle 7.3) and started the restore of the real crown jewels, the database.
The tapes in the red box were labeled with (if I remember correctly) 5-digit numbers so a backup set would start with tape 00117, after which you had to manually eject and load tape 00118, and so on. The labels written on the tape matched those written in the tape header metadata. At least, until tape 120. When I ejected 120 and inserted tape labeled 121, the backup client complained that I inserted tape 122 and not 121. Rats…
I tried to insert tape labeled 00122 but the backup app complained that this was tape 123. The numbers seemed to have shifted by 1 for the rest of the tape set. I tried several tapes without any luck. The real tape 00121 was missing from the red box. Even Jedi powers can’t recover from that. It was already well in the afternoon and the deadline for the recovery was end of day, and the backup was only half-way or so. I reported back to the operations manager that I could not continue the restore because of a missing tape.
After a short discussion about our options, I was instructed to call the backup operator in the datacenter (yes, the one that was supposed to no longer exist) to ask for the real tape numbered 121. The operator arrived half an hour later with a fresh, unlabelled ghost tape, freshly retrieved from the virtually exploded backup pool. Try this one, he said…
I inserted the tape and the restore process continued. I kept feeding the tapes each with an offset of 1 in numbering and after a while the restore was complete. Of course I had some issues with starting the database (wrong UNIX version, had to relink binaries, change kernel parameters, you name it) but in the end we got it up and running.
IT department reporting back to the business: “D/R test completed and successful”. Well done guys, we’re good to go when the real disaster strikes…
Things I learned from this adventure:
- IT will twist reality a little to keep the business happy (and ingorant)
- Ad hoc disaster recovery without a proven and consistent plan is prone to errors (I consider it a near-miracle we got the app up and running in the end)
- If your datacenter has evaporated, you cannot do what we did and fetch a few more bits of data to fill in the gaps
- Relying on rental equipment that has been battered and punished by other customers is not a good idea (we had no virtualization in those days to isolate virtual and physical worlds to work around that)
- Relying on manual transport of tapes, manual labeling and manual retrieval is Russian Roulette (with the latter having better odds for survival)
Later I wrote an internal memo to highlight the challenges with the existing D/R strategy, and highlighted a new product called EMC SRDF – the first remote storage replication solution, introduced in 1995, to finally deal with all the D/R adventures we had been through (with varying success). It ended up in a drawer for a few years after I had left for another company (the one I now work for… coincidence?).
Afterwards, I was told that one of the reasons they went for SRDF was because of my (now few years old) memo that resurrected from the drawer…
SRDF worked like a charm and I know a few of our local customers who suffered datacenter problems – but didn’t lose a single drop of data because they were using SRDF. Side note: SRDF (like most other EMC based replication products) has no application “awareness” – they just replicate data. Garbage in, garbage out. The unfortunate events of 9/11 caused a lot of damage and the lives of many innocent people, but many of the companies in the Twin Tower buildings were able to restart their mission critical applications because of it. SRDF is still actively developed for better sizing and performance, as well as new features (stay tuned…) but in the meantime, EMC has introduced other methods of replicating data for D/R as well.
One thing they all have in common:
They offer a one-point-of-control, consistent, reliable, fire-and-forget method of replicating data for a business application landscape,
rather than for a single infrastructure component, requiring to deploy and maintain multiple versions and methods, increasing risk and complexity.
If your preferred software vendor tells you that disaster recovery suddenly requires “application awareness”, take it with a grain of salt.
This post first appeared on Dirty Cache by Bart Sjerps. Copyright © 2011 – 2015. All rights reserved. Not to be reproduced for commercial purposes without written permission.