Duplexing Oracle Redo logs?

A customer asked me recently what EMC’s advice is regarding duplexing Oracle redo logs. There is a thought behind this – Oracle redo logs are sensitive to data corruption – if redo logs are corrupt, there is no way to nicely recover the database to a technically consistent state (at least not without restoring data from backups).

This is what Oracle tells you:

Oracle recommends that you multiplex your redo log files; the loss of the log file information can be catastrophic if a recovery operation is required.

 

Duplexed Redo

Duplexed Redo



So what is redo log duplexing?

Basically, it means Oracle is keeping a double physical copy of each redo log file. The idea behind it is that if one file becomes corrupt for whatever reason, then the other is still good and no data is lost. Hopefully the database can keep running (it will probably complain with system errors but continue to be online).

In the old days – a very long time ago when Oracle databases were still running on JBOD (Just a Bunch of Disks) a disk failure could be a real disaster. Protection against disk failures with RAID technology (Redundant Array of Independent Disks), as opposed to JBOD, became mainstream in the late 1980’s. Using RAID, any disk in a RAID group may fail but no data will be lost. Of course, multiple disk failures can still lead to data loss, but statistically the chance of having two disk failures within the same group (typically between 2 to 15 disks) is a rare occurence.

Need to say this only holds, of course, if broken disks are replaced and the RAID data is rebuilt ASAP (I know some Non-EMC, El Cheapo disk subsystems did not report in any way that a disk was broken. So the system kept running unprotected for months, maybe years, and then the next disk to fail caused a real disaster).

So you could wonder why, if storage systems protect against disk failures, it is still needed to “duplex” redo log files.

The answer is, data corruptions can occur without disk failures. Hard disk drives are not perfect and once in a while they either mis-write data (silent write error) or read data back in a wrong way (silent read errors). More information can be found here:
Silent Data Corruption Whitepaper

And not only the drive itself can make errors. What about:

  • The I/O channel (SCSI, Fibre Channel, iSCSI over IP, etc). Here every now and then a few bits get misformed along the way. Mostly the corruption is detected, causing an I/O retry and that’s it. Every so many times the corruption goes undetected.
  • The host hardware. Most servers (I must say, Intel-based architecture) have less high standards than enterprise class storage for detecting corruptions, such as on the PCI bus, in memory, on adapter cards etc. Servers are low-margin business for vendors and making them more resilient against data corruption would be very expensive.
  • The software (such as SCSI drivers, OS kernels, multipath software, volume managers etc) can also corrupt data (albeit extremely rare, especially if you make sure you run the right patch levels – note that this is one reason why EMC performs very extensive testing in the E-labs and publishes the results in the famous “EMC support matrix”).

So given the fact that the hardware/software stack is not flawless, that a block corruption on an Oracle data file can be fixed using Oracle recovery tools, but that a redo log corruption can ruin your whole day – it might make sense to duplex the redo logs.

So what was my recommendation to this (EMC) customer?

You don’t need it (at least not for technical reasons).

Why? Because of the I/O overhead (negative performance impact) and (my opinion) that duplexing does not add any value to the protection that EMC already offers.

I have had the same discussion something like five years ago. In the discussion with a senior architect of that customer, we went through all possible scenarios (and I must admit it was not easy to convince him). But in the end, we could find no possible, realistic scenario where:

a) The I/O stack could cause silent data corruption on redo logs, and
b) Duplexing the redo logs would prevent data loss.

I have to add to that, that this was assuming the customer was running EMC storage (Either Symmetrix or CLARiiON), and was compliant with our “Support Matrix”. I can’t make the same claim for non-EMC storage (I suggest you consult your storage vendor to get a definitive answer on this).

So although we could not rule out that corruptions can still occur (even with EMC storage, but I will show you in a moment why this is even more rare than with less robust storage systems), we found that duplexing redo logs could not guarantee prevention against data loss (of course there are occasions where a block corruption is caused only once and then the other redo file will still be OK. Question is: can you detect this later when reading, and if so, why couldn’t you detect this at the time of writing?)

The discussion is a long time ago but I will try to show a few possible scenarios (top down) and why duplexing does not help.

1. Corruption caused in the application (database) software itself. In this case, the host will nicely write the wrong data without complaining. If it has to write two copies of corrupted data, it will nicely do so. No way duplexing can prevent corruption. Besides, this would be an extremely rare event.

2. Software errors in either OS, volume manager, filesystem. Again, both copies of the redo logs are using the same OS software components and the likeliness of both redo logs becoming corrupted is high. Duplexing does not help much.

3. Hardware errors in the server, host bus adapter, SAN or storage ports (including theoretically impossible errors caused by network collisions or overload). If these errors cause one redo log to go silently corrupt, then the other will likely be affected as well. Unless maybe if you completely isolate the physical end-to-end I/O path of one redo log from the other (which is complex to set up and even harder to maintain, given all configuration changes on infrastructure over time).

4. Errors in the disk backend (storage cache, disk paths, physical disks). As EMC systems have multiple layers of detecting corruption, the chance of this happening is astronomically low. I bet the chance of having your storage box destroyed by a meteorite is equally likely :-)

Again, and this is very important, we’re talking about non-detectable corruptions.

Detectable corruptions are much more likely, but you can do something about them (refusing the I/O with an error, “please try again”) before allowing your data to become corrupt. You don’t need log duplexing if your storage box is best-of-breed. If the error keeps occuring then the storage system should be redundant enough to allow the failing component to be disabled and the I/O flow should use alternate paths.

So how is EMC protecting you against corruptions?

  • Quality testing when our disk systems are assembled, before shipping to customers. We use hot and cold rooms, shocks and other sophisticated tests to isolate potential failures at the beginning of the “Bathtub curve”
  • Saving extra CRC information with every block of data written to disk. Although this adds a (negligible) overhead, data integrity to us is even more important than raw performance. The checksum is verified after reading a disk block to see if the data is valid.
  • Continuous backend disk “scrubbing” (test reading of disk blocks and their CRC checksum) to find potential (future) disk failures – and of course preventive replacement of disks that are about to fail
  • Not relying on battery backup power to keep write cache alive during power failures. In case of a power fail, we use batteries only to save data to special vault areas on disk. Of course you need to periodically check the batteries as well to see if they can deliver enough power to do the job when you need it most.
  • Other advanced stuff like Triple Module Redundancy with Majority Voting to find out if data processing components fail – I won’t go into details but unique features like this helps preventing data corruptions “in transit”.

Leaves us with non-technical reasons why some customers still use redo log duplexing.

– An application vendor demands this (in my customers’ case, this was SAP) because otherwise they don’t “certify” the application stack.
– The “just in case” reason. “I don’t know if it helps but you never know. It does not hurt.”

Well, if you can’t convince your apps vendor, then by all means, go ahead and duplex your redo logs. Probably better to have some overhead than having a vendor refusing support because of not being compliant.

The “just in case” reason is fine with me, too. As long as you’re aware that duplexing causes I/O overhead (especially for writes to redo logs, which are response time sensitive anyway).

For EMC, it does not really matter. It causes a bit more storage to be consumed and requires more I/O performance so you would need a bit faster EMC system. And probably more CPU power too (Oracle will also be happy to sell an extra CPU licence). We really don’t mind selling you this if you really want to. But I’m not paid to upsell larger storage boxes to customers, I’m paid to advise them how to run databases as efficient as possible. So my advice: don’t use it.

And if you really want to go all the way in protecting against corruptions and data loss of any kind, put a business continuity plan in place which covers disaster recovery to another location, quick and reliable restores from backups (are you still using physical tape by the way?) and disk based snapshots or even disk journalling (continuous data protection) to allow you to recover, even from redo log corruptions.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: