Data Guard or Storage based replication?
September 28, 2011 5 Comments
A comparison between Oracle (Active) Data Guard and EMC replication for disaster recovery purposes
This is an article I wrote a while ago for customers’ Database Administrators (DBAs) and application managers, that helps them in selecting the right Disaster Recovery tools for their business applications.
It is slightly modified to update new insights and to make it more readable on the web.
Oracle Data Guard
Data Guard is a tool that has evolved over time. It started out in the Oracle version 7 and 8 period (or maybe even before) where customers used scripting and self-made tools to replicate Oracle archive logs, over some kind of network connection (often over TCP/IP using FTP or NFS protocols to transfer files), to keep a standby database more or less in sync with the primary. Starting with version 8, Oracle developed a management toolset – initially called Managed Standby on top of this and eventually this evolved into a separate tool, currently called Oracle Data Guard.
The purpose of Data Guard is primarily Disaster Recovery (D/R) but, in some cases, you can also use it for backup offloading, data distribution and some other things. Bear in mind that database log shipping for Disaster Recovery evolved over time, and was never designed from scratch as a Disaster Recovery tool. It is still depending more or less on the old log shipping method, even though Oracle did some tweaks and changes to the database software to make it more of an enterprise D/R tool.
For example, normally if you perform non-logged transactions (such as creating an index or load tables using the NOLOGGING option) on the primary database, the changes do not get shipped to the standby (thus, you lose data in the event of a fail-over). Oracle solved this by implementing a “force logging” mode which, if set, forces all transactions to be logged (and therefore shipped to the standby) – with the additional penalty on performance and more logging data to be generated and this can have severe impact on things like Data Warehouse Loading and other sorts of business processing – and the effect of this is often not considered during performance Proof-of-concepts as in such POC’s, D/R is almost never part of the key decision factors. This behavior already suggests that Data Guard depends on careful installation, configuration and monitoring – and a minor bug in the software stack or config setting could render the standby database useless until after a full re-sync. Also, it only replicates database data – one database at a time – and does not care at all for non-database application data, host software, middle-ware or anything else.
EMC Remote Replication
Tools such as EMC SRDF and EMC Recoverpoint were designed from scratch as business continuity tools (i.e. Disaster Recovery and to support Backup/Restore scenarios) and, by architecture do not depend on any host platform hardware, operating system, driver, database, application, filesystem, network stack, volume manager or any other component on the application host.
Therefore, these replication methods are enterprise level D/R tools that make sure that all data is always consistent at the fail-over location, regardless of how the host stack is configured. Of course, if the host messes up data due to, for example, application bugs, then the D/R copy also suffers data corruption, but this is explicitly the way these tools are architected. They do not even attempt to solve this form of data corruption or data loss, because this is a function that, in my opinion, architecturally needs to be solved – or better, prevented, by other tools (such as backup/recovery management software, storage snapshot/cloning capabilities, disk check-summing, etc). These storage replication tools can replicate any kind of data, independent from platform, database, application, OS level etc. and don’t even require the host to be up and running at either location. Modern D/R tools allow for multi-application consistency. The tools are designed more or less with “fire and forget” in mind, which means, once set up correctly you don’t have to monitor each individual application, but you typically only monitor the logical link “up” or “down” state per “application consistency group”.
A reported “Link up and in sync” state means that D/R is functional and you’re fully protected. You don’t have to worry about wrong application settings or whatever. Enterprise grade tools also can incrementally re-sync even if there have been changes to app data at both locations (by smartly comparing metadata bitmaps that track differences on both locations).
EMC has advanced deployment capabilities that even allow to have triple-site replication, to make sure you still have remote protection, even after one datacenter fails completely.
It is also possible to have zero-data loss even at extended distances. At EMC, we did not break the light speed barrier to achieve this (as they recently attempted at CERN in Geneva). We use a method called “cascaded SRDF” where a buffer system is deployed at a bunker site at short distance to store-and-forward transactions. The replication from primary to bunker is synchronous and onwards from bunker to the far remote system is asynchronous. In case of a disaster the bunker site contains all committed transactions and needs only a short time to send them to the far location (typically before it gets destroyed by the same disaster a few minutes later).
“Nothing travels faster than the speed of light with the possible exception of bad news, which obeys its own special laws.” – Douglas Adams, “The Hitchhiker’s Guide to the Galaxy” – English humorist & science fiction novelist (1952 – 2001)
Storage and SAN based tools are completely independent from anything that happens in servers, operating systems, databases, applications or middle-ware. The concept is very simple and is very similar to RAID protection in storage arrays, but you basically “stretch” the mirrors of a RAID-1 disk set across two different storage systems in different locations. A very interesting EMC feature called “consistency groups” makes it possible to contain multiple disk volumes that span more than one application into one large, write-consistent pool of data. It does not matter if the database or applications use journalling, logging, or anything else. There even does not need to be a server at the target location (unless you want to perform quick and automated recovery).
By nature, storage mirroring is very simple – you only need two storage systems connected together; you define what volumes belong together for consistency and you switch on remote mirroring. As long as the storage tools tell you that the mirrors are “in sync” then you can be certain that you can fail-over in case of a disaster – and all data is there; including (if you want) Operating System data; middle-ware, application binaries, export/dump files, databases, etc. It only takes a bit of time to enable the remote mirror, start up the servers and bring up the databases to use the remote data set.
Database- or application replication (including Oracle Data Guard) work on individual databases only and have the advantage that the standby database is continuously in recovery mode, so in case of a disaster it is very easy and fast to enable the standby database as new primary. The challenge is then to fail over the remaining application servers, middle-ware, monitoring systems, etc. quickly, consistently and in the right order. After bringing up multiple individual databases it might be that there is application (business) inconsistency between the databases. For example; a logistics database might be interacting with an invoice system. An invoice sent out to an end customer should always match a transaction in the logistics system – to avoid sending the same invoice twice, or vice versa, not sending the right product to a customer even if the order is payed. If one database is recovered to time-stamp X and the other to time-stamp X+Y, then the latter could contain transactions not reflected in the first.
Such problems can happen even with both databases technically consistent – but recovered to a different logical point in time (Recovery Point). You can imagine even bigger damage of such situations on the financial markets where multiple financial processing systems, each with their own database, work together as one logical entity.
Also, setting up Data Guard is typically more complex than storage mirroring, and depends on a lot of settings to be in place. By the way, this is my personal opinion, and I’ve spoken to many Oracle Expert DBAs claiming the opposite – leading to many good and interesting discussions on this topic. But most DBAs will confirm that if you make mistakes when configuring the database replication instance, you might not be able to recover at all in case of a disaster, so constant monitoring and re-validation (per each database and application) becomes an important issue and can claim a lot of human and system resources in large, complex landscapes with many applications.
On top of that; you need different replication mechanisms for non-Oracle data (i.e. if you run SQL-Server, MySQL, IBM UDB/DB2, MS-Exchange or SharePoint, VMware, messaging middleware, JAVA applications and the like, then you need a different replication tool or mechanism for all of those). How do you manage – from an overall IT management (CIO) perspective – different replication tools; each and every one with different versions and instances, and still make sure all data is consistent and available after a disaster?
The more methods and tools, and the more replication instances you have to manage using those heterogeneous tools, the more risk you introduce in the overall D/R strategy.
Using standby for reporting purposes
Both strategies have different methods to do this. Starting with Data Guard, useful usage of the standby database is only possible as of version 11g with “active” Data Guard which is a licensed option from Oracle. It works by opening the standby database in read-only mode while still applying redo log files from the primary (I will skip the deep technical details).
As some applications need read/write access (sometimes even a user login is causing a write transaction in the database) not all applications can be used unless they are modified to make it work with a read-only database. Oracle can possibly work around this by combining Active Data Guard with Oracle Flashback, so that you can write against a database “Flashback” logical copy. For this you need Flashback enabled and probably a lot of additional disk space for Flashback logs at the standby location – and there are consequences for the Disaster Recovery service levels (“RPO” and “RTO”) in doing so.
When using EMC replication, the approach is a bit different. As EMC storage replication’s architecture does normally not allow any reads or writes to the standby data disks (with the notable exception of EMC VPLEX, as I have described in earlier posts), the way to use this data anyway is to create a read/write enabled disk snapshot or clone, and mount that snapshot image to the remote database server. This replica is then a full functional copy of the original database and typically is read/write enabled. Administrators must make sure no real important transactional data is written to the snapshot – as this data will be lost when the snapshot is refreshed from the production database. The refresh can be done as often as required, but the snapshot database has to be stopped for a few minutes each time during the refresh. To work around this, it is possible to have two snapshot copies available; one being refreshed and the other kept open for reporting at the same time. It does not make sense to refresh more often than each hour or so because then the overhead of restarting the database each time causes too much interruptions. Realistically therefore, the maximum age of the snapshot data is somewhere between one and two hours. There is no need for Oracle Flashback or Active Data Guard to make this work.
Sometimes it is mentioned that having to stop and restart the copy database during refresh is a disadvantage – which is partly true. But the advantage is that the snapshot is read/write, so it works with any application. Furthermore, the database (including datafiles, etc) can be renamed to avoid confusion for developers, testers and administrators (which is not really possible with a Data Guard standby database). Also, the database is application-consistent for every database connection (where an active standby database is only consistent per SQL transaction for as long as the transaction runs – the next query can run on different data).
It is possible to combine storage replication (SRDF or Recoverpoint) with database replication (Data Guard) to have the best of both worlds – be it that this requires some customization and is not out-of-the-box functionality. The advantage would be that it allows for very quick (incremental) refreshes of a standby database that otherwise would require a full re-sync over the network after testing, rolling disasters or other unexpected situations.
Data Guard is a fine tool to replicate one, or a few, not too large Oracle databases. From a large enterprise perspective, it’s just one tool, protecting only one piece of the application landscape – and you need a lot of other tools to manage all other (non-Oracle database) applications. Only Enterprise Disaster Recovery tools like EMC SRDF or EMC Recoverpoint can protect a whole application landscape – without having to worry about consistency, configuration errors or the complexity that is inherent with application- and host based replication tools.
There is a lot more to be said about the subtle differences between application- and storage replication strategies. I don’t think a comparison of features of one tool over the other helps a lot. I don’t really care if tool X uses a bit more or less bandwidth than tool Y, or that one tool can do a few extra tricks that it was never designed for, versus the nice tricks and features of the other tool. In the end you spend a lot of “Dirty Cash” and implementation effort on such tools, as an insurance policy for if something bad happens. These tools are your last line of defense, if everything else fails. You better make sure they are rock solid, reliable, simple and effective.
Using EMC storage replication for Disaster Recovery reduces implementation and maintenance overhead for servers, networks and applications, reduces cost, and most important: it reduces risk. It lets the CIO sleep at night.