Eliminate Hot Backup with EMC consistency technology
August 26, 2011 3 Comments
For many years, EMC customers have been using storage replication technology to create copies of entire databases. Using storage cloning has many advantages over other mechanisms (file copy, tape restore, and the like). Most significant is that EMC storage can create near-instant copies of large applications without significant performance overhead. The reason is that the storage system is using its huge internal bandwidth and a couple of smart tricks to create the copy, therefore bypassing the host I/O layer.
In other words, a server running a database does not have to move a single bit of data for creating a copy of a multi-terabyte database.
In EMC we distinguish clones (full, 100% copies) from snapshots (more space efficient virtual copies that logically represent the full dataset, but internally only need space for changed data). I will save technical details for another time.
For clearity of this article, I will use the name “snapshot” for any (logical or physical) copy of a database or application set.
What are the added benefits to create storage based snapshots?
“Server-less backups” and Quick restores
If you need to frequently backup a large database, the traditional method (backup the data directly, either by copying the database data files or using database tooling such as Oracle RMAN) puts a lot of extra workload on the database server which causes performance impact. How much impact is depending on a lot of factors such as the backup method, database version, how much load the database is already processing without backup, etc. If you can offload the backup to storage, you could reduce the performance impact to practically zero. The idea is that you create a snapshot on disk within seconds or minutes, without I/O on the production server (therefore the designation “server-less”. But of course there is a server involved somewhere for moving data, just not the production host). Then you use a second server (the “mount host” or “proxy host” or “storage node” or whatever it is called depending on the backup tool) to backup this copy to physical tape (slow, error-prone), virtual tape (faster, less errors but expensive) or, best case, to a de-duplication platform (very fast, very low capacity requirements) such as EMC Data Domain.
The backup from a static snapshot to tape may now take up to 24 hours (if your backup schedule is once per day) because during the backup, the production database is not impacted in any way.
Also, if you have a snapshot on disk, and your database goes up in smoke due to a server failure or bug, administrator error (‘ rm -rf * ‘ experiences, anyone?) then restoring from (virtual or physical) tape can be a long and tricky exercise. If you have a snapshot on disk you can restore that in seconds, and optionally replay the database journals (archive logs) in an attempt to lose as few transactions as possible. The RPO (Recovery Time Objective) for backups can be close to zero (seconds to minutes) this way.
If the root cause is still there (i.e. broken application module, I/O driver bug) and your database becomes corrupted again directly after recovery, you can restore the clone again within minutes and give it another go. With tape you’d be down for another period of many hours (Murphy never sleeps). At EMC we call this mechanism “protected restores” i.e. you don’t overwrite the clone by restoring it to production so you can use it as many times as you like.
EMC has a very rich feature set for manipulating such snapshots. You could do out-of-order restores in many cases if you have multiple ones. So you can create a snap at 1am, then a new one at 3am, then at 5am, then restore the 1am snapshot to production without destroying the 3pm and 5pm snaps. This can be of very high importance if you’re trying to do things like root cause analysis. You might want to take an extra snapshot of a failed database just before you restore it to a known-good state, so you can extract missing transactions or do analysis on it to find the cause of the data corruption.
EMC tooling is also flexible enough to create the snapshot on a remote replicated location and use that snapshot, many miles away, to do the restore locally, within minutes. You can start the application recovery even before the restore from the remote system is finished.
Let’s say you have an application problem and you need to get external support people to access your data. This is risky business. What if they, in trying to fix the problem, apply 10 different patches and modify many configuration settings before they found the problem? They might fix the initial issue, but to make problems worse, also cause a few other issues, because you cannot audit what changes were made to get to that point.
If you redirect external support people to a copy of the buggy application, they can apply different patches and configuration changes, even accidentally destroy the whole (copy) database (you would just quickly make another snapshot), all without any risk to your production system. If they think they found the fix, you apply only the fix to production and do not allow them to cause any more damage. If needed, you can restrict access to security sensitive data (i.e. credit card numbers) in the copy database for external consultants. Or mangle this data before giving them full access to start messing around with your highly sensitive data.
By the way, do you plan to run Oracle database on VMware? Good idea, I will blog on this very soon! But then you probably know that Oracle might ask to reproduce the system on a physical server if they suspect problems with the hypervisor (Oracle support ID 249212.1). By using disk snapshots, recreating the database on physical is a trivial task.
Creating test, development or acceptance copies
Similarly, if you create a disk snapshot (in seconds), it saves a lot of time.
Many organizations have very complex landscapes where one production application is supported by many replicas.
Oracle RMAN restores plus renaming a database takes – according to my customers – anywhere between 24 to 48 hours and about 5 man-hours of work – all of which can be completely eliminated using snapshot technology. A click of a mouse is enough to refresh a test or development system from production. And the mouse click can be done by the developers or testers themselves – no database or or storage guy required to do that.
Creating copies for reporting, data warehouse loading or staging
In my old (pre-EMC) days (around 1998) we were running reporting on the primary financial database in my company that was meant to do transaction processing. Heavy reporting from analyst users could bring the performance to its knees and disk utilization over the edge.
We used Business Objects (BO) as reporting tool and we jokingly renamed it to “Blocks Oracle” ;-)
Which is not completely fair, because Business Objects by itself was not the problem, but the way non-technical users were using it for reporting.
I was one of the first Unix engineers in the Netherlands to use EMC snapshot technology (EMC Timefinder) for creating a daily copy of production, on another server, to redirect reporting there. It completely removed performance problems caused by heavy reporting tasks on the production database. Running reporting on production was no longer allowed.
Similarly, if you run heavy ETL (Extract, Transform, Load) on production you might want to offload that to a production snapshot. Very effective.
Application or database Upgrades
Upgrading a database or application to a new version can be very tricky and time consuming. I’ve seen application teams delaying upgrades for months or even years because they cannot allow days of downtime to run through the whole process. I can’t go in too much details, but you can speed up certain application or database upgrades using snapshot technology. It also allows you to create a few checkpoints along the way, so that if your upgrade fails on Sunday afternoon, you don’t have to go back to Friday evening’s full backup. Instead, you quickly go back to the last-known-good checkpoint and restart from there. Probably in time to be back in business by early Monday morning. You can also ease the testing of upgrades (during the week, normal office hours) to see if all procedures work so that the chance of succes in the critical weekend is increased.
Hot backup mode
The early version of EMC technology I used in my 1998 project, had a small problem that we had to deal with. If you had a database consisting of more than one storage volume (let’s say 10 volumes) then creating a snapshot of the full disk set caused an internal process to “split” the disks one by one – with a 10 to 20 millisecond interval. This causes serious database consistency issues, and it forced us to use Oracle’s “Hot Backup” mode in that process to manage the time differences (this is what hot backup is designed for, because a classic data-file-to-tape backup typically takes many hours and during that time even the data files themselves are constantly changing, even during the writing to tape. This required the database to be recovered after a restore, using archive logs, before the database could be restarted).
Hot backup mode causes significant performance impact. Oracle reduced the overhead drastically in more recent versions but it is still there. Fortunately, EMC technology only required a few minutes in hot backup mode (where traditional backups required many hours). But still customers asked if it was possible to get rid of it.
EMC developed consistency technology, so that a large set of volumes can be copied logically at the exact same point in time. It’s as if the snapshot image was frozen in time, and the image looks similar to an aborted database (which can be restarted after a simple, automated recovery). Using consistency, we did no longer have to use hot backup mode to create database clones. However, Oracle still required hot backup if you wanted to use the clone image for “roll-forward” recovery using archive logs (a requirement for most Oracle backups, if you want to minimize the amount of lost transactions after a corruption or crash).
This seemingly is a complex issue and I had many discussions with customers to explain. So I am explaining it here again as simple as possible.
- If you have no consistency technology in your storage system so that you can make snapshots without any time dependencies or out-of-order writes, you NEED hot backup mode if you want to clone a database using any snapshot technology.
- If you have consistency technology (some non-EMC vendors do have it, but limited to single RAID groups or restricted by other limitations) then you can clone a running database without hot backup mode. The copy database will look exactly like production (crash consistent) as it was during the snapshot. Any non-committed transactions will be rolled back. This was supported by Oracle for years. Fundamentally because the recovery mechanism for the snapshot database is exactly similar to recovering an aborted production database (i.e. power failure, shutdown abort, CPU crash).
- If you want to use your snapshot as a source for tape, virtual tape, or disk based (de-duplicated) backups, then you HAD to use hot backups. Oracle did not guarantee that you could use archive logs on a database image that was created when NOT in hot backup mode.
I remember having those discussions with customers at least 5 years ago. Surprisingly, one of my customers told me they backed up snapshots to tape without using hot backup mode. They tested the restore and archive recovery and they claimed it worked every single time. My response was they were playing Russian Roulette with their backups, because Oracle claimed it can’t be done that way (or at least, not supported).
More customers reported to some of my EMC colleagues in Engineering that it worked even though Oracle did not support the method.
Oracle supporting backups without hot backup mode
EMC tested the same methodology in their labs and also found it to be working every time they tried.
For clarification, we are NOT talking about Oracle RMAN backups here. RMAN backups haven’t needed hot backup for a long time. The requirement for Hot Backup mode was for disk cloning – and only if the clone is used for backup purposes.
As a real-world illustration of the EMC/Oracle partnership, EMC engineering discussed this with the Oracle database engineers, and after long discussions about complex consistency, journalling and I/O dependency fundamentals, Oracle were convinced and they published a note on Oracle Support (aka Metalink) that this method actually works and has been working all along since Oracle version 18.104.22.168.
This support assumes that the following prerequisites are met:
- The third-party snapshot technology is integrated with Oracle’s recommended restore and recovery operations.
- The database image is crash-consistent at the point of the snapshot.
- Write ordering is preserved for each file within a snapshot.
According to the principle, EMC snapshot technologies fully meet these three prerequisites.
Effectively this means you can completely get rid of hot backup mode when using EMC consistency technology for database cloning (even as source for backups depending on archive logs).
The Oracle Support note has ID 604683.1, and was published on Oct 14, 2010. Description:
Supported Backup, Restore and Recovery Operations using Third Party Snapshot Technologies [ID 604683.1]
I also found a presentation where this was discussed for users running SAP on Oracle as well. SAP seems to support the same mechanism in SAP note 105074. More info:
Maybe, easy database and application cloning, and this EMC supported Oracle feature can make your life a bit easier…