Oracle snapshots and clones with ZFS

Another Frequently Asked Question: Is there any disadvantage for a customer in using Oracle/SUN ZFS appliances to create database/application snapshots in comparison with EMC’s cloning/snapshot offerings?

Oracle marketing is pushing materials where they promote the ZFS storage appliance as the ultimate method for database cloning, especially when the source database is on Exadata. Essentially the idea is as follows: backup your primary DB to the ZFS appliance, then create snaps or clones off the backup for testing and development (more explanation in Oracle’s paper and video). Of course it is marketed as being much cheaper, easier and faster than using storage from an Enterprise Storage system such as those offered by EMC.

Oracle Youtube video

Oracle White paper

In order to understand the limitations of the ZFS appliance you need to know the fundamental workings of the ZFS filesystem. I recommend you look at the Wikipedia article on ZFS (here http://en.wikipedia.org/wiki/ZFS) and get familiar with its basic principles and features. The ZFS appliance is based on the same filesystem but due to it being an appliance, it’s a little bit different in behaviour.

So let’s see what a customer gets when he decides to go for the Sun appliance instead of EMC infrastructure (such as the Data Domain backup deduplication  system or VNX storage system).

Granularity

ZFS is actually a sort of combination of a volume manager and a file system. In comparing to a classic volume manager, the concept of a ZFS “Zpool” is much like an LVM volume group. In the Zpool you have a default filesystem (which is named the same as the pool) and you can optionally create additional filesystems within the same pool. A ZFS file system cannot span multiple pools.

Now the ZFS snapshot happens on the ZFS filesystem level. So if you have multiple filesystems in the pool for a given database (say, one for data, one for logs, and one for indices) you cannot create a crash-consistent snapshot of that database using ZFS snaps [ update: slightly incorrect, see comments below ]. Even worse if your database spans not only multiple ZFS filesystems but also multiple pools. In those cases you need to fallback to Oracle’s Hot Backup methods and use a bunch of scripting to be able to recover the cloned database afterwards (EMC on the other hand offers technology to create snapshots for backups without even going in hot backup mode).

One size fits all

In one Zpool  all volumes must have similar behavior (in terms of performance and size). This means you cannot effectively mix & match multiple drive types in the pool. A customer looking for some kind of storage tiering needs to have a different Zpool for every tier – giving you the consistency problems mentioned. Automatic data movement across the tiers a la FAST-VP is not possible. Oracle is suggesting they have some kind of tiering (they call it “Hybrid storage pools”) but it’s nothing more than one disk type with different sorts of (dirty) cache (DRAM and Flash cache). Marketing ain’t reality.

Also if you want to use SATA (or other low-cost, high-capacity, low-iops) disks for backup then you must have SATA disks for the clone databases as well (remember you cannot create snaps from one zpool to another zpool). So how do you perform an acceptance performance stress test if the production database is on fast Fibre Channel or SAS disks (or even on Flash drives) when the acceptance database is on slow SATA? It just isn’t gonna work…

Unpredictable performance

If the creation of snaps drives up the storage utilization of the file systems beyond 70-80% then the performance of the appliance will slow down, or at least become very unpredictable (according to SUN/Oracle best practices you should not go over 80% allocation on the ZFS filesystems). Of course you can monitor the pool but we all know how that works – somebody kicks off the creation of another snap before leaving for home – or the acceptance test suddenly starts allocating huge chunks of new tablespace data at 3am – Murphy is always around. Note that the performance of backups (100% write) will suffer so the backup window of the primary (production) database might be seriously affected as well.

The snap space in ZFS is shared by the primary data on the same drives, so databases using snapshots cannot be isolated iops-wise from the spindles used for backup or other purposes. There is no way to isolate or prioritize I/O workloads within the zpool.

Availability

If, by even more Murphy intervention, the allocation reaches 100%, then the clone databases will abort – as will all running backup jobs writing to the same Zpool. Depending on the case the backup job could just hang forever or fail (I’m not sure which one is worst). If you’d have a separate snap pool (i.e. in a logically separate area) then the snapshots would be affected (in terms of availability and performance) but not the primary data (that’s why EMC uses separate snap areas).

Efficiency

To avoid filling the pool you therefore need lots of empty space hanging around (on energy-consuming, floorspace hogging, pricey spinning disks) – bye bye TCO and efficiency

HCC support

If you make a snapshot of (Exadata) HCC stored tables then the test and development environments they are talking about, need to be on Exadata as well (otherwise they cannot use the HCC compressed data for testing purposes). No Virtual Machines for testing (not even on Oracle VM) unless you drop HCC on the primary DB or do a painful conversion each time after backup. But Oracle will happily sell you another Exadata so don’t worry.

Management

There is no GUI driven tool such as EMC Replication Manager so everything needs to be scripted (and I know from past experience that in such scripts, the devil is in the details).

Risk on backups

You must have the test and development databases on the same storage system that you use for your backup data. By not physically isolating the backup target, restricting user access, and abusing it for other purposes you put your backups at risk (i.e. the ZFS appliance – holding your last-line-of-defence backup data – suddenly gets accessed by Unix/Database admins to mess around with security and FS and NFS export settings etc – you need to be aware of this risk…)

Snaps off primary database

You need to have at least one 100% full copy (i.e. RMAN backup or Data Guard standby) of the production database (if it’s on Exadata) before you can make snaps. No direct snaps off primary DB – unless you put your primary database completely on the ZFS appliance (I promise to write in a future post on why you might not want this)

Finally

And of course all other ZFS limitations for databases apply (fragmentation performance issues, deduplication doesn’t work well, etc) but I’ll leave that for a future post ;)

[update] Matt Ahrens pointed out (see comments below, thanks!) that it is possible to create consistent snapshots of multiple ZFS filesystems within a pool using the “-r” option (recursive) or using “projects”. If I find time I will test to see if that works. I still don’t see how you could set up a database across different storage tiers (thus multiple zpools) – i.e. FC and SATA disk and maybe even some Flash – and then create consistent snaps. I also failed to mention that ZFS snapshots are read-only so you first have to clone from a snapshot before you can use it to run Oracle databases. For me the capability of making snaps directly off a database and directly mounting those read/write on a test DB was such a n0-brainer that I missed that one completely :-)

6 Responses to Oracle snapshots and clones with ZFS

  1. Matt Ahrens says:

    In my opinion, the benefits of using ZFS to create snapshots and clones of Oracle databases outweighs the concerns you outlined here — otherwise I wouldn’t be at Delphix.

    However I’d like to point out one factual inaccuracy in your post. You say “if you have multiple filesystems in the pool for a given database you cannot create a crash-consistent snapshot of that database using ZFS snaps”. In fact, ZFS can take consistent snapshots of multiple filesystems, by using “zfs snapshot -r”, “zfs snapshot …”, or with the ZFS Appliance by creating a snapshot of a Project. This behavior is documented in the zfs manpage.

    • Bart Sjerps says:

      Hi Matt,

      Thanks for pointing that out. The man pages that I have found were not very clear on whether a recursive snap (-r) was or was not creating consistent snaps across multiple zfs filesystems. I checked this again and the Oracle paper seems to indicate that you’re correct, it mentions:

      “Snapshots are point-in-time image copies of the file system from which clones will be created. The snapshot operation is performed at the project level so that snapshots are performed on all the file systems under the project in a consistent manner. Snapshots are read-only, so cannot be used for database operations.”

      I will update my post accordingly.

      That being said, what I forgot to mention is that EMC’s snap implementations allow for writeable snapshots. So you don’t have to clone from a snap first in order to mount the test database. We even have out-of-order deletes of snaps, and we can restore from snap to prod *without* destroying the original production data set (i.e. it becomes another snap) or destroying any other snap (this is what we call “protected restores”). Can be very handy if you need to quickly recover a database a few times in a short time.

      Back to your first point – I see benefits of the ZFS filesystem in some areas, mostly in classic file serving environments. However, due to the copy-on-write nature and the consistency limitations I am not a big fan to use ZFS for databases. I am tempted to say that Oracle agrees because they don’t use it for primary storage in their flagship product Exadata. Oracle ASM performs much better.

  2. Mertol says:

    I will try to point out a few mistakes. (I am carrying an Oralce badge if you ask)
    1- Snapshot’s are consistent in projects which can be used as virtual containers for fileshares and lun’s. I think ZFS-SA is one of the very few systems that can do consistency between fileshares and Lun’s.
    2- In ZFS sysems Clone means a writable version of SnapShot and takes no more space than a snapShot and takes no time and no pre-setup to create. Same snap can be used for multiple Clone’s and each clone can have different set of settings and in the meanline root SnapShot is protected as it’s readonly. That’s a lot better and flexible than classic writable snapshots and requires no added administration.
    3- HCC and dNFS is supported on ZFS-SA systems and it’s free , can be used with cloneDB functions, where it’s possible to open a DB directly from an RMAN backup image without a time consuming restore operation. (great for partial restores and emergency or test and development)
    4- Efficieny on ZFS-SA snapshots are not that bad. It’s one of the very few systems where granularity of snapshot space usage is just a single block. and it does not require pre allocation as some of the systems on the market.

  3. Bart Sjerps says:

    Hi Mertol,
    Few remarks and questions…

    1. Yes I am aware of this thanks to Matt who already pointed that out. So another question to you then, I appreciate your feedback on this: Say a customer wants to tier his database and wants to have, for example, 1 TB on fast FC or SAS disks, and 3 TB on SATA (I will give you a break and not ask for a Flash drive performance tier).

    Can the ZFS appliance support such a configuration (I think you need 2 Zpools each with different disk types) and still make consistent snaps?

    If your primary database is on Exadata (storage on Exadata storage cells) then how do you create a consistent application landscape snapshot off multiple databases – and maybe include an application server or 2 ?

    Are you aware that EMC had such consistency across multiple filesystems and LUNs (and even multiple storage boxes and/or operating systems together) for years? So if you say “One of the few” you probably mean EMC and ZFS-SA (although I think some other storage vendors can do similar things up to a certain level – I bet if you ask the guys at IBM or Netapp or HDS they will confirm they can do this as well :-)

    2. Not sure what you mean by “Classic” writeable snapshots. Which implementation are you referring to? EMC’s? We have had writeable snaps, clones and BCV mirrors since Oracle had version 7 of the database. We can do clones of clones, clones of snaps, snaps of clones and even remote snaps (snapshots on a remote replicated storage system). What’s the exact advantage you talk about of the ZFS-SA in this context? I’m missing it completely…

    In the whitepaper Oracle has published they mention that (especially with Exadata) you first have to replicate a full copy of the database to ZFS-SA. So a snapshot directly off the primary database is not possible? (Or is it? If I’m wrong here I’d like to know?)

    3. HCC would work on all storage systems from all vendors if Oracle hadn’t artificially disabled it on non-Oracle storage platforms. I find it hard to call this a benefit of Oracle over others. dNFS was developed with Oracle both by Netapp and EMC and we have been supporting this for many years. What’s the unique advantage for ZFS-SA?

    4. EMC has deliberately chosen for an architecture where you reserve space for snaps (in all of our implementations so CLARiiON/VNX FC and NFS, and Symmetrix). There are advantages and disadvantages to this approach as you mention. IMO the 2 biggest advantages is that you can run out of storage on the snaps without primary impact on your production DBs. The other advantage is that I/O to primary volumes and snap space is not shared so you have *predictable* performance. The drawback is – yes – you have to pre-allocate storage and there is a little overhead in initially creating the snaps. Choose your potion :-)

    Regarding you mentioning efficiency is not bad – but you need to reserve at least 20% free space right? At least that’s what the best practices say? So if I have no snaps yet and my FS utilization is already closing 80% then making snaps will quickly put me above 80% (if the DB is doing write transactions). So in order to make snaps and guarantee reasonable performance I have to start off with, say, 60% utilization (i.e. 40% free space).

    5. How do you restore a snapshot to production if the production database is using a different storage system (i.e. Exadata or other SAN/NAS storage box)?

    Looking forward to your thoughts,
    Bart

    • Cecil Jacobs says:

      Your question about tiering presupposes that one requires tiering to achieve the desired performance for the database. You would first need to establish that the hybrid storage pool (HSP) concept around which the ZFS-SA is built is unable to deliver the goods. There used to be some nice tools available to help with this for a proof of concept (SWAT, vdbench come to mind). The ZFS-SA HSP uses flash technology to accelerate writes and reads for “hot data,” and does so automatically. I do not see where the HSP concept is so different from EMC’s “FAST” technology.

      I happen to have multiple Oracle 11g databases running on top of ZFS (just ZFS, not on a ZFS-SA – though someday may move to that), and I scripted the consistent cloning of my databases even though I have *separate* pools for redo logs, for archive logs, and for the remainder of the database files (e.g. data files, index files, undo tables, temp tables, etc.) It would be a trivial matter to add a fourth separate pool if I needed to put my indexs on a Zpool built purely out of flash drives to further improve performance. The key: I leverage Oracle hotbackup mode to achieve that consistent image, but I’m only in hot backup mode for the time that it takes to make the nearly instantaneous recursive snapshot for the data Zpool.

      What wows me about the ZFS-SA is the ability to really dig in and see where performance issues are with the Dtrace analytics. Having spent the last two years using a VNX, I can honestly say that performance analysis capabilities on it pale in comparison to what is achievable on the ZFS-SA. That’s one topic you seem to have neglected in your original post.

      • Bart Sjerps says:

        Hi Cecil,

        ZFS-SA does not do tiering. The flash is used as cache – meaning it holds a *copy* of data that primarily sits on disk somewhere. You could certainly argue if it makes a difference but let’s get the definition straight: tiering is when you *move* data from one type (or configuration) of storage to another. Caching is when you hold a copy to speed up current or future I/O. That’s the difference – without making judgments on whether which one of the two methods is best (EMC uses both for that matter).

        Now with ZFS having database data in different Zpools you could of course make it work by using hot backups. Perfectly fine and this is how EMC did things 15 years ago (I implemented such stuff when I was an EMC customer back then). But hot backups have disadvantages and the ZFS promise was to make things simpler. EMC allows you to make a database snap or clone without hot backup – and use it for archive log roll forward if you like.

        An area where hot backup mode fails: what if you use ZFS to replicate data for disaster recovery? Now you have 2 or more ZFS update streams going to another location and if hell breaks loose you risk having data ahead of log issues. Russian Roulette.

        Haven’t looked at Dtrace yet – looking at all the noise on social media it’s supposed to be an awesome tool. I understand it is or will be available for Linux as well so it’s no longer a Solaris-only option. Will certainly take a look at it when I find the time, it sounds promising.

        That said, why would you need a performance tuning tool on an appliance that promises to offer great performance out-of-the-box? You cannot tune much on ZFS-SA. Only one type of storage per Zpool and ZFS claims no tuning *should* be needed (hence the existing of an *Evil* tunin guide).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: