Oracle ASM vs ZFS on XtremIO
August 11, 2014 6 Comments
In my previous post on ZFS I showed how ZFS causes fragmentation for Oracle database files. At the end I promised (sort of) to also come back on topic around how this affects database performance. In the meantime I have been busy with many other things, but ZFS issues still sneak up on me frequently. Eventually, I was forced to take another look at this because of two separate customers asking for ZFS comparisons agaisnt ASM at the same time.
The account team for one of the two customers asked if I could perform some testing on their lab environment to show the performance difference between Oracle on ASM and on ZFS. As things happen in this business, things were already rolling before I could influence the prerequisites and the suggested test method. Promises were already made to the customer and I was asked to produce results yesterday.
Without knowledge on the lab environment, customer requirements or even details on the test environment they had set up. Typical day at the office.
In addition to that, ZFS requires a supported host OS – so Linux is out of the question (the status on kernel ZFS for Linux is still a bit unclear and certainly it would not be supported with Oracle). I had been using FreeBSD in my post on fragmentation – because that was my platform of choice at that point (my Solaris skills are, at best, rusty). Of course Oracle on FreeBSD is a no-go so back then, I used NFS to run the database on Linux and ZFS on BSD. Which implicitly solves some of the potential issues whilst creating some new ones, but alas.
This time the idea was to run Oracle on Solaris (x86) that had both ZFS and ASM configured. How to perform a reasonable comparison that also shows the different behavior was unclear and when asking that question to the account team, the conference call line stayed surprisingly silent. All that they indicated up front is that the test tool on Oracle should be SLOB.
My first reaction was if they were aware of the fact that SLOB is designed to drive random I/O and thus, by nature, is not very well positioned to show performance effects of fragmentation – which would require sequential I/O. More silence. Sigh. To make matters worse, the storage platform on which the Solaris VM was configured (using VMware of course) was XtremIO. Again, XtremIO is so very different from every other EMC storage platform (as well as every other competitive platform, for that matter) – in that it uses hashing of data blocks to determine data placement within the flash cells. So – a bit like ZFS in itself – it has “fragmentation by design” – which makes the platform completely insensitive to random vs sequential I/O. In the XtremIO backend, everything is random anyway whatever you do – which allows the platform to scale up and out, and avoid any kind of data hotspots. So given a Proof of Concept where the test tool generates random I/O, on a storage platform that converts all I/O to random again, how do you show the impact of fragmentation?
But the customer promise was made so I started a voyage to get Solaris moving with Oracle, ASM, ZFS and SLOB, and think of a reasonable way to test the two scenarios.
Configuring the environment
After getting access to the system the first thing that I needed to do is install Oracle and Clusterware / ASM. Which was a challenge because the virtual machine was installed with pretty much default settings, a 100% full root file system and lack of paging space and some required software packages. But I will not go into details on what was needed to get the system going and stable (with a few exceptions that I will touch upon later).
The system had a bunch of XtremIO volumes configured of which of most importance are the ones I used for I/O testing. There were 10 volumes of 120GB each. 5 were already configured in a single ZFS pool. I checked disk alignment and some other settings and it turned out to be workable. I configured the other 5 volumes into an ASM (DATA) disk group, plus a few smaller ones for REDO.
On the ZFS pool I created a ZFS file system according to Oracle best practices (blocksize=8K, logbias, etc)
Over the weekend I got some time to think on how to run such a test and I came up with the following test scenario:
- Create SLOB tablespace (“IOPS”) on ASM
- Create SLOB tables carefully sized – so that the data would fill up ZFS to exactly 80% (the limit according to best practices before you get serious performance issues).
- Run SLOB tests with different read/write ratios (on ASM)
- Do some sort of sequential IO on SLOB tables – so this must be some kind of (full) table scan scenario
- Bring IOPS tablespace offline and copy it to ZFS (using RMAN) – this way you get an initially unfragmented tablespace (hopefully)
- Change datafile location in Oracle to use ZFS datafile
- Run read-only SLOB tests
- Run table scans on SLOB data
- Run SLOB updates sized such that every SLOB block is updated at least once
- Re-run ZFS tests and note any difference
Full table scans
For running full table scans I wrote a pl/sql script that basically does the following:
- AWR snap
- Select random SLOB users (either all of them or limit the number of users)
- For each user, select count(*) from <user>.cf1 (full table scan)
- Record overall start/end time and per-user start/end time
- Calculate total data size and use time & data size to calculate scan bandwidth
- Report per-user and total statistics after completion (most notably, scan rate)
- AWR snap
Remark that this procedure is single-threaded. It is not intended to drive maximum bandwidth, it’s intended to drive a predictable comparison on scan rate in between tests. You could run a bunch of these in parallel to drive more bandwidth and I have been messing with the idea to build it into the sql code. Maybe another time :P
A few notes on what I expected as result. Obviously, XtremIO does not care about fragmentation – so you would initially expect similar results on ASM as on ZFS, for 100% read IOPS, as well as for full table scans. But another ZFS issue is IOPS inflation (which is a side effect of fragmentation). Consider a full table scan is requesting 128K read I/O (because Oracle db_multiblock_count is set to 16 using an 8K blocksize). Because ZFS has to get blocks from all over the place (after fragmentation), a 128K IO might be chopped into a bunch of 8K, 16K and maybe a few larger pieces. So it will result in somewhere between 16 I/O’s and maybe a bit less (if some blocks are still adjacent on disk – as seen by the OS at least). I would expect 128K IOs to translate roughly into 10-12 smaller IOs but we will see.
So the full table scan bandwidth, according to my expectations, would drop a bit, not because of fragmentation (and thus excess disk seeks) directly (we’re on 100% flash) but because of the extra host IO overhead due to many smaller IOs instead of a few large ones.
I also expected the random IOPS to be more or less equal.
For the techies who are interested in the details, a disclosure of the configuration can be found here. But a few highlights:
XtremIO Volumes 5x 120GB (ZFS), 5x 120GB (ASM). SLOB tablespace 450GB (80% of ZFS free space). SLOB users: 64, SCALE=900000. Oracle: SGA 3G, multiblock read count 16, 8KB block size. Redo logs on separate ASM disk group (also during ZFS testing although usually you would move redo to ZFS as well). ZFS cache: 100MB (to avoid Solaris system hang which happened to me initially, and to prevent ZFS serving RIOPS from OS memory).
- It seems near impossible to do an apples-to-apples performance comparison. So for ZFS worshippers out there: don’t complain to me that the test is wrong, instead do the test yourself how you think it should be done and publish the result (with disclosure of configuration of course). There, got that one of my chest ;-)
- The test environment I was using was not tuned for the best possible performance. The results therefore are relative and don’t reflect the most in what you can expect from an EMC XtremIO array.
Actually I’m running other tests in a better tuned lab on which I might blog as well very soon :-)
Performing the test
Will skip the details here on SLOB runtime parameters, but every SLOB run was done with 64 users (entire dataset) and 5 minutes runtime with the exception of the update runs that drive fragmentation. Initially I ran the full table scans against all users, but later found that randomly picking 5 or 10 gives equally consistent results. All database results come from AWR reports and the scan rate as calculated by my script, in addition I have comments on the system level output (“iostat”).
Random read IO on ASM
This was after creating SLOB on ASM, nothing else happened to the data in between. But you get consistent results here even after messing with the data (I tried moving ZFS based tablespaces back to ASM and the results after that move are within 1%).
Physical reads per second: 44,391
Physical writes per second: 2
Bandwidth: 346 MB/s (roughly)
Note that a single XtremIO X-brick can handle much more than this, if you would configure more host volumes, correct I/O load-balancing etc. Will blog on these details later.
Typical iostat looks like this:
device r/s w/s kr/s kw/s wait actv svc_t %w %b sd12 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd13 9958.9 0.3 79670.8 1.3 0.0 11.7 1.2 4 100 sd14 9978.9 0.0 79836.2 0.0 0.0 11.7 1.2 4 100 sd15 9996.9 0.0 79974.8 0.0 0.0 11.8 1.2 4 100 sd16 9931.9 0.0 79457.5 0.0 0.0 11.7 1.2 4 100 sd17 9830.9 0.7 78649.5 10.7 0.0 11.6 1.2 4 100 sd18 0.0 0.3 0.0 1.3 0.0 0.0 0.6 0 0
Note that this was when the IOPS were a bit above average. By dividing read bandwidth over reads you can estimate the IO size: 79670/9958 = 8K. Database I/O gets translated 1:1 into disk I/O. Service time a bit over 1ms (again, XtremIO can get service time much lower at a much higher IOPS rate – due to the better I/O load-balancing and overall better consistency with best practice and optimizations – you should expect well below 0.5 ms with this workload).
Full table scan on ASM
SYS:xtremdb > @slob-fulltablescan Limit number of full table scans (default all):10 USER28 7031 MB, Time: 29.64 USER5 7031 MB, Time: 30.01 USER30 7031 MB, Time: 29.96 USER40 7031 MB, Time: 30.04 USER1 7031 MB, Time: 30.05 USER29 7031 MB, Time: 30.11 USER34 7031 MB, Time: 30.04 USER20 7031 MB, Time: 30.06 USER7 7031 MB, Time: 30.12 USER31 7031 MB, Time: 30.2 ---Summary--- Users: 10 Scanned: 70313 MB Scan rate: 234.16 MB/s Runtime: 300.27 PL/SQL procedure successfully completed.
You can see here that the scan rate is roughly 234 MB/s. The AWR report agrees:
Physical read bytes/s = 245MB/s. Physical reads/s = 1872. Divide and you get 245000/1872 = 131 (close to 128).
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c2t11d0 575.9 9.0 73657.5 94.0 0.0 1.6 0.0 2.7 0 22 c3t0d0 609.4 10.5 77745.1 196.0 0.0 1.7 0.0 2.7 0 23 c3t1d0 640.9 6.5 81832.8 92.0 0.0 1.8 0.0 2.8 0 25 c3t2d0 549.0 12.5 70145.8 212.0 0.0 1.5 0.0 2.7 0 21 c3t3d0 590.4 5.0 75401.3 152.0 0.0 1.7 0.0 2.8 0 23 c3t4d0
See if it agrees on IO size: 73657/576=127.87. Close enough.
SLOB on ASM with 50% updates
For the record, another run with 50% update percentage. Not that it matters much.
Physical reads per second: 32,600
Physical writes per second: 15,687
Bandwidth: 267 MB/s read, write 134 MB/s.
Nearly all is 8K I/O (a few multiblock writes). Note here that for Oracle to write a block, it has to be read first, so for 50/50 SLOB reads/writes you get double the reads (the usual ones for select and the other part for updates). Hence the 2/1 ratio on OS level.
Moving to ZFS
(after moving tablespace offline – note that the copy was done once before so the write to ZFS was actually an overwrite of the existing iops.dbf file)
RMAN> copy datafile '+DATA/xtremdb/datafile/iops.267.854808281' to '/data_pool/data/iops.dbf'; . . input datafile file number=00005 name=+DATA/xtremdb/datafile/iops.267.854808281 output file name=/data_pool/data/iops.dbf tag=TAG20140806T100939 RECID=3 STAMP=854886495 channel ORA_DISK_1: datafile copy complete, elapsed time: 02:18:46
extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 516.5 751.1 2810.2 17611.3 0.0 2.3 0.0 1.8 1 35 c2t0d0 528.5 851.1 3122.2 17617.3 0.0 2.4 0.0 1.7 1 36 c2t1d0 422.0 743.1 2026.2 17503.3 0.0 2.2 0.0 1.9 0 34 c2t2d0 442.5 854.6 2238.2 17485.3 0.0 2.3 0.0 1.8 1 34 c2t3d0 505.0 444.0 2846.2 17737.3 0.0 2.0 0.0 2.1 0 33 c2t4d0 48.0 0.5 12288.9 2.0 0.0 0.1 0.0 2.4 0 11 c3t0d0 48.0 0.0 12288.9 0.0 0.0 0.1 0.0 2.6 0 13 c3t1d0 48.0 0.0 12288.9 0.0 0.0 0.1 0.0 2.5 0 12 c3t2d0 48.0 0.0 12288.9 0.0 0.0 0.1 0.0 2.3 0 11 c3t3d0 32.0 0.0 8192.6 0.0 0.0 0.1 0.0 2.5 0 8 c3t4d0
Note: I left out all irrelevant lines from iostat.
Interesting: Copy from ASM to ZFS, the upper 5 lines are ZFS disks, the lower 5 are ASM. You can see that RMAN is reading roughly 256K I/Os from ASM. More interesting is that ZFS writes also involve reads (now why would you have to read lots of data from a file system just to overwrite a file? Think about it). Another finding here is that ZFS writes much more data than ASM reads from disk. And I’m using recordsize=8K with aligned disks so that should not be an issue. Probably an artifact of ZFS trying to bundle IO’s together a little bit overaggressive? Intent logging (shouldn’t be because logbias=throughput)? You tell me.
Random read IO on ZFS
Physical reads per second: 22,231
Physical writes per second: 2
Bandwidth: 182 MB/s (roughly).
extended device statistics device r/s w/s kr/s kw/s wait actv svc_t %w %b sd2 9170.4 11.6 73404.2 241.0 0.0 8.8 1.0 4 100 sd3 9174.4 12.6 73422.8 227.7 0.0 8.8 1.0 4 100 sd4 9123.5 3.7 73000.0 141.1 0.0 8.8 1.0 4 100 sd5 9181.4 3.3 73547.2 49.3 0.0 8.7 1.0 4 100 sd6 9229.0 10.3 73919.9 137.1 0.0 8.8 1.0 4 100 Notes:
This iostat snapshot is from a higher-than-average moment. Average read size (disk) = 8K. Service times are actually a bit lower than the same test on ASM. I expected similar RIOPS but only get half the rate of ASM here. ZFS kernel overhead? Not sure.
Full table scan on ZFS
SYS:xtremdb > @slob-fulltablescan Limit number of full table scans (default all):10 PL/SQL procedure successfully completed. USER58 7031 MB, Time: 52.58 USER64 7031 MB, Time: 55.17 USER61 7031 MB, Time: 54.14 USER0 7031 MB, Time: 52.36 USER29 7031 MB, Time: 51.62 USER10 7031 MB, Time: 52.28 USER32 7031 MB, Time: 51.63 USER57 7031 MB, Time: 53.69 USER14 7031 MB, Time: 52.7 USER6 7031 MB, Time: 55.89 ---Summary--- Users: 10 Scanned: 70313 MB Scan rate: 132.15 MB/s Runtime: 532.08
PL/SQL procedure successfully completed.
You can see here that the scan rate is roughly 132 MB/s. The AWR report agrees again:
Physical read bytes/s = 138MB/s. Physical reads/s = 1075. Divide and you get 138000/1075 = 128.
Now I ran a total of 3 hours with SLOB update percentage 100. I calculated based on the write rate that after 3 hours statistically most database blocks would have been overwritten at least once.
Random read IO on ZFS after updates
Physical reads per second: 22,111
Physical writes per second: 4
Bandwidth: 181 MB/s (roughly).
The iostat output also looks very similar compared to before the updates. This is what I expected as this is all random 8K I/O which should not be influenced by fragmentation or I/O inflation.
Full table scan on ZFS after updates
SYS:xtremdb > @slob-fulltablescan Limit number of full table scans (default all):10 PL/SQL procedure successfully completed. USER47 7031 MB, Time: 49.68 USER15 7031 MB, Time: 50.9 USER37 7031 MB, Time: 50.36 USER6 7031 MB, Time: 49.75 USER7 7031 MB, Time: 53.51 USER33 7031 MB, Time: 77.78 USER16 7031 MB, Time: 76.57 USER35 7031 MB, Time: 77.37 USER28 7031 MB, Time: 77.29 USER19 7031 MB, Time: 77.13 ---Summary--- Users: 10 Scanned: 70313 MB Scan rate: 109.80 MB/s Runtime: 640.35 PL/SQL procedure successfully completed.
Scan rate after fragmentation dropped from 138 to 110 MB/s. Frankly I expected a steeper dive but it seems that the XtremIO box is holding up pretty well, and the IOPS inflation issue is probably not that bad on such an array.
In this test comparing Oracle on XtremIO using both ASM and ZFS, the difference is significant and ASM performs at least twice as good as ZFS.
One should take into consideration that this is on an extremely high performance Flash array that is ignorant on workload types (hot spots, fragmentation, large vs small I/O).
- ASM seems to have at least double the I/O performance compared to ZFS on the same system
- XtremIO completely solves (or avoids) ZFS fragmentation issues but IOPS inflation still occurs although with less dramatic effects than expected
- There are many parameters that can be tweaked so an apples-to-apples comparison is practically impossible. Mileage may vary.
I haven’t had the chance yet to also test on spinning disk. Will keep that for a future post.