ZFS and Database fragmentation

Disk Fragmentation

Disk Fragmentation – O&O technologies.
Hope they don’t mind the free advertising

Yet another customer was asking me for advice on implementing the ZFS file system on EMC storage systems. Recently I did some hands-on testing with ZFS as Oracle database file store so that I could get an opinion on the matter.

One of the frequent discussions comes up is on the fragmentation issue. ZFS uses a copy-on-write allocation mechanism which basically means, every time you write to a block on disk (whether this is a newly allocated block, or, very important, overwriting a previously allocated one) ZFS will buffer the data and write it out on a completely new location on disk. In other words, it will never overwrite data in place. Now a lot of discussions can be found in the blogosphere and on forums debating whether this is really the case, how serious this is, what the impact is on performance and what ZFS has done to either prevent, or, alternatively, to mitigate the issue (i.e. by using caching, smart disk allocation algorithms, etc).

In this post I attempt to prove how database files on ZFS file systems get fragmented on disk quickly. I will not make any comments on how this affects performance (I’ll save that for a future post). I also deliberately ignore ZFS caching and other optimizing features – the only thing I want to show right now is how much fragmentation is caused on physical disk by using ZFS for Oracle data files. Note that this is a deep technical and lengthy article so you might want to skip all the details and jump right to the conclusion at the bottom :-)

My test environment is legacy hardware and therefore I am not even attempting to break performance records or come close. The performance I get is not impressive at all. I’m using single- and dual-core servers with limited memory and single SATA disks with a limited data size that are in no way representing modern datacenter equipment. For showing the effects of fragmentation you don’t need fast hardware.

Now a bit of background on what I expected from my testing. First of all, database data files behave much different than, say, office files (i.e. PowerPoint, PDF, word docs) or other “file serving” style environments (i.e. content management, software development and many other things).

An office file typically gets written as one whole. On ZFS, I expect this to cause more or less sequential, low fragmented allocation even with larger files (5 to 10 MB). If you load the file in your text editor, then save it, most office software (Microsoft Office, Libreoffice, …) will overwrite the whole file at once. Much the same for other regular file serving purposes. In ZFS due to the copy-on-write mechanism this causes the new file version to be completely written to a new disk location but the new file is still largely unfragmented.

A database file behaves differently. Such files typically get created only once (when creating or extending tablespaces) and only grow if the database needs more space. Otherwise the file itself does not change (although the contents within the file change heavily – as we will see).

Especially if the workload is OLTP-like with many small random updates, every update will modify a single database block – or a few of them at a time. What’s the best tool to create random I/O on an Oracle database? Kevin Closson’s SLOB!

But I made a few modifications to SLOB, first of all to make it run on my very modest hardware (128 sessions on a single CPU is not a good idea), tweak the data size (82MB for a single user is likely to stick in cache and I don’t want that), and finally, I needed a method to track database blocks on a physical disk.

So let’s go over the test details.


ZFS Server:

  • FreeBSD 9.1
  • Dual processor (32 bit) FreeBSD 9.1
  • Memory: 6 GB (of which 3.5 GB usable without PAE)
  • Disks: 3x 73GB 10.000 rpm SCSI via Adaptec SCSI controller (not used in this test)
  • 1x SATA 400 GB via SATA PCI controller
  • Gigabit Ethernet (with Jumbo frames but set to MTU 7200 to match the DB server)

DB server:

  • CentOS 5.9 64-bit single CPU
  • Memory: 2GB
  • Disks: 1x SATA 400GB
  • Gigabit Ethernet (MTU size max 7200 due to low cost consumer hardware ;)
  • Database software: Oracle with Direct NFS enabled

On Linux, I created an LVM logical volume of 1GB in size and zeroed it out:

# lvcreate -Ay -L1GB -nslob data
# dd if=/dev/zero of=/dev/data/slob bs=65536

Then created a standard EXT3 file system on it:

# mkfs.ext3 /dev/data/slob
# mount /u02/slob/
# df | grep slob
                       1032088     34092    945568   4% /u02/slob
# chown oracle:dba /u02/slob

Then I created a tablespace for SLOB:

# ls -alh /u02/slob/slob.dbf ; df -h /u02/slob
-rw-r----- 1 oracle dba 2.1M Feb 25 11:55 /u02/slob/slob.dbf
Filesystem            Size  Used Avail Use% Mounted on
1008M   36M  973M   4% /u02/slob

Setup SLOB table with only one user in EXT tablespace:

$ ./setup.sh EXT 1

However, I modified SLOB a bit to be able to find row offsets in a tablespace later (via the MAGIC string). I also increased the default number of rows (12000) to 100000 so that it will create an 830MB tablespace (default will only create about 93MB per user) – forcing 83% space allocation on the FS.

SQL> desc cf1
Name                                      Null?    Type
----------------------------------------- -------- ----------------------------
MAGIC                                              CHAR(32)
CUSTID                                             NUMBER(8)
C2                                                 VARCHAR2(128)
C3                                                 VARCHAR2(128)
C19                                                VARCHAR2(128)
C20                                                VARCHAR2(128)

The code to enter MAGIC values in CF1.MAGIC looks like this and is run once after loading the table:

i           number := 1;
FOR p IN ( SELECT rowid FROM cf1)
UPDATE CF1 SET MAGIC=chr(10) || 'SLOBROW ' || to_char(i,'000000') || chr(10) WHERE p.rowid = CF1.rowid;
i := i+1;
end loop;

This causes a sequential run through all rows of the table (starting from the first row) and setting a searchable text string with the row number as ASCII. The chr(10) are newlines so that the output looks better later on (using the “strings” command).

The table content looks like this:

> select magic,custid from cf1 where rownum < 10
MAGIC                                CUSTID
-------------------------------- ----------
SLOBROW  000001
SLOBROW  000002
SLOBROW  000008
SLOBROW  000009

9 rows selected.

Note that the newlines are caused by the char(10) so this is on purpose.

We check the fragmentation within the datafile:

# strings -t x /u02/slob/slob.dbf |grep "SLOBROW" >outfile1.txt

The outfile1.txt looks like this (hex block offset, followed by the MAGIC string):

109daf SLOBROW  000001
10bdaf SLOBROW  000002
10ddaf SLOBROW  000003
10fdb1 SLOBROW  000004
111daf SLOBROW  000005
113dad SLOBROW  000006
115daf SLOBROW  000007
117daf SLOBROW  000008
119daf SLOBROW  000009
11bdad SLOBROW  000010
11ddad SLOBROW  000011
31213dad SLOBROW  099997
31215dad SLOBROW  099998
31217dad SLOBROW  099999
31219dad SLOBROW  100000

So within the datafile there is no fragmentation.

Now we look at the 1GB block device that holds the /u02/slob filesystem:

# strings -t x /dev/data/slob |grep "SLOBROW" >outfile2.txt

The outfile2.txt looks like:

790adad SLOBROW  000001
790cdad SLOBROW  000002
790edad SLOBROW  000003
7910dad SLOBROW  000004
7912dad SLOBROW  000005
7914daf SLOBROW  000006
7916dad SLOBROW  000007
7918dad SLOBROW  000008
791adad SLOBROW  000009
791cdad SLOBROW  000010
791edad SLOBROW  000011
39a0ddaf SLOBROW  099995
39a0fdaf SLOBROW  099996
39a11dad SLOBROW  099997
39a13dad SLOBROW  099998
39a15dad SLOBROW  099999
39a17dad SLOBROW  100000

Again, no fragmentation within the block device.

Now I ran SLOB with one user (writer) against the table.

One SLOB writer (lite) will update 1000 x 256 = 256000 rows x 8K blocks = about 2 GB, which is more than the size of the table contents (830MB). After a few runs we can assume most blocks have been updated a few times.

Note that I updated the line in SLOB’s writer.sql.lite so that it touches all table rows:

v_r := dbms_random.value(257, 10000) ;


v_r := dbms_random.value(257, 100000) ;

to generate writes against all table rows (830MB).

Note that my SGA is 500MB, buffer cache only about 184 MB, which forces most of the block writes to physical I/O.

$ ./runit.sh 1 0

Tm 1538

(1538 seconds to read and write 2GB worth of random 8K blocks – about 300 iops – I only use a single slow SATA disk so this is not even too bad ;)

From awr.txt FYI:

physical read bytes                   1,692,778,496    1,099,279.6   1,241,950.5
physical read total IO requests             207,392          134.7         152.2
physical read total bytes             1,722,757,120    1,118,747.6   1,263,945.1
physical write bytes                  1,860,108,288    1,207,942.5   1,364,716.3
physical write total IO requests            210,036          136.4         154.1
physical write total bytes            2,107,344,896    1,368,496.4   1,546,107.8

Let’s check fragmentation on the block device again now that we have updated many blocks:

# strings -t x /dev/data/slob |grep "SLOBROW" >outfile3.txt

The file looks like this:

790adad SLOBROW  000001
790cc71 SLOBROW  000002
790cdad SLOBROW  000002
790ec71 SLOBROW  000003
790edad SLOBROW  000003
7910c71 SLOBROW  000004
7910dad SLOBROW  000004
7912dad SLOBROW  000005
7914c74 SLOBROW  000006
7914daf SLOBROW  000006
7916c71 SLOBROW  000007
7916dad SLOBROW  000007
7918c71 SLOBROW  000008
7918dad SLOBROW  000008
791ac71 SLOBROW  000009
791adad SLOBROW  000009
791cc71 SLOBROW  000010
791cdad SLOBROW  000010
39a0dc74 SLOBROW  099995
39a0ddaf SLOBROW  099995
39a0fc74 SLOBROW  099996
39a0fdaf SLOBROW  099996
39a11c71 SLOBROW  099997
39a11dad SLOBROW  099997
39a13dad SLOBROW  099998
39a15c71 SLOBROW  099999
39a15dad SLOBROW  099999
39a17c71 SLOBROW  100000
39a17dad SLOBROW  100000

So why the double entries? I guess Oracle modifies the block contents after updating a row, in such a way that offsets within the blocks change slightly. The double entries are just a leftover feature of original data but within Oracle this is now unallocated content.

Notice that:
a) No fragmentation occurs, not on the datafile, not on the block device
b) the offsets of every row are mostly the same.

Lets take a look at row 10, row 5000 and row 10000 as an example:

# cat outfile2.txt |grep 000010
791cdad SLOBROW  000010
# cat outfile3.txt |grep 000010
791cc71 SLOBROW  000010
791cdad SLOBROW  000010

# cat outfile2.txt |grep 005000
a307dad SLOBROW  005000
# cat outfile3.txt |grep 005000
a307c71 SLOBROW  005000
a307dad SLOBROW  005000

# cat outfile2.txt |grep 010000
ca8fdad SLOBROW  010000
# cat outfile3.txt |grep 010000
ca8fc71 SLOBROW  010000
ca8fdad SLOBROW  010000

Note that the offset of the magic only changed a little bit: 316 bytes. Well within a single 8K oracle block. Again this shows that Oracle modified the contents of the blocks but not the block offsets within the datafile.

Let’s do another run…

$ ./runit.sh 1 0

Tm 1180

(bit faster, probably due to ext3 file system cache warmed up)

Check if anything changed on the layout since the last run:

# strings -t x /dev/data/slob |grep "SLOBROW" >outfile4.txt
# diff outfile3.txt outfile4.txt |wc -l
# diff outfile3.txt outfile4.txt |grep SLOB|wc -l
# diff outfile3.txt outfile4.txt |grep SLOB|grep -v ">"
<no output>

So we have a few changes (additions to the new file – reflecting 756 changes within Oracle blocks!)

# diff outfile3.txt outfile4.txt | head -6
> 7a94c71 SLOBROW  000196
> 7b28c71 SLOBROW  000268
> 7b86c71 SLOBROW  000315

Again Oracle moved some data within the blocks. But no single block moved positions within the datafile, and no blocks within the datafile moved offsets on disk.

We can conclude that on EXT3 there is no fragmentation other than that caused by initial file creation. We can update blocks as many times as we like but the block offset never changes again. This also makes me confident that filling up an ext3 filesystem up to 99% with large static datafiles will have minimal impact on fragmentation.

Let’s now do the same thing on ZFS.

On my BSD ZFS server:

root@zfs:/root # gpart show
=>       63  781422705  ada0  MBR  (372G)
63        126        - free -  (63k)
189   16776963     1  freebsd  [active]  (8G)
16777152   16777215     2  freebsd  (8G)
   33554367    2097144     3  freebsd  (1G) 
35651511  745771257        - free -  (355G)

The disk slice I will put the Zpool on is /dev/ada0s3, an 1GB partition (like the EXT3 logical volume). The other slices and disks will be used for performance testing later.

Ada0 is also a 400GB SATA disk like the one on the Linux DB server (not that this matters for this test). Let’s zero out the data device that we want to use for the ZPOOL (so that we can use “strings” later on the device):

root@zfs:/root # dd if=/dev/zero of=/dev/ada0s3 bs=65536
dd: /dev/ada0s3: short write on character device
dd: /dev/ada0s3: end of device
16384+0 records in
16383+1 records out
1073737728 bytes transferred in 15.815460 secs (67891652 bytes/sec)

Now we create the zpool on the RAID device and a zfs filesystem on it:

root@zfs:/root # zpool create data ada0s3
root@zfs:/root # zfs create -o recordsize=8K data/data
root@zfs:/root # zfs set sharenfs=on data/data
root@zfs:/root # zfs set logbias=throughput data/data
root@zfs:/root # zfs set sync=disabled data/data
root@zfs:/root # zfs set atime=off data/data
root@zfs:/root # chown oracle:dba /data/data/

Note that I used 8K recordsize because this is the Oracle recommendation for datafiles. It also prevents excess reads when updating a single 8K block.

“Logbias=throughput” is a tricky one as it might cause lost transactions or even corruption when suffering a system crash. But I am not interested in that right now and I need this setting to get a bit decent performance. Same for “sync=disabled”. This is normally not recommended but on my not-so-powerful system it boosts performance bigtime and for this test it will not matter.

root@zfs:/root # zpool list
data  1016M   140K  1016M     0%  1.00x  ONLINE  -
root@zfs:/root # zpool status
pool: data
state: ONLINE
scan: none requested


data        ONLINE       0     0     0
ada0s3    ONLINE       0     0     0

errors: No known data errors
root@zfs:/root # zfs list
data        140K   984M    32K  /data
data/data    31K   984M    31K  /data/data

Going back to the Linux db host:

# mount -a
# df | grep zfs
zfs:/ufs               8106752         0   7458176   0% /zfs/ufs
zfs:/data/data         1007552        64   1007488   1% /zfs/data

(the UFS entry is there because I plan to test UFS against ZFS performance as well)
I have prepared Direct NFS and this is what the oranfstab looks like:

# cat /etc/oranfstab
server: zfs
export: /data/data mount: /zfs/data
export: /ufs mount: /zfs/ufs

Now we have an empty ZFS pool/FS. Let’s create a tablespace on it. (after bouncing the database to re-enable direct NFS):


Now let’s drop the existing SLOB tablespace (on EXT3) and users first:

$ sqlplus / as sysdba @drop_users

And create the SLOB table:

$ ./setup.sh ZFS 1
Setting up user 1
Waiting for background processes - Mon Feb 25 13:44:55 CET 2013
Table created.
PL/SQL procedure successfully completed.

So now we have an identical datafile as in the previous test but now on ZFS (via dNFS) instead of local EXT3 file system.

root@zfs:/root # df -h /data/data/ ; ls -alh /data/data/zfs.dbf
Filesystem    Size    Used   Avail Capacity  Mounted on
data/data     983M    841M    142M    86%    /data/data
-rw-r-----  1 oracle  dba   834M Feb 25 13:48 /data/data/zfs.dbf

Before even running the first SLOB test, let’s look at the MAGIC offsets again (first in the datafile, then on the raw disk):

root@zfs:/root # strings -t x /data/data/zfs.dbf | grep "SLOBROW" > zfsout1.txt ; cat zfsout1.txt
109daf SLOBROW  000001
10bdad SLOBROW  000002
10ddaf SLOBROW  000003
10fdad SLOBROW  000004
111dad SLOBROW  000005
3120fdad SLOBROW  099995
31211dad SLOBROW  099996
31213dad SLOBROW  099997
31215dad SLOBROW  099998
31217daf SLOBROW  099999
31219dad SLOBROW  100000

That looks very familiar. So the zfs.dbf file on ZFS itself is not fragmented (like in the initial test on ext3).

Now let’s inspect the raw block device:

root@zfs:/root # strings -t x /dev/ada0s3 | grep "SLOBROW" > zfsout2.txt ; cat zfsout2.txt
4037ad SLOBROW  001087
4081ad SLOBROW  001088
40a1ad SLOBROW  001089
40cfaf SLOBROW  001090
410dad SLOBROW  001091
412dad SLOBROW  001092
3cbf7fad SLOBROW  084135
3cbf9faf SLOBROW  084136
3cbfbfad SLOBROW  084137
3cbfdfaf SLOBROW  084138

So you see the first block is not row 1. The last rows are not rows 99990 thru 100000.

Let’s inspect a random area in the file:

28c01dad SLOBROW  095075
28c03dad SLOBROW  095081
28c05dad SLOBROW  095080
28c07dad SLOBROW  095077
28c09dad SLOBROW  095082
28c0bdad SLOBROW  095083
28c0ddad SLOBROW  095084
28c0fdad SLOBROW  095085
28c11dad SLOBROW  095086
28c13dad SLOBROW  095076
28c15dad SLOBROW  095088
28c17dad SLOBROW  095087
28c19dad SLOBROW  095090
28c1bdad SLOBROW  095092
28c1ddad SLOBROW  095089
28c1fdad SLOBROW  095093
28c21dad SLOBROW  095094
28c23dad SLOBROW  095078
28c25dad SLOBROW  095096
28c27dad SLOBROW  095097
28c29dad SLOBROW  095098
28c2bdad SLOBROW  095095
28c2ddaf SLOBROW  095099
28c2fdad SLOBROW  095100
28c31dad SLOBROW  095102
28c33dad SLOBROW  095091
28c35dad SLOBROW  095104
28c37dad SLOBROW  095101
28c39daf SLOBROW  095106
28c3bdad SLOBROW  095103

So you see the rows are not linear allocated and we have some weird forward and backward row jumps all over the place. That’s directly after table creation and filling with data – and before throwing block updates at the table!

Let’s figure out how many sequential chunks we have. With ZFS copying blocks around this is less trivial than it seems. Consider this entry:

18e3afad SLOBROW  000879
27444bad SLOBROW  000879

How do we know which of these two reflects the real Oracle block and which one is the leftover from a previous write? Ignoring such problems for a moment (there’s only a few duplicates so far so I will work this out later) let’s try to find out how much fragmentation we have. How do you define that?

Let’s define the fragmentation as being the total disk seek distance (in bytes) if we had to read the file sequentially. So every time the disk heads have to reposition we are going to measure this as the disk offset jump on the disk.

Let’s say a seek is a forward jump larger than 4 blocks (32768) or a jump back larger than 4 blocks (ignoring data blocks not not together but very close i.e. on the same disk track).

The script I used to calculate seek distance:

typeset -i c=0
typeset -i r=0
cat $1 | sort -n -k +3 | while read offset magic row
  off=$(printf "%d" 0x$offset)
  diff=$(expr $off - $oldoff)
  if [[ $oldoff -ne 0 ]] && [[ $diff -gt 32768 || $diff -lt -32768 ]]; then
    distance=$(expr $distance + $(expr ${diff#-} / 1024 ))
    rows=$(expr $row - $oldrow)
    printf "%8s %8s %11s %11s %8s\n" $row $rows $off $diff $(expr $distance / 1024)

Now the script is not perfect – you could argue whether it reflects the real fragmentation or not and even propose better methods. But getting a bit closer to perfection the thing becomes very complex and I like to keep it simple.

Running the script against the output file:

[root@zfs ~]# ./countfragments zfsout2.txt | tee frags.txt

results in total seek distance of 94GB. Note that in the real world the physical disk seek distance is probably much less because a) we have duplicates that are not real, and b) if you see something like this:

28c2bdad SLOBROW  095095
28c2ddaf SLOBROW  095099
28c2fdad SLOBROW  095100
28c31dad SLOBROW  095102
28c33dad SLOBROW  095091

then the caching mechanism will most likely use prefetch and caching to work this out with only one or two seeks (instead of jumping up and down for each block). But I ignore that for now – we assume we have dumb disk without any cache or I/O order optimizations to make the point (otherwise the whole exercise would be much too complicated)

Now let’s kick off a slob run and see what happens.

$ ./runit.sh 1 0

Tm 3205

(note on ext3 it took 1180 seconds, but this delay could also be caused by the NFS network layer so I will not draw conclusions from this).

Some Awr.txt output again FYI:

physical write bytes                  1,910,071,296      595,624.9   1,057,040.0
physical write total IO requests            172,951           53.9          95.7
physical write total bytes            2,254,415,872      703,003.2   1,247,601.5
physical write total multi block              1,866            0.6           1.0
physical writes                             233,163           72.7         129.0

Created a new scan for SLOB blocks:

root@zfs:/root # strings -t x /dev/ada0s3 | grep "SLOBROW" > zfsout2.txt

Checking for differences before running SLOB:

root@zfs:/root # wc -l zfsout2.txt zfsout3.txt
100056 zfsout2.txt
212729 zfsout3.txt
312785 total

There are too many duplicates now to make realistic estimations about fragmentation. So let’s update the MAGIC strings so that we can look for actual database block and exclude most of the leftovers from previous writes.


i           number := 1;
  FOR p IN ( SELECT rowid FROM cf1)
    UPDATE CF1 SET MAGIC=chr(10) || '&magic' || to_char(i,'000000') || chr(10) WHERE p.rowid = CF1.rowid;
    i := i+1;

Running it:

$ sqlplus user1/user1 @updatemagic

Enter value for magic: SLOB001
old   6:         UPDATE CF1 SET MAGIC=chr(10) || '&magic' || to_char(i,'000000') || chr(10) WHERE p.rowid = CF1.rowid;
new   6:         UPDATE CF1 SET MAGIC=chr(10) || 'SLOB001' || to_char(i,'000000') || chr(10) WHERE p.rowid = CF1.rowid;

PL/SQL procedure successfully completed.

(FYI runtime was 22 minutes – on EXT this runs in less than 1 minute – so you get some feeling about the delays here. I can tell you the network was not highly utilized at all, in fact it was hard to find any bottleneck but it seemed to be caused by high CPU context switches per second on the BSD machine)

Here the output on the BSD machine during the MAGIC update from iostat against the ZFS device:

root@zfs:/root # iostat -x ada0 10
device     r/s   w/s    kr/s    kw/s qlen svc_t  %b
ada0     133.5   0.2   580.3     0.9    1   1.9   8 <-- it looked like this for a while, only reads but hardly any updates. 
ada0     133.4   0.2   581.8     0.9    0   2.0   9  -- guess this is due to delayed dbwriter activity
ada0     116.8  61.1  1618.8   358.9    3  22.8  94 <-- here it finally starts writing
ada0     113.0  84.6  1346.8   397.8   10  18.7  88
ada0     108.0  75.8  1344.9   443.2    7  20.3  90
ada0     109.8  73.4  1505.4   494.5    6  19.3  86
ada0      96.7 186.3  1409.4  1170.1    6  16.9  92
ada0      41.8  14.2   390.2   137.2    8  26.6  37
ada0       0.0   0.0     0.0     0.0    8   0.0   0 <-- here ZFS does not do any IO for about 20 seconds. Bottleneck assumed
ada0       0.0   0.0     0.0     0.0    8   0.0   0  -- to be completely CPU/memory based – probably in part because of
ada0      17.4 186.7   615.3  1334.9   10 126.4 329  -- my old hardware ;-)
ada0       0.0   0.0     0.0     0.0   10   0.0   0
ada0       0.0   0.0     0.0     0.0   10   0.0   0
ada0      25.5  26.9   231.8   118.6    1 589.4 322
ada0     111.4  85.5  1145.5   630.2    1  17.4  94
ada0      98.4 156.5  1269.0  1074.7    1  16.2  89

Check for the results:

root@zfs:/root # strings -t x /dev/ada0s3 | grep "SLOB001" > zfsout4.txt
root@zfs:/root # wc -l zfsout4.txt
82841 zfsout4.txt

(Weird. There should be at least 100,000 rows) The fix: flush Oracle’s “dirty cache” ;-)

SQL> alter system checkpoint;
root@zfs:/root # strings -t x /dev/ada0s3 | grep "SLOB001" > zfsout5.txt
root@zfs:/root # wc -l zfsout5.txt
101223 zfsout5.txt

(much better, and only 1223 duplicates)

Let’s check for fragmentation in the new environment:

root@zfs:/root # ./countfragments zfsout5.txt | tee frags.txt
root@zfs:/root # tail -5 frags.txt
099996        1    95640177  -633761280  2433521
099997        1   729409649   633769472  2434125
099998        1    95918705  -633490944  2434730
099999        1   730164340   634245635  2435334
100000        1    95943281  -634221059  2435939

So the total seek distance after we performed a SLOB write run followed by an update of the magic numbers, is now 2435939 MB or about 2500 GB (up from 64 GB after initial creation). You can see the fragmentation taking its heavy toll. Let’s also find out how large the largest continuous sequence of blocks are:

root@zfs:/root # cat frags.txt | sort -n -k +2 | tail
040557       13   653829233      967168   184067
040570       13   180370545  -473556992   184519
067558       13    79769713  -408209920   510990
084989       13   205561969  -730385214   613430
031627       14    12778609  -165509632   151051
033906       14   855796337     3719168   159611
034090       15   264289393  -594196992   160196
085566       15     4227185  -605910016   614971
033971       31   858530417      166912   159618
085389       36   609938548       35331   614351

(note that the sorted second column indicates the number of adjacent blocks before jumping to another offset so a value of 15 means the script fount 15 adjacent 8K blocks)

Within our new layout of the datafile on ZFS, there are no sequential chunks on zfs larger than 36 Oracle blocks! And actually, the vast majority of continuous block areas is less than 4 blocks in size.

That means very heavy fragmentation!

For the record, let’s look at the datafile itself again (not the zfs block device):

root@zfs:/root # strings -t x /data/data/zfs.dbf  | grep "SLOB001" > zfsout6.txt
root@zfs:/root # cat zfsout6.txt | tail -5
31211c71 SLOB001 099996
31213c71 SLOB001 099997
31215c71 SLOB001 099998
31217c74 SLOB001 099999
31219c71 SLOB001 100000

That’s still 100% sequential. So within the datafile not a single block has changed offsets!

root@zfs:/root # ./countfragments zfsout6.txt
096804     1020   798006385       40957        3
097824     1020   806394993       40960        3
098844     1020   814783601       40960        3
099864     1020   823172212       40963        3

Total seek distance within the datafile: Only 3 MB! (as opposed to 2500 GB on the ZFS block device).


First a disclaimer: I used ZFS on FreeBSD 9.1. ZFS might behave different on Solaris (however I don’t think it so). ZFS is an OS feature that seems to be a very popular choice for some administrators, up to a point where I would say, an almost “religious” choice. So I would not be surprised to get some flaming comments on this post. As long as it is honest and valuable feedback I will approve them even if they prove me wrong.

Random(ish) workload against database tables on ZFS causes heavy fragmentation.

Whether this impacts performance is another discussion and I will attempt to blog on that later. ZFS (AFAIK) does not have defragmentation capabilities today, so to fix this, some of Oracle’s documentation vaguely recommends to copy complete datafiles around:

(from http://www.oracle.com/technetwork/server-storage/solaris10/documentation/wp-oraclezfsconfig-0510-ds-ac2-163550.pdf):

For a data warehouse database, keep 20% free space in the storage pool as a general rule. Periodically copying data files reorganizes the file location on disk and gives better full scan response time.

And that’s exactly my point. Now how should I do that in a mission critical environment with 99,999% required availability? That would require to take the tablespace offline, then copy the datafile(s) to a separate file system, then copying them back???

Update: Interesting article on degrading ZFS write performance http://blog.delphix.com/uday/2013/02/19/78/

About these ads

31 Responses to ZFS and Database fragmentation

  1. Pingback: Oracle databases on ZFS | portrix systems

  2. Maciej Przepiorka says:

    So, whatever your DB_FILE_MULTIBLOCK_READ_COUNT parameter may fix, your ZFS filesystem will make worse in reality :)

    • Bart Sjerps says:

      Hi Maciej!

      Yep it seems to be the case. Still I found some very short sequences left, like 2,3 or 4 adjacent blocks. The multiblock read count will still optimize. As long as Oracle does not read more than it really needs, multiblock I/O is good (even with ZFS).

  3. nice article – the duplicates that you see are the remnants leftover by the updated rows/blocks.
    That data is there physically, but if you attempted to do a block dump you would not see the duplicates only the active block the data dictionary is aware of.

    This ghosting affect is what is mentioned in the implementation guide for Oracle’s Transparent Encryption option. You can enable an encrypt data in place – however the ghost data will remain until it is overwritten.
    Oracle recommends you create a new encrypted tablespace/table(s) and move your data into the encrypted location.
    Then delete the old tablespace including contents and files to remove any sensitive data in those ghost blocks.

    • Bart Sjerps says:

      Hi Matthew,

      Yep that confirms what I already thought. I was wondering if there’s another way to clear out unused data within the blocks but I guess there isn’t.
      The “ghosting” effect is both an artifact of Oracle as well as ZFS (you can see that from the offset – if it’s small i.e. less than 8K then it’s Oracle ghosting. If it’s large (many MB) then it’s ZFS.
      I also get the impression that the Oracle ghosting only occurs if you write data to rows that had NULL values – not if you overwrite data (at least not for the fixed size strings).
      But I assumed as long as the ghosting occurs not too often – like in my case 1223 out of 100,000 – then it’s not too bad and my method of counting fragmentation is relatively valid.
      The problem with ZFS is that if you want to update even a single Oracle block you will find the new block at a completely new location but the old one is still “ghosted” somewhere. That’s why I had to update the MAGIC string.

  4. Bjorn Naessens says:

    Great article.

    Recently had a client who had issues with ZFS and looked a lot like a fragmentation issue. Adding Luns to the pool solved the issue. Oracle should provide some way to easily monitor the fragmentation and/or make it possible to defragment without having an impact on your database system.

    I still prefer ASM over ZFS.

    • Bart Sjerps says:

      Hi Bjorn, thanks, same experiences here… adding LUNs helps if that means more physical spindles. But spinning disk is often already the bottleneck even without fragmentation. Why solve it by throwing more hardware at it when the root cause is the FS architecture…
      Totally agreed on ASM as ASM has no fragmentation (especially with larger AU sizes).

      I don’t expect Oracle to come up with a solution anytime soon because that would mean they admit the issue in the first place :-)

      • Robert Felber says:

        You forget that with more spindles this could be called distribution instead of fragmentation. Given the dbf is spread across 70 spindles and mirrors and ZFS being a Storage-FS you’ll understand why distribution on a single disk has symptoms of a fragmented flat-fs. As a Storage-FS ZFS _wants_ do distribute blocks in the first place (at least that’d be my explanation). Things get fragmentally worse if you bring ditto-blocks into the game. Yes, this distribution comes at a price, specially at a pool-space-usage >80%.

        For the performance-part:
        I’d like to see the results of a raid-10/ASM and zfs striped mirrors (SSD ZIL+L2ARC). From a RealWorld-DBA-PointOfView.

        • Bart Sjerps says:

          Hi Robert,
          Whether it’s called distribution or fragmentation – that’s just a matter of naming. The question is what the impact is on performance.
          Distribution is fundamentally a good thing – as long as it doesn’t mess up other optimizations. That’s why Oracle ASM (with 1MB AU size default but adjustable to larger sizes) and EMC FAST-VP (depends on implementation but can have about 7,5 MB chunk sizes) are, IMHO, much better at “distribution” than ZFS – because a) the element size is much bigger, allowing for some pre-fetching in cache, and b) they don’t move data offsets with every individual write. Data once in place stays there (guess FAST-VP is a bit of an exception, but it moves data based on I/O stats in order to get *better* performance, not randomly messing stuff around causing trouble).

          That ZFS wants to distribute blocks is not a big problem if you write big files just once (office files, compiler builds, etc) but it becomes a problem if you have a database writing 8KB blocks at a time and then expects to get decent sequential performance when reading it back :)

          The comparison you mention – I’d like to see that as well. When I get my hands on an EMC lab with fast storage, and I can find some time I might do the comparison and blog about it indeed…


          • Robert Felber says:


            ZFS aggregates and asynchronizes IO, so, if a db-file is spread across multiple disks the 8K-blocks can be fetched (also prefetched) in parallel – minimizing latency. If it would have to read the data from a flat, serialized file (either single-disk or normal raid (usually stripe-size of 64k) – see also emc, hp or netapp volumes) then it couldn’t make much use of multiple disks – the only optimization would be: defragmentation. Block-Distribution _addresses_ the need of low latency IO – a requirement for this is: multiple disks.

            See also: http://docs.oracle.com/cd/E26502_01/html/E29022/chapterzfs-db1.html

            Pay attention for “Consider tuning storage array I/O queues (for systems with HDS or EMC storage arrays)”

  5. Jan-Marten Spit says:

    Hi, great article.

    You’ll never guess why i found this in the first place :) Indeed, Oracle on ZFS and very, very heavy fragmentation. The database is OLTP and doing a lot of random writes to wide range of data blocks.

    I tested, unlike you, with a small C program that
    1 – writes 1000 random 8KiB block with random content to random locations (on 8KiB boundaries) in a 10GiB file – OLTP load.
    2 – read the full file in 1MiB IO’s (at least that is what i request with the pread call), say backups, full table scans, and so on.

    the read performance drops dramatically after only a few random write IO’s (after roughly 100.000 write IO’s the 1 MiB read latency drops from 11ms to 35ms on my system), which is peanuts for a oltp db that runs for two years. the rate at which it drops depends on logbias and recordsize, but i did not find a ZFS configuration that does not suffer from the problem.

    the test also show that ZFS gives very poor random write performance when comparsed to other filesystems. EXT3 and JFS for example, are giving a steady 9ms no matter how many writes i make into the file.

    so i wholeheartedly concur that ZFS is not suitable for databases because it fragments the datafiles. did you notice that Oracle states in an Oracle-on-ZFS whitepaper that moving datafiles around may help performance? Yep, we all know our clients love downtime, and we all love to work late.


    • Bart Sjerps says:

      Hi Jan-Marten,

      Great to hear my findings confirmed with a different testing method, appreciate your comment!

      Oracle will likely respond that things are not this worse, databases also write sequential I/O which more or less causes (accidental) defragmentation, and lots of mem/flash/etc will mitigate the problem – but I would not rely on that. I still plan sometime to analyze the performance impact on an OLTP and DWH workload on ZFS but haven’t found the time and resources to do that yet. ps. I have an earlier post on ZFS you might be interested in – although the statements are slightly outdated as Oracle has done some improvements on their appliance.

      Moving datafiles around – yes I have seen that one. Basically copying them means delete and re-write the entire file, sequentially, so that fragmentation disappears (for the time being). Downtime – in the era of big data, cloud and “the internet of things”. Sure… :-)

      Thanks for your insight!

  6. Richard Elling says:

    Today, many databases also COW, so do not expect them to perform better than ZFS on rotating media for the exact same reasons. If you want performance, you won’t bother with HDDs for this exact same reason.

    If you do use ZFS for a database, then you will also be happier using a separate log (slog) for the sync writes. You do this for the exact same reason that Oracle recommends putting the redo log on a separate device. If you do not do this, then your expectation of contiguous writing gets interleaved with the ZIL, leading towards more fragmentation, sooner.

  7. Dan Langille says:

    I’ll be interested to read your follow up article on how this fragmentation affects performance. Do you have an ETA for that?

    • Bart Sjerps says:

      Hi Dan,

      That’s still on my to-do list… no ETA at this point. I am still figuring out if it is possible to do a reasonably unbiased test to compare. Will probably compare classic ASM vs ZFS but the question then is, how much cache do you provide to ZFS, what kind of DB workload, what’s the impact of Flash tiers, etc…

      You can think of workloads that are reasonably ZFS friendly and workloads that are extremely ZFS hostile. That said, I think that ASM will outperform ZFS in all cases. But only testing will prove it.

      Let’s see if I can find time and lab resources to do the follow-up sometime.

      Thanks for your interest!

      • Dan Langille says:

        I hope you can, because at present, it’s, forgive the term, FUD. Yes, systems create fragmentation. Yes, systems deal with it. PostgreSQL on ZFS is not unheard of. It’s been done for year. Successfully.

        • Bart Sjerps says:


          Yes it can be done, for sure. I have customers running Oracle on ZFS with high workloads in production. Don’t ask what they needed to do to get decent performance (plus, it highly depends on their workload profile as well: read/write ratio, small/large blocks, random/sequential…) Their issues triggered me to start investigating ZFS in the first place, otherwise I don’t care what file system our customers use as long as it’s supported and reliable.

          But given that your standpoint seems to be etched in stone, then let me ask: what kind of test (or results) would be needed to convince you otherwise?

        • Dan Langille says:

          Solutions depend on the problem being solved. i.e. determine your workload, then design the ZFS solution accordingly.

          How did you conclude etched in stone?

          By ‘convince me otherwise’, I think you are referring to AFS vs ZFS. I’m always happy to see comparative results. The problem with comparing X and Y is both X and Y have to be used to their best abilities, which means a great learning curve for both products.

          For ZFS, I see ZIL, L2ARC, multiple vdev, etc. Without that, your critics will cry foul.

          I wish you well and I look forward to the results.

          • Bart Sjerps says:

            Etched in stone because I basically prove, with a full disclosed method that anyone can repeat, how fragmentation occurs on ZFS. Proving something is not FUD. I think I also made it clear that I don’t know currently exactly how much this fragmentation affects performance – and agreed it would depend on a lot of things. Very specific workloads (i.e. near 100% write) could even benefit from copy-on-write file systems.

            My job is to help customers optimize their Oracle workloads on EMC systems. Because I have a number of customers who ran into trouble with ZFS I decided to put it to the test and show where the pain is – hopefully it warns my readers to be careful with ZFS when deploying Oracle on it. I have no deep knowledge of PostgreSQL or most other database systems for that matter. I know Greenplum (now part of Pivotal, but a product of the EMC family) is based on Postgres and they have used ZFS in the past – and Greenplum is not known for bad performance (on the contrary). Why didn’t they run into fragmentation issues?
            Most likely because Greenplum is used for analytics: Write LOTS of data once (sequentially) then access it read-only. No problem for ZFS. The thing goes belly-up if you do lots of small random updates (i.e. Oracle with OLTP workloads).

            If with AFS you mean Andrew File System – that’s irrelevant to what I work with as AFS is not supported for Oracle anyway. Alternatives for ZFS are: ASM (Oracle’s volume manager which would be my preferred alternative), directNFS (NFSv3 initiating from Oracle’s kernel, using a high-end NAS server – preferrably from EMC ;) – as target), or ext3, or, on Solaris, UFS (granted, a bit limited in flexibility but performance is good) or VxFS.

            On the learning curve: interesting point. My opinion is that file systems should be simple. And I actually like the interface of ZFS, few commands, lots of control. But a file system should not depend on lots of RAM, flash, write intent log, lots of IO queues (LUNs/devs) etc to get good performance. A good file system is non-intrusive, i.e. it translates I/O requests from an application directly into disk I/O with as low overhead as possible. ZFS requires lots of RAM (I’d rather give that RAM to the database cache), a ZIL to make it perform (which, with RDBMS systems should already be handled by the redo logs, so it’s double work), a large cache to mitigate the fragmentation issues (they should not be there in the first place) etc.
            I would love to see a file system that combines volume manager capabilities but doesn’t fragment data (maybe a future version of ZFS could have a data placement scheme without copy-on-write but fixed-in-place. Could be very interesting).

            That said, if your database is 1TB and your Flash based L2ARC is 2TB then yes of course you have good performance – all read I/O is coming from fast Flash in any case. But then you might just as well buy an AFA (All Flash Array) in the first place, why bother with spinning rust :-)

            It’s still on my list to do a performance comparison sometimes. The big problem is with ZFS you can not compare apples to apples. Whatever I would use as a setup, there will always be readers who say the ZIL was too small or the config was wrong or I was running the wrong type of workload. Can’t avoid that. People who prefer to go down the ZFS route, because their vendor told them it’s the future, will not change their mind because a stubborn Dutch guy did a small performance test and posted the results on an unofficial blog. Like with operating systems, some file system discussions tend to be more driven by “religion” than facts…

          • Dan Langille says:

            ASM, I meant ASM no AFS. Sorry for my typo.

            Ahh, yes, sorry, no, I am not casting doubt upon fragmentation. That’s not the FUD.

            The FUD is whether or not this fragmentation causes a performance hit.

          • rofel says:

            You’re claiming tiered storage is no step forward?
            It does matter whether I am able to provide plenty of high demand data for a server farm from DDR3 or Flash. Thats how IOPS of 1.000.000 at random 8k blocks are achieved.

            ZFS may not be suitable locally on the DB. But it is suitable as SAN-FS.

            Running ASM on a ZFS SAN yields no performance problems. Please, don’t use ZFS the wrong way. Use 24 Disks, use good SLC ZIL, use good MLC L2ARC. Make it a 12-vdev mirror-pool. Compare to EMC with nvram. Compare prices. Compare dataintegrity (hash-tree). It’s about tradeoffs in either way.

          • Jan-Marten Spit says:

            “Running ASM on a ZFS SAN yields no performance problems.”

            but the discussion is about using Oracle datafiles on a locally mounted ZFS filesystem.

            if a SAN uses ZFS internally below the block devices it presents, the issue of increased roundtrips against the SAN due to fragmentation of a locally mounted filesystem is absent. so your comment has no bearing on the discussion ‘Oracle datafiles on a locally mounted filesystem’.

          • Jan-Marten Spit says:

            Dan, all,

            i have done extensive testing on ZFS, observing how streaming reads (aka oracle scattered reads aka full scans) are being affected by sustained random writes to a datafile. The results were appalling (the read bandwidth dropped to that of fully random reads) enough to advise customer to either

            a) get rid of ZFS
            b) provide ZFS with a sufficient amount of SSD’s as cache to counter the fragmentation effect.

            obviously, when the bulk of database IO is random (typical OLTP), the negative effect may not be noticeable although i will probably show up in RMAN (fuul/0) backup time – as that will do full scans on the datafiles too. In fact, the duration of the RMAN full backup was customers main problem and caused by ZFS – the read bandwidth was terrible due to years of sustained random writes to the datafiles. Copying the datafiles to a new location solved that problem , although it will slowwly build up again.

            datawarehouses typically do more full scans, so i would be very reluctant to use ZFS for that, unless provided with a big ass SSD cache – or runnin on SSD only.

            that is the downside of ZFS, or any copy on write filesystem, it fragments on random writes. There is no need to prove that, because it is a logical consequence of the technique – it is a fact.

            That being said, the days of rotational disks are numbered. On SSD’s fragmentation hardly matters at all (although it does a little in my tests, but that may improve/depend on the SSD type). Moreover, i find ZFS very easy to use – there is no other filesystem that can utilize a SSD as smart cache as easily as ZFS can – it’s just one line to create a cached pool.

            i forgot to point out one thing that is often overlooked, or mistaken for a valid argument:

            ‘you are talking about disks, but these are not real disks. they are luns, and there is no way of knowing if your data is adjacent/unfragmented or not, even if you filesystem believes it does. in fact, your lun is striped over many devices in the SAN. countering fragmentation does not solve any problem.’

            the last sentence is wrong. if the filesystem thinks blocks are adjacent, but they are really not, it will still submit a single larger IO. This still saves roundtrips to the block device (which is often remote eg a SAN), and still affects performance (significantly).

            kind regards, Jan-Marten

  8. Bart Sjerps says:


    OK, I see… Glad we agree on the fragmentation :)

    Whether it causes a performance hit, that’s the big question indeed. So here’s my take on the matter:

    ZFS will never be *faster* than ASM because of fragmentation. Best case the performance impact is near equal to ASM or other (good) file systems, worst case it brings your database to its knees (not FUD, one of my retail customers lost millions in revenue due to sudden performance issues caused by fragmentation – and for the record, it wasn’t on EMC). That said, with spinning rust going to be replaced by all Flash in the future, the fragmentation thing becomes less of an issue (some problems will fade away without being solved) – it’s just a problem with mechanical rotating disk.

    Go figure – If you have a PC with an SSD the recommendation is to NOT defragment the drive anymore (it will only make the drive wear out sooner and not improve performance).

    But don’t take my words for it – check the TPC.org website and find benchmarks filed by Oracle (now owner of ZFS development). The most recent benchmarks (http://www.tpc.org/tpcc/results/tpcc_result_detail.asp?id=113032601 – TPC-C) on Solaris are done using raw volumes striped with the classic Solaris volume manager (not even ASM to my surprise). Raw volumes are not even supported anymore with version 11.2. Nowhere you will find a ZFS based benchmark there.

    (The guys at Oracle know best how to squeeze the last bit of performance out of a database. Why not on ZFS? It’s the preferred FS on Solaris, right?)

    • Jan-Marten Spit says:


      “(The guys at Oracle know best how to squeeze the last bit of performance out of a database. Why not on ZFS? It’s the preferred FS on Solaris, right?)”

      as i already stated above, Oracle states in an Oracle-on-ZFS whitepaper that “moving datafiles around may help performance”

      Oracle is clearly fully aware of the downsides of Oracle on ZFS, but it is commercially reluctant to say it out loud. Also bear in mind my remark above that although SSD’s have a constant seek time regardless of the seek location, even on SSD’s filesystem fragmentation increases roundtrips (IO’s). So there will still be a performance hit, and how big this hit is depends on the latency towards the block device.

      • Bart Sjerps says:

        Right! Just trying to provide my customers the same insight. If for nothing else, it avoids fingers pointing to EMC systems when dealing with IO performance issues…

    • Dan Langille says:

      “(The guys at Oracle know best how to squeeze the last bit of performance out of a database. Why not on ZFS? It’s the preferred FS on Solaris, right?)”

      I think the history of ZFS since Oracle took over answers that question.

  9. Dan Langille says:

    BTW, ZFS continue outside Oracle. Look at Open-ZFS for starters. Oracle no longer owns ZFS.

  10. Bart Sjerps says:

    In reply to Jan-Marten,
    (note WordPress limits the nested levels of replies…)

    Agreed, fully consistent with my experiences.

    One reaction:

    > If the filesystem thinks blocks are adjacent, but they are really not, it will still submit a single larger > IO. This still saves roundtrips to the block device (which is often remote eg a SAN), and still affects > performance (significantly).

    If the filesystem is ZFS and it’s local then if it thinks the blocks are adjacent then they are (in most cases). There is no layer below ZFS that is causing fragmentation (ignoring the occasional RAID stripe boundaries and the like).

    If the filesystem is, say, EXT3 (or ASM) but the backend storage system is using ZFS then I think your comment applies. You would indeed reduce the number of IO requests going to the storage system. But that storage box will have to chop a large IO in smaller pieces and send them to different offsets of different disks. Still causing random IO (called IOPS inflation).

    Interesting insights BTW, thanks!

    • Jan-Marten Spit says:

      “If the filesystem is ZFS and it’s local then if it thinks the blocks are adjacent then they are (in most cases). There is no layer below ZFS that is causing fragmentation (ignoring the occasional RAID stripe boundaries and the like).”

      that’s not my point, or rather my point the wrong way around :)

      suppose a filesystem wants to read 64KiB, but it knows the 8KiB blocks are not adjacent on the local filesystem, then ZFS will issue 8 IO’s. Regardless whether these blocks are ‘really adjacent’ on the ‘real storage’ or not. If the filesystem thinks the 64KiB is adjacent, it will issue one 64KiB IO.

      It does not matter if the data on the real (remote) block device is really adjacent or not, it matters what the local filesystem (or rather, the IO scheduler) THINKS, because that is what determines the number of IO’s/roundtrips issued. Note that the IO scheduler will not merge IO’s if it thinks the blocks are not adjacent.

      “If the filesystem is, say, EXT3 (or ASM) but the backend storage system is using ZFS then I think your comment applies.”

      no, it applies always, and it is not limited to ZFS perse, although is a much bigger issue on filesystems that fragment more.

      “You would indeed reduce the number of IO requests going to the storage system. But that storage box will have to chop a large IO in smaller pieces and send them to different offsets of different disks. Still causing random IO (called IOPS inflation).”

      true, but that chopping up is very low latency as that is all in the same box, whereas the host IO has a fabric to travel. So if the problem should be parked anywhere, this is the best place :)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


Get every new post delivered to your Inbox.

Join 215 other followers

%d bloggers like this: