How to set disk alignment in Linux

As you might know, if disk partitions containing Oracle datafiles are not aligned with the underlying storage system, then some I/O’s can suffer from some overhead as they are effectively translated in two I/O’s.

If you want more info, google for “EMC disk alignment” and you’ll find plenty of information, explaining the issue.

Update 28-03-2013: I wrote a follow-up for this post describing the same thing for Linux (Red Hat / CentOS / OEL) versions 6. For that, you might want to jump straight to the new post as this one gets a bit outdated ;-)

One example is http://www.vmware.com/pdf/esx3_partition_align.pdf for Vmware ESX version 3.x.

In short: If you create partitions in Intel based Operating Systems, then by default, the first partition will start at an offset of 15 x 512 byte blocks (equals 7680 bytes) – which does not match typical SAN storage systems that use 4K or 8K disk chunks. A write to a block crossing the boundary will cause 2 writes (plus some partial reads) in the disk backend (and the remote copy if you use remote storage mirroring) and will sometimes cause an extra cache slot to be allocated. Performance improvement when changing to the right alignment can be between 5 and 15% depending on workloads and other configuration settings.

Recent Linux distributions will sometimes already do this by default, if that is the case, make sure it actually does so (see end of this article) and you probably don’t have to change anything.

Now the way most documentation explains how to resolve this in Linux is, in my opinion, too complex, you need to manually enter “fdisk”, go into expert mode, change the starting block mode etc. Not nice if you have to configure a few hundred Oracle ASM disks at once.

There is an easier way.

Here goes…

(assuming you have a completely empty disk and you only want to create exactly one aligned partition, i.e. for Oracle ASM)

  • Check if your linux system has the command “sfdisk”. I bet most linux systems will have it installed by default.
  • Make sure you know the linux device name of the disk (such as /dev/sdk)
  • Enter the command:
echo "128,," | sfdisk -uS /dev/sdk

Note the command will fail if there is already a partition (so it’s reasonably safe). This is what the output looks like on my system:

Checking that no-one is using this disk right now ...
OK
Disk /dev/sdk: 1044 cylinders, 255 heads, 63 sectors/track
Old situation:
Units = sectors of 512 bytes, counting from 0
Device Boot    Start       End   #sectors  Id  System
/dev/sdk1             0         -          0   0  Empty
/dev/sdk2             0         -          0   0  Empty
/dev/sdk3             0         -          0   0  Empty
/dev/sdk4             0         -          0   0  Empty
New situation:
Units = sectors of 512 bytes, counting from 0
Device Boot    Start       End   #sectors  Id  System
/dev/sdk1           128  16771859   16771732  83  Linux
/dev/sdk2             0         -          0   0  Empty
/dev/sdk3             0         -          0   0  Empty
/dev/sdk4             0         -          0   0  Empty
Warning: no primary partition is marked bootable (active)
This does not matter for LILO, but the DOS MBR will not boot this disk.
Successfully wrote the new partition table
Re-reading the partition table ...
If you created or changed a DOS partition, /dev/foo7, say, then use dd(1)
to zero the first 512 bytes:  dd if=/dev/zero of=/dev/foo7 bs=512 count=1
(See fdisk(8).)

Explanation:

sfdisk will read from “stdin” any commands it has to perform. To work around having to enter everything manually by ourselves we use “echo” to feed the commands directly into sfdisk. From the man page of sfdisk we can find out how sfdisk accepts commands:

sfdisk reads lines of the form <start> <size> <id> <bootable> <c,h,s> <c,h,s>

And using the -uS options we tell sfdisk to use sizes of sectors (of 512 bytes each) instead of cylinders or anything else.

As we want to use the full size of the disk we leave that field empty and let sfdisk figure it out. The id will be default (Linux partition). If you want something else then read the man page and you’ll find it. We ignore also the bootable and disk cylinders/heads/sectors parameters (they are optional).

The disk will be aligned exactly at 64KB offset (8 chunks of 8K which fits nicely with either EMC CLARiiON or EMC Symmetrix).

Sometimes you might want another alignment value. Common is one megabyte (2048 sectors). The command would then be:

echo "2048,," | sfdisk -uS /dev/sdk

To verify disk alignment:

sfdisk -uS -l <disk>

Example:

Here is the partition overview of my small Oracle RAC cluster.

[root@oradb1 ~]# listasm
#dev     scsi lun ASMVol    SizeMB
/dev/sda    0   0 -            101
/dev/sdb    1   0 -              9
/dev/sdc    1   1 ASM1        8189
/dev/sdd    1   2 ASM2        8189
/dev/sde    1   3 -           1019
/dev/sdf    1   4 -           1019

sda is the boot disk, sdb contains Oracle binaries, sdc/sdd are ASM volumes and sde/sdf are cluster resources / voting disks.
Let’s look at the boot volume.

[root@oradb1 ~]# sfdisk -uS -l /dev/sda

Disk /dev/sda: 2088 cylinders, 255 heads, 63 sectors/track
Units = sectors of 512 bytes, counting from 0

   Device Boot    Start       End   #sectors  Id  System
/dev/sda1   *        63    208844     208782  83  Linux
/dev/sda2        208845  33543719   33334875  8e  Linux LVM
/dev/sda3             0         -          0   0  Empty
/dev/sda4             0         -          0   0  Empty

You can see that sda1 is mis-aligned at 63 sectors. I don’t really care as the boot (OS) disk in Linux will not cause much I/O anyway. The LVM volume is also misaligned at 208845 sectors. I only keep OS stuff in there so don’t care.
Now let’s check the ASM disks.

[root@oradb1 ~]# sfdisk -uS -l /dev/sdc

Disk /dev/sdc: 1044 cylinders, 255 heads, 63 sectors/track
Units = sectors of 512 bytes, counting from 0

   Device Boot    Start       End   #sectors  Id  System
/dev/sdc1           128  16771859   16771732  83  Linux
/dev/sdc2             0         -          0   0  Empty
/dev/sdc3             0         -          0   0  Empty
/dev/sdc4             0         -          0   0  Empty

Nicely aligned at 64K (128 sectors) !

Let’s take a look at another Linux server that I installed with Ubuntu Server 10.10 recently.

root@silverstone:~# sfdisk -uS -l /dev/sda

Disk /dev/sda: 48641 cylinders, 255 heads, 63 sectors/track
Warning: extended partition does not start at a cylinder boundary.
DOS and Linux will interpret the contents differently.
Units = sectors of 512 bytes, counting from 0

   Device Boot    Start       End   #sectors  Id  System
/dev/sda1   *      2048    499711     497664  83  Linux
/dev/sda2        501758 781422591  780920834   5  Extended
/dev/sda3             0         -          0   0  Empty
/dev/sda4             0         -          0   0  Empty
/dev/sda5        501760 781422591  780920832  8e  Linux LVM

You can see that on this system, even the boot volume is aligned at 1 Megabyte (2048 sectors). So some modern Linux distros will remove the burden of doing this yourself.

Let’s see what happens if I accidentally try to overwrite an existing partition.

[root@oradb1 ~]# echo "128,," | sfdisk -uS /dev/sdc
Checking that no-one is using this disk right now ...
BLKRRPART: Device or resource busy

This disk is currently in use - repartitioning is probably a bad idea.
Umount all file systems, and swapoff all swap partitions on this disk.
Use the --no-reread flag to suppress this check.
Use the --force flag to overrule all checks.

For those geeks who still think this is not enough, here the real proof.

 dd if=/dev/sdc bs=512 count=130 | xxd -c 32

000ffc0: 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000  ................................
000ffe0: 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000  ................................
0010000: 0182 0101 0000 0000 0000 0080 bc9c b1ab 0000 0000 0000 0000 0000 0000 0000 0000  ................................
0010020: 4f52 434c 4449 534b 4153 4d31 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000  ORCLDISKASM1....................
0010040: 0000 100a 0000 0103 4441 5441 5f30 3030 3000 0000 0000 0000 0000 0000 0000 0000  ........DATA_0000...............
0010060: 0000 0000 0000 0000 4441 5441 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000  ........DATA....................
0010080: 0000 0000 0000 0000 4441 5441 5f30 3030 3000 0000 0000 0000 0000 0000 0000 0000  ........DATA_0000...............

You can see that the ASM volume starts at offset 0x10000 which equals 65536.

Hope this makes your life a bit easier! Needless to say that you can put the given commands in a simple script to make it even easier :-)

Update 1

My colleague Erik Zandboer has an excellent explanation of the alignment problem on his blog. You can find it here and here. Or search for keyword “alignment” on his site: http://www.vmdamentals.com/?tag=alignment

Also, I found that the “cfdisk” command shows weird behavior in CentOS 6.0 (probably also in Red Hat version 6). You might have to use the “–force” option to make it work in those Linux distributions. The drawback of this is that using that option does not prevent overwriting existing partitions. Be careful! (or write a script to prevent mistakes).

About these ads

23 Responses to How to set disk alignment in Linux

  1. donikatz says:

    Nice post, thanks! Of all the different methods, this is the easiest I’ve seen.

  2. Bart Sjerps says:

    You’re welcome!

    Wrapping it up in a script to align a dozen volumes at once should not be too hard, either. Try that with the standard “fdisk” method ;-)

  3. dashesy says:

    Thanks, that was helpful
    Do you know where those seemingly constant numbers come from?
    I mean 1 Megabyte (2048 sectors), 1 block (1024 bytes), …
    “sfdisk -g” only returns the number of cylinders which is not of any help

    • Bart Sjerps says:

      Sure!

      When magnetic disks were invented in the 1950’s, they standardized on 512 byte block sizes. Only much later, some disk vendors started developing disks with larger block sizes (i.e. 1024, 4096 byte blocks).

      As disk blocks are (more or less) standardized on 512 bytes, the direct result is that 1 megabyte equals 2048 sectors, etc.
      The reason for EMC to recommend 64K as starting offset is that all our storage systems work nicely with that value, including the state-of-the-art V-max (with cache slots of 64K).
      Oracle sometimes recommends 1 MB which is also fine (as it is a multiple of 64K).

      I guess the weird 63-block default offset for the first partition on an Intel architecture originated in the MS-DOS days (maybe even before) when many (IDE) disks had 63 sectors per track. In those days, starting the first partition on the 63-block offset was beautifully aligned. Soon after, when larger disks evolved, the real geometry of drives could not be represented anymore with the old CHS (Cylinders-Heads-Sectors) method and they had to translate (emulate) weird formats (think of it as some sort of virtualization ;)

      Misalignment on plain (simple) disks (we call this JBOD – just a bunch of disks) was never a big problem as I/O was 512-byte anyway. But intelligent disk arrays (like EMC’s) are optimized for much larger I/O and use 4096 or even 8192 bytes per block internally, to make better use of cache and other internal resources. The drawback is that misaligned partitions (Intel platforms) cause a noticeable performance drop. The issue was never on high-end UNIX (AIX, SPARC Solaris, HP-UX) because they use different partitioning methods.

      • Jack says:

        Dude, the reason is that 64k aligns perfect with the track size. Nothing to do with the cache slot. As you prefetch, the subsystem with prefetch at a multiple of 4K. As 64K is the track size and it is also the element size in all RAID systems, then you optimize the efficiency of the IO.

        • Bart Sjerps says:

          Dude…

          I wonder who told you this (I certainly hope it’s not one of my EMC colleagues)… I am talking about EMC (Symmetrix or CLARiiON) storage systems. Maybe you are correct for simple RAID controllers or JBOD (just a bunch of disks).

          For EMC:
          – Cache slot sizes DO matter. If you have misaligned I/O then some of them will cause two I/O’s in the backend and if the two cross a cache slot boundary they will require two cache slots.
          – Subsystems will prefetch with 8K increments in more recent Symmetrix systems. Actually it’s dynamic so you will also see larger prefetch sizes (always multiples of the disk block size). With modern architectures that can easily handle larger blocks, using legacy 4K size is causing more overhead in memory management. Going to 16K would be even more efficient if it wasn’t that many databases and filesystems still use 8K block sizes.

          I wonder what you mean by “element size”. Seems weird that (according to you) with many different RAID architectures they all – without exception – have the same element size (whatever that means). Maybe you mean “stripe size” (sometimes called “stripe element size”) but that is different for many storage systems. For example, EMC VMAX uses 256K.

          But feel free to ignore all this and do things differently :-)

  4. Alexei says:

    Hi Bart,

    Can you kindly respond for couple of questions:
    1) Is it necessary to use partition alignment also for DM multipath Linux native devices?;
    2) does offset correspond to ASM AU size?
    Thank you

  5. Bart Sjerps says:

    Hi Alexei,

    >> Is it necessary to use partition alignment also for DM multipath Linux native devices?

    Yes. Multipath does not change the layout of partitions. Unless you don’t use (linux) partitions and directly provide linux volumes to ASM – so the ASM volume is something like /dev/sdk (full disk) instead of /dev/sdk1 (first partition).

    The rule is: If you use old-style Intel (PC) partitioning (the one using 4 primary partitions where 1 of them can be an extended partition holding more logical partitions) then you need to be aware of alignment. If you use another partition method (such as non-Intel or the newer GPT partitioning) then there are no issues.

    Beware of VMware VMFS by the way… You could have alignment issues on the VMFS partition but none on the VMDK virtual LUN presented to the virtual machine. You need to make sure *both* are aligned.

    >> does offset correspond to ASM AU size?

    No. The default Oracle AU size is 1MB but I/O’s to an ASM disk group will be much smaller. As long as the Oracle blocks (typically 8K by default) are aligned then you’re fine.

    For current EMC storage, any offset that is a multiple of 8K is fine. So the very first byte of the partition may be at 64K, 100K, 128K, 1024K (1MB) or whatever. But (techie) people tend to use powers of 2 so that’s why it will typically be something like 64K, 128K, 1024K.

    Hope this helps, let me know if you have more questions!

  6. Hans says:

    Can you please clarify your response previous comment ?

    If a whole, unpartitioned volume e.g. /dev/sdk is presented to the Linux kernel’s DM Multipath or LVM is alignment still an issue ?

    i.e. do DM multipath or LVM somehow offset the beginning of the data by an odd number of sectors/segments that is not readily obvious or are all sectors/segments used from the beginning of the volume and alignment is not a problem ?

    Unless you want multiple partitions per disk, is there any value in partitioning the disk at all ? Using a whole volume would seem to save the trouble of deciding how much to offset by.

    • Bart Sjerps says:

      Hi Hans,

      The Linux kernel multipath or LVM do not change anything. They do not introduce a new alignment offset neither get rid of the already existing one.

      The multipath feature allows you two paths to the same (partitioned or not) volume.
      The LVM is one level higher in the stack and works with whatever you give it as an LVM physical volume (i.e. whole disk ala /dev/sdk, or partition a la /dev/sdk1 or whatever). If the physical volume is misaligned then every LV in the volume group (using the same PV) will be misalgined)…

      If you present an empty unpartitioned disk (/dev/sdk) to the kernel then there is not (yet) an alignment issue. The issue is introduced by ONE factor only, and that’s is the (legacy) PC (MSDOS) partitioning method. If you have no partitioning at all (i.e. you create a LVM physical volume directly onto the whole disk) then you don’t have to worry. Or if you use alternative partitioning (i.e. GPT) or if you use an alignment aware partitioner (i.e. the most recent Ubuntu linux distro).

      To be more clear: Every time you encounter a “DOS” partitioning style (the one having max. 4 primary partitions of which one can be an extended partitions holding more logical partitions) you should be cautious.

      Hope this helps clearing things up!

  7. alecsyz says:

    Hi, quick question, I have the following scenario
    vmware 4.1 / netapp nfs / rhel 5.5 vm
    Inside the vm I have 10 ms-dos partitions, no lvm.
    I have created the first one /boot, aligned, from sector #64, do I need to align the 9 remaining ones ? to have every partition starting and ending with a /64 number of sectors ? thanks

  8. Aaron says:

    Quick question: We are using a Symmetrix VMAX 20K with virtual provisioning on RAID5 TDATs and using 4 and 8 member Striped METAs allocated to Linux RHEL v6. We are experiencing high write response times (above 64ms in SPA) from the array during heavy random write loads. We see high queuing at the FA ports, but are still getting 100% write hit to cache. No device pending event. Is an offset of 2048 aligned?

    • Bart Sjerps says:

      Hi Aaron,
      Yep 2048 blocks of 512 bytes equals exactly 1 Megabyte (1048576 bytes). This is divisable by 8192 (128 x 8K) so you’re good. By the way I noticed that RHEL aligns correctly by default since version 6. The high response time must be caused by something else. Or maybe you’re just pushing too much IO for the # of FA ports to handle (in that case, go wider across more FA ports). Write resp. time should be closer to 1-5 ms if you get 100% write hit. 64ms is way too high.
      Let me know if you want me to take a look at it…

  9. Pingback: Linux Disk Alignment Reloaded | Dirty Cache

  10. SatishY says:

    excellent stuff.

  11. Andreas Lund says:

    Unfortunately, this solution is incomplete because for some reason it only solves the problem in some cases. This was confirmed by hours and hours of testing with NetApp’s “nfsstat -d” output which shows misaligned IO per file.

    To reliably solve the problem and prevent the underlying storage system from doing partial reads and writes, I had to use “advanced mode” and adjust the “offset” of each partition up to the nearest multiple of 8 before creating the file system. (Note: This is a different value than the starting sector)

    In fdisk: Use the command “x” to enter “expert mode”, then “p” to show the current offset and “b” to adjust it.

    • Bart Sjerps says:

      Always interesting if people claim something doesn’t work for them and then come up with insufficient, vague information on why it doesn’t – and proving it with a claim based on a proprietary tool bundled with a competitive product but without giving full insight in what’s happening… Way to go…

      Granted, in some cases you might have more than one problem. If you create misaligned vmfs file systems and then again misaligned file systems on top in the guest OS, then you may find yourself indeed spending many hours finding a root cause of performance problems. Or you might have an application (not being Oracle DB but another obscure DBMS) with bad manners that causes misaligned IO within the file (which has nothing to do with misaligned partitions).  Or you’re using a state-of-the-art geeky cool new file system or innovative volume manager which hasn’t been used much in mission critical computing yet – just because you can…

      FWIW you can get exactly what you need with the method I described (sfdisk one-liner) without going through manual fdisk menus as you say. Just use a different value than 128 if that works for you. Also in my follow-up post I describe how sfdisk is causing some trouble in RHEL 6 and “parted” might serve you better.

      Hope this helps.

  12. Pingback: Poor Random Read SSD Performance On Linux | Click & Find Answer !

  13. Pingback: NFS tuning and disk alignment – Linux, Oracle and Netapp | rsr72

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 371 other followers

%d bloggers like this: