Managing REDO log performance
May 20, 2012 14 Comments
I have written before about managing database performance issues, and the topic is hot and alive as ever. Even with today’s fast processors, huge memory sizes and enormous bandwidth to storage and networks.
warning: Rated TG (Technical Guidance required) for sales guys and managers ;-)
A few recent conversations with customers showed other examples of miscommunication between IT teams, resulting in problems not being solved efficiently and quickly.
In this case, the problem was around Oracle REDO log sync times and some customers had a whole bunch of questions to me on what EMC’s best practices are, how they enhance or replace Oracle’s best practices, and in general how they should configure REDO logs in the first place to get best performance. The whole challenge is complicated by the fact that more and more organizations are using EMC’s FAST-VP for automated tiering and performance balancing of their applications and some of the questions were around how FAST-VP improves (or messes up) REDO log performance.
So here is a list of guidelines and insights on REDO logs in general and the same on FAST-VP in particular – but first a tricky statement (which may lead me in serious problems – but heck, I’m Dutch, stubborn and I do it anyway).
Statement: On EMC storage, you should always have redo writes around 1 millisecond or less.
There. I said it. Any exceptions? Yes, a few, the most important one being: if you are using synchronous replication you have to add the round-trip latency to that 1ms. But then still you should see write response times below, let’s say, 3 ms or so.
The other notable exception is if you are really hammering the system with (large) writes (i.e. a data warehouse load or other sort of bulk load action).
Why do I say this? Because on EMC you always should write to storage cache. The resulting disk I/O is a background process and should not influence the write at all. Well, I heared some of my customers respond, what happens if your cache cannot flush writes fast enough to disk? The answer is, that happens but it should not happen in normal circumstances. If it does, something is wrong and should be fixed.
Strangely enough when I made this statement in presentations for Database Admins (DBA‘s), they looked pleasantly surprised in all occasions, some have thanked me for that, and it seemed as if someone finally listened to their prayers… Maybe at EMC we are normally too cautious to make such comments?
So, what kind of stuff challenges our targeted millisecond?
Let’s say you create one large filesystem and put all your database files in there. Redo logs, data files, indexes, temp tables, rollback and all of the other stuff. The filesystem is using a set of disk volumes (LUNs) but without any separation for the different data types. Now assume your database is hammered with mixed workloads. A reporting user starts a heavy query – resulting in a full table scan – and the query puts a bunch of large-IO read requests on the disk queues. Let’s say a certain volume in the filesystem has 10 outstanding large read requests (say, 128K each) on the queue, totaling over 1 MB. Now another user is entering customer information on a web form in the application and submits the request (“save”). The save results in a commit for the updated small piece of data. Now we said a REDO write should be around 1ms or less, a REDO log sync (result of a commit) generates a set of these writes so the total redo sync time to be expected is maybe 5 ms. But the redo writes are queued on top of the outstanding 10 large read I/O’s. So before the REDO writes are processed, they have to wait for the ~ 1MB (or much more) reads to be completed. So the redo log sync time in the database might be reported as 50 milliseconds. The storage guys report “all quiet on the western front” as in their view of the world the REDO writes were serviced all around 1ms or less.
Solution: Create dedicated volumes (LUNs at the host levels) at least for redo logs. This gives dedicated queues for the redo I/O and data I/O will not interfere. The I/O’s will now be queued at HBA level so if that’s a concern, use dedicated HBA’s (but only if needed as this adds cost and complexity). I was told that on the HBA and storage the IOs are not processed just sequentially, but instead, the FA processes will pick the right IOs from the queue at will, so the queuing as described is less of an issue.
I said before that if the disks cannot keep up with redo writes then something is wrong. When does this happen? REDO I/O is normally mostly-write (near 100% writes) and sequential. So if you create a RAID group (either RAID level) that holds only the REDO logs then the normal IOPS sizing for the disk type is no longer valid. For example, a 10,000 rpm drive can do about 150 random IOPS. Note the word random! IF you do sequential writes then the number of IOPS can be much, much higher. A RAID-5 3+1 set has 4 spindles and can normally handle either 4 x 150 random read IOPS or about less than 2 x 150 random write IOPS (as every write has to be written twice, plus the overhead on disks for parity calculations). Now if you only have sequential write workload on the disk, there are hardly any seeks and the disk is limited by pure throughput. A 10K rpm disk can do maybe 50MByte/s which is theoretically about 6000 8K writes per second. No sweat. As in RAID-1 both drives have to do the same, a RAID-1 disk set can also do the 6000 8K IOPS. A RAID-5 3D+1P set hammered with pure sequential writes will – at least with EMC – keep all data for a RAID stripe in cache until it can calculate the parity in memory, and then write out the whole stripe at once to all disks. So the 3+1 group can do 3 x 6000 8K IOPS – the “4th” spindle will handle the parity (note that actual workload will probably be much lower with 8K writes due to other bottlenecks, the numbers here are pure for illustration and learning).
Now consider the same RAID group where all data is shared on the disk. Indexes, data, TEMP, REDO, etc. A bunch of REDO log flushes come in and write megabytes of data to the storage cache, to be flushed later. The disk starts writing the first pieces of data – only to be interrupted by a read request for an INDEX table – followed by a few more for data rows in the table – before the rest of REDO data can be flushed. But because of the read, the physical disk had to move the heads and now to complete the REDO flush, another seek is required to reposition the disk heads. But before finishing the REDO flush, a random write comes in for TEMP followed by a bunch of random TEMP reads. REDO has to wait and the disks are moving the heads again. The mix of data types on physical disk level therefore messes up the nice spinning-rust friendly workload and the amount of “dirty” redo blocks in storage cache are starting to add up. At a certain point some high water mark is reached (“Write Pending limits” in Symmetrix) and REDO sync times start to suffer. But even if this does not happen, the redo writes also generate disk seeks that interfere with random read requests. It seems as if REDO writes are doing fine but under the covers they mess up random read performance for data files.
Solution: Create dedicated disk groups for REDO logs (note that not all of my colleagues might agree and I bet the last word hasn’t been spoken on this discussion ;-)
If you use synchronous replication (most of my customers have SRDF for this) then the bottleneck for writes will quickly move to the replication link. In SRDF, a LUN (volume) can accept (depending on microcode level and other parameters) 4 concurrent writes and no more. If the database throws 10 small (sequential) writes against a REDO volume, then this volume can accept and service only 4 (not sure if EMC Symmetrix Engineering increased the concurrency in recent microcode versions so I might not be accurate here – again, read my numbers as an example and verify the real numbers with your EMC technical support guys). The other 6 are kept on the queue until the remote storage system has responded that it has accepted the writes in good state. So even if the storage cache could accept hundreds of writes without waiting for disk flush, the limitation is now the logical SRDF link. The problem can be very subtle, as I have seen customers increasing the number of REDO volumes in an attempt to increase concurrency without results. This can happen because in Oracle there is always one redo log (group) active and within the redo log all writes might still go to one small part of a volume – even if there are many volumes configured. Note that performance analysis tools are notoriously unreliable here because within their interval of, say, 10 seconds, the hotspot might have moved hundreds of times and the workload averages out to acceptable levels, making you think the problem is elsewhere.
Solution: Switch to Asynchronous replication (my favourite). Good enough for 98% of all applications (with the notable exception of financial transaction processing where a second of data represents large amounts of Dirty Cash ;)
Alternative solution: use striping or other means to drive more parallelism, use priority controls, try to get less large IOs vs many small, etc.
Bit of a no-brainer but if you share logical or physical resources (such as host-bus adapters, front-end ports, SAN, etc etc) with other workloads (other servers, other databases, etc) then the REDO writes might end up higher in the queue before getting served. The idea is similar to what I said above about mixing data/index/log on single devices but this relates to complete databases or applications. For best redo log performance, make sure no other processes can interfere.
Solution: Isolate competing resources and give dedicated resources (logical or – if needed – physical) where required. Make sure you understand the performance tools and don’t let them fool you.
What happens if you stripe in storage (striped metavolumes, RAID-10, etc), then stripe on the volume manager (Linux/Unix LVM striped logical volumes or striped “md” multi-devices) then on top of that you use ASM fine striping or file system striping? The sequential write will be chopped in pieces and offered to storage as a bunch of seemingly unrelated small writes, causing many random seeks. The storage system will have no algorithm to detect and optimize for sequential streams. EMC recommends to stripe at most at two levels (not including the implicit RAID-5 striping). My personal view is to move away from striping completely, especially if you use ASM, FAST-VP (on Symmetrix) or both.
Solution: No – or limited – striping, and maybe use larger stripe sizes.
A term I invented myself related to any breaking of large I/Os into smaller pieces. I have seen that both the Linux OCFS2 and EXT3 file system, as well as the Linux I/O multipath feature, can break large I/O into smaller pieces (tip: EMC’s PowerPath does not do this). A single 1MB write could be carved up in 256 x 4K pieces. Needless to say this causes huge and unnecessary overhead. Not sure if these were Linux bugs or features (meaning they work as designed). Just something to verify in case of suspicion. Also be aware that wrong disk alignment can cause similar problems for some (not all) of the writes.
Solution: Ditch any layer that chops IO and use an alternative that doesn’t (personal experience: replace EXT3 with ASM, replace the Linux IO balancer with PowerPath)
RAID disk failures/Rebuilds
Especially in RAID-5, if a disk in the group is broken, then the rebuild or invoking of hot-spare might cause serious overhead. You cannot really avoid this unless by moving to RAID-1 or RAID-6. As said in earlier posts, EMC’s hot spare and disk scrubbing technologies attempt to isolate the failed disk and invoke the hot spare before the disk fails, thereby avoiding most of the overhead. If this is still a concern, use RAID-1.
Solution: Use EMC instead of cheapie-cheapie gigabytes ;-)
Additional solution: Use RAID-1 if you are concerned about this.
Well, every system has a breaking point. If your system can handle 100 but you give it 150, it will slow down. Even if you configured everything by the book.
A few notes on FAST-VP (Symmetrix)
If you use (or plan to use) FAST-VP on Symmetrix (VMAX) then you need to be aware of how this works. Without going into too much detail on the FAST-VP algorithms, it works by measuring performance statistics on chunks of data either 768K or 7.5 MB in size. If you do not separate data types at the host level, then both Oracle ASM, Unix/Linux filesystems and additional features (striping, etc) will potentially store REDO log data as well as other data in the same 768K or 7.5 MB chunks. Now if one of these chunks is hammered by redo writes as well as other random/sequential, large/small block, read/write workloads then the FAST-VP algorithms will have a very hard time figuring out what workload profile this chunk of data has. It will probably move the whole chunk to flash drives if there is heavy data I/O and it takes some of the REDO blocks along – resulting in the wrong type of data clogging the expensive flash drives.
My recommendation would be to always separate LUNs on the host level for at least REDO, DATA, ARCH and – depending on how much you want to optimize – also for DATA vs INDEX, separate TEMP, separate rollback/UNDO etc. The FAST-VP algorithms are best in breed and EMC has spent a lot of R&D to make it work, but a little help up front from the database engineers at our customers will not hurt ;-)
Another recommendation is to increase the default ASM AU (Allocation Unit) size from 1MB to at least 8MB (preferably 16MB or even higher). This forces the database to put similar data and hot spots together, allowing FAST-VP to make even better decisions about what chops of data to move around and where.
I also got the question whether to create separate FAST-VP pools for these different data types within the same database. Honestly I cannot tell, I bet it depends again on how much effort you are willing to spend and how much additional benefit you will get from it. YMMV ;-)
That said, if many more customers struggle with this, I will pick it up with engineering and see if we can create some guidelines on this. My intuitive answer is that FAST-VP is designed to make life of admins as easy as possible (note the A in FAST stands for “Automated”) – which means not too many knobs should need tweaking.
I cannot think of other problems in the I/O stack but there will probably be more. If you use EMC, have followed all my guidelines and still see high redo writes, let me know and I will try to help out (or throw the problem over the wall, to be picked up by one of my colleagues)…