Application processing at lightning performance – The hourglass view of access times
December 8, 2011 2 Comments
Even in these modern times, when lots of things are changing in the ICT world, some lessons from the past still hold true.
Previously, I discussed the I/O stack in a typical database environment. As virtualization has complicated things a bit, the fundamental principles of performance tuning stay the same.
Recently I was browsing through old presentations of colleagues and found another interesting view on response times in an application stack. Again, I polished it up a bit and modified it to reflect a few innovations and personal insights.
The idea is as follows. We as humans have problems getting a feel of how fast modern microprocessors work. We talk in milliseconds, microseconds, nanoseconds. So – in the comparison we assume a 1 Gigahertz processor and then scale up one nanosecond to match one second – because this fits better in human’s view of the world. Then we compare various sorts of storage on the “indexed” timescale and see how they relate to each other.
Note that in this view, I ignore the fact that different components are part of different physical boxes (the application does not know this either) and (like in previous discussions) I also ignore parallelism and things like throughput and utilization. I consider an idle system that then has to perform just one access operation. Furthermore I simplified a few things and the exact numbers might be off a bit – but it’s not meant to be an exact comparison but merely food for thought.
So what can we learn?
In the world where each clock cycle takes one second, the typical access of a processor to its internal registers takes only a few clock cycles (so a few seconds). So if the data you need happens to be in the CPU register then you get the data really, really fast. The problem is, processor registers are very limited in number and therefore you can only store a very tiny bit of information there (few bytes). In today’s world of “Big Data” it’s really nothing. The registers are only required for processing data, not for storing data.
The next level is the CPU cache – the L1, L2, L3 cache that processor vendors are talking about. L1 is typically closest to the CPU and fastest but limited in capacity (typically something like 32K or 64K, and sometimes separated caches exist for code and data). My table shows access times of 100 nanoseconds (translating into 100 seconds on our virtual timescale) but I bet it’s a bit faster than that in reality. I’m not a microprocessor expert – so don’t pin me on actual numbers – but you get the idea. Level 2 and 3 caches are a bit slower but also have a bit more capacity.
Here it starts to get interesting. What if I have a very computation intensive algorithm that requires very little storage? Say for example, I am a math geek and want to calculate as many decimal places of π. Very computational intensive but I don’t need a lot of raw source data to get started. Probably a lot of this job can be completely processed in the CPU and its L2/L3 cache. Technically I don’t need any main memory for this. In reality I only need it to store the operating system kernel and the rest of the OS, and the executable binary for running the calculation program – but once the binary is cached in L1 (code) cache, I can start rolling.
In the past (a few years ago) I remember that an EMC competitor showed off interesting storage IOPS (I/O’s per Second) benchmark numbers and our EMC performance engineers had a hard time figuring out how these guys from the competition had done it – as we could not even get close to these impressive claims with our state-of-the-art systems. They performed competitive analysis in the labs and could not get the IOPS number even on the competitors equipment. Until they figured out the trick to do it: limiting the dataset to just a few megabytes so that the entire workload would happen to fit in L2 cache of the competitor’s front-end (adapter) processors. They did not even need the main (shared cache) memory, physical disks or anything and thereby could perform I/O really, really fast…
Then again – who would buy an expensive enterprise storage box to run an application dataset of only a few megabytes? Watch out, some vendors are playing tricks with numbers.
The next level is main memory. This is where a database keeps its cache buffers (like Oracle’s SGA). On our timescale, we can get access to data in about 17 minutes – who said RAM was fast? – and without CPU cache, performance would be lousy and our Gigahertz processor would spend most of it’s time spinning cycles waiting for data from main memory (like the original 8-bits processors of the seventies like the Z80 and 6502).
This gap is overcome by prefetching algorithms on the processor and by loading large blocks of data from memory in CPU cache – where it can then be processed much quicker.
Just a few years ago, memory was often a limiting factor – being expensive and not very scalable. These days we see high-end servers with a terabyte of RAM (!) or more. Database vendors start talking about in-memory databases which can, for obvious reasons, be much faster than disk storage based RDBMS systems.
Typical medium sized databases today are often smaller than 1 Terabyte. So watch out for benchmarks again – if your 500GB database runs on one of these high-end servers you will get very impressive performance (even with disk-based cache-optimized RDBMS’es). But larger databases or consolidated environments of many average sized databases will behave very different. I discussed this in my post POC: proof of concept or proof of contradiction – the bottom line is to watch out for single (small) database benchmarks on large high-end systems. Vendors play tricks here as well.
Talking about real in-memory databases (SAP HANA for example, or Oracle Times-ten) you might be able to drive more performance than a classic RDBMS on the same hardware. But there is one caveat. Full in-memory databases do not always comply with ACID because they commit transactions to memory, not disk. If you experience a powerfailure or kernel panic, you will lose transactions. As soon as you make those in-memory databases (partly) ACID compliant by waiting for data to be committed to disk, write performance will suffer. It’s a tradeoff. But stay tuned, help is on the way…
Until recently, the next level down was rotating disk. As spinning disks obey the laws of Newtonian physics, the time to access data on disks with magnetised rust can be relatively high. In the storage arena, we talk milliseconds – but translating to our self-invented timescale this is now suddenly a whopping four months !
You can see where this goes. Time to access L1 cache: 100 seconds, time to access disks: 4 months. Big gap. As the capacity of disks is getting larger obeying Moore’s law for disks, the I/O intensity (IOPS per megabyte) is increasing, driving up the 4-month response time to years easily, and something had to be done:
Flash comes in two flavors: Flash disk and Flash memory (Flash disk actually being the same fundamental technology as flash memory – but accessed using legacy disk I/O protocols like SCSI and Fibre Channel). Needless to say that skipping the SCSI stack legacy and accessing Flash as if it were memory will be faster than making it behave like a spinning disk drive. So the access time of Flash memory is in the range of 100 microseconds (conservative) which translates to 28 hours on our timescale. Much better than the 4 months of physical disk but not as good as RAM.
Flash disk is about a factor 10 slower than Flash memory (due to the protocol overhead and some other stuff) and sits at 1 millisecond (278 hours).
As discussed before, EMC was the first to market enterprise level Flash disks (“Enterprise” meaning: we can drive performance and reliability to make the technology more than good enough for mission-critical applications). But we see the gap between Flash disk and memory being addressed with Flash memory. EMC’s Project Lightning is addressing this gap and will bring the best of both worlds together: Flash memory performance with Flash disk behaviour (i.e. persistent shared storage behaviour – so you don’t loose any transactions). Would that help solving the consistency issue with in-memory databases?
The picture makes it also very clear why physical tape has died out completely – except maybe for long-term archival stuff: with access times of 100 (real) seconds or more (many centuries of waiting on our pseudo-timescale) it is simply not good enough for application processing. Whether there is still a place for tape as backup- or archival medium is under debate (I don’t think so as SATA-disk based storage innovations with effective de-duplication algorithms are much better at storing bulk data efficiently).
You might have noticed that I missed the top layer in the picture, the “Avoided I/O”. Like said before (and thanks to my colleague Vince for pointing this out) – The fastest I/O is the one you don’t need to do. So how fast is this I/O on our timescale (or on any timescale for that manner)? Infinitely fast! Response time? Zero! Remember that before trying to optimize any of the other layers :-)
When trying to get most performance out of your application, remember the scaled hourglass view when understanding access times.