Managing database performance SLA’s with quality of service
May 11, 2011 Leave a comment
A guy walks into the showroom of a Porsche dealer. He wants to buy a new set of hot wheels. The sales guy tells him about the latest technology in sports car design. This year’s model has active 4-wheel drive traction control, a very powerful engine (over 500 horsepower) with direct fuel injection, semi-automatic dual-clutch with seven speeds, and the whole car is weight-balanced to offer the best handling and cornering speeds. At the same time, carbon emissions per kilometer are the lowest in years and the car actually has green labels, so at least you can make yourself believe that it does not ruin the environment too much ;-)
“Great,” says the customer. “What’s the acceleration and top speed?”
The sales guy tells him the impressive performance claims of the car. The customer is almost convinced and at the point of spending a lot of money, he raises another question.
“When I run this car at full speed, it consumes a lot of petrol. Therefore I want to limit top speed so I don’t go over my fuel budget, because I have another car (a big pickup truck) that I need to fill up, as well. Is there an automatic system that gently pulls the handbrake if I go over 160 km/hour? Or if it used more than a certain number of liters per 100 kilometers? So that my fuel consumption is capped and I have enough left for my other car?”
“Not sure about that one,” the sales guy says. “Let me ask my mechanic, he knows the technology better than I do.”
The mechanic comes in after washing his oily hands from doing some oil changes and muffler replacements.
“Well,” says the mechanic, “I’m not sure why you want to manage fuel consumption this way. It is certainly possible to have the brakes applied automatically but maybe it’s better to just drive a bit slower than the car’s top speed. Maybe a cruise control system is what you need? So the car can be set to hold a fixed, more economic cruising speed and only apply more engine power when going uphill, for example? Or when you override the cruise control and just push the accelerator?”
“No, no, no,” says the customer. “I’ve read the specifications guide from your supplier of traction controls and it says the computerized system could be set to apply the brakes based on various input values (such as fuel consumption)”.
“Yes, ” says the mechanic, “It’s technically possible. But it means that when the brakes are applied by the system, you will choke the engine, it will not run very efficient, and you will not be able any more to quickly accelerate when needed. And you will not be able to run the car at maximum performance because the brakes will kick in at 160, which is only half the top speed. I really think it is a better idea to use cruise control. That way you can still accelerate beyond 160 when you want.”
The customer really looks unhappy. He had this idea of quality of service in his mind for months, and now that he saved enough money to buy his dream car, he cannot use it the way he had in mind. Maybe he will go to another sports car vendor where they will happily sell him a car that continuously pulls the handbrake by itself when he goes too fast……
The customer is king, right?
By the way, what happens to a car when the wheels overheat because the handbrake is applied all the time?
Back to the application world. At EMC, we are highly focused on storage systems. We try to manage many things through storage. Our systems are market leading in performance, features, reliability, data integrity and so on. To be able to compete with our friends from other vendors, we develop a lot of technology that works very well if applied correctly. You could compare a car to an enterprise application. The engine in my example represents the database server, the wheels with traction system, brakes, suspension etc. relates to the storage system. The whole car being the business application.
You could argue about this sports car comparison being valid, but I think you get the point ;-)
Traction control is comparable to Quality of Service (QoS) features. We can limit cache allocation in the storage, we can set priority controls etc. This all works fine on the storage level (within specifications) and it is more than just a bunch of marketing features. You need QoS to make sure Flash drives don’t choke I/O to classic Fibre Channel disks, for example. Or to limit system resources (fuel consumption?) for application cloning, so that the performance of the production system does not suffer from snapshot backups.
But customers are often trying to apply features in new, creative ways. Sometimes very well-thought out, sometimes less so. Recently I have had discussions with a few customers who wanted to limit database performance by throttling the I/O subsystem, so that another database sharing the same storage box will not suffer from peak loads in the first one.
For example, a data warehouse (a pick-up truck, has limited top speed but can carry a lot of junk around) and a transactional system (sports car, no place to put your suitcase but you get there fast) are sharing a storage box. The data warehouse can be idle for hours, until a business analyst has a brilliant marketing idea and creates a monster query from hell, and runs it during day hours. It causes massive I/O for the data warehouse, so much that the transactional system suffers performance issues. The transaction system is part of the core business process of the company so this should never, ever happen.
So these customers asked whether it is possible to limit I/O bandwidth using storage QoS features on our EMC systems. Is this possible? Sure. Is it a good idea?
I don’t think so. Why? Because, like in my example, it’s like driving a sports car with the hand brakes on. Customers spend most of their money on expensive database and application licenses. Then limiting the performance by throttling I/O is like throwing money in the fireplace. You just spent thousands of dollars on enterprise licenses and licensed options for a many-core database server, to run massive business intelligence as good and economic as possible, to find that most of the time the CPUs are just wasting time waiting for I/O.
On the data warehouse system, administrators will experience increased response times and lower megabyte-per-second numbers and will treat this as a performance problem (and actually, they are right). Too high response times and the database management tools will start complaining about I/O performance bottlenecks. Even higher response times and you might see I/O errors due to timeouts.
Did I mention overheated brakes?
So what is the right solution?
There is no single answer to that question. But in general, here’s my advice:
- Manage workloads as high as possible in the application stack. QoS on the OS is better than on storage. Even better on the database level (if it works). More so if you can manage the application. Analogous to the cruise control in the car, if you manage the processing requests even before they go into the application stack is the best you can do. An example is a batch scheduler that allows no more than so many heavy jobs running at the same time. Only if there are enough resources available, it will submit another job to the running queue – but the system itself runs with unlimited restrictions.
- Note that the most effective component by far, in any car, that limits fuel consumption, is the nut that connects the seat to the steering wheel. Same for users in the application landscape.
- Make sure no CPU resources are wasted. Server hardware is relatively cheap, but the licensing allowing the processors to work are not. Drive up your return-on-investment (ROI) by allowing these CPUs to do as much work as possible, reducing the car’s wheel friction and aerodynamic drag (i.e. the wait for other resources – storage) to a minimum.
I know there exist many OS, Database and Application workload managers (and don’t forget the impressive workload management features of modern hypervisors such as VMware), but I haven’t done hands-on work on these myself, and I bet the application- database- and OS vendors know better how to implement these than I do. Just that I think this is the best place to perform workload management.
And don’t forget to educate your users…
Optimizing license cost
Let’s assume a customer spends a million Euros on an application stack. I bet more than 80% of the total cost is database- and application licenses. The remaining 20% is server, storage, some networking gear and some other stuff. Even though my customers sometimes complain that EMC is not cheap (for cheap storage, you might want to go to your local computer shop and buy a bunch of cheap SATA disks), the cost of our stuff compared to the rest of the stack is relatively limited. On the million Euros spent on the total stack, I bet that much less than 100,000 are spent on storage.
Instead of implementing QoS on storage to limit the server CPU’s workload, you could also spend a little bit more on storage to add some flash disks, a few extra front-end ports, and maybe some storage cache so that you don’t HAVE to use QoS on storage. The minor extra investment will pay itself back bigtime as ROI for your database- and application CPU licenses. It allows the sports car to run at full speed anytime when needed without artificial limitations – and you will have budget left for the pick-up truck to carry enough garbage from the local supermarket to the house.
Although hard to justify, investing a bit in storage might allow the delay in new server equipment and database CPU licenses for a few months or even years.
Besides, most of us techies forget to figure out how much more revenue or profit our company (our internal customer, the application owner!) could make if their applications run faster. We tend to think that if there are no visible bottlenecks in our ICT equipment, none of our performance tools are complaining, everything is fine. But I’ve spoken to business guys who say that, even if their application has no real performance issues (which is rare, by the way), they would happily invest a lot of money to make it run 20% quicker (that’s one of the reasons why they are often willing to spend on overpriced database appliances, without thinking about the consequences).
In this example, spending 10% more on storage (10,000 Euros, for example for a few extra Flash drives) might save 10% (100,000 Euros – or much more) on the total cost of the application stack.
Even though competition in storage land is tough and customers push us for the lowest gigabyte prices, maybe it is a good idea to consider the whole application stack and spend the budget a bit more wisely.