Tales from the past – Overheated Datacenter
May 21, 2015 Leave a comment
A long time ago in a datacenter far, far away….
It is a period of digital revolution.
Rebel Dot Com companies, striking from hidden basements and secret lofts,
have won their first fights against long-standing evil corporate empires.
During the battles, rebel geeks have managed to invent secret technology to
replace corporations old ultimate weapons,
such as snail mail and public telephone networks currently powering the entire planet.
Contracted by the Empire’s sinister CIOs, the UNIX Engineer and author of this blog
races against the clock across the UNIX root directories,
to prepare new IT infrastructure for the upcoming battle –
while at the same time, trying to keep the old weapons of mass applications available and running
as best as he can to safeguard the customers freedom in the digital galaxy.
In the late nineties, before I switched to the light side of the Force and joined EMC, I was UNIX engineer and working as a contractor for financial institutions. This is a first in a number of stories from that period and later. I obfuscated all company or people names to protect their reputation or disclose sensitive information, but former colleagues might recognize parts of the stories or maybe everything. Also, some of it is a long time ago and I cannot be sure all I say is factually correct. The human memory is notoriously unreliable.
It was a friday late afternoon.
Everyone in my department already left for the weekend, but I was working on critical infrastructure project that was on a tight deadline, otherwise I guess I would have left already, too.
At some point I needed to re-install a UNIX server, which in those days was done by physically booting them from an install CD – so I needed to go to the datacenter room and get physical console access to get that going. I walked to the datacenter floor, which hosted several large UNIX systems, a mainframe, a number of EMC Symmetrix storage systems, network gear, lots of Intel servers mostly running Windows NT and maybe a few Novell.
There were large tape libraries for backup, lots of server racks, fire extinguishers and whatever you typically find in a large datacenter floor like that. I used my keycard to open the door to the datacenter and stepped in… The first thing I thought was, wow, it’s warm in here…
After a few seconds I realized that the temperature was way above normal. I unbuttoned my shirt and rolled up my sleeves, and walked straight to one of the large cooling systems against one of the side walls to check out if it was working correctly.
HA! This one showed an error code and a high temperature (I believe it was something like 30 Celsius which equals 86 Fahrenheit… way too hot for a datacenter). Maybe the other aircos could not handle the required cooling as it was a summer day and the units had to work hard to keep temperature at acceptable levels.
I checked out the next unit… Also in error! strange! Next one… Error… All the airco units were showing an error code. How is that possible? Firmware bug? Date issues? The millennium bug had not arrived yet… (more on that in a future story). I tried to reset one of them, and it seemed to work, it started running again. But after a minute or so, it stopped and complained with the same error code. In the datacenter it was quickly getting hotter and hotter. By now it must have been 35 Celsius. I realized that if this kept going like it did, soon many servers would overheat and go down, either gracefully or worse.
A server that crashes because it’s too hot might actually corrupt data before it shuts down. If a few hundred servers would do that we would face a real disaster getting that stuff online again within our recovery time objective defined in the service level agreement.
I had to think quickly. As all my colleagues already left, I called the operations manager and asked for permission to do an emergency shutdown of all the servers except maybe the mainframe and the primary transaction processing system (not that I would know how to properly shut down a Mainframe for that matter). He agreed – and I went to a central console and started to issue shutdown commands on nearly all hosts I could get access to. Doing that manually on nearly a hundred UNIX boxes, few hundred windows and some other stuff takes a while – and in the meantime, it got so hot that some of the servers already showed error messages on status displays or were simply unreachable on the network. One of the operators walked in to see if anyone was around – and he told me that EMC customer support called him, with the message that all of the Symmetrix boxes had dialed home with a warning that the temperature sensors showed a very high temperature and if he could check what was going on…
Fortunately the EMC systems were tested in much harder conditions before ever being sent to customers so even with the temperature warning, they happily kept humming along without errors. For now.
I powered off many of the Intel boxes the hard way because they already kernel-panicked. Probably a good part of an hour later the whole datacenter was much quieter as usual as I managed to power down almost all servers except a few critical ones. The temperature had risen to Sahara levels but we had all the windows and doors opened and the temperature was slowly dropping again. The key transaction systems were still operational. I just single-handedly fenced off a major disaster! Phew…
The airco maintenance guy arrived not long after that and while I was still wondering how it was possible that multiple separate airconditioning units all went in panic at the same time, he asked me to enter the stairs to a roof to show me something if I was interested.
On the roof we walked to a small area where I could see some tubes and other equipment, not far from the heat exchangers for the airconditioners. When we came closer there was a loud rattling noise. Slowly it became clear to me what happened – there was a set of pumps to provide circulation to the cooling fluid for the airconditioners. The pumps were working in parallel with shutoff valves in case one would fail – so the other would do the job alone. One of the pumps was making the rattling noise and was obviously completely broken. “What about the other one,” I asked. “broken a long time ago” the man answered…
So the root cause was found – we had redundant pumps for cooling fluid but one pump had given up at some point, and today the second one kicked the bucket. Obviously there was no mechanism in place to detect the first failure – so even if there was no Single Point of Failure (SPOF) in the design (not true by the way, I’ll come to that later), if you’re not able to detect and repair broken components, then sooner or later even a redundant architecture can ruin your whole day. The pumps were replaced after a few hours (don’t remember exactly how long it took) and we could restart the air conditioners again – followed by powering on again the entire server farm.
A few things I learned from this adventure:
- Even if you think there’s no SPOF, you probably missed something and it’s probably somewhere not on your radar screen
- Redundant components only do good if you can detect (in time) that one of them is broken
- Passive components can be a SPOF too (coming back on the cooling architecture; we had multiple airco units but they all were fed by one pipe for coolant)
- EMC Customer Support rocks bigtime (as they warned of something going on roughly about the same time I discovered it – but what if I had gone home or not had to go to the datacenter?)
- Rock-solid Disaster Recovery is often a good idea (we introduced SRDF replication at that customer not much later)
But most notably… If you think you kicked Murphy out of your building, don’t think he’s completely gone. He’s wreaking havoc on the roof instead.
This post first appeared on Dirty Cache by Bart Sjerps. Copyright © 2011 – 2015. All rights reserved. Not to be reproduced for commercial purposes without written permission.