Lessons Learned From A Data Center Meltdown

By Al Crowley, Senior Software Engineer at TCG

Recently a TCG project, the Neuroimaging Informatics Tools and Resources Clearinghouse (NITRC), suffered a week-long outage. It was painful, nerve-wracking, and frustrating for everyone involved. NITRC is hosted in a large third-party academic data center on a managed server. There were redundant systems in place to prevent long term outages. All of the textbook configurations and protocols had been followed to prevent any standard failure. However, an extremely unlikely chain of events brought down the whole data center.

Here’s what happened:

It started with a small water leak elsewhere in the building. When maintenance started repairs, they found that it was close to some power lines that would have to be deactivated. The staff knew those power lines were supplying the data center air conditioner, but they also knew there was a redundant AC unit that should automatically start when the primary unit lost power. What maintenance didn’t know was that the support contract on the air conditioners had lapsed and was still pending re-approval by accounting. Without the proper routine maintenance and testing, the secondary unit did not power on as expected. The temperature quickly rose in the machine room, causing a failure in a number of hard drives. Normally, losing a few drives would not be a problem due to the redundancy in the system, but at the same time, components of the disk array controller were overheating which blocked the normal recovery process. Finally, a temperature-related failure of the battery/power supplies prevented any chance of a clean shutdown. This left the disk array in an unknown — but surely bad — state.

Once the server farm went down, of course, the data center took steps as quickly as it could to bring things back up. It took a week to get back to normal, though.

As demonstrated by this failure, if you stay in the business long enough you will eventually hit some combination of rare events, combined with human error, for which you haven’t prepared.

I’ll try to condense the lessons we learned while resolving our problems:

Getting everyone involved on the phone at the same time will take longer than you think. It’s impossible to make good decisions without information. When things are going horribly wrong, scheduling conference calls isn’t high on everyone’s priority list. Get commitments from key players in advance regarding their availability during an outage. Even so, be ready to make decisions with incomplete information.
Once you do get everyone talking, you may learn that the reality of your architecture doesn’t quite match expectations. Even if you did get perfect documentation of the network, backup plan, etc. when you launched your site, it’s unlikely to stay unchanged over the course of years. So periodically verify and test your hosting and backup information.
You might be thinking about your continuity of operations plan (COOP) right now. Have you tested your COOP recently? In my experience, when I’ve tested COOP plans, they mostly worked but didn’t quite get everything working as expected. So regular updates to the plans is a must.
Don’t just do something, stand there! When you are trying to fix things fast, it is easy to make the situation worse in the long run by diving in and changing things. Remember that any temporary changes you make will have to be reversed or incorporated into your final solution.
Communication is key. In our situation, we were frustrated by the pace information was coming from our providers. Our users and stakeholders were feeling the same angst. Not only is it hard to decide what to say, it can be hard to say anything at all if your hosting provider is offline. Have some secondary channels of communication with your users, maybe Twitter, maybe a dedicated status dashboard — whatever works for your situation.

Outages are inevitable. No matter how good your planning, there will be unexpected “gotchas”. Human error accounts for a large number of server outages. Human error accounts for 24% to 70% of down time, according to some reports. That means nothing went wrong with the redundant hardware and auto-failover was working perfectly but the service still managed to go dark.

How did it all work out for us in the end? About two days into the outage, we decided on-site recovery was going to take a very long time. We then went to our Plan C and deployed a read-only version of NITRC that meant our users’ neuroscience research could continue progressing. I hope that you can learn from our misadventure without having to live through it yourself.