Monday, July 6, 2009

Data center issues: Now is a good time for a pre-postmortem review!

In the past week there have been a handful of significant issues at a few large datacenters, including an 11 hour outage at Authorize.NET that left thousands of websites unable to process credit card transactions. The company that I work for narrowly missed the 45 minute outage at Rackspace, but it looks like Justin Timberlake wasn’t quite so lucky:

Since it is my job to make sure that the servers that run our software stay online as much as possible, I enjoy reading about problems with hosting providers and making sure that I would know what to do to if/when it happens to us. The new cloud services offered by Amazon, Google, and Microsoft aim to give everyone the scale-ability and reliability that use to only be available to large companies with thousands of servers, But these new computing services are still in their infancy and have had more than their fair share of down time recently as well.

I very much enjoyed reading Dyn Inc’s analysis of the Authorized.Net outage, as it gave a good mix of the gory details of the failure and the best practice techniques that could be used to prevent or mitigate this type of extended outage from occurring. Usually you have to learn these kinds of things first hand, having gotten burned by it once and swearing to never let it happen again. Instead I prefer to try and learn from other people’s mistakes, as you usually spend much less time in the hospital that way!

1 comment:

Greg Bray said...

I almost forgot! If you want to see the gory details from the Rackspace incident, they posted a public timeline to show all the events that lead up to the power failure. The actual downtime occurred between 3:15 p.m. CDT and 3:58 p.m. CDT on June 29th and only affected half of the data center. I seriously wish I could have been a fly on the wall to watch the investigation roll out amid the chaos :-P


Post a Comment