slide

Everything is broken... still

Ned Bellavance
4 min read

Cover

Earlier this week Amazon Web Service’s Simple Storage Service, better known as S3, was experiencing higher than normal error rates in the us-east-1 region, i.e. S3 was down.  To put that into perspective, it also means that several high-profile websites and applications were experiencing major issues.  As you may know, S3 has a 99.9% SLA uptime, and it’s been down for a couple hours now.  I’m no math genius, but that’s more than the 44 minutes a 99.9% uptime per month requires.

I don’t want to beat up on AWS though, and that’s not the intention of this article.  But it is an excellent example of the fact that everything breaks.  Everything.  And as technologists we know this and must plan for the worst.  A well-architected solution should be able to withstand outages anywhere in its infrastructure and keep running.  We talk a lot about not leaving a single point of failure, and that includes a data center or a cloud service in a specific region.  Not all of AWS S3 was down, just the us-east-1 region.  There are still four other regions in North America alone that had a functioning S3 service.  A well architected solution should be able to seamlessly failover to one of those other instances without interrupting the end user.  Sadly, it appears that many services have not chosen to build their solution in that way.

According to TechCrunch about 0.8% of the top 1 million websites use S3 for storage, which is about 80,000 sites.  Not an insignificant amount.  There are also other services that appear to be impacted, such as IoT devices and the Amazon Fire TV.  Even the AWS health dashboard was not functioning properly at first, showing an all green when things were clearly not.

The irony here is that even AWS and Amazon didn’t follow their own best practice and make sure they were designing a fully resilient architecture.  Turns out this stuff is hard, which is why we need to know how to do it and help our customers do the same.

Eventually AWS will recover from this snafu.  Turns out that the whole issue was caused by a mistyped command, which is frankly amazing and not surprising at the same time. Some websites are likely to decry the cloud, and curse its name to the heavens.  Cooler heads shall prevail, and in a week the churn cycle of the internet will make this barely a blip on the radar.  For the rest of us, I think there are some lessons to be learned here:

  1. Design your solution such that there are no single points of failure
  2. Make changes easy to roll-back
  3. Plan for outages, and test those outages in Production
  4. Try not to rely on a single solution or vendor for key components of your solution

When I said plan for the worst, that’s a little inaccurate.  We have to be prepared for the worst that we can reasonably afford to mitigate.  When I am architecting a solution, sometimes I will start going down the rabbit-hole of “what if” until I’ve over-architected the hell out of something.  Just as important as understanding the worst-case scenario, is understanding what a worst-case scenario will cost you and spending a commensurate amount of money to prevent it.  If your Project Management website being down for a day costs you $1,000 in productivity, and you could prevent it for an extra $50 a day, then it’s probably worth it.  If it’s an extra $1,000 a day, then it probably is not worth it.  You must look at the likelihood of a failure, the cost of the failure, and the cost of preventing it.  Run the numbers and you’ll know how much work to put into prevention versus setting a reasonable SLA.

The outage will cause AWS to credit 10% to all S3 customers impacted by the outages.  Another hour or two and that would be 25%.  If AWS is raking in $10 million from S3, they are about to cut a check for $1 million in refunds.  That hurts, but probably doesn’t come close to the financial impact on those sites that were down.