…Not five nines,
…or four nines,
…not even three nines (99.9% uptime) !
If you tried to visit some of the Web’s most popular sites for a good part of the day yesterday, July 20, 2008, you were likely disappointed. Sites like WordPress (where this blog is hosted), Twitter, SmugMug and others, were impacted for hours yesterday because they depend on Amazon’s S3 (Simple Storage Service), which went down. Apparently, even some Apple iPhone applications were impacted by the S3 outage. It was the second time in less than six months (the previous outage occured on February 15) that AWS (Amazon Web Services) has experienced a major failure.
Based on what we’ve learned so far about S3, our best guess is that yesterday’s outage was caused by a software bug, a human error of some sort, or as was the case in their February outage, some set of conditions that occured within their system that overwhelmed their ability to handle traffic (interestingly, the latest problem occured early on a Sunday morning… not exactly a time when you would expect a peak load on their system). We view a malicious attack on the service a less likely cause, and hardware or connectivity problems a very unlikely cause. S3 is a decentralized system designed to survive the loss of some of it’s components and still operate normally. In many widespread telecom or network failures suffered by providers and carriers in the past few years, the cause has often been determined to be software related or human error (like a construction crew cutting a fiber optic cable they didn’t know was buried there).
As an aside, here’s some articles about human error that has caused some major outages…
Optus cable culprit found
The Backhoe, The Internet’s Natural Enemy
Cut in Fiber Cable Disrupts Internet Traffic Nationwide
The Backhoe: A Real Cyberthreat
The S3 outages bring to mind another concern among people responsible for the operation of the Internet itself. One of the services that the Internet is built on is DNS (the Domain Name System). The DNS system is what allows your computer to find a website such as this one, from among the millions of computers and websites on the Internet. There is concern among some that even though DNS functionality is spread across many servers on the Internet, in a hierarchical system, that a widespread DNS failure could occur. This would cripple almost all Internet traffic. Worst of all, if there was a major DNS failure, you might not be able to get to this blog ! Heaven forbid.
S3 is a “cloud” storage service. Internet-based computing resources are collectively referred to as cloud computing (see this Businessweek article on cloud computing). In cloud computing, resources that were traditionally located, say, in a company’s data center (disk storage, application software, servers, etc.) are offered by service providers via the Internet. Cloud computing is a relatively new paradigm, and problems similar to what Amazon has experienced are sure to make CIOs and IT managers hesitant to rely on the cloud when they can provide computing resources locally and have greater control over them.
Almost by definition, services offered in the cloud must offer high availability. The uptime standard that is generally used in the telecommunications and computing industries for critical systems is “five nines“, or 99.999% availability. That translates (approximately) to less than five minutes downtime a year, and generally does not include scheduled service outages. In the United States, the public telephone network operated by the Bell System was consistently able to achieve five nines reliability (so Ma Bell wasn’t that bad to us after all, may she rest in peace). Clearly, Amazon’s S3 service has failed this benchmark. It doesn’t even appear that AWS has achieved two nines availability (less than about seven hours downtime per month) this month. That’s utterly dismal performance that is unacceptable for critical systems, and it does not bode well for Amazon’s future in the cloud, or for cloud computing in general.
Interestingly, Amazon’s S3 SLA (Service Level Agreement) states that users are not entitled to a service credit unless their uptime drops below three nines (99.9%) in any month, and even if they fail to achieve two nines (99% uptime) in a month, they will only give users a 25% credit. They must not have a lot of confidence in their ability to provide four nines availability (less than one hour a year of downtime), which Amazon states is one of the design requirements that S3 was built to provide. And if they don’t meet their service levels, will they give their customers a refund? No. It appears all they will offer is a credit to be applied to future service. Not good.
But don’t expect disgruntled S3 customers who have been impacted by Amazon’s Simple Storage System outages to issue press releases critical of Amazon. Paragraph 4.2.4 of their customer agreement specifically prohibits that unless you get their permission first. Incredible.
With an SLA like Amazon’s, and especially because of their outages in the past few months, we might be inclined to use a service such as S3 only to store backup files. We don’t feel that the service is reliable enough to be used to support a live website or other mission critical systems. And even if Amazon had a 100% uptime record, there’s always this to worry about when deciding if you want to depend on services in the cloud (and to think that you were worried about the Y2K problem!).
Perhaps cloud computing is an idea whose time has not yet come.
– Routing By Rumor