Untitled Document
 Register Now & Save!
Untitled Document
2008 West Diamond Sponsor
Untitled Document
2008 West Platinum Sponsor
Untitled Document
2008 West Gold Sponsors
Untitled Document
2008 West Silver Sponsors
Untitled Document
2008 West Bronze Sponsors
Untitled Document
2008 West Exhibitors
Untitled Document
2008 West Media Sponsors
Untitled Document
2008 East Diamond Sponsor
Untitled Document
2008 East Platinum Sponsors
Untitled Document
2008 East Gold Sponsors
Untitled Document
2008 East Silver Sponsors
Untitled Document
2008 East Exhibitors
Untitled Document
2008 Media Sponsors
Latest News
In his session at 21st Cloud Expo, Carl J. Levine,...
"CA has been doing a lot of things in the area of ...
Data scientists must access high-performance compu...
"NetApp is known as a data management leader but w...
"We're focused on how to get some of the attribute...
Long-term partners Fujitsu Limited and Citrix Syst...
"WineSOFT is a software company making proxy serve...
As you move to the cloud, your network should be e...
In his session at 21st Cloud Expo, Michael Burley,...
22nd International Cloud Expo, taking place June 5...
Can't Miss RSS Feed
Subscribe to the RSS Feed & Get All The Conference News As It Happens!
Amazon Explains How and Why S3 Storage Service Wobbled
S3 failed for around 8 hours last Sunday

The S3 storage service failed for around 8 hours last Sunday, July 20th. Now the Amazon S3 Team, saying "we know that any downtime is unacceptable and we won't be satisfied until performance is statistically indistinguishable from perfect," has published some additional detail about the problem it experienced.

Here Cloud Computing Journal brings you the explanation in full:

Amazon S3 Availability Event: July 20, 2008

We wanted to provide some additional detail about the problem we experienced on Sunday, July 20th.

At 8:40am PDT, error rates in all Amazon S3 datacenters began to quickly climb and our alarms went off. By 8:50am PDT, error rates were significantly elevated and very few requests were completing successfully. By 8:55am PDT, we had multiple engineers engaged and investigating the issue. Our alarms pointed at problems processing customer requests in multiple places within the system and across multiple data centers. While we began investigating several possible causes, we tried to restore system health by taking several actions to reduce system load. We reduced system load in several stages, but it had no impact on restoring system health.

At 9:41am PDT, we determined that servers within Amazon S3 were having problems communicating with each other. As background information, Amazon S3 uses a gossip protocol to quickly spread server state information throughout the system. This allows Amazon S3 to quickly route around failed or unreachable servers, among other things. When one server connects to another as part of processing a customer's request, it starts by gossiping about the system state. Only after gossip is completed will the server send along the information related to the customer request. On Sunday, we saw a large number of servers that were spending almost all of their time gossiping and a disproportionate amount of servers that had failed while gossiping. With a large number of servers gossiping and failing while gossiping, Amazon S3 wasn't able to successfully process many customer requests.

At 10:32am PDT, after exploring several options, we determined that we needed to shut down all communication between Amazon S3 servers, shut down all components used for request processing, clear the system's state, and then reactivate the request processing components. By 11:05am PDT, all server-to-server communication was stopped, request processing components shut down, and the system's state cleared. By 2:20pm PDT, we'd restored internal communication between all Amazon S3 servers and began reactivating request processing components concurrently in both the US and EU.

At 2:57pm PDT, Amazon S3's EU location began successfully completing customer requests. The EU location came back online before the US because there are fewer servers in the EU. By 3:10pm PDT, request rates and error rates in the EU had returned to normal. At 4:02pm PDT, Amazon S3's US location began successfully completing customer requests, and request rates and error rates had returned to normal by 4:58pm PDT.

We've now determined that message corruption was the cause of the server-to-server communication problems. More specifically, we found that there were a handful of messages on Sunday morning that had a single bit corrupted such that the message was still intelligible, but the system state information was incorrect. We use MD5 checksums throughout the system, for example, to prevent, detect, and recover from corruption that can occur during receipt, storage, and retrieval of customers' objects. However, we didn't have the same protection in place to detect whether this particular internal state information had been corrupted. As a result, when the corruption occurred, we didn't detect it and it spread throughout the system causing the symptoms described above. We hadn't encountered server-to-server communication issues of this scale before and, as a result, it took some time during the event to diagnose and recover from it.

During our post-mortem analysis we've spent quite a bit of time evaluating what happened, how quickly we were able to respond and recover, and what we could do to prevent other unusual circumstances like this from having system-wide impacts. Here are the actions that we're taking: (a) we've deployed several changes to Amazon S3 that significantly reduce the amount of time required to completely restore system-wide state and restart customer request processing; (b) we've deployed a change to how Amazon S3 gossips about failed servers that reduces the amount of gossip and helps prevent the behavior we experienced on Sunday; (c) we've added additional monitoring and alarming of gossip rates and failures; and, (d) we're adding checksums to proactively detect corruption of system state messages so we can log any such messages and then reject them.

Finally, we want you to know that we are passionate about providing the best storage service at the best price so that you can spend more time thinking about your business rather than having to focus on building scalable, reliable infrastructure. Though we're proud of our operational performance in operating Amazon S3 for almost 2.5 years, we know that any downtime is unacceptable and we won't be satisfied until performance is statistically indistinguishable from perfect.

Sincerely,

The Amazon S3 Team

About Cloud News Desk
Cloud Computing News Desk brings the latest industry news related to the Cloud paradigm of massively scalable IT resources and capabilities delivered as a service using Internet technologies. For up to date news on the International Cloud Computing Conference & Expo series, the easiest way is to follow it on Twitter.

In order to post a comment you need to be registered and logged in.

Register | Sign-in

Reader Feedback: Page 1 of 1

Untitled Document

Call 201 802-3020 or Click Here to Save $100!

Save $100

 Sponsorship Opportunities

Virtualization Conference & Expo, California and London is the leading event in its third year covering the booming market of Virtualization for the enterprise. Now featuring Cloud Computing Expo, this leading event will surely deliver the #1 i-technology educational and networking opportunity of the year for leading Virtualization technology providers.



Who Should Attend?

Senior Technologists including CIOs, CTOs, VPs of technology, IT directors and managers, network and storage managers, network engineers, enterprise architects, communications and networking specialists, directors of infrastructure Business Executives including CEOs, CMOs, CIOs, presidents, VPs, directors, business development; product and purchasing managers.

Cloud Computing Bootcamp

Introducing at Cloud Computing Expo 2008 West the world's first-ever full one-day, immersive "Cloud Computing Bootcamp" - led by developer-entrepreneur Alan Williamson, Founder of Blog-City.com and creator of the OpenBlueDragon CFML runtime engine.

View the full one-day schedule

Video Coverage of Virtualization Conference
on SYS-CON.TV

David Greschler: Virtualization Beyond the Datacenter to the Desktop
Miko Matsumura: Time Oriented Architecture: Evolution by Design?
Brian Stevens: The Future of the Virtual Enterprise
Kevin Brown: Leveraging Desktop Virtualization for Security, Manageability and Usability Beyond the Perimeter

Video Coverage of the Virtualization Power Panel 2007

Virtualization Power Panel 2007 with Gordon Jackson, David Christian, Ken Jisser and Ben Rudolf

 Conference Media Sponsor: Cloud Computing Journal

Cloud Computing Journal aims to help open the eyes of Enterprise IT professionals to the economics and strategies that utility/cloud computing provides. Cloud computing - the provision of scalable IT resources as a service, using Internet technologies - potentially impacts every aspect of how IT deploys and operates software.

Cloud Computing Expo 2008 Speakers Include...


VOGELS
Amazon


FEINBERG
EMC


WELTMAN
Yahoo

NICKOLOV
3Tera

HAAR
Appistry

ZHOU
Platform Computing

HERROD
VMware

KEAGY
GoGrid

KRISHNAN
ParaScale

COHEN
Enomaly

EATON
Cloudworks

BRYCE
Mosso

SHALOM
GigaSpaces

SOMAL
VMware

CHU
VMware

THORSTEN VON EICKEN
RightScale



SYS-CON EVENTS


Past Events Archive

SOAWorld Conference & Expo 2008 East
soa2008east.sys-con.com
Virtualization Conference & Expo 2008 East
virt2008east.sys-con.com
AJAXWorld 2008 Conference & Expo East
ajaxmar08.sys-con.com
SOAWorld Conference & Expo 2007 West
www.soaworld2007.com
Virtualization Conference & Expo 2007 West
virt2007west.sys-con.com
AJAXWorld 2007 Conference & Expo West
ajaxoct07.sys-con.com
SOAWorld Conference & Expo 2007 East
soa2007east.sys-con.com
Virtualization Conference & Expo 2007 East
virt2007east.sys-con.com
AJAXWorld 2007 Conference & Expo East
ajaxmarch07.sys-con.com
Other SYS-CON Events
events.sys-con.com

SOAWorld & Conference Alumni Delegates Represents...

• AccuRev
• Adea Solutions
• Adobe Systems, Inc [3 delegates]
• ADP
• Aeropostale, Inc
• Aetna
• Akbank Training Center
• American Family Insurance
• American International College
• American Modern Insurance
• Amphion Innovations
• Amplify LLC, Clipmarks [2 delegates]
• Anderson Consulting
• Arrow Electronics [3 delegates]
• Ashcroft Inc
• Athabasca University
• ATS
• Audatex
• Avanade, Inc.
• Avaya Inc. [5 delegates]
• Azul [2 delegates]
• Backbase [2 delegates]
• Bank of America
• Bank of NY
• Barnes and Noble
• Barnex Investment International Limited
• BEA
• Bear Stearns [2 delegates]
• Bendel Newspaper Company Limited
• BizInnovative
• Bloomberg [2 delegates]
• BlueBrick Inc.
• BMC Software
• Boeing
• Bottomline Technologies [2 delegates]
• BP
• Broadcom

   read more...
Cloud Computing Blogs
In other words, VMware’s server density is higher. Boles suggests this means that customers should be “assessing virtualisation on a ‘cost per application’ basis. VM density has a sign
Traditionally, the way people have implemented high availability is by using a high-availability management package like Linux-HA[1], then configure it in detail for each application, file system moun