Amazon explains big AWS outage (http://www.geekwire.com)
Amazon explains big AWS outage, says employee error took servers offline, promises changes.
Amazon has released an explanation of the events that caused the big outage of its Simple Storage Service Tuesday, also known as S3, crippling significant portions of the web for several hours.
RELATED: AWS cloud storage back online after outage knocks out popular sites
Amazon said the S3 team was working on an issue that was slowing down its billing system. Here’s what happened, according to Amazon, at 9:37 a.m. Pacific, starting the outage: “an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.”
Those servers affected other S3 “subsystems,” one of which was responsible for all metadata and location information in the Northern Virginia data centers. Amazon had to restart these systems and complete safety checks, a process that took several hours. In the interim, it became impossible to complete network requests with these servers. Other AWS services that relied on S3 for storage were also affected.
About three hours after the issues began, parts of S3 started to function again. By about 1:50 p.m. Pacific, all S3 systems were back to normal. Amazon said it has not had to fully reboot these S3 systems for several years, and the program has grown extensively since then, causing the restart to take longer than expected.
Amazon said it is making changes as a result of this event, promising to speed up recovery time of S3 systems. The company also created new safeguards to ensure that teams don’t take too much server capacity offline when working on maintenance issues like the S3 billing system slowdown.
Amazon is also making changes to its service health dashboard, which is designed to track AWS issues. The outage knocked out the service health dashboard for several hours, and AWS had to distribute updates via its Twitter account and by programming in text at the top of the page. In the message, Amazon said it made a change to spread that site over multiple AWS regions.
Continue reading at http://www.geekwire.com
My Two Cents:
We were working with the ESRI ArcGIS Web Services API when it went down. I was not aware that ESRI leveraged the Amazon S3 Cloud systems. If you are going to run API Services, make sure you have redundancy. I was surprised. The old saying “do not put all your eggs in one basket” is obviously alive and well with some Tech corporations.