According to reports, over 121,000 companies were affected today as Amazon Web Services encountered a massive problem within their infrastructure. Major websites like AirBNB, FreshBooks, Twillo, ZenDesk, Pinterest, Lonely Planet, MailChimp, Citrix, and even Apple’s iCloud were experiencing issues among many more. Certain Amazon Cloud hosted servers were completely offline while others had no visible impact.
Among the casualties was the Amazon Status Dashboard itself. Although AWS fixed this quickly, the initial moments caused quite a bit of confusion as information was not flowing.
Update March 3rd:
On Tuesday morning, members of the S3 team were debugging the billing system. As part of that, the team needed to take a small number of servers offline. However, the command was entered improperly and took down supporting systems. Without it, services that depend on it couldn’t perform basic data retrieval and storage tasks.
Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended. The servers that were inadvertently removed supported two other S3 subsystems. One of these subsystems, the index subsystem, manages the metadata and location information of all S3 objects in the region. This subsystem is necessary to serve all GET, LIST, PUT, and DELETE requests.
While S3 was down, a variety of other Amazon web services stopped functioning, including Amazon’s Elastic Compute Cloud (EC2), which is also popular with internet companies that need to rapidly expand their storage. Amazon’s Web Services outage had a significant impact on the Internet on Tuesday, primarily in the eastern portion of the United States. Apple relies on AWS for some of its iCloud operations and thus iCloud performance was slowed for some users, as well. Amazon added:
Finally, we want to apologize for the impact this event caused for our customers. While we are proud of our long track record of availability with Amazon S3, we know how critical this service is to our customers, their applications and end users, and their businesses. We will do everything we can to learn from this event and use it to improve our availability even further.
As a result, Amazon said it is making changes to S3 to enable its systems to recover more quickly. It’s also declaring war on typos. In the future, the company said, engineers will no longer be able to remove capacity from S3 if it would take subsystems below a certain threshold of server capacity.
It’s also making a change to the AWS Service Health Dashboard. During the outage, the dashboard embarrassingly showed all services running green, because the dashboard itself was dependent on S3. The next time S3 goes down, the dashboard should function properly, the company said.