After hours of disruptions for services such as project management tool Trello, news website Business Insider and image hoster Giphy, it turns out Amazon Web Services’ (AWS) outage on Tuesday was caused by the simplest of errors: a typo.
S3, Amazon’s popular web hosting and storage platform, crashed on Feb. 28 due to what the company called “high error rates,” but according to new information, an Amazon employee accidentally input the wrong command and took a larger number of servers offline than was intended.
“An authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process,” explains a Mar. 1 post on the AWS website. “Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.”
The servers that were removed supported two other S3 subsystems, including the index subsystem, which manages the metadata and location information of all S3 objects in the region and is necessary for processing GET, LIST, PUT, and DELETE requests, as well as the placement subsystem, which relies on the index system to allocate new storage.
Essentially, as these systems restarted and took longer than expected to get back online, the entire eastern region’s network stayed down.
Amazon says that it will be making several changes to its operations as a result of this incident, which includes limiting the capacity of the tool that took down the servers as well as improving recovery time of key S3 subsystems.
“We have modified this tool to remove capacity more slowly and added safeguards to prevent capacity from being removed when it will take any subsystem below its minimum required capacity level,” the company says. “This will prevent an incorrect input from triggering a similar event in the future. We are also auditing our other operational tools to ensure we have similar safety checks.
Amazon apologizes for the inconvenience the outage caused for its customers, adding “we will do everything we can to learn from this event and use it to improve our availability even further.”