Recently, both Amazon Web Services and Microsoft Azure suffered severe outages which affected customers and clients across the world, with human error and a brownout related to each issue respectively. System downtime is one of the biggest issues that a company can face, and the very real costs of downtime can include anything from the physical cost of lost income through to the impact of a negative experience for customers. Financial loss and loss of reputation are both difficult to recover from, and in the case of Amazon and Azure, the impact was not insignificant and required a release and damage control to mitigate the effects.
Source What happened with Amazon? According to a company update which was posted to explain some of the outages for Amazon, the problem was caused by a simple debugging procedure which inadvertently took down several servers. Due to these servers being down, other subsystems were taken down, and this cascading effect resulted in many cloud-based businesses experiencing disruption. Businesses like Xero, Expensify, Slack, Trello and Medium all experienced some disruption due to the partial failure at Amazon’s data centres. The issue affected businesses that used Simple Storage Service (S3) and was a cause for critical concern for those affected. The full impact of the issue was such that websites could function but that they couldn’t access their backend storage and had issues displaying stored images or sharing files. What was the fallout? Amazon was quick to say that they would be learning from this and making changes to ensure that human error couldn’t have such a significant impact in future.
Amazon’s Outage Down to Human Error
