Have you heard of Netflix and Monkey Chaos?
Netflix is one of the most popular paid online movie providers in the world. Netflix launched its streaming service in early 2007 as a complimentary add-on to its existing DVD mail subscribers. Their initial streaming library contained only about 1,000 titles at launch, but their popularity and demand continued to grow, so Netflix was constantly adding to its streaming library and reached over 12,000 titles by June 2009. With growing titles, Netflix needed to create failures in its systems to improve its service and customer experience.
The Netflix streaming service was originally built by Netflix engineers on top of Microsoft software and placed in vertically modified server racks. However, this single point of failure began to bite again in August 2008, when major database corruption caused three days of downtime during which DVDs could not be delivered to customers. After this event, Netflix engineers began migrating the entire Netflix stack from monolithic architecture to distributed cloud architecture deployed to Amazon Web Services.
Netflix designed Chaos Monkey to test system stability by forcing a pseudo-random termination of Netflix instances and services. After migrating to the cloud, Netflix was newly dependent on Amazon Web Services and needed technology to show how their system reacted when critical components of their production service infrastructure were removed. The intention was to move from a development model that did not anticipate any failures to a model where failures were considered inevitable and to encourage developers to consider built-in resilience as a duty rather than an alternative.
This major shift towards a distributed architecture of hundreds of micro-services represented a lot of additional complexity. This level of complexity and connectivity in a distributed system created something that as unsolvable and required a new approach to avoid seemingly accidental outages. One of the most important lessons was that the best way to avoid failure is to fail constantly. The engineering team needed a tool that could actively inject failures into the system. This would show the team how the system behaved under unusual conditions and teach them how to change the system so other services can easily tolerate future unplanned failures. The Netflix team began their journey to Chaos.
Chaos Monkey helped kick-start Chaos Engineering as a new engineering practice. Chaos Engineering is a disciplined approach to fault identification before failures occur. By actively testing how the system responds to fault conditions, you can identify and correct faults before they become public faults. Chaos Engineering allows you to verify what you think will happen to what is going on in your systems. By doing the smallest possible experiments you can measure, you can deliberately smash things to learn how to build more robust systems.
In 2011, Netflix announced the development of Chaos Monkey with some other tools known as The Simian Army. Inspired by the success of their original Chaos Monkey tool to accidentally deactivate manufacturing instances and services, the engineering team has developed additional "simulations" designed to cause other types of failure and to induce abnormal system conditions. For example, Latency Monkey introduces artificial delays in client-to-server RESTful communication, allowing the Netflix team to simulate the unavailability of service without actually removing the service.
In conclusion, you may wonder how the name Chaos Monkey originated.
This is explained in the book by Chaos Monkeys by Antonio Garcia Martinez, who imagined that monkeys entered the data center (server farms) that host all the important features of our online activities. The monkey tears the cables randomly destroy the device and return everything that passes through the hand. IT managers are invited to design an information system that could work despite these monkeys that no one ever knows when they will arrive and what they will destroy.