# Netflix/chaosmonkey

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/netflix-chaosmonkey).**

16,597 stars · 1,271 forks · Go · apache-2.0

## Links

- GitHub: https://github.com/Netflix/chaosmonkey
- awesome-repositories: https://awesome-repositories.com/repository/netflix-chaosmonkey.md

## Description

Chaos Monkey is a chaos engineering tool designed to verify the resilience of distributed systems by intentionally terminating production instances. It functions as a fault injection service that identifies weaknesses in cloud-based architectures by simulating real-world hardware and software outages.

The platform operates through a centralized orchestration engine that executes periodic disruption cycles based on predefined configuration rules. It employs a rule-based selection process that evaluates instance metadata against safety constraints to ensure that only eligible targets are disrupted, while a persistent data store tracks execution history to prevent excessive system instability.

The system integrates with cloud environments through a plugin-based abstraction layer that translates generic termination commands into provider-specific API calls. It monitors infrastructure lifecycle events to ensure that disruption actions remain aligned with current service health and deployment status, supporting automated site reliability engineering workflows.

## Tags

### DevOps & Infrastructure

- [Fault Injection Testing](https://awesome-repositories.com/f/devops-infrastructure/fault-tolerance/kernel-fault-injection/fault-injection-testing.md) — Tests the reliability of microservices by intentionally terminating instances to verify that the overall architecture remains operational.
- [Failure Simulation Tools](https://awesome-repositories.com/f/devops-infrastructure/resilient-infrastructure/failure-simulation-tools.md) — Simulates random failures in production environments to ensure that distributed systems can automatically recover from unexpected outages.
- [Resilient Infrastructure](https://awesome-repositories.com/f/devops-infrastructure/resilient-infrastructure.md) — Identifies weaknesses in distributed systems by simulating real-world outages and hardware disruptions in production environments.
- [Automated Service Reliability](https://awesome-repositories.com/f/devops-infrastructure/devops/operational-reliability/automated-service-reliability.md) — Implements automated chaos experiments to validate system stability and improve incident response readiness.
- [Cloud Infrastructure Management](https://awesome-repositories.com/f/devops-infrastructure/cloud-infrastructure-management.md) — Monitors infrastructure state changes to ensure that automated disruption actions align with current service health and deployment status.
- [Infrastructure Abstraction Layers](https://awesome-repositories.com/f/devops-infrastructure/infrastructure/infrastructure-as-code/iac-providers-and-cloud/cloud-provider-integrations/infrastructure-abstraction-layers.md) — Translates generic termination commands into provider-specific API calls through a modular interface layer.
- [Target Selection Rules](https://awesome-repositories.com/f/devops-infrastructure/label-based-selection/target-selection-rules.md) — Evaluates instance metadata against safety constraints to identify eligible infrastructure components for disruption.
- [Orchestration Engines](https://awesome-repositories.com/f/devops-infrastructure/distributed-task-orchestrators/orchestration-engines.md) — Coordinates periodic execution cycles to trigger failure events based on predefined schedules and configuration rules.
- [Task Schedulers](https://awesome-repositories.com/f/devops-infrastructure/task-schedulers.md) — Triggers periodic execution cycles to select and terminate infrastructure targets based on predefined configuration rules.

### System Administration & Monitoring

- [Instance Termination Tools](https://awesome-repositories.com/f/system-administration-monitoring/instance-administration-tools/instance-operational-metrics/instance-termination-tools.md) — Stops production infrastructure components randomly to verify that services remain operational during unexpected failures. ([source](https://netflix.github.io/chaosmonkey))

### Software Engineering & Architecture

- [System Reliability](https://awesome-repositories.com/f/software-engineering-architecture/performance-reliability/system-reliability.md) — Identifies weaknesses in cloud-based architectures by simulating real-world outages and hardware disruptions in production.
- [Distributed Coordination Systems](https://awesome-repositories.com/f/software-engineering-architecture/distributed-coordination-systems.md) — Tracks execution history and active disruption windows to prevent overlapping or excessive infrastructure instability.
- [Plugin-Based Architectures](https://awesome-repositories.com/f/software-engineering-architecture/software-architecture/architectural-patterns/plugin-module-systems/modular-plugin-architectures/plugin-based-architectures.md) — Translates generic termination commands into provider-specific API calls through a modular interface layer.
