Curated guides, roadmaps, and technical documentation for mastering site reliability engineering and infrastructure management practices.
This project is a comprehensive educational resource and curriculum focused on site reliability engineering, distributed systems, and infrastructure operations. It provides technical guides, a systems engineering course, and instructional manuals designed to teach the principles of managing large-scale computing environments. The curriculum covers high-level architectural design for scalability and resilience, including fault-tolerant infrastructure, high-availability patterns, and microservices decomposition. It emphasizes the practical application of site reliability engineering through the study of system design, resource estimation, and the elimination of single points of failure. The material extends into broad operational capabilities, including container orchestration, continuous integration and delivery pipelines, layered observability, and network routing. It also provides detailed instruction on Linux system administration, database management, security auditing, and the implementation of service level indicators and objectives.
This repository provides a comprehensive, structured curriculum and technical documentation covering the core pillars of SRE, including observability, service level objectives, and infrastructure operations.
OpenObserve is a unified observability data platform designed to ingest, store, and analyze logs, metrics, and traces. It functions as a cloud-native monitoring tool that centralizes telemetry from diverse sources, including standard collectors and cloud service providers, into a single, scalable system. By utilizing a columnar storage engine backed by object storage, the platform enables efficient long-term data retention and high-performance analytical querying. The platform distinguishes itself through deep integration with artificial intelligence, allowing users to query data using natural language, generate dashboards via prompts, and automate incident analysis. It provides specialized monitoring for language model pipelines, including token usage cost analysis and performance tracking for AI agents. Furthermore, the system enforces strict multi-tenant resource isolation and zero-trust access, ensuring that organizational data remains secure and independent within shared infrastructure. Beyond its core storage and AI capabilities, the platform includes a comprehensive suite of tools for incident management, infrastructure monitoring, and data pipeline orchestration. It supports real-time stream processing, schema-agnostic indexing, and automated data enrichment, allowing for flexible telemetry management without rigid pre-defined structures. The system also provides advanced diagnostic features such as production error deobfuscation, service dependency mapping, and user journey analysis to accelerate root cause investigation. The software is designed for flexible deployment, running as a stateless, containerized service that supports high availability and horizontal scaling. It is distributed as a single binary or container image, with configuration managed through infrastructure-as-code templates.
This is a comprehensive observability and monitoring platform that directly addresses key SRE requirements like incident management, telemetry analysis, and capacity planning within a single, cloud-native tool.
This project is a centralized library of community-contributed, declarative configuration files designed for automating the deployment of cloud infrastructure and services. It serves as a repository of machine-readable templates that define the desired state of cloud environments, enabling consistent and repeatable resource provisioning. The collection provides pre-configured scripts that streamline the setup of virtual machines, databases, and networking components. By utilizing these templates, users can standardize the deployment of cloud services and automate the creation of development, testing, and production environments. These templates leverage infrastructure-as-code practices to define resource topologies, ensuring that cloud environments are configured through structured schemas. The repository supports the automation of complex cloud environments by providing verified configurations that reduce manual setup time and configuration errors.
This repository provides infrastructure-as-code templates for cloud provisioning, which is a useful building block for SRE automation but does not offer the comprehensive learning paths or documentation required to master SRE practices.
Cachet is a self-hosted, open-source status page system designed to communicate service uptime, incident history, and infrastructure performance to end users. It provides a centralized dashboard for managing the operational lifecycle of system components, tracking service disruptions, and scheduling maintenance windows. The platform distinguishes itself through a comprehensive RESTful API that enables programmatic status page management and automated incident reporting. It supports deep integration with external monitoring tools, allowing for the synchronization of performance metrics and the automated triggering of status updates. Administrators can standardize communication using reusable incident templates and maintain system integrity through event-driven webhook notifications that include payload signing for authenticity. Beyond core reporting, the system offers extensive customization options for the public-facing interface, including branding, layout adjustments, and custom asset injection. It manages administrative access through team-based permissions and protects service availability using request throttling and token-based authentication. The platform also includes built-in telemetry for usage reporting and tools for visualizing quantitative performance data over time. The software is built using a model-view-controller pattern and relies on a relational database for state persistence. It is distributed as a web-based application that can be installed and configured to match specific organizational branding requirements.
This is a status page system for communicating service uptime and incident history to users, which serves as a specific operational tool rather than a comprehensive learning resource or curriculum for mastering SRE practices.
Terraform is a declarative infrastructure-as-code tool designed to manage the lifecycle of cloud and on-premises resources. It functions as a workflow engine that reconciles a defined desired state against real-world infrastructure, using a persistent state-tracking layer to maintain consistency and visibility across distributed environments. By mapping infrastructure components into a directed acyclic graph, the system calculates the optimal order for provisioning, updating, or destroying resources. The platform is distinguished by its extensible plugin-based architecture, which decouples core orchestration logic from vendor-specific service APIs. This allows users to manage diverse infrastructure across multiple providers through a unified workflow. The system enforces predictability by separating operations into a three-stage lifecycle—planning, applying, and state-updating—and supports policy-as-code evaluation to validate changes against security and compliance rules before any modifications are executed. Beyond core orchestration, the tool provides robust support for collaborative management, including workspace isolation for environment separation and module sharing for distributing standardized infrastructure patterns. It integrates into broader development ecosystems through support for programmatic definition in various languages, external system hooks, and comprehensive tooling for configuration debugging and editor assistance.
This is a specialized infrastructure-as-code tool used to implement SRE practices, but it is a specific technical utility rather than a comprehensive learning resource or curriculum for mastering the SRE discipline.
Dispatch is an incident response orchestration platform that automates the coordination of detection, participant assembly, and task tracking across existing communication and project management tools. It provides a web-configurable state machine to manage incident lifecycle transitions, with template-driven incident models that define types, priorities, and severity levels. The platform enforces role-based access control to map user roles to specific actions and data access, while maintaining a database-backed audit trail of all incident events and system changes for compliance and post-incident review. The platform distinguishes itself through an event-driven workflow engine that emits and consumes events to trigger automated resource creation, notifications, and task tracking across integrated tools. Its plugin-based integration architecture connects to external platforms via standardized adapters, while an API-first extensibility layer allows customization of workflows and integration with tools beyond the plugin system. A web administration interface enables configuration of incident types, notification rules, and escalation policies without manual scripting, and supports assigning incident commanders with decision authority and delegation capabilities. The system covers the full incident lifecycle, including automated timeline tracking so responders can focus on resolution without manual logging, task management to ensure follow-through on required actions, and post-incident review management that collects and organizes incident data for analysis and improvement. Participant roles can be customized through the web interface to control access and responsibilities during active incidents.
This is a specialized incident response orchestration platform rather than a comprehensive SRE learning resource, though it serves as a practical tool for implementing the incident management component of SRE practices.
OneUptime is an open-source observability platform designed for monitoring service availability, infrastructure health, and application performance. It functions as a comprehensive system for tracking uptime and managing the end-to-end lifecycle of production incidents. The platform distinguishes itself through automated root cause analysis agents that identify failure triggers and generate code fixes via pull requests. It also provides branded public status pages to communicate real-time service availability and historical uptime data to end users. The system covers a broad range of operational capabilities, including global multi-location probing, centralized log aggregation, and infrastructure monitoring for servers and containers. It integrates incident coordination tools such as on-call rotation scheduling and escalation-based notification routing, alongside software error tracking and event-driven workflow automation.
This is a comprehensive observability and incident management platform, but it is a functional tool for performing SRE tasks rather than a curated learning path or educational resource for mastering SRE practices.
OpenTofu is a declarative infrastructure orchestrator that automates the provisioning and management of cloud resources. It functions as a platform-agnostic interface, allowing users to define their desired environment state in configuration files, which the system then reconciles against live infrastructure to calculate and execute necessary updates. The project utilizes a graph-based execution engine to determine the optimal sequence for resource operations, enabling the parallel processing of independent components to reduce deployment times. To support complex, multi-platform environments, it employs a provider-based plugin architecture that translates generic configuration definitions into specific API calls for various cloud services and third-party providers. Beyond core provisioning, the system facilitates infrastructure lifecycle management through reusable configuration modules that standardize deployments and enforce consistent patterns. It also provides a synchronization layer for state metadata, enabling distributed teams to coordinate changes and maintain consistent environment status across collaborative workflows.
This is an infrastructure-as-code tool used to automate cloud provisioning, which serves as a foundational building block for SRE workflows rather than a comprehensive learning resource or documentation hub for SRE practices.
SigNoz is a full-stack observability platform designed to collect, store, and visualize metrics, logs, and distributed traces in a unified environment. It leverages OpenTelemetry-based data collection to ingest telemetry from diverse sources using vendor-neutral protocols, ensuring interoperability across complex microservices architectures. The platform utilizes a high-performance columnar storage engine to enable rapid aggregation and filtering, providing a centralized backend for monitoring application health and performance. What distinguishes the platform is its focus on automated instrumentation and semantic correlation. It allows users to capture telemetry data across various programming languages and frameworks without manual code changes, often requiring only simple environment variable updates. Once ingested, the system automatically links logs, metrics, and traces through shared identifiers, enabling seamless navigation between different telemetry types during root cause analysis. The frontend further supports this by using virtualized rendering to efficiently display complex distributed traces containing millions of spans. The platform provides a comprehensive suite of tools for infrastructure monitoring, application performance tracking, and log management. Users can define complex alert conditions and manage monitoring configurations as version-controlled resources, ensuring consistency across deployment environments. Additionally, the system includes specialized support for monitoring large language model applications and provides visual query pipelines that translate user-defined filters into optimized database queries for real-time dashboard generation. The entire observability stack can be deployed using container orchestration tools, with built-in utilities for verifying service status and managing data retention.
This is a comprehensive observability and monitoring platform that provides the tooling necessary for SRE practices, but it is a functional software tool rather than a curated learning path or documentation repository.
This project is a comprehensive software observability suite and application performance monitoring platform designed to track runtime errors, performance bottlenecks, and system health. It functions as a centralized diagnostic service that aggregates and categorizes exceptions, providing the infrastructure necessary to visualize complex execution paths across distributed systems and microservices. The platform distinguishes itself through a high-throughput distributed event ingestion pipeline and a columnar storage analytics engine that enables rapid aggregation of large-scale performance metrics. It utilizes runtime-level instrumentation hooks to capture execution data directly from the host environment and employs symbolication-based stack trace resolution to map minified code or raw memory addresses back to original source files. Furthermore, the system includes specialized capabilities for monitoring the operational performance of AI agents and ensuring sensitive data compliance through schema-driven scrubbing of incoming event payloads. Beyond core error tracking and tracing, the platform supports a wide range of programming languages and frameworks, allowing for consistent visibility across diverse software architectures. It integrates with external services to automate incident response workflows and provides a command-line interface for managing releases, debug symbols, and project configurations. The system also features a modular, plugin-based architecture that facilitates connectivity with third-party tools for issue tracking and alerting.
This is a powerful observability and error-tracking platform that provides the monitoring tools necessary for SRE workflows, but it is a specific software product rather than a curated learning path or comprehensive SRE educational resource.
Developer Roadmap is a community-driven platform that provides structured, graph-based learning paths for software engineering. It serves as a comprehensive knowledge repository where technical domains are organized into visual sequences to guide professional skill acquisition and career growth. The project distinguishes itself through a collaborative ecosystem that enables users to contribute roadmaps, curate industry best practices, and maintain professional profiles. It integrates diagnostic assessment frameworks to evaluate technical proficiency, helping developers identify knowledge gaps and prepare for professional interviews through targeted learning sequences. Beyond its core mapping capabilities, the platform offers practical project ideas and interactive tutoring to reinforce engineering concepts. It provides a centralized space for the community to share resources, track progressive skill development, and navigate complex technical landscapes.
This repository provides a structured, visual learning path for DevOps and SRE roles, offering a comprehensive roadmap that covers essential domains like observability, infrastructure as code, and automation.
Kubernetes The Hard Way is an educational curriculum designed to teach the fundamental architecture and operational requirements of container orchestration platforms. It provides a structured, hands-on learning path that guides users through the manual bootstrapping of a multi-node cluster from scratch, intentionally avoiding automated installers to ensure a deep understanding of how individual control plane and worker node components interact. The project distinguishes itself by requiring the manual configuration of every layer of the infrastructure, including the generation of cryptographic identities for mutual authentication and the establishment of encrypted communication channels between distributed components. Participants gain practical experience in managing distributed key-value consensus, configuring network-overlay routing for pod communication, and handling the lifecycle of system services through manual configuration files. This guide covers the entire provisioning process, from setting up compute resources to implementing security protocols and managing binary-based service deployments. By building the system piece by piece, users develop the operational knowledge necessary to troubleshoot complex failures in production environments. The tutorial requires four virtual or physical machines and provides a comprehensive walkthrough of the steps needed to establish a functional cluster environment.
This repository provides a rigorous, hands-on curriculum for mastering the operational fundamentals of distributed systems and infrastructure provisioning, which serves as a foundational learning resource for SRE practitioners.
DevOps-Roadmap is a comprehensive educational repository and knowledge base designed to guide technical professionals through the complexities of modern software engineering. It functions as a structured curriculum and reference library, covering the full spectrum of skills required to master system architecture, infrastructure management, and cloud operations. The project distinguishes itself by bridging the gap between high-level architectural design and the practical realities of engineering leadership. It provides curated insights into distributed systems, data consistency, and scalable design patterns, while simultaneously offering frameworks for managing high-performing teams, navigating corporate dynamics, and fostering psychological safety within technical organizations. Beyond core architecture, the repository encompasses a broad capability surface that includes professional development, productivity optimization, and the integration of emerging technologies. It offers guidance on implementing AI-driven workflows, managing large-scale machine learning lifecycles, and applying evidence-based metrics to track team performance and development health. The repository serves as a centralized resource for engineers at all career stages, providing access to industry-standard principles, technical interview preparation materials, and strategic coaching frameworks.
This repository provides a structured, comprehensive curriculum and knowledge base that covers essential SRE domains like observability, infrastructure management, and automation, serving as a central learning path for mastering these practices.
This project is a comprehensive educational curriculum designed to build proficiency across modern infrastructure, cloud-native technologies, and systems administration. It functions as a reference library and interview preparation resource, offering a structured collection of conceptual questions, practical coding challenges, and hands-on scenarios that cover the full spectrum of software delivery and operational workflows. The repository distinguishes itself through a modular, domain-specific structure that links instructional problem statements with verified implementation examples. By employing a standardized documentation schema, it provides a predictable learning path for mastering complex technical concepts, ranging from infrastructure-as-code patterns and container orchestration to cloud platform administration and security best practices. The content spans a wide array of technical domains, including automated configuration management, distributed system monitoring, database operations, and version control. It provides deep dives into specific tooling for cloud provisioning, container networking, and service deployment, ensuring that learners can validate their technical skills through isolated, practical exercises. All instructional materials are organized into a unified taxonomy of markdown-based documents, allowing users to navigate and study specific technical topics at their own pace.
This repository provides a structured, exercise-based curriculum that covers essential SRE domains like infrastructure as code, observability, and automation, making it a highly relevant resource for mastering operational practices.
This project is an interactive programming curriculum and educational system designed to teach computer science and software engineering. It provides a structured set of courses and professional roadmaps focused on backend engineering, DevOps, and systems fundamentals. The platform is distinguished by an AI-powered coding tutor that provides Socratic guidance and contextual hints to help students find solutions independently. It features a browser-based code sandbox using WebAssembly to eliminate local environment setup, alongside automated test-based grading and spaced-repetition logic to reinforce difficult concepts. The curriculum covers a broad range of technical domains, including programming languages such as Go, Python, and TypeScript, as well as relational database design, container orchestration with Kubernetes, and cloud operations. It also includes professional development resources for technical interview preparation and portfolio construction. Learning engagement is managed through gamified incentives like experience points and leaderboards, while progress is tracked via sequenced learning paths and AI-generated coding challenges.
This is a comprehensive educational platform that offers structured learning paths for DevOps and systems fundamentals, providing a solid foundation for SRE practices even though it is not exclusively dedicated to the SRE domain.
This project provides educational materials and courseware focused on the theoretical and practical foundations of distributed systems design. It serves as a comprehensive curriculum covering the disciplines of consensus, data consistency, reliability engineering, and scalability. The instructional content focuses on achieving cluster agreement through consensus algorithms and managing system-wide state via coordination frameworks. It includes a dedicated guide to data theory, exploring replication strategies, consistency models, and data convergence. The courseware covers a broad capability surface including fault tolerance engineering, scalable data partitioning, and network behavior modeling. It also addresses operational strategies such as chaos engineering, traffic flow control through backpressure, and the implementation of gossip protocols for cluster communication.
This repository provides a rigorous, curriculum-based approach to the theoretical and operational foundations of distributed systems, covering essential SRE topics like fault tolerance, chaos engineering, and observability.
A checklist of anyone practicing Site Reliability Engineering
This repository provides a structured checklist of essential SRE practices, covering key areas like observability, incident management, and infrastructure as code to help practitioners track their progress.