Open-source tools for managing system alerts, on-call scheduling, and coordinating team responses to service incidents.
Grafana is an observability data platform designed to aggregate metrics, logs, and traces from diverse sources into a unified environment. It functions as a centralized interface for visualizing complex telemetry data, transforming raw streams into interactive dashboards that support real-time system health tracking and performance monitoring. The platform distinguishes itself through a plugin-based modular architecture that integrates disparate databases, cloud services, and monitoring tools via a standardized data abstraction layer. This framework allows for the dynamic loading of external
Traefik is a cloud-native edge router and API gateway designed to manage service communication and traffic flow across distributed infrastructure. It functions as a dynamic service proxy that automatically discovers backend services and configures routing rules in real time, eliminating the need for manual restarts or complex configuration updates. By integrating directly with container orchestrators and service registries, it maintains a consistent state for network traffic, load balancing, and security policy enforcement. The project distinguishes itself through its deep integration with di
Alertmanager is a monitoring notification gateway and routing service that deduplicates, groups, and directs alerts to the correct receivers. It functions as a central manager for Prometheus alerts, using a hierarchical routing tree and label-based matchers to dispatch notifications to external services. The system employs a peer-to-peer mesh network to coordinate multiple instances in a high availability cluster, ensuring continuous alert processing. It features a dedicated inhibition engine and grouping mechanisms to reduce notification noise by suppressing redundant alerts when related iss
SigNoz is a full-stack observability platform designed to collect, store, and visualize metrics, logs, and distributed traces in a unified environment. It leverages OpenTelemetry-based data collection to ingest telemetry from diverse sources using vendor-neutral protocols, ensuring interoperability across complex microservices architectures. The platform utilizes a high-performance columnar storage engine to enable rapid aggregation and filtering, providing a centralized backend for monitoring application health and performance. What distinguishes the platform is its focus on automated instru
DataHub is a metadata management platform designed to unify technical, operational, and business context across diverse data ecosystems. By utilizing a graph-based metadata model and an event-driven ingestion architecture, it creates a centralized source of truth that maps complex data relationships, lineage, and ownership. This foundational framework enables organizations to maintain a synchronized view of their data landscape, supporting both human-led discovery and automated data operations. The platform distinguishes itself through its focus on grounding artificial intelligence and autono
Prometheus is a comprehensive monitoring and alerting platform designed to track infrastructure health and application performance. It functions as a time series database that ingests, indexes, and queries high-frequency numerical data points. By utilizing a pull-based model, the system periodically collects multi-dimensional metrics from monitored targets, storing them in an optimized block storage format that supports high-throughput ingestion and efficient historical analysis. The platform distinguishes itself through a specialized query engine that enables real-time analysis of performanc
This project is a comprehensive educational resource and curriculum focused on site reliability engineering, distributed systems, and infrastructure operations. It provides technical guides, a systems engineering course, and instructional manuals designed to teach the principles of managing large-scale computing environments. The curriculum covers high-level architectural design for scalability and resilience, including fault-tolerant infrastructure, high-availability patterns, and microservices decomposition. It emphasizes the practical application of site reliability engineering through the
Uptime Kuma is a self-hosted monitoring platform designed to track the availability and performance of network services and websites. It functions as a centralized dashboard that executes asynchronous health checks on a scheduled interval, providing real-time visibility into infrastructure health and service uptime. The platform distinguishes itself through a dedicated notification engine that dispatches alerts across multiple third-party messaging services, alongside a public status page generator that allows users to communicate service health and historical metrics via custom domains. Its
CrowdSec is a collaborative, distributed security engine designed for threat detection and infrastructure protection. It functions as an intrusion detection system that parses logs and network traffic to identify malicious patterns, utilizing a bucket-based threshold detection model to aggregate events and trigger alerts. The platform is built on a modular architecture that includes a centralized local API server for managing security signals and a relational database for persistent storage of remediation decisions. What distinguishes the project is its decoupled enforcement model, which offl
Netdata is a distributed observability platform designed for real-time infrastructure monitoring and performance tracking. It functions as a high-frequency agent that collects system, container, and application metrics with per-second precision, providing both local visualization and centralized aggregation across complex, multi-cloud environments. The platform distinguishes itself through edge-based intelligence, utilizing local machine learning models to automatically detect performance anomalies without requiring manual configuration or external query engines. Its architecture prioritizes
Prowler is an automated cloud infrastructure security scanner and posture management tool. It evaluates cloud environments and infrastructure-as-code templates against security benchmarks to identify misconfigurations, vulnerabilities, and compliance gaps that could compromise system integrity. The platform distinguishes itself through graph-based attack path analysis, which identifies chains of misconfigurations that create exploitable routes for unauthorized access. It utilizes a plugin-based execution model to perform state-based assessments of live environments and static analysis of conf
This project is a comprehensive software observability suite and application performance monitoring platform designed to track runtime errors, performance bottlenecks, and system health. It functions as a centralized diagnostic service that aggregates and categorizes exceptions, providing the infrastructure necessary to visualize complex execution paths across distributed systems and microservices. The platform distinguishes itself through a high-throughput distributed event ingestion pipeline and a columnar storage analytics engine that enables rapid aggregation of large-scale performance me
The AWS Cloud Development Kit is an infrastructure-as-code framework that enables developers to define and provision cloud resources using familiar programming languages. By utilizing construct-based synthesis, it translates high-level, object-oriented code into declarative templates, allowing for the automated management of complex cloud environments through a centralized, code-driven control plane. The framework distinguishes itself through its ability to model infrastructure as a dependency-aware resource graph, ensuring that components are provisioned and updated in the correct order. It
TheHive is a security incident response platform and multi-tenant case management system. It functions as a Security Orchestration, Automation, and Response (SOAR) tool and a threat intelligence platform designed to coordinate security investigations by managing alerts, cases, and observables. The platform is distinguished by its multi-tenant architecture, which isolates data across different organizations while supporting selective cross-tenant sharing. It features a SOAR automation engine capable of executing sandboxed JavaScript logic to automate workflows and trigger response actions thro
Glances is a cross-platform system monitoring tool designed to track real-time resource usage and hardware health metrics across diverse computing environments. It functions as a command-line utility that provides a unified view of system performance, identifying bottlenecks and maintaining infrastructure stability through a consistent abstraction layer that translates kernel calls into actionable data. The project distinguishes itself through its distributed capabilities, offering a web-based interface that enables remote access to live performance metrics from any device without requiring d
This project provides a framework for managing multi-agent systems, designed to automate complex software development, infrastructure, and business workflows. It functions as a multi-agent workflow orchestrator that routes tasks to domain-specific workers while maintaining state persistence and infrastructure automation. By leveraging large language models, the system decomposes high-level objectives into actionable plans, ensuring that complex operations are executed with consistency and reliability. The framework distinguishes itself through its hierarchical agent registry and policy-driven
Loki is a horizontally scalable, highly available log aggregation engine designed to store and query massive volumes of unstructured log data. It functions as a distributed observability platform that correlates logs, metrics, and traces to provide comprehensive visibility into the health and performance of complex infrastructure. The system distinguishes itself through a distributed query execution model that processes large datasets in parallel across cluster nodes. It utilizes label-based stream indexing and a distributed index to map log data to specific chunks, enabling rapid retrieval w
Kener is a self-hosted status page platform and uptime monitoring tool designed to track the health and availability of websites, APIs, and infrastructure components. It functions as an incident management system that automates the detection of service disruptions and provides a public-facing dashboard to communicate real-time system status and maintenance schedules to end users. The platform distinguishes itself through its multi-tenant architecture, which allows for the operation of multiple independent, branded status pages from a single installation. It supports deep customization of the
Watchtower is a container-based solution designed to automate the lifecycle management of Docker applications. It functions as a background service that monitors running containers, detects when new base image versions are available in registries, and automatically redeploys the containers to ensure they remain synchronized with the latest builds. The project distinguishes itself through its ability to orchestrate complex deployment workflows and maintain service availability during updates. It interacts directly with the container runtime to manage service dependencies and restart sequences,
PostHog is a comprehensive product analytics and feature management platform designed to capture, process, and visualize user behavior data. It provides a unified suite for tracking application events, managing feature rollouts, and monitoring system health through session recordings and error tracking. By leveraging a columnar-storage-optimized architecture, the platform enables high-performance aggregation and filtering across massive event datasets. What distinguishes PostHog is its integrated approach to data pipelines and application control. It features a robust event ingestion system t
Vercel is a cloud platform for building, deploying, and scaling web applications. It provides a unified infrastructure that automates the build process by detecting project frameworks and distributing static and dynamic content through a global content delivery network. The platform executes application logic using serverless functions that scale automatically based on real-time traffic demand. The platform distinguishes itself through a centralized AI gateway that proxies requests to multiple model providers, enabling standardized authentication, observability, and cost tracking. It supports
Changedetection.io is a self-hosted monitoring service designed to track web pages for content updates and notify users of changes. It functions as a centralized platform where users can manage tracking tasks, observe specific website elements, and receive automated alerts through various communication channels whenever modifications are detected. The service distinguishes itself through an integrated headless browser engine that executes interaction sequences, such as logins or form submissions, to access dynamic or restricted content. It maintains a historical record of page snapshots, util
Mizu is a suite of tools for capturing, indexing, and visualizing cloud-native network traffic and decrypted payloads for cluster-wide diagnostics. It provides Kubernetes network observability by using eBPF to index and visualize layer 4 and layer 7 traffic with full cluster context, allowing for the mapping of workload dependencies and the diagnosis of network failures. The project distinguishes itself by using kernel-level hooks to decrypt TLS traffic in plain text without requiring private keys. It further integrates a standardized context protocol to expose indexed network telemetry to AI
Kubernetes is a distributed container orchestration platform that automates the deployment, scaling, and management of containerized applications across clusters of computing nodes. It functions as a declarative infrastructure controller, utilizing a control loop architecture that continuously monitors the current system state against user-defined configurations to ensure desired operational outcomes. The system relies on a centralized API-driven interface and a replicated key-value store to maintain a consistent source of truth for all cluster objects. The platform distinguishes itself throu
This repository is a community-driven collection of security content for Azure Sentinel, the cloud-native security information and event management (SIEM) platform from Microsoft. It serves as a central hub for sharing detection rules, threat hunting queries, interactive workbooks, and automated response playbooks, all contributed and maintained through a standard GitHub-based pull request workflow. The content is designed to help security operations teams quickly deploy proven detections, proactively hunt for threats, visualize security data, and orchestrate incident response actions without
This project is a community-curated directory of open-source software designed for deployment in private server environments and home labs. It serves as a comprehensive resource for discovering independent, self-hosted alternatives to mainstream cloud services, enabling users to maintain full data ownership and control over their digital infrastructure. The directory is structured through a hierarchical taxonomy that organizes a vast collection of applications into logical categories, ranging from media management and data analytics to private communication and team productivity tools. It dis
InfluxDB is a specialized time series database platform engineered for the high-speed ingestion, compression, and retrieval of timestamped data at scale. It functions as a distributed metrics platform, providing the infrastructure necessary to organize and analyze massive volumes of time-stamped information to identify trends, patterns, and anomalies within complex data streams. The platform distinguishes itself through a functional dataflow engine that utilizes a specialized programming language for complex analytical transformations and automated tasks. This architecture is supported by a p
Zap is a high-performance structured logging library designed for production environments. It provides a framework for generating machine-readable logs that minimize memory overhead and CPU usage, allowing for efficient event analysis and system monitoring. The library distinguishes itself through a focus on zero-allocation logging, utilizing buffer pooling to reduce garbage collection pressure during high-frequency operations. It enforces strict data typing through compile-time checks and structured field encoding, which ensures consistent output without the performance cost of reflection-ba
Kubeshark is a network observability platform designed for Kubernetes environments, functioning as an eBPF-powered engine for cluster-wide traffic analysis. It captures, indexes, and visualizes network activity and API calls directly from the kernel, providing deep visibility into service-to-service communication without requiring sidecar proxies or manual code instrumentation. The platform distinguishes itself through its ability to perform protocol-aware traffic dissection and user-space cryptographic hooking, which allows for the inspection of encrypted traffic and the reconstruction of ap
ntfy is a self-hosted messaging infrastructure that provides a lightweight platform for sending and receiving real-time notifications. It functions as a topic-based pub-sub server, allowing users to publish and subscribe to message channels using standard HTTP requests. By bridging server-side events with native mobile and desktop clients, it enables the delivery of alerts across various environments through a unified communication layer. The project distinguishes itself by offering a complete, private notification ecosystem that includes persistent message caching and robust access control.
Cachet is a self-hosted, open-source status page system designed to communicate service uptime, incident history, and infrastructure performance to end users. It provides a centralized dashboard for managing the operational lifecycle of system components, tracking service disruptions, and scheduling maintenance windows. The platform distinguishes itself through a comprehensive RESTful API that enables programmatic status page management and automated incident reporting. It supports deep integration with external monitoring tools, allowing for the synchronization of performance metrics and the
Nginx is a high-performance HTTP server and reverse proxy designed to handle high-concurrency traffic through an efficient, event-driven architecture. It functions as a versatile traffic management gateway and content delivery accelerator, providing the infrastructure necessary to route client requests, balance loads across backend servers, and serve static assets with minimal resource consumption. The project distinguishes itself through a master-worker process model that separates configuration management from request processing, ensuring stable operations under heavy load. Its modular requ
Keep is an open-source AIOps alert management platform that aggregates, deduplicates, and orchestrates the lifecycle of alerts from multiple monitoring tools. It functions as a multi-provider integration hub to centralize the flow of data between observability, ticketing, and communication tools. The platform distinguishes itself through incident workflow automation and AI-powered enrichment. It uses a declarative workflow engine to execute multi-step operational sequences and integrates large language models to summarize event data and correlate technical logs for faster incident resolution.
Kong is a high-performance API gateway and service connectivity platform designed to manage, secure, and monitor traffic across distributed microservices and hybrid cloud environments. It functions as a centralized control plane for service governance, providing essential traffic routing, load balancing, and request transformation capabilities to ensure consistent policy enforcement across all service endpoints. The platform distinguishes itself through a modular plugin architecture and a declarative configuration engine that allows infrastructure behavior to be defined via version-controlled
Security-101 is a vendor-agnostic, foundational cybersecurity learning curriculum organized into modular, framework-aligned modules. It is designed to build core knowledge across multiple security domains without tying content to specific products or platforms, making it suitable for both beginners and professionals seeking a structured introduction to the field. The curriculum is built around established security frameworks, including the MITRE ATT&CK framework for standardized threat analysis and the NIST Cybersecurity Framework for incident response workflows. It covers a broad range of do
etcd is a distributed, strongly consistent key-value store designed to provide reliable storage for critical system metadata and coordination primitives. It functions as a distributed consensus engine, utilizing a replicated log and leader-based state machine to ensure that all nodes in a cluster maintain a synchronized view of data. By providing atomic operations and linearizable reads and writes, it serves as a foundational component for distributed systems requiring high availability and fault tolerance. The system distinguishes itself through its multi-version concurrency control, which e
Velociraptor is a digital forensics and incident response platform, endpoint detection and response system, and visibility tool. It provides a query engine and remote forensic collector used to hunt for indicators of compromise and perform triage across a fleet of hosts. The system is distinguished by its specialized query language for interrogating host state and parsing binary files. It features a notebook environment that combines markdown documentation with executable query cells to standardize investigative workflows and enable collaborative reporting. The platform covers a wide range o
NSQ is a distributed, brokerless messaging platform designed for high-throughput, fault-tolerant communication. By utilizing a decentralized topology, it eliminates single points of failure and allows for horizontal scaling across clusters. The system organizes message streams into topics and channels, effectively decoupling producers from consumers to support both streaming and job-oriented workloads. The platform distinguishes itself through a lookup-service-based discovery mechanism that enables clients to dynamically locate producers at runtime without requiring centralized coordination.
Dispatch is an incident response orchestration platform that automates the coordination of detection, participant assembly, and task tracking across existing communication and project management tools. It provides a web-configurable state machine to manage incident lifecycle transitions, with template-driven incident models that define types, priorities, and severity levels. The platform enforces role-based access control to map user roles to specific actions and data access, while maintaining a database-backed audit trail of all incident events and system changes for compliance and post-incid
The Serverless Framework is a declarative infrastructure-as-code tool designed to automate the deployment, scaling, and lifecycle management of cloud-native applications. It provides a unified command-line interface that translates high-level configuration files into provider-specific resource templates, enabling developers to orchestrate complex architectures, event-driven functions, and cloud resources within a single project structure. What distinguishes this framework is its focus on developer experience and multi-environment parity. It supports local function invocation and event proxyin
Caldera is an adversary emulation platform and command and control framework designed to simulate cyber attack patterns. It functions as an automated red team tool and threat framework orchestrator, executing attack sequences based on standardized cybersecurity threat frameworks to validate security defenses and detection capabilities. The platform distinguishes itself through the dynamic compilation of customized executable payloads and the use of framework-mapped adversary modeling to structure attack techniques. It manages asynchronous agents on targeted endpoints via a central server acce
Celery is an asynchronous job processor and distributed task queue designed to offload time-consuming operations to background worker nodes. By utilizing a message-passing architecture, it decouples task producers from consumers, allowing applications to maintain responsiveness while scaling workloads across multiple isolated environments. The system functions as a distributed workload orchestrator that manages the lifecycle of deferred operations through persistent queues. It distinguishes itself by providing a pluggable transport abstraction, which allows the core task logic to remain indep
Wazuh is an integrated security platform that combines endpoint detection and response, security information and event management, and cloud workload protection. It functions as a centralized system for collecting telemetry, aggregating logs, and correlating events across distributed infrastructure to maintain security and integrity. The platform distinguishes itself through its active response orchestration, which allows for the automated execution of scripts on remote endpoints to neutralize threats in real time. It provides deep visibility into system activity through file integrity monito
gRPC is a language-agnostic remote procedure call framework designed for high-performance communication between distributed services. It utilizes a structured interface definition language to generate consistent client stubs and server skeletons, enabling applications to invoke methods on remote servers as if they were local objects. By leveraging the HTTP/2 transport layer, the framework supports efficient binary serialization and multiplexed data exchange across diverse programming environments. The framework distinguishes itself through its support for flexible communication patterns, incl
CS-Base is a comprehensive educational platform and technical repository designed to support software engineers in mastering backend architecture, artificial intelligence engineering, and career development. It functions as a centralized knowledge hub that combines illustrated theoretical tutorials with practical, project-based learning to bridge the gap between foundational computer science concepts and professional industry requirements. The project distinguishes itself by integrating a robust career mentorship framework with advanced AI engineering resources. It provides users with tools f
Stats is a system performance monitor that tracks real-time hardware metrics and resource usage directly from the operating system menu bar. It functions as a hardware control interface, allowing users to adjust fan speeds and thermal settings to maintain optimal performance levels for computing hardware. The application distinguishes itself through kernel-level sensor polling, which retrieves telemetry by interfacing directly with low-level system drivers and power management APIs. It provides remote infrastructure oversight via a web-based telemetry dashboard, enabling users to view live pe
OpenFaaS is a serverless function platform that provides a container-native framework for deploying and managing event-driven code. It functions as an abstraction layer over container orchestrators, allowing developers to package code into scalable functions that run across Kubernetes clusters or edge computing environments. The platform distinguishes itself through a developer-centric runtime that utilizes standardized language templates and automated build pipelines to simplify the creation of container images. It features a central API gateway that manages request routing, authentication,
SmsForwarder is an Android application designed to capture incoming text messages and automatically transmit them to external services, messaging platforms, or email accounts. It functions as a bridge for mobile alerts, enabling the centralized monitoring of SMS traffic and system notifications across various digital channels. The application distinguishes itself through a modular forwarding architecture that supports diverse communication protocols via a plugin system. It utilizes a background service and system-level listeners to ensure that message interception and relay operations continu