Self-hosted tools for managing on-call schedules, incident response workflows, and automated system alert notifications.
Uptime Kuma is a self-hosted monitoring platform designed to track the availability and performance of network services and websites. It functions as a centralized dashboard that executes asynchronous health checks on a scheduled interval, providing real-time visibility into infrastructure health and service uptime. The platform distinguishes itself through a dedicated notification engine that dispatches alerts across multiple third-party messaging services, alongside a public status page generator that allows users to communicate service health and historical metrics via custom domains. Its architecture utilizes a reactive, single-page interface that maintains persistent bidirectional connections with the server to push live status updates without requiring manual page refreshes. The system is built for flexible deployment, supporting containerized environments, native package installations, and bare-metal execution. It manages monitoring configurations and historical data using a local, file-based relational database, while a decoupled abstraction layer ensures that alert delivery logic remains independent of the core monitoring engine.
This project provides a framework for managing multi-agent systems, designed to automate complex software development, infrastructure, and business workflows. It functions as a multi-agent workflow orchestrator that routes tasks to domain-specific workers while maintaining state persistence and infrastructure automation. By leveraging large language models, the system decomposes high-level objectives into actionable plans, ensuring that complex operations are executed with consistency and reliability. The framework distinguishes itself through its hierarchical agent registry and policy-driven tool access, which enforce security boundaries by restricting agent operations based on defined functional roles. It utilizes context-aware task routing to match incoming requests with specific agent capabilities and model performance profiles, while implementing deterministic fallback mechanisms to maintain operational continuity when agents encounter errors or context limits. This architecture allows for modular capability expansion and reproducible environment configurations through version-controlled templates. The system covers a broad capability surface, including automated technical documentation, cloud infrastructure management, and security auditing. It supports diverse domains such as API design, database optimization, and system reliability engineering, providing tools for incident response, performance monitoring, and compliance enforcement. These capabilities are integrated into a command-line interface that enables developers to search, fetch, and deploy specialized subagents directly from the repository.
SigNoz is a full-stack observability platform designed to collect, store, and visualize metrics, logs, and distributed traces in a unified environment. It leverages OpenTelemetry-based data collection to ingest telemetry from diverse sources using vendor-neutral protocols, ensuring interoperability across complex microservices architectures. The platform utilizes a high-performance columnar storage engine to enable rapid aggregation and filtering, providing a centralized backend for monitoring application health and performance. What distinguishes the platform is its focus on automated instrumentation and semantic correlation. It allows users to capture telemetry data across various programming languages and frameworks without manual code changes, often requiring only simple environment variable updates. Once ingested, the system automatically links logs, metrics, and traces through shared identifiers, enabling seamless navigation between different telemetry types during root cause analysis. The frontend further supports this by using virtualized rendering to efficiently display complex distributed traces containing millions of spans. The platform provides a comprehensive suite of tools for infrastructure monitoring, application performance tracking, and log management. Users can define complex alert conditions and manage monitoring configurations as version-controlled resources, ensuring consistency across deployment environments. Additionally, the system includes specialized support for monitoring large language model applications and provides visual query pipelines that translate user-defined filters into optimized database queries for real-time dashboard generation. The entire observability stack can be deployed using container orchestration tools, with built-in utilities for verifying service status and managing data retention.
This project is a comprehensive educational resource and curriculum focused on site reliability engineering, distributed systems, and infrastructure operations. It provides technical guides, a systems engineering course, and instructional manuals designed to teach the principles of managing large-scale computing environments. The curriculum covers high-level architectural design for scalability and resilience, including fault-tolerant infrastructure, high-availability patterns, and microservices decomposition. It emphasizes the practical application of site reliability engineering through the study of system design, resource estimation, and the elimination of single points of failure. The material extends into broad operational capabilities, including container orchestration, continuous integration and delivery pipelines, layered observability, and network routing. It also provides detailed instruction on Linux system administration, database management, security auditing, and the implementation of service level indicators and objectives.
Alertmanager is a monitoring notification gateway and routing service that deduplicates, groups, and directs alerts to the correct receivers. It functions as a central manager for Prometheus alerts, using a hierarchical routing tree and label-based matchers to dispatch notifications to external services. The system employs a peer-to-peer mesh network to coordinate multiple instances in a high availability cluster, ensuring continuous alert processing. It features a dedicated inhibition engine and grouping mechanisms to reduce notification noise by suppressing redundant alerts when related issues are already active. Capability areas include incident notification management via webhooks and third-party integrations, temporal alert silencing, and active alert limiting to prevent receiver flooding. The service also provides system event recording and event log export for auditing notification deliveries. Administrative tasks can be performed through a command-line interface for managing silences and routing configurations.
Changedetection.io is a self-hosted monitoring service designed to track web pages for content updates and notify users of changes. It functions as a centralized platform where users can manage tracking tasks, observe specific website elements, and receive automated alerts through various communication channels whenever modifications are detected. The service distinguishes itself through an integrated headless browser engine that executes interaction sequences, such as logins or form submissions, to access dynamic or restricted content. It maintains a historical record of page snapshots, utilizing a diffing engine to perform visual or textual comparisons that identify exactly how information has evolved over time. Users can isolate relevant page regions using specific query rules to filter out noise and focus on data points like price fluctuations or inventory status. The platform supports a modular notification pipeline that dispatches alerts to external services via webhooks. It also features a plugin-based architecture that allows for the integration of custom logic to transform raw page data before evaluation. Monitoring tasks can be organized using descriptive tags and imported from external files to streamline the management of large collections of tracked targets.
Prowler is an automated cloud infrastructure security scanner and posture management tool. It evaluates cloud environments and infrastructure-as-code templates against security benchmarks to identify misconfigurations, vulnerabilities, and compliance gaps that could compromise system integrity. The platform distinguishes itself through graph-based attack path analysis, which identifies chains of misconfigurations that create exploitable routes for unauthorized access. It utilizes a plugin-based execution model to perform state-based assessments of live environments and static analysis of configuration files, ensuring security coverage across the entire development lifecycle. The tool provides comprehensive capabilities for continuous security integration, allowing teams to automate compliance reporting by mapping findings to regulatory frameworks. It supports risk prioritization and provides actionable remediation guidance, while enabling the integration of security data into external incident management and monitoring systems through automated reporting pipelines.
This project is a comprehensive software observability suite and application performance monitoring platform designed to track runtime errors, performance bottlenecks, and system health. It functions as a centralized diagnostic service that aggregates and categorizes exceptions, providing the infrastructure necessary to visualize complex execution paths across distributed systems and microservices. The platform distinguishes itself through a high-throughput distributed event ingestion pipeline and a columnar storage analytics engine that enables rapid aggregation of large-scale performance metrics. It utilizes runtime-level instrumentation hooks to capture execution data directly from the host environment and employs symbolication-based stack trace resolution to map minified code or raw memory addresses back to original source files. Furthermore, the system includes specialized capabilities for monitoring the operational performance of AI agents and ensuring sensitive data compliance through schema-driven scrubbing of incoming event payloads. Beyond core error tracking and tracing, the platform supports a wide range of programming languages and frameworks, allowing for consistent visibility across diverse software architectures. It integrates with external services to automate incident response workflows and provides a command-line interface for managing releases, debug symbols, and project configurations. The system also features a modular, plugin-based architecture that facilitates connectivity with third-party tools for issue tracking and alerting.
Checkmate is an open-source, self-hosted tool designed to track and monitor server hardware, uptime, response times, and incidents in real-time with beautiful visualizations. Don't be shy, join here: https://discord.com/invite/NAb6H3UTjK :)
Glances is a cross-platform system monitoring tool designed to track real-time resource usage and hardware health metrics across diverse computing environments. It functions as a command-line utility that provides a unified view of system performance, identifying bottlenecks and maintaining infrastructure stability through a consistent abstraction layer that translates kernel calls into actionable data. The project distinguishes itself through its distributed capabilities, offering a web-based interface that enables remote access to live performance metrics from any device without requiring direct terminal access. It also operates as a telemetry data exporter, utilizing an export-driven pipeline to stream collected statistics to external databases and monitoring tools for long-term historical analysis. The system supports a modular architecture that allows for extensible data collection through independent scripts. It facilitates remote monitoring by maintaining persistent network connections between lightweight data providers and centralized management interfaces.
TheHive is a security incident response platform and multi-tenant case management system. It functions as a Security Orchestration, Automation, and Response (SOAR) tool and a threat intelligence platform designed to coordinate security investigations by managing alerts, cases, and observables. The platform is distinguished by its multi-tenant architecture, which isolates data across different organizations while supporting selective cross-tenant sharing. It features a SOAR automation engine capable of executing sandboxed JavaScript logic to automate workflows and trigger response actions through external connectors. The system covers a broad range of capabilities, including incident lifecycle management, threat intelligence synchronization with frameworks like MITRE ATT&CK and MISP, and automated data ingestion. It provides extensive identity and access management through role-based access control and integration with various identity providers. The software can be installed on Linux, via Docker containers, or deployed to Kubernetes clusters using Helm charts.
Prometheus is a comprehensive monitoring and alerting platform designed to track infrastructure health and application performance. It functions as a time series database that ingests, indexes, and queries high-frequency numerical data points. By utilizing a pull-based model, the system periodically collects multi-dimensional metrics from monitored targets, storing them in an optimized block storage format that supports high-throughput ingestion and efficient historical analysis. The platform distinguishes itself through a specialized query engine that enables real-time analysis of performance data using a dedicated functional language. It maintains operational visibility in dynamic environments by integrating with infrastructure APIs for service discovery, allowing it to adapt automatically to changing topologies. To support diverse architectures, it includes mechanisms for buffering metrics from short-lived batch jobs and streaming data to external long-term storage systems via standardized protocols. Beyond core data collection, the system provides integrated alerting capabilities that continuously evaluate logical expressions against incoming data streams. It manages the full lifecycle of incident notifications by applying grouping, inhibition, and silence rules to reduce operational noise. The ecosystem also supports broad observability through service availability probing, legacy metric translation, and the instrumentation of application-level performance data. The software is available as pre-compiled binaries or container images, and it can be managed through standard infrastructure automation tools.
The AWS Cloud Development Kit is an infrastructure-as-code framework that enables developers to define and provision cloud resources using familiar programming languages. By utilizing construct-based synthesis, it translates high-level, object-oriented code into declarative templates, allowing for the automated management of complex cloud environments through a centralized, code-driven control plane. The framework distinguishes itself through its ability to model infrastructure as a dependency-aware resource graph, ensuring that components are provisioned and updated in the correct order. It employs a language-agnostic intermediate representation to synthesize these definitions into platform-specific configurations, while supporting aspect-oriented policy injection to apply security and compliance rules across infrastructure definitions during the synthesis phase. Beyond core provisioning, the project provides a modular component registry for distributing and reusing pre-configured infrastructure building blocks. It supports multi-account orchestration, allowing for the deployment of consistent resource sets across different regions and accounts from a single template, and includes capabilities for detecting infrastructure drift to ensure deployed environments remain aligned with their defined state. The project is distributed as a software development kit, providing programmatic interfaces to manage the full lifecycle of cloud resources and integrate infrastructure definitions directly into application codebases.
ntfy is a self-hosted messaging infrastructure that provides a lightweight platform for sending and receiving real-time notifications. It functions as a topic-based pub-sub server, allowing users to publish and subscribe to message channels using standard HTTP requests. By bridging server-side events with native mobile and desktop clients, it enables the delivery of alerts across various environments through a unified communication layer. The project distinguishes itself by offering a complete, private notification ecosystem that includes persistent message caching and robust access control. It supports the UnifiedPush protocol, acting as a gateway to native mobile operating system push services, which allows for decentralized notification delivery without reliance on proprietary cloud providers. Users can interact with the system through a command-line interface, webhooks, or persistent streaming connections like Server-Sent Events and WebSockets. The platform covers a broad range of operational capabilities, including automated system monitoring, workflow integration, and cross-platform event broadcasting. It supports advanced message features such as content templating, file attachments, interactive buttons, and priority-based delivery. The system is designed for flexible deployment, offering containerized and binary-based installation options that simplify integration into existing infrastructure. The software is distributed as a single static binary, facilitating straightforward deployment across Linux, macOS, and Windows environments.
DataHub is a metadata management platform designed to unify technical, operational, and business context across diverse data ecosystems. By utilizing a graph-based metadata model and an event-driven ingestion architecture, it creates a centralized source of truth that maps complex data relationships, lineage, and ownership. This foundational framework enables organizations to maintain a synchronized view of their data landscape, supporting both human-led discovery and automated data operations. The platform distinguishes itself through its focus on grounding artificial intelligence and autonomous agents in verified enterprise context. It provides specialized capabilities to inject provenance-aware lineage, business definitions, and quality signals into AI prompts, ensuring that generated insights are accurate and trustworthy. Through a policy-as-code governance engine, it enforces access controls and compliance rules directly within the metadata graph, allowing for programmatic oversight of data assets across hybrid environments. Beyond its core identity, the project offers a comprehensive suite of tools for data discovery, observability, and lifecycle management. It includes features for automated lineage extraction, impact analysis, and semantic search, enabling users to navigate data dependencies and resolve quality issues efficiently. The platform also supports collaborative workflows, allowing teams to manage business glossaries, certify data assets, and automate access requests through integrated communication channels. DataHub is built to scale, utilizing a distributed architecture that allows storage, search, and graph processing layers to operate independently. It provides standardized interfaces and a bridge-based connector framework to facilitate integration with heterogeneous data sources and external AI agent frameworks.
SmsForwarder is an Android application designed to capture incoming text messages and automatically transmit them to external services, messaging platforms, or email accounts. It functions as a bridge for mobile alerts, enabling the centralized monitoring of SMS traffic and system notifications across various digital channels. The application distinguishes itself through a modular forwarding architecture that supports diverse communication protocols via a plugin system. It utilizes a background service and system-level listeners to ensure that message interception and relay operations continue independently of the user interface. To maintain security and reliability, the software employs encrypted storage for sensitive configuration data and a local database for persistent message logging. Users can manage message flow through a dynamic rule engine that evaluates incoming content against specific criteria to determine routing behavior. This capability facilitates the aggregation of automated verification codes and remote device monitoring, allowing for the consolidation of alerts into a unified communication stream.
CrowdSec is a collaborative, distributed security engine designed for threat detection and infrastructure protection. It functions as an intrusion detection system that parses logs and network traffic to identify malicious patterns, utilizing a bucket-based threshold detection model to aggregate events and trigger alerts. The platform is built on a modular architecture that includes a centralized local API server for managing security signals and a relational database for persistent storage of remediation decisions. What distinguishes the project is its decoupled enforcement model, which offloads active blocking to lightweight external components known as bouncers. These bouncers query the central API to synchronize threat intelligence and apply real-time remediation across distributed environments. The system also features a hub-based configuration management framework, allowing users to download and deploy community-curated security scenarios, parsers, and collections to ensure consistent protection against evolving threats. The platform provides a comprehensive suite of tools for security operations, including automated log parsing pipelines, event-driven plugin systems for notification workflows, and extensive command-line utilities for infrastructure management. It supports flexible deployment patterns across standalone, containerized, and cloud-native environments, enabling centralized orchestration of security agents and fleet-wide monitoring of threat activity. The project includes a robust documentation and command-line interface that facilitates the lifecycle management of security components, from initial service discovery and configuration to the validation of detection logic and the auditing of active security policies.
This project is a centralized notification infrastructure platform designed to manage multi-channel messaging workflows, delivery routing, and user preference settings through a unified integration layer. It provides a code-first workflow engine that allows engineers to define complex messaging sequences and notification logic as version-controlled code, ensuring consistency across development and deployment pipelines. The platform distinguishes itself by decoupling notification content from application logic, enabling non-technical teams to design and update templates through a visual interface without requiring developer intervention. It also features provider-agnostic message routing that abstracts multiple third-party delivery services, alongside intelligent delivery optimization tools such as event-driven digest aggregation and timezone-aware scheduling to reduce user fatigue. Beyond core orchestration, the platform includes a suite of embeddable, framework-agnostic user interface components for in-app notification centers and preference management. It enforces strict data integrity through schema-based type validation and provides comprehensive delivery monitoring to track and debug message status across email, SMS, push, and chat channels. The platform supports both managed cloud services and self-hosted environments, with built-in data encryption and regional residency configuration to meet security and compliance requirements.
This project is a detection-as-code framework providing a library of security monitoring rules and predefined detection content for Elasticsearch data indices. It serves as a threat detection rule library designed to identify malicious activity and attack patterns across diverse data streams in cloud and on-premises environments. The framework implements a detection engineering workflow where rules are defined in YAML and managed as versioned code. It includes a set of command-line utilities for automated rule deployment, metadata searching, and template generation, supported by a Python-based testing framework to validate rule syntax and accuracy before deployment. The system covers a broad range of security operations, including threat intelligence integration, cloud posture auditing, and security event correlation. It also provides capabilities for anomaly detection, entity risk analysis, and the coordination of security incidents through case management and alert noise suppression.
Watchtower is a container-based solution designed to automate the lifecycle management of Docker applications. It functions as a background service that monitors running containers, detects when new base image versions are available in registries, and automatically redeploys the containers to ensure they remain synchronized with the latest builds. The project distinguishes itself through its ability to orchestrate complex deployment workflows and maintain service availability during updates. It interacts directly with the container runtime to manage service dependencies and restart sequences, ensuring that dependent containers are handled in the correct order. Users can further customize the update process by defining lifecycle hooks that execute shell commands before or after a container is replaced, allowing for tailored initialization and cleanup tasks. Beyond automated updates, the tool provides extensive infrastructure observability and flexible management options. It supports event-driven updates via HTTP webhooks, declarative filtering to target specific containers, and secure remote management through encrypted communication and private registry authentication. Operational statistics can be exported to external monitoring systems, and the service can be configured to run in a passive observation mode to track image changes without performing automated redeployments.