# Kubernetes Custom Metric Autoscalers

> Search results for `automatically scale Kubernetes pods based on custom metrics` on awesome-repositories.com. 113 total matches; showing the first 50.

Explore on the web: https://awesome-repositories.com/q/automatically-scale-kubernetes-pods-based-on-custom-metrics

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [this search on awesome-repositories.com](https://awesome-repositories.com/q/automatically-scale-kubernetes-pods-based-on-custom-metrics).**

## Results

- [kubernetes/kubernetes](https://awesome-repositories.com/repository/kubernetes-kubernetes.md) (123,197 ⭐) — Kubernetes is a distributed container orchestration platform that automates the deployment, scaling, and management of containerized applications across clusters of computing nodes. It functions as a declarative infrastructure controller, utilizing a control loop architecture that continuously monitors the current system state against user-defined configurations to ensure desired operational outcomes. The system relies on a centralized API-driven interface and a replicated key-value store to maintain a consistent source of truth for all cluster objects.

The platform distinguishes itself through a highly extensible design that allows users to define domain-specific objects using the same native API and control loop infrastructure. It employs a standardized abstraction layer for container runtimes, enabling modular execution engines, and utilizes a pluggable controller pattern that supports third-party integrations without requiring modifications to the core codebase. An algorithmic bin-packing engine further optimizes hardware utilization by dynamically matching workload requirements with available cluster capacity.

Beyond core orchestration, the system provides comprehensive operational support for distributed environments, including automated lifecycle management, horizontal and vertical scaling, and self-healing mechanisms that maintain service availability. It encompasses integrated solutions for networking, persistent storage orchestration, and secure secret management. Diagnostic utilities for monitoring performance metrics, aggregating logs, and troubleshooting infrastructure-level issues are also included to support cluster health and reliability.
- [kubernetes-sigs/metrics-server](https://awesome-repositories.com/repository/kubernetes-sigs-metrics-server.md) (6,651 ⭐) — Metrics Server is a lightweight, single-purpose daemon that collects CPU and memory usage data from every node and pod in a Kubernetes cluster and exposes those metrics through a standard Kubernetes API endpoint. It registers as an aggregated extension API server behind the Kubernetes apiserver, making resource utilization data available to the Horizontal Pod Autoscaler and Vertical Pod Autoscaler for automatic replica count and resource request adjustments.

The project distinguishes itself by operating as a focused, in-cluster resource metrics collector that polls kubelet summary endpoints across all nodes, with resource consumption scaling linearly to support clusters of up to 5,000 nodes. It secures all API-to-node traffic through HTTPS, client certificate validation, and tunnel-based proxy connections, while delegating authorization decisions to the Kubernetes apiserver via access reviews. Multiple replicas can run with leader election to maintain availability during failures.

Beyond its core metrics collection and autoscaling pipeline, Metrics Server exposes node and pod resource usage through the standard metrics API for command-line querying with tools like top, and supports extension API registration for custom metric endpoints. The project provides a complete autoscaling data source that feeds container-level resource usage to both horizontal and vertical autoscalers, with configurable scaling behaviors and stabilization windows.
- [openfaas/faas](https://awesome-repositories.com/repository/openfaas-faas.md) (26,092 ⭐) — OpenFaaS is a serverless function platform that provides a container-native framework for deploying and managing event-driven code. It functions as an abstraction layer over container orchestrators, allowing developers to package code into scalable functions that run across Kubernetes clusters or edge computing environments.

The platform distinguishes itself through a developer-centric runtime that utilizes standardized language templates and automated build pipelines to simplify the creation of container images. It features a central API gateway that manages request routing, authentication, and metrics, while a sidecar-based watchdog process handles the translation of HTTP requests into standard input and output for function code. To support complex workflows, the system includes an asynchronous queue-based execution layer that buffers requests for long-running tasks and provides reliable retries.

The project covers a broad capability surface, including event-driven integration through connectors for various message queues and external sources, as well as comprehensive tooling for CLI-based management, secret handling, and CI/CD pipeline integration. It also supports advanced operational requirements such as autoscaling, fine-grained monitoring, and identity management through various single sign-on providers.

The platform is designed for deployment on Kubernetes, including managed services and local environments, and provides extensive documentation and tutorials to guide users through the installation and development lifecycle.
- [prefecthq/prefect](https://awesome-repositories.com/repository/prefecthq-prefect.md) (21,640 ⭐) — Prefect is a workflow orchestration platform designed to define, schedule, and monitor complex data pipelines as Python code. It functions as a container-native engine that wraps individual tasks in isolated environments, ensuring consistent dependencies and resource allocation across diverse infrastructure. By utilizing a state-machine-based orchestration model, the system tracks execution progress through discrete transitions and persistent event logs to maintain reliable and observable task processing.

The platform distinguishes itself through a decoupled worker-API architecture, which separates task scheduling from execution by allowing remote workers to poll a central API for pending work units. This design enables distributed task concurrency, allowing parallel workloads to scale horizontally across clusters or remote nodes. Furthermore, the system supports event-driven workflow triggering, enabling pipelines to initiate or resume automatically in response to system state changes or external signals.

The project provides a comprehensive capability surface for managing the entire lifecycle of data operations. This includes modular block-based configuration for injecting credentials and infrastructure settings, result persistence caching for optimizing redundant computations, and extensive integration support for cloud services, databases, and version control systems. Users can also leverage built-in tools for infrastructure automation, data lineage tracking, and automated notification management.

The software is distributed as a Python-based framework, with documentation and installation guides available to assist in configuring self-hosted deployments or connecting to managed orchestration services.
- [kubernetes/kops](https://awesome-repositories.com/repository/kubernetes-kops.md) (16,631 ⭐) — kops is a Kubernetes cluster provisioner and lifecycle manager designed to automate the creation, maintenance, and destruction of production-grade clusters on cloud infrastructure. It functions as a declarative infrastructure manager, synchronizing the live state of a cluster with versioned manifests stored in remote object storage to ensure idempotent operations.

The project distinguishes itself by offering comprehensive automation for the entire cluster lifecycle, including high-availability control plane deployment, incremental rolling updates, and automated version upgrades. It also serves as an infrastructure-as-code exporter, capable of generating Terraform configurations from the current state of a deployed cluster.

Beyond provisioning, it covers a broad operational surface including automated node and pod scaling, etcd data store management, and complex networking configurations such as dual-stack IPv6 and CNI integration. It also manages identity and security through OIDC authentication integration, cloud IAM role mapping, and x509 certificate lifecycle management.

The tool provides a command-line interface with support for shell autocompletion.
- [agones-dev/agones](https://awesome-repositories.com/repository/agones-dev-agones.md) (6,888 ⭐) — Agones is a Kubernetes game server orchestrator designed for hosting, scaling, and managing dedicated multiplayer game servers. It extends the Kubernetes control plane using custom resource definitions to define game server and fleet objects, utilizing a dedicated fleet manager to maintain pools of warm server instances.

The system provides a game server SDK and language-specific client libraries that allow server processes to signal readiness, health, and shutdown states directly to the controller. It distinguishes itself through specialized scaling logic, including the use of WebAssembly modules and external webhooks to calculate replica counts and maintain ready server buffers.

The platform covers a broad range of operational capabilities, including automated fleet scaling, session-aware deployment strategies, and precise port mapping for UDP traffic. It manages the full infrastructure lifecycle across multi-cloud environments, offering tools for regional allocation, latency-based routing, and integrated health monitoring via sidecar containers.

The project supports deployment via infrastructure-as-code tools like Terraform and provides local development environments for simulating server lifecycles and debugging binaries.
- [fission/fission](https://awesome-repositories.com/repository/fission-fission.md) (8,863 ⭐) — Fission is a function-as-a-service platform and serverless framework for Kubernetes. It manages the lifecycle and execution of code snippets as serverless functions, providing an orchestrator that triggers these functions based on HTTP requests, message queues, or scheduled events.

The platform features a cold-start optimized runtime that utilizes warm container pools and dynamic loaders to achieve millisecond execution. It includes a native autoscaler to adjust the number of function instances based on real-time traffic demand and supports canary release testing to split incoming traffic between different function versions.

The system covers event-driven orchestration, automatic workload scaling, and runtime environment management. It also provides capabilities for monitoring system performance and provisioning local development clusters.
- [pjs7678/kpod-metrics](https://awesome-repositories.com/repository/pjs7678-kpod-metrics.md) (13 ⭐) — eBPF-based pod-level kernel metrics collector for Kubernetes
- [signoz/signoz](https://awesome-repositories.com/repository/signoz-signoz.md) (27,355 ⭐) — SigNoz is a full-stack observability platform designed to collect, store, and visualize metrics, logs, and distributed traces in a unified environment. It leverages OpenTelemetry-based data collection to ingest telemetry from diverse sources using vendor-neutral protocols, ensuring interoperability across complex microservices architectures. The platform utilizes a high-performance columnar storage engine to enable rapid aggregation and filtering, providing a centralized backend for monitoring application health and performance.

What distinguishes the platform is its focus on automated instrumentation and semantic correlation. It allows users to capture telemetry data across various programming languages and frameworks without manual code changes, often requiring only simple environment variable updates. Once ingested, the system automatically links logs, metrics, and traces through shared identifiers, enabling seamless navigation between different telemetry types during root cause analysis. The frontend further supports this by using virtualized rendering to efficiently display complex distributed traces containing millions of spans.

The platform provides a comprehensive suite of tools for infrastructure monitoring, application performance tracking, and log management. Users can define complex alert conditions and manage monitoring configurations as version-controlled resources, ensuring consistency across deployment environments. Additionally, the system includes specialized support for monitoring large language model applications and provides visual query pipelines that translate user-defined filters into optimized database queries for real-time dashboard generation.

The entire observability stack can be deployed using container orchestration tools, with built-in utilities for verifying service status and managing data retention.
- [ai-dynamo/dynamo](https://awesome-repositories.com/repository/ai-dynamo-dynamo.md) (6,112 ⭐) — Dynamo is a distributed inference orchestration platform designed for large language models. It functions as a system to coordinate prefill and decode phases across GPU nodes, utilizing a multi-backend runtime adapter to connect engines like vLLM and TensorRT-LLM through a unified block-oriented memory interface. An OpenAI-compatible API server provides the frontend for integration with existing tools and clients.

The project is distinguished by its disaggregated serving architecture, which separates prompt processing and token generation onto independent GPU pools to optimize throughput and memory. It employs a key-value cache-aware request router that directs queries to workers holding relevant cache entries to reduce recomputation. High-speed data transfer mechanisms move cache blocks and weights directly between GPU VRAMs over RDMA or NVLink to minimize latency.

The platform includes comprehensive capabilities for distributed fault tolerance, allowing in-flight requests to migrate and resume from failure points via token-state continuation. It features SLA-based autoscaling and performance profiling to right-size GPU pools and a Kubernetes-native operator for topology-aware scheduling. Additional support covers multimodal inference for images, video, and audio, alongside dynamic swapping of LoRA adapters.

Installation is available via wheels, container images, charts, and crates, with support for major Linux distributions and NVIDIA GPU architectures from Ampere through Blackwell.
- [luxas/kubernetes-on-arm](https://awesome-repositories.com/repository/luxas-kubernetes-on-arm.md) (602 ⭐) — Kubernetes ported to ARM boards like Raspberry Pi.
- [andrewfarley/serverless-cloudwatch-rds-custom-metrics](https://awesome-repositories.com/repository/andrewfarley-serverless-cloudwatch-rds-custom-metrics.md) (0 ⭐) — Found at: https://github.com/AndrewFarley/serverless-cloudwatch-rds-custom-metrics Farley farley at olindata dot com OR farley at neonsurge dot_ com
- [helixdb/helix-db](https://awesome-repositories.com/repository/helixdb-helix-db.md) (3,830 ⭐) — Helix DB is a distributed graph database and knowledge graph platform that persists nodes and edges on object storage for durable and unlimited scaling. It operates as an ACID-compliant system, ensuring data consistency through serializable snapshot isolation during concurrent operations.

The project distinguishes itself by combining a vector search engine and a property graph, utilizing hybrid vector and full-text search to locate entry points for graph traversals. It enables dynamic graph querying through a domain-specific language, allowing complex logic and recursive queries to be executed via an API without redeploying application code.

The system provides high availability through a distributed cluster of gateways and reader nodes that scale automatically based on load. Its broader capabilities include graph data mutation, multi-hop relationship traversal, and query output shaping with filtering and pagination.

A command-line interface is provided for cluster management and project bootstrapping.
- [seleniumhq/selenium](https://awesome-repositories.com/repository/seleniumhq-selenium.md) (34,203 ⭐) — Selenium is a comprehensive browser automation framework that provides a standardized interface for controlling web browsers to perform automated tasks, user interactions, and data extraction. It functions as a cross-browser testing tool, enabling developers to execute identical automation scripts across various browser engines and operating systems to ensure consistent application behavior. By implementing the WebDriver protocol, it maps high-level automation commands to browser-specific drivers using a standardized HTTP-based wire protocol.

The project distinguishes itself through its distributed grid infrastructure, which allows for the parallel execution of test suites across multiple machines or containers. This architecture uses capability-based slot matching to dynamically allocate browser instances within a cluster, effectively scaling automated testing to reduce total execution time. Additionally, Selenium offers advanced bidirectional debugging capabilities that leverage native browser interfaces for real-time event streaming, script injection, and low-level network traffic interception.

Beyond its core automation and distribution features, the framework includes a robust suite of utilities for element interaction, synchronization, and browser configuration. It supports complex input simulation, including mouse, keyboard, and stylus actions, alongside sophisticated session management that handles browser lifecycle, authentication, and file operations. The project also provides automated driver management to ensure environment readiness across diverse platforms.

Selenium is designed to be integrated into various testing methodologies, including functional, regression, and performance testing. It offers extensive documentation and language-specific bindings to facilitate the creation of maintainable test suites, supporting patterns like page objects and domain-specific languages to improve readability and reduce code duplication.
- [prometheus/prometheus](https://awesome-repositories.com/repository/prometheus-prometheus.md) (64,569 ⭐) — Prometheus is a comprehensive monitoring and alerting platform designed to track infrastructure health and application performance. It functions as a time series database that ingests, indexes, and queries high-frequency numerical data points. By utilizing a pull-based model, the system periodically collects multi-dimensional metrics from monitored targets, storing them in an optimized block storage format that supports high-throughput ingestion and efficient historical analysis.

The platform distinguishes itself through a specialized query engine that enables real-time analysis of performance data using a dedicated functional language. It maintains operational visibility in dynamic environments by integrating with infrastructure APIs for service discovery, allowing it to adapt automatically to changing topologies. To support diverse architectures, it includes mechanisms for buffering metrics from short-lived batch jobs and streaming data to external long-term storage systems via standardized protocols.

Beyond core data collection, the system provides integrated alerting capabilities that continuously evaluate logical expressions against incoming data streams. It manages the full lifecycle of incident notifications by applying grouping, inhibition, and silence rules to reduce operational noise. The ecosystem also supports broad observability through service availability probing, legacy metric translation, and the instrumentation of application-level performance data.

The software is available as pre-compiled binaries or container images, and it can be managed through standard infrastructure automation tools.
- [metric-learn/metric-learn](https://awesome-repositories.com/repository/metric-learn-metric-learn.md) (1,436 ⭐) — Metric learning algorithms in Python
- [qt-pods/qt-pods](https://awesome-repositories.com/repository/qt-pods-qt-pods.md) (0 ⭐) — See the video that demonstrates how easy it is to embed foreign code and write a complete code with auxiliary functions in less than five minutes.
- [baserow/baserow](https://awesome-repositories.com/repository/baserow-baserow.md) (4,188 ⭐) — Baserow is a self-hosted, no-code relational database platform built on PostgreSQL. It provides a spreadsheet-like interface for structuring and managing data without writing code, while exposing all database resources via a REST API to support headless architectures.

The platform distinguishes itself by integrating large language models and embedding servers to power AI assistants and automated data generation. It further extends its utility as a no-code application builder, allowing users to create custom internal portals, dashboards, and business tools using visual logic and managed data.

The system covers a broad range of capabilities, including business process automation with visual triggers, collaborative workspace management, and flexible data visualization through kanban boards, calendars, and timelines. It also supports advanced extensibility via a plugin system for custom field types and view filters, and executes user-defined scripts within a secure webassembly sandbox.

Deployment is supported across various environments using Docker Compose, Helm charts for Kubernetes, and cloud infrastructure templates.
- [pstadler/metrics.sh](https://awesome-repositories.com/repository/pstadler-metrics-sh.md) (0 ⭐) — metrics.sh is a lightweight metrics collection and forwarding daemon implemented in portable POSIX compliant shell scripts. A transparent interface based on hooks enables writing custom collectors and reporters in an elegant way.
- [kedacore/keda](https://awesome-repositories.com/repository/kedacore-keda.md) (10,314 ⭐) — KEDA is a Kubernetes event-driven autoscaler and cloud event scaling engine. It functions as a custom metrics provider that monitors external event sources—including message brokers, databases, and cloud metrics—to dynamically adjust the replica counts of containerized workloads.

The project is distinguished by its scale-to-zero workflow, which reduces workloads to zero replicas during inactivity and automatically restarts them when new events are detected. It operates as a multi-cloud event trigger system, using a pluggable scaler interface to integrate with a wide array of third-party services and cloud identity providers.

The system manages the scaling of various resource types, including deployments and discrete Kubernetes jobs. It provides comprehensive identity and authentication support via integration with cloud secret managers, IAM roles, and vault services. Additionally, it includes observability features for exporting telemetry via OpenTelemetry and tools for calculating complex scaling logic using multi-source metric aggregation.
- [aws/aws-cdk](https://awesome-repositories.com/repository/aws-aws-cdk.md) (12,817 ⭐) — The AWS Cloud Development Kit is an infrastructure-as-code framework that enables developers to define and provision cloud resources using familiar programming languages. By utilizing construct-based synthesis, it translates high-level, object-oriented code into declarative templates, allowing for the automated management of complex cloud environments through a centralized, code-driven control plane.

The framework distinguishes itself through its ability to model infrastructure as a dependency-aware resource graph, ensuring that components are provisioned and updated in the correct order. It employs a language-agnostic intermediate representation to synthesize these definitions into platform-specific configurations, while supporting aspect-oriented policy injection to apply security and compliance rules across infrastructure definitions during the synthesis phase.

Beyond core provisioning, the project provides a modular component registry for distributing and reusing pre-configured infrastructure building blocks. It supports multi-account orchestration, allowing for the deployment of consistent resource sets across different regions and accounts from a single template, and includes capabilities for detecting infrastructure drift to ensure deployed environments remain aligned with their defined state.

The project is distributed as a software development kit, providing programmatic interfaces to manage the full lifecycle of cloud resources and integrate infrastructure definitions directly into application codebases.
- [kubernetes/autoscaler](https://awesome-repositories.com/repository/kubernetes-autoscaler.md) (8,771 ⭐) — The Kubernetes Cluster Autoscaler is a mechanism that automatically adjusts the number of nodes in a cluster to match the resource demands of pending pods. It functions as a cloud infrastructure scaler that manages the desired capacity of scaling groups to ensure sufficient compute resources for workloads.

The system manages cloud infrastructure automation by adjusting node counts when resources are insufficient or nodes are underutilized. It includes a manager for scaling groups using mixed instance policies to balance on-demand and spot instances for cost and availability.

The project also includes a resource optimizer that analyzes pod usage to update CPU and memory requests. Supporting capabilities include automatic node group discovery via metadata tags and internal state capturing for diagnosing scaling logic.

Installation and configuration across different environments are supported via a Helm chart.
- [kubero-dev/kubero](https://awesome-repositories.com/repository/kubero-dev-kubero.md) (4,150 ⭐) — Kubero is a self-hosted Platform as a Service (PaaS) that simplifies the deployment, scaling, and management of containerized applications on Kubernetes. It functions as an application manager, CI/CD orchestrator, and multi-tenant manager, allowing users to run workloads without writing manual configuration files.

The platform distinguishes itself through automated image synthesis, transforming source code from Git repositories into deployable containers via buildpacks, Dockerfiles, or nixpacks. It implements a GitOps delivery model with automated pipelines that trigger builds on push events and provision ephemeral review environments for pull requests.

Beyond deployment, it provides integrated infrastructure management for provisioning databases and caches through a graphical interface. The system includes multi-tenant isolation using namespaces, role-based access control with OAuth2 authentication, and automated SSL certificate management. Additional capabilities cover resource scaling, application health monitoring, and the attachment of persistent storage volumes.

The platform can be installed on local Kubernetes clusters or provisioned on supported cloud providers using a dedicated CLI and web-based management console.
- [cockroachdb/cockroach](https://awesome-repositories.com/repository/cockroachdb-cockroach.md) (32,207 ⭐) — Cockroach is a distributed SQL database designed to scale horizontally across multiple nodes while maintaining strict ACID compliance and global data consistency. It functions as a relational database engine that automatically partitions data into ranges, rebalancing them across a cluster to accommodate growing storage and throughput requirements. By utilizing a distributed consensus protocol, the system ensures that all nodes agree on the order of operations, providing fault tolerance and continuous availability even in the event of hardware failures.

The system distinguishes itself through a layered architecture that separates the relational SQL abstraction from a distributed key-value store. It achieves global consistency without requiring perfectly synchronized hardware clocks by employing a hybrid logical clock synchronization mechanism. To support high-concurrency environments, it utilizes multi-version concurrency control and lock-free transaction execution, which allow for consistent snapshots and efficient conflict resolution. Furthermore, the engine is built for compatibility, implementing the standard wire protocol to support existing relational database drivers and tools.

Beyond its core transactional capabilities, the platform includes comprehensive tooling for cluster orchestration, security, and performance diagnostics. It supports a variety of deployment models, ranging from self-hosted on-premises configurations to fully managed cloud services. The system provides a command-line interface for session management and query execution, ensuring that administrators can monitor cluster health and manage workloads through standard relational interfaces.
- [n8n-io/n8n](https://awesome-repositories.com/repository/n8n-io-n8n.md) (192,772 ⭐) — n8n is a workflow automation platform that combines a visual interface with code-based extensibility to design, orchestrate, and manage automated processes. It provides a comprehensive suite of tools for data transformation, filtering, and storage, allowing users to build complex logic through conditional branching, looping, and sub-workflow execution. The platform supports both pre-built integration nodes and custom code execution in JavaScript or Python, enabling connectivity with a wide range of external services and APIs.

The platform includes a suite of generative AI capabilities, such as an AI-powered workflow builder, a centralized chat interface for custom agents, and retrieval-augmented generation tools that ground responses in domain-specific data. To support development and production lifecycles, n8n offers version control integration with Git, workflow publishing mechanisms, and administrative tools for managing user roles, security policies, and environment configurations.

For monitoring and maintenance, the system provides observability tools that include performance metrics, execution insights, and real-time log streaming. It also features error-handling capabilities, such as automated recovery workflows and manual failure triggering, to ensure system reliability. Users can interact with the platform programmatically via a public REST API or manage administrative tasks through a command-line interface.
- [automq/automq-for-kafka](https://awesome-repositories.com/repository/automq-automq-for-kafka.md) (10,026 ⭐) — AutoMQ is a cloud-native streaming platform and Kafka-compatible message broker. It implements the Kafka protocol to provide integration with existing clients and ecosystems while functioning as a message queue that persists data directly to cloud object storage.

The system decouples compute from storage, allowing processing power and storage capacity to scale independently. It utilizes a shared-log architecture and object-storage-based persistence to remove dependencies on local disks, which reduces operational costs and eliminates manual disk management.

The platform includes mechanisms for automated partition balancing and dynamic resource autoscaling to handle workload demands without downtime. It supports multi-zone high availability and utilizes availability zone-aware request routing to minimize inter-zone data transfer fees.

The system also provides capabilities for cluster data migration, unified stream and table integration, and the export of system metrics for real-time monitoring.
- [ztellman/automat](https://awesome-repositories.com/repository/ztellman-automat.md) (0 ⭐) — Automat is a library for defining and using finite-state automata, inspired by Ragel. However, instead of defining a DSL, it allows them to be built using simple composition of functions.
- [istio/istio](https://awesome-repositories.com/repository/istio-istio.md) (38,226 ⭐) — Istio is a service mesh infrastructure that provides a centralized control plane to manage, secure, and observe communication between distributed microservices. It functions as a policy-driven network traffic controller, enabling developers to route, balance, and secure service-to-service traffic without requiring modifications to application code. The system enforces zero-trust security by utilizing mutual transport layer authentication to verify cryptographic identities for every network request.

The project distinguishes itself through a sidecar-less proxy architecture, which offloads networking tasks to shared infrastructure proxies rather than requiring individual proxies for every container. This approach is complemented by waypoint proxies, which perform deep packet inspection and enforce granular access policies at the application layer. Furthermore, the platform provides a unified connectivity fabric that synchronizes service registry data across multiple clusters, allowing for consistent traffic management and security policy enforcement across disparate network boundaries.

The system operates on a declarative model where a centralized management component continuously reconciles the desired state with the underlying network infrastructure. It supports both transport-layer and application-layer authorization, allowing for precise control over service access based on service accounts and specific request methods. The architecture is designed to simplify operational management and reduce resource overhead while maintaining consistent network behavior across complex, multi-cluster environments.
- [benchr267/pod-search-alfred](https://awesome-repositories.com/repository/benchr267-pod-search-alfred.md) (6 ⭐) — Use this workflow to find pods in Alfred and open their Github page.
- [meshery/meshery](https://awesome-repositories.com/repository/meshery-meshery.md) (9,966 ⭐) — Meshery is a service mesh management plane and cloud native infrastructure orchestrator. It provides a visual design-as-code environment for modeling microservices and infrastructure components through declarative blueprints, functioning as a centralized platform for designing, deploying, and managing service mesh infrastructure.

The platform is distinguished by its ability to translate visual designs into active deployments and its use of gRPC-based adapters to integrate with diverse infrastructure providers. It features a multi-tenant architecture that manages shared workspaces and role-based access control, allowing teams to collaboratively share, publish, and merge infrastructure designs.

Its capabilities extend to infrastructure lifecycle management, resource discovery via composite fingerprints, and performance analysis through synthetic traffic generation. It also covers comprehensive configuration management, including the ability to package infrastructure models into OCI-compatible images for portable distribution.

The management plane can be installed on Kubernetes clusters using command-line tools or Helm charts.
- [netdata/netdata](https://awesome-repositories.com/repository/netdata-netdata.md) (79,176 ⭐) — Netdata is a distributed observability platform designed for real-time infrastructure monitoring and performance tracking. It functions as a high-frequency agent that collects system, container, and application metrics with per-second precision, providing both local visualization and centralized aggregation across complex, multi-cloud environments.

The platform distinguishes itself through edge-based intelligence, utilizing local machine learning models to automatically detect performance anomalies without requiring manual configuration or external query engines. Its architecture prioritizes local-first data persistence and secure metadata-only synchronization, ensuring that granular observability data remains on the host while essential system information is routed to a cloud-connected management plane. This hierarchical approach allows for horizontal scaling through parent-child node relationships, enabling unified monitoring and alerting across distributed infrastructure.

Beyond core collection and analysis, the system supports automated troubleshooting through natural language querying and intelligent metric correlation. It features a modular data acquisition engine that employs thread-per-core execution for low-latency performance, alongside isolated external processes for heterogeneous application support. The platform includes automated service discovery, diverse deployment options, and built-in diagnostic utilities to maintain visibility and connectivity across large-scale clusters.

Installation is supported through various methods including package managers, automated scripts, source compilation, and containerized orchestration.
- [paritytech/scale-decode](https://awesome-repositories.com/repository/paritytech-scale-decode.md) (0 ⭐) — This crate makes it easy to decode SCALE encoded bytes into a custom data structure with the help of a TypeResolver (one of which is a scale_info::PortableRegistry). By using this type information to guide decoding (instead of just trying to decode bytes based on the shape of the target type),…
- [neondatabase/neon](https://awesome-repositories.com/repository/neondatabase-neon.md) (22,251 ⭐) — Neon is a serverless PostgreSQL database platform designed with a decoupled storage and compute architecture. It functions as a multi-tenant system that isolates data and compute resources for independent users on shared cloud infrastructure, utilizing a specialized PostgreSQL storage engine.

The platform features a database branching system that allows for the creation of isolated, instant copies of a database for testing and development. It further distinguishes itself with an HTTP-based SQL gateway, enabling the execution of queries via HTTP requests and JSON responses without the need for native drivers.

The system supports automatic resource scaling, including the ability to scale compute nodes down to zero during periods of inactivity. Its capability surface covers point-in-time recovery, copy-on-write snapshots, and a variety of connectivity options including standard TCP and WebSocket tunneling.

Infrastructure management includes multi-tenant resource isolation, database traffic routing, and storage metadata auditing to ensure data consistency and the purging of obsolete objects.
- [langchain-ai/deepagents](https://awesome-repositories.com/repository/langchain-ai-deepagents.md) (25,006 ⭐) — Deepagents is an LLM agent orchestration platform and stateful application server designed for deploying and managing AI agents built with computational graphs. It provides a containerized runtime environment that handles agent execution, state persistence, and the versioning of AI assistants.

The platform distinguishes itself through deep integration with the Model Context Protocol, allowing agents to function as servers that expose tools and capabilities to external clients. It features a sophisticated observability suite for capturing execution traces, performing LLM-based evaluations against datasets, and conducting side-by-side model output comparisons.

The system covers a broad range of operational capabilities, including cron-based task scheduling, multi-tenant workspace isolation, and human-in-the-loop review workflows. It also manages long-term memory through semantic search and provides automated scaling of compute resources across cloud environments.

A command-line interface is provided for local agent validation, graph packaging, and rapid testing via a local development server.
- [beberlei/metrics](https://awesome-repositories.com/repository/beberlei-metrics.md) (321 ⭐) — Simple library that abstracts different metrics collectors. I find this necessary to have a consistent and simple metrics (functional) API that doesn't cause vendor lock-in.
- [lensapp/lens](https://awesome-repositories.com/repository/lensapp-lens.md) (23,180 ⭐) — Lens is a multi-cluster management platform and desktop application for administering Kubernetes environments. It provides a graphical interface for deploying Helm charts, editing YAML manifests, and managing the lifecycle of pods and deployments.

The project features an AI-powered cluster assistant that enables users to query cluster state, perform autonomous troubleshooting, and translate natural language requests into system commands. It also supports collaborative team access through shared spaces, utilizing encrypted cluster sharing and role-based access control to manage credentials and permissions across organizations.

Broad capabilities cover native integration with cloud providers such as AWS EKS, Azure AKS, and Google GKE, alongside real-time observability tools for streaming container logs and visualizing Prometheus metrics. The platform also includes enterprise identity management via SSO and SCIM, and security analysis tools for scanning clusters for vulnerabilities.

The application supports silent installation via command-line parameters for non-interactive setup.
- [marhkb/pods](https://awesome-repositories.com/repository/marhkb-pods.md) (0 ⭐)
- [cube-js/cube](https://awesome-repositories.com/repository/cube-js-cube.md) (20,251 ⭐) — Cube is a semantic data layer that provides a unified framework for defining business metrics, dimensions, and relationships across diverse data sources. By acting as a headless business intelligence engine, it transforms raw data into a governed model that can be queried via SQL, REST, and GraphQL interfaces. This architecture ensures consistent data definitions and logic across all downstream analytical applications and reporting tools.

The platform distinguishes itself through its integrated conversational AI capabilities, which allow users to explore data using natural language. It orchestrates these interactions by mapping questions to the underlying semantic model, ensuring that AI-generated insights remain accurate and context-aware. Furthermore, Cube is designed for multi-tenant environments, offering robust infrastructure isolation, row-level security, and dynamic context injection to ensure that data access is strictly governed and personalized for every user or tenant.

Beyond its core modeling and AI features, the platform includes a comprehensive suite of tools for performance optimization, including automated pre-aggregation caching and asynchronous query queuing. It supports a wide range of data sources and deployment models, from self-hosted containers to managed cloud environments. The system also provides extensive programmatic control over report management, dashboard publishing, and user identity synchronization, making it suitable for embedding interactive analytics directly into custom software applications.
- [benhamner/metrics](https://awesome-repositories.com/repository/benhamner-metrics.md) (1,650 ⭐) — Machine learning evaluation metrics, implemented in Python, R, Haskell, and MATLAB / Octave
- [kubernetes-sigs/external-dns](https://awesome-repositories.com/repository/kubernetes-sigs-external-dns.md) (8,999 ⭐) — ExternalDNS is a controller that automatically synchronizes Kubernetes resource states with external DNS providers. It monitors cluster resources such as services, ingresses, and gateway APIs to dynamically create and update DNS records, enabling automated service discovery and external traffic management.

The project features a provider-agnostic interface that supports a wide array of cloud-managed vendors and on-premises providers, as well as an extension system for custom providers via webhooks and sidecars. It implements a reconciliation loop that uses resource annotations and custom resource definitions for declarative DNS management, ensuring that records are synchronized based on the desired state of the cluster.

To maintain stability and security, the controller utilizes leader election for high availability and tracks record ownership through TXT records or external databases like DynamoDB. It optimizes provider API usage through in-memory caching and batching of record changes. The system also supports advanced traffic management, including split-horizon DNS and routing policies, while exposing operational metrics via Prometheus.
- [getmoto/moto](https://awesome-repositories.com/repository/getmoto-moto.md) (8,550 ⭐) — Moto is a cloud service mockery framework and API mock server that simulates AWS infrastructure locally. It allows developers to test cloud-dependent code and verify infrastructure-as-code templates without deploying real resources or incurring costs.

The project functions as an SDK interceptor that can patch existing service clients to redirect requests to a local mock environment. It can also be run as a standalone HTTP server, enabling any programming language to interact with the simulated endpoints.

The framework covers a vast array of simulated capabilities, including data storage, compute and hosting, identity and access management, AI and machine learning, and networking. It further supports the simulation of complex environments through account-based resource isolation and simulated access control to mimic multi-tenant cloud logic.
- [spotify/postgresql-metrics](https://awesome-repositories.com/repository/spotify-postgresql-metrics.md) (598 ⭐) — Tool that extracts and provides metrics on your PostgreSQL database
- [noodle-run/noodle](https://awesome-repositories.com/repository/noodle-run-noodle.md) (12,328 ⭐) — Noodle is a containerized application orchestrator designed to automate the deployment and lifecycle management of services across distributed production environments. It functions as an infrastructure automation platform that maintains a consistent global state for containerized workloads.

The platform provides a multi-cloud abstraction layer that normalizes disparate cloud provider APIs into a unified interface, enabling workload portability across different infrastructure vendors. It utilizes a declarative state reconciliation model to continuously compare desired configurations against the actual cluster state, automatically applying corrective actions to eliminate configuration drift.

The system manages distributed services through a control plane that employs a replicated consensus algorithm to ensure high availability. It supports immutable infrastructure deployments by replacing existing container instances with fresh versions, and it handles service discovery and traffic routing through sidecar proxy networking.
- [linkerd/linkerd2](https://awesome-repositories.com/repository/linkerd-linkerd2.md) (11,424 ⭐) — This project is a service mesh platform designed to manage, secure, and observe service-to-service communication within Kubernetes clusters. It functions as a control plane that orchestrates transparent sidecar proxies, which intercept and manage network traffic to provide reliable connectivity for microservices. By automating the injection of these proxies, the platform ensures that infrastructure-level policies are applied consistently across all workloads without requiring manual configuration changes.

The platform distinguishes itself through its focus on zero-trust security and cross-cluster connectivity. It enforces mutual TLS for all inter-service communication by automatically issuing and rotating short-lived cryptographic certificates, ensuring that traffic is encrypted and identities are verified. Furthermore, it provides robust multicluster capabilities, enabling unified service discovery, traffic routing, and load balancing across distinct network environments, effectively bridging distributed workloads into a single logical communication fabric.

Beyond its core security and connectivity features, the project offers a comprehensive suite for traffic management and observability. It supports advanced routing strategies, including header-based and protocol-aware traffic shifting, alongside resilience patterns like circuit breaking, retries, and fault injection to maintain system stability. The observability framework collects real-time telemetry, request metrics, and distributed traces, providing deep visibility into service health, performance, and dependencies through integrated dashboards and diagnostic tools.

The project is managed via a command-line interface that supports automated installation, upgrades, and cluster diagnostics to ensure operational readiness. It allows for extensive customization of proxy behavior and resource allocation through standard Kubernetes manifests and annotations, facilitating integration into diverse infrastructure environments.
- [azure/aad-pod-identity](https://awesome-repositories.com/repository/azure-aad-pod-identity.md) (0 ⭐) — ❗ IMPORTANT: As of Monday 10/24/2022, AAD Pod Identity is deprecated. As mentioned in the announcement, AAD Pod Identity has been replaced with Azure Workload Identity. Going forward, we will no longer add new features or bug fixes to this project in favor of Azure Workload Identity, which…
- [mattermost/mattermost](https://awesome-repositories.com/repository/mattermost-mattermost.md) (38,139 ⭐) — Mattermost is a self-hosted, enterprise-grade communication platform designed for organizations that require strict control over their internal data and messaging infrastructure. It functions as a centralized hub for real-time team interaction, offering persistent messaging, voice and video conferencing, and integrated project management tools within a single, private workspace. The platform is built to support high-security environments, including air-gapped deployments where public internet access is restricted or unavailable.

The platform distinguishes itself through a focus on regulatory compliance and administrative sovereignty. It provides granular role-based access control, comprehensive audit logging, and data retention policies to meet legal and security standards. Organizations can extend the core functionality through a plugin-based framework, allowing for the injection of custom server-side logic and UI components without modifying the underlying source code. Furthermore, the system acts as a secure workflow orchestrator, enabling teams to integrate automated tasks and external services directly into their communication channels.

The architecture is designed for scalability and reliability, supporting large-scale deployments through Kubernetes-based orchestration and microservices-ready infrastructure. Administrators can manage complex environments using centralized identity federation, external search indexing for high-performance data retrieval, and robust disaster recovery planning. The platform also includes tools for mobile device management and custom branding to ensure a consistent and secure experience across organizational hardware.

Comprehensive documentation is available to guide administrators through installation, configuration, and maintenance, including specific procedures for Kubernetes deployments and air-gapped environment setups.
- [ubicloud/ubicloud](https://awesome-repositories.com/repository/ubicloud-ubicloud.md) (12,098 ⭐) — Ubicloud is an open-source cloud infrastructure platform that provides a unified control plane for provisioning and managing virtual machines, container clusters, and managed databases. It functions as an infrastructure-as-code provider, utilizing declarative configuration files to automate the deployment and scaling of compute, networking, and storage resources across cloud environments.

The platform distinguishes itself by integrating a dedicated managed PostgreSQL database service that automates backups, read replicas, and high-availability configurations. It also features a container orchestration engine designed to manage and scale workloads across node pools, ensuring consistent performance and availability for applications.

Beyond its core orchestration capabilities, the system includes tools for managing virtual machine images, configuring network security through firewall rules and subnets, and distributing traffic across instances via integrated load balancing. These features are managed through a centralized control loop that reconciles desired infrastructure states with the actual environment.

The platform is accessible via a command-line interface that enables programmatic control over infrastructure lifecycle management.
- [junfenggo/scale-up](https://awesome-repositories.com/repository/junfenggo-scale-up.md) (0 ⭐) — This is the official implementation of our paper 'SCALE-UP: An Efficient Black-box Input-level Backdoor Detection via Analyzing Scaled Prediction Consistency', accepted in ICLR 2023. This research project is developed based on Python 3 and Pytorch, created by Junfeng Guo and Yiming Li.
- [mattermost/mattermost-mobile](https://awesome-repositories.com/repository/mattermost-mattermost-mobile.md) (2,593 ⭐) — This project is an enterprise messaging mobile application and cross-platform team chat client. It serves as a self-hosted messaging interface for team communication, direct messaging, and voice calls within corporate environments.

The application integrates artificial intelligence agents to automate repetitive tasks and retrieve information. It also functions as a Kanban task management tool, providing project and task coordination through planning boards to track operational work.

The platform covers secure mobile messaging with local data sanitization and mobile workflow automation. It includes user preference management for adjusting notification settings, visual themes, and profile details.
- [linkedin/school-of-sre](https://awesome-repositories.com/repository/linkedin-school-of-sre.md) (8,093 ⭐) — This project is a comprehensive educational resource and curriculum focused on site reliability engineering, distributed systems, and infrastructure operations. It provides technical guides, a systems engineering course, and instructional manuals designed to teach the principles of managing large-scale computing environments.

The curriculum covers high-level architectural design for scalability and resilience, including fault-tolerant infrastructure, high-availability patterns, and microservices decomposition. It emphasizes the practical application of site reliability engineering through the study of system design, resource estimation, and the elimination of single points of failure.

The material extends into broad operational capabilities, including container orchestration, continuous integration and delivery pipelines, layered observability, and network routing. It also provides detailed instruction on Linux system administration, database management, security auditing, and the implementation of service level indicators and objectives.
