Thanos

Thanos is a distributed metrics query engine and monitoring scalability suite designed to provide a unified interface for aggregating data from multiple Prometheus servers and clusters. It functions as a high availability monitoring backend that eliminates single points of failure by deduplicating data from replicated instances.

The system enables long-term retention by persisting time-series data to cloud-native object storage, allowing for unlimited historical archiving beyond the limits of local disks. It further optimizes this storage through a downsampling and retention manager that compresses historical data to reduce costs and accelerate query speeds.

The project covers broad capability areas including cross-cluster metric federation, stateless query execution, and automated data compaction. It also includes mechanisms for alert and recording rule evaluation and fault-tolerant query routing across distributed nodes.

Features

Global Prometheus Querying - Provides a unified interface for aggregating and querying metrics across multiple distributed Prometheus clusters.

Unified Live and Historical Metric Queries - Provides a query interface that seamlessly combines real-time data with archived object-storage blocks.

Distributed Query Engines - Acts as a distributed query engine that aggregates metric data from multiple Prometheus servers into a single interface.

Distributed Query Processing - Implements parallel execution of data queries across multiple distributed nodes to retrieve a unified result set.

High-Availability Metric Deduplications - Deduplicates metric samples from redundant Prometheus replicas to ensure query results are accurate.

Object-Storage Persistence - Persists immutable blocks of time-series data to cloud object storage for durable and unlimited scaling.

Object Storage Persistence - Persists time-series data to cloud-native object stores to provide an unlimited historical metric archive.

Long-Term Metric Retentions - Implements policies and mechanisms for long-term metric retention in object storage without scaling storage nodes.

Cross-Cluster Federation - Connects disparate monitoring clusters to enable unified querying and analysis across different environment boundaries.

High Availability Observability - Ensures continuous telemetry collection and visibility by running redundant monitoring instances and deduplicating data.

Cross-Cluster Metric Federation - Provides the ability to connect disparate clusters for unified querying and data visibility across environment boundaries.

Global Metric Aggregation - Aggregates metric data from multiple distributed sources through a unified service to create a global view of system performance.

High-Availability Metric Deduplication - Eliminates single points of failure by identifying and removing duplicate metric samples from redundant Prometheus replicas.

Global Query Engines - Provides a global query engine that aggregates and unifies results from distributed Prometheus instances.

Block Consolidation - Merges small time-series data blocks into larger ones within object storage to reduce storage footprint and improve query performance.

Data Downsampling Strategies - Employs data downsampling strategies to reduce the storage cost of old metrics and improve retrieval speed.

Distributed Block Compaction - Merges and deduplicates metric blocks in object storage through parallel compaction across multiple instances.

Data Retention Policies - Applies down-sampling and retention policies to stored metrics to control storage growth.

Query Fan-out - Splits a single global request into multiple parallel queries across distributed data sources to aggregate a unified result.

Stateless Query Execution - Runs queries across stateless instances that discover available data sources to minimize request fanout.

Object Storage Retrieval - Implements optimized data retrieval from object stores using block metadata and index caching to accelerate queries.

Metric Store Discovery - Locates available data sources using address lists or DNS lookups to build a dynamic cluster for querying.

TSDB Block Index Caches - Caches TSDB block indexes in memory to translate data requests into optimized lookups within object storage.

Metric Aggregation & Downsampling - Reduces the resolution of historical time-series data to lower storage costs and accelerate long-term trend queries.

Fault Tolerance Mechanisms - Implements fault-tolerant query routing to distribute requests across available components, ensuring resilience against node failures.

High Availability Systems - Distributes metric data and query processing across multiple nodes to ensure fault tolerance and continuous system availability.

Sidecar Data Uploaders - Runs as a companion process to Prometheus to ship local time-series blocks to remote object storage.

Cluster Discovery Services - Provides automated services for identifying and registering nodes in a distributed monitoring system.

Gossip-Based Discovery - Uses gossip protocols to exchange network addresses and discover available nodes for query routing.

Stateless Serving Layers - Decouples stateless request handling for queries from stateful data storage to enable independent scaling.

Retention Management - Compresses historical metric data and applies retention policies to optimize storage costs and query speeds.

PromQL Rule Evaluation Engines - Evaluates PromQL-based alerting and recording rules on a schedule to trigger notifications and pre-compute metrics.

Prometheus Remote Write Ingestion - Receives, validates, and batches Prometheus remote-write samples for long-term persistence in cloud storage.

Database Tools - Highly available Prometheus setup.

Monitoring and Logging - A system for high-availability and distributed storage for Prometheus.

thanos-iothanos

Features

Star history