Distributed data stores optimized for handling massive write volumes and large-scale analytical or transactional workloads.
Cassandra is a distributed NoSQL database and wide-column store designed for high availability and linear scalability. It functions as a fault-tolerant distributed system that utilizes an LSM-tree storage engine to optimize write throughput and manage massive datasets. The system is a CQL-compliant database, using a structured query language to manage and retrieve tabular data stored across multiple nodes. It organizes information into rows and columns based on a flexible schema and primary keys. The project provides capabilities for horizontal database scaling, distributed data partitioning, and high-volume tabular querying. It also supports data expiration policies and the execution of user-defined functions for data transformations.
Apache Cassandra is a flagship distributed wide-column store that natively supports high-volume write throughput, horizontal scalability, and CQL, making it a definitive match for your requirements.
OceanBase is a distributed SQL database designed for high availability and strong consistency across multiple nodes and regions. It functions as a hybrid transactional and analytical processing engine, allowing real-time analytics and transactions to execute on a single data copy. The system also serves as a vector database engine for indexing and querying vector data to power semantic search and recommendation systems. The platform features native compatibility layers for MySQL and Oracle, enabling the migration of legacy workloads without rewriting SQL code. It utilizes a Paxos-based distributed store for synchronous replication and implements a multi-tenant architecture that isolates CPU, memory, and I/O resources for different tenants within a single cluster. The system covers a broad range of capabilities, including horizontal storage scaling, distributed transaction management, and hybrid row-columnar storage. It provides tools for cluster orchestration, automated load balancing via log-stream migration, and disaster resilience through multi-zone replication and automated failover. Deployment and management are supported through a Kubernetes operator and a web monitoring dashboard.
OceanBase is a distributed HTAP database that supports hybrid row-columnar storage and horizontal scalability, making it a capable engine for high-volume data workloads even though it prioritizes SQL compatibility over CQL or Thrift interfaces.
Scylla is a distributed wide column NoSQL database designed as a high-performance data store. It functions as a Cassandra compatible database and a DynamoDB compatible store, implementing a shared-nothing architecture built on an asynchronous event-driven framework. The system emulates cloud-based APIs to support applications built for proprietary cloud protocols and implements the Cassandra Query Language for high-throughput workloads. This allows for the migration of cloud workloads to self-hosted environments while maintaining API compatibility. The project covers distributed data storage and NoSQL database management, utilizing a SQL-like syntax for data retrieval and manipulation across multiple nodes to ensure high availability and fault tolerance.
Scylla is a high-performance, distributed wide-column store that natively supports CQL and horizontal scalability, making it a direct and robust solution for high-volume write workloads.
TDengine is a distributed time-series database designed for the high-speed ingestion, compression, and retrieval of timestamped metrics and sensor data. It functions as a SQL-compatible analytics engine, allowing users to perform complex operations on massive volumes of time-ordered information using standard relational syntax. The platform is built to serve as a backend foundation for industrial IoT environments, managing real-time data streams and device metadata through a cluster-based architecture. The system distinguishes itself through a distributed sharding architecture that uses consistent hashing to ensure horizontal scalability and high-throughput ingestion. It employs a log-structured write path to minimize disk seek latency and utilizes super-table virtualization to provide a unified logical view across multiple physical tables. To maintain performance and cost-efficiency, the database features automated multi-tiered lifecycle management, which migrates data between high-performance memory and low-cost storage based on age and access frequency. Beyond its core storage capabilities, the platform provides robust tools for edge-to-cloud synchronization, ensuring consistent data states across geographically distributed infrastructure. It includes built-in support for real-time stream processing, allowing for the analysis of live data without requiring external message queues. The system also incorporates comprehensive security frameworks, including user access control, audit logging, and encrypted transport protocols to protect sensitive operational data. Developers can interact with the database through native client libraries that support connection pooling and query parameter binding. The system is documented with comprehensive error code diagnostics and provides command-line utilities for cluster administration, health monitoring, and configuration management.
This is a specialized time-series database optimized for IoT metrics rather than a general-purpose wide-column store, making it a neighbouring category that lacks CQL or Thrift support.
GreptimeDB is a distributed, open-source time-series database built for unified observability. It stores and queries metrics, logs, and traces together in a single columnar engine, supporting both SQL and PromQL for analysis. The database is designed as a Kubernetes-native operator with a decoupled compute and storage architecture, enabling horizontal scaling and multi-region deployment. What distinguishes GreptimeDB is its role as a multi-protocol ingestion gateway, accepting data through OpenTelemetry, Prometheus Remote Write, InfluxDB, Loki, Elasticsearch, Kafka, and MQTT protocols without requiring a predefined schema. It provides a unified observability data model that processes all three signal types as timestamped wide events, allowing JOIN queries across signals. The system includes a continuous aggregation pipeline with an optional Flownode component for streaming and materialized view computations, plus configurable log pipeline processing that parses and transforms raw log lines during ingestion. The database offers a broad capability surface including automatic schema inference, columnar storage with LSMT, distributed query execution with pushdown, and support for inverted, fulltext, and skipping indexes. It provides multiple query APIs (MySQL, PostgreSQL, HTTP, gRPC, Elasticsearch, Jaeger), BI tool connectivity, and integration with AI assistants through the Model Context Protocol. Deployment options range from standalone binaries to distributed clusters on Kubernetes, with metadata stored in etcd, MySQL, or PostgreSQL.
GreptimeDB is a distributed, columnar database designed for high-volume time-series data that offers horizontal scalability and a decoupled architecture, though it focuses on observability signals rather than general-purpose wide-column storage like Cassandra.
Citus is a PostgreSQL extension that transforms a standard database into a distributed system. It functions as a sharding framework and distributed SQL engine, enabling horizontal scaling by partitioning tables across a cluster of nodes. By utilizing a coordinator-worker topology, the system manages metadata and routes queries to the appropriate nodes, allowing for parallel execution of complex operations across distributed data shards. The platform distinguishes itself through its specialized support for multi-tenant architectures and real-time analytical processing. It enables tenant-based distribution and schema-based sharding, which allows for the isolation of tenant data and the migration of high-volume workloads to dedicated nodes. To accelerate analytical performance, the system integrates columnar storage with data compression and supports pre-aggregated rollups, ensuring that large-scale datasets remain performant as the cluster grows. Beyond its core distribution capabilities, the project provides comprehensive tools for cluster administration and data lifecycle management. It automates shard rebalancing, schema propagation via a two-phase commit protocol, and the maintenance of time-based partitions. The system also includes diagnostic utilities for monitoring query performance, detecting resource contention, and analyzing index usage across the distributed environment.
Citus is a distributed SQL engine built as a PostgreSQL extension, which provides horizontal scaling and columnar storage but operates as a relational database rather than a wide-column store.