30 open-source projects similar to apache/hbase, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best Hbase alternative.
Hadoop is a big data infrastructure suite and distributed data processing framework designed to store and process massive datasets across clusters of computers. It consists of a distributed storage system for managing large files across multiple nodes and a parallel computing engine for processing data across a distributed cluster. The framework implements a distributed file system to ensure fault tolerance and high throughput, paired with a programming model that processes large datasets in parallel. It manages the underlying hardware and software environment required for distributed big dat
Apache Hive is a SQL-on-Hadoop data warehouse that enables querying and managing petabytes of data stored in distributed storage such as HDFS and cloud storage services. It provides a familiar SQL interface for batch analytics and reporting, supported by a core set of components including the HiveServer2 Thrift service for remote query execution, the Hive Metastore Service for central metadata management, the Hive ACID Transaction Engine for concurrent read-write operations, and the Hive LLAP Interactive Engine for low-latency analytical processing. The WebHCat REST API offers an HTTP interfac
3FS is a distributed file system and RDMA storage cluster designed for high-performance AI training and inference workloads. It functions as a strongly consistent storage layer that utilizes a disaggregated architecture to pool SSDs and memory resources across multiple nodes. The system provides specialized storage implementations including an AI training checkpoint store for parallel state preservation and a distributed key-value cache store for decoder layer vectors to optimize inference processing. It ensures data integrity through chain replication and apportioned query distribution. The
GlusterFS is a software-defined distributed file system and scale-out storage cluster that aggregates disk resources from multiple servers into a single global namespace. It functions as a unified storage platform, allowing the same underlying data to be exposed through file, block, and object storage interfaces. The system distinguishes itself through a decentralized architecture that uses consistent hashing to distribute files across network nodes without a central metadata server. It ensures data integrity and availability using self-healing replication, quorum-based consistency to prevent
Hazelcast is a distributed data platform that combines an in-memory data grid with a stream processing engine to support real-time analytics and event-driven applications. It functions as a partitioned, distributed key-value store that replicates data across cluster nodes to provide low-latency access and high availability. The platform also serves as a distributed SQL query engine, allowing users to execute standard SQL statements against both in-memory datasets and external data sources. What distinguishes Hazelcast is its use of a distributed consensus subsystem to maintain strongly consis
This project is a comprehensive Java backend engineering guide and technical reference focused on high-concurrency design, distributed systems, and microservices architecture. It provides detailed strategies for decomposing monolithic applications, managing service discovery, and implementing the architectural patterns required for scalable backend environments. The repository distinguishes itself through an extensive collection of big data algorithmic references and database scaling strategies. It covers memory-efficient techniques for analyzing massive datasets, such as Top-K element extrac
Kvrocks is a disk-based NoSQL database and distributed key-value store that leverages the RocksDB storage engine to persist large datasets to physical disk. It is designed to be a Redis-compatible database, utilizing the standard Redis communication protocol to ensure interoperability with existing client libraries and tools. The project distinguishes itself by combining a disk-persistent storage model with advanced retrieval capabilities, including vector search for k-nearest neighbor queries, full-text search indexing, and geospatial query execution. It supports distributed clustering with
Cassandra is a distributed NoSQL database and wide-column store designed for high availability and linear scalability. It functions as a fault-tolerant distributed system that utilizes an LSM-tree storage engine to optimize write throughput and manage massive datasets. The system is a CQL-compliant database, using a structured query language to manage and retrieve tabular data stored across multiple nodes. It organizes information into rows and columns based on a flexible schema and primary keys. The project provides capabilities for horizontal database scaling, distributed data partitioning
Rustfs is a distributed object storage system designed for high availability and horizontal scalability. It functions as a cluster-based platform that manages data across multiple nodes, providing a self-hosted infrastructure for large-scale storage requirements. The system is built to be container-native, utilizing an operator to automate deployment and management within orchestrated environments. It provides compatibility with standard object storage protocols, allowing existing applications and tools to interact with the storage layer through a translation interface. To ensure long-term re
SparkInternals is a technical reference and architecture guide detailing the internal design and implementation of the Apache Spark distributed computing engine. It serves as a study of big data engine analysis, focusing on how the system manages cluster execution and the interaction between driver nodes, executors, and workers. The project provides a detailed breakdown of how logical plans are converted into physical execution stages. It specifically analyzes the mechanics of data shuffle operations, memory management, and the coordination of distributed job scheduling. The documentation co
Scylla is a distributed wide column NoSQL database designed as a high-performance data store. It functions as a Cassandra compatible database and a DynamoDB compatible store, implementing a shared-nothing architecture built on an asynchronous event-driven framework. The system emulates cloud-based APIs to support applications built for proprietary cloud protocols and implements the Cassandra Query Language for high-throughput workloads. This allows for the migration of cloud workloads to self-hosted environments while maintaining API compatibility. The project covers distributed data storage
This project is an open source relational database management system and SQL database designed for storing and managing structured data. It functions as a relational database for ensuring consistency and reliability, while also operating as a vector database for storing and querying high-dimensional vector embeddings. The system incorporates a columnar storage engine to optimize analytical query processing and large-scale data aggregation. It further enables vector similarity search, allowing users to find similar items by querying vector embeddings. The software covers a broad capability su
RocksDB is a high-performance, embeddable persistent key-value library and storage engine based on Log-Structured Merge-trees. It is designed to provide durable storage for large-scale datasets, integrating directly into applications to manage data on flash and RAM-based hardware. The engine is distinguished by its focus on minimizing read and write amplification through multi-threaded compaction and custom memory allocators. It features specialized optimizations for flash storage, including support for zoned block devices, and provides the ability to extend store behavior via external plugin
VictoriaMetrics is a high-performance, scalable time series database and observability platform designed for long-term storage and analysis of metric, log, and trace data. It functions as a unified backend for monitoring ecosystems, offering full compatibility with industry-standard protocols and query languages. The system is built to handle massive data volumes through a distributed architecture that supports horizontal scaling and efficient data lifecycle management. The platform distinguishes itself through a storage engine that utilizes consistent hashing for data sharding and log-struct
rpcx is a high-performance remote procedure call framework for building scalable microservices in Go. It functions as a binary protocol RPC system and a service mesh, providing the necessary infrastructure for low-latency inter-service communication and distributed cloud environments. The project features a cross-language service gateway that provides an HTTP entry point, allowing clients written in any programming language to invoke Go remote services via protocol translation. It also includes a specialized RPC traffic analyzer for capturing and analyzing binary packets to debug network comm
TiKV is a distributed transactional key-value store designed for horizontal scalability and high availability. It functions as a storage engine that maintains massive datasets across a cluster of physical nodes, ensuring that information remains accessible and consistent even when individual hardware components fail. The system utilizes a consensus-based replication model to synchronize data across nodes, ensuring that all replicas agree on the order of operations. It manages data distribution through a sharding mechanism that partitions large datasets into smaller groups, each governed by in
This project is a comprehensive computer networking textbook and instructional resource. It serves as a technical guide for the design and implementation of network layers, protocols, and hardware architecture, covering the spectrum from physical links to application-layer protocols. The content provides a detailed study of standards for congestion control, reliable data delivery, and internetwork routing. It includes specialized technical material on network security, public-key infrastructure, and the operation of modern cloud infrastructure and data centers. The material covers a broad ra
Webmin is a web-based administration interface for Unix systems. It provides a centralized console for managing the full range of server administration tasks — users and groups, software packages, storage, network configuration, system services, and security — all through a browser. Its modular architecture allows separate modules to handle databases (MySQL, MariaDB, PostgreSQL), web servers (Apache), DNS (BIND), email (Sendmail, Dovecot), file sharing (Samba, NFS), and more, with a unified access control system that restricts what each administrator can see and do. What sets Webmin apart is
Garage is a distributed object storage system that provides an S3-compatible API gateway. It is designed to synchronize metadata across distributed nodes using conflict-free replicated data types and Merkle-tree state alignment to maintain cluster-wide consistency. The system ensures data resilience through zone-aware replication, distributing data copies across multiple physical locations. It employs quorum-based request routing and versioned layout management to validate and commit cluster configuration changes. The project covers a broad range of operational capabilities, including automa
Redis is a high-performance in-memory key-value store that functions as a distributed cache, message broker, and NoSQL database. It provides sub-millisecond read and write access to data stored in RAM and can operate as a vector database for indexing high-dimensional embeddings. The system supports a wide range of data storage and synchronization primitives, including the management of strings, hashes, lists, sets, and JSON documents. It enables real-time data operations through atomic transactions, hybrid persistence using snapshots and append-only logs, and high-availability configurations
Weaver is a distributed application framework and remote procedure call system that allows developers to organize logic into independent components. It provides a multi-process execution environment where these components communicate via automated serialization, enabling applications to run as a single unit locally or as a distributed system across multiple cloud machines. The framework distinguishes itself through a configuration-driven topology mapping that allows the same logic to execute as a local function call or a remote network request without altering the business logic. It includes
This project is a comprehensive performance programming guide and reference for the Go language, focusing on runtime efficiency and memory optimization. It provides a collection of patterns and techniques designed to increase execution speed by reducing garbage collection overhead and optimizing memory usage. The resource distinguishes itself through detailed reference implementations for memory optimization, such as escape analysis, object pooling, and structure memory alignment. It offers specific strategies for reducing binary size and improving CPU cache efficiency through structure memor
FastStream is an asynchronous Python framework designed for building event-driven microservices. It provides a unified abstraction layer for interacting with various message brokers, enabling developers to manage event production and consumption through a consistent interface while maintaining access to native provider-specific features. The framework centers on a decorator-based routing model that binds application logic directly to broker topics, supported by a built-in dependency injection container that resolves resources at runtime. The framework distinguishes itself through its deep int
FastDFS is a distributed file system and object store designed as a high-capacity file server. It functions as a cluster storage manager that saves, syncs, and accesses large volumes of unstructured data across a network of distributed servers. The system uses unique identifiers for file retrieval and indexing instead of traditional hierarchical naming to avoid metadata bottlenecks. It manages file attributes through key-value metadata mapping and employs a distributed replication model to ensure high availability and data redundancy across storage groups. The project provides capabilities f
Buildbot is a Python-based continuous integration framework and distributed build orchestrator. It functions as a build automation engine that coordinates the retrieval of source code, the execution of build steps, and the reporting of results through a central controller and a network of remote worker agents. The system is distinguished by a plugin-based extensibility architecture and a master-worker distribution model. It allows for dynamic build modification at runtime and supports a pluggable database backend for persisting system state and historical build data. The project covers a bro
go-fastdfs is a distributed file system and object storage server designed for building private cloud storage. It provides a FastDFS compatible storage implementation that manages clusters of storage nodes to handle large-scale file uploads and downloads. The system focuses on high availability through a decentralized architecture that automatically synchronizes data and repairs failures across multiple machines without a central coordinator. It specifically supports resumable file storage via HTTP, allowing large transfers to be paused and resumed from the last successful byte to handle netw
Mini-LSM is an educational storage engine and key-value database library designed to demonstrate the implementation of log-structured merge-tree architecture. It serves as a pedagogical resource for understanding how to build high-performance storage systems from the ground up, focusing on the mechanics of persistent data structures and disk-based storage. The project provides a functional framework for managing data through memory-to-disk flushing and multi-version concurrency control. It distinguishes itself by implementing snapshot-based isolation, which allows for consistent views of the
Pholcus is a distributed web crawling system designed for large-scale data scraping. It employs a master-worker distribution model to coordinate high-concurrency scraping tasks across a network of remote client nodes, enabling both horizontal and vertical data collection. The system features a hot-loadable rule engine that allows extraction and navigation logic to be updated at runtime without restarting the process. It handles dynamic content through headless browser integration and bypasses bot detection using proxy rotation, automated user authentication, and simulated human behavior. The
Ceph is a unified, software-defined storage platform designed to provide object, block, and file storage services from a single distributed cluster. By decoupling data management from physical hardware, it enables elastic scaling across commodity hardware, allowing organizations to build large-scale storage infrastructure without reliance on proprietary vendor equipment. The system distinguishes itself through a shared-nothing, distributed architecture that utilizes deterministic hashing for data placement. This approach eliminates centralized metadata bottlenecks, allowing the cluster to sca
MinIO is a software-defined, cloud-native object storage server designed to manage large volumes of unstructured data. It functions as a distributed storage cluster that aggregates multiple independent nodes into a unified, scalable pool, providing a high-performance infrastructure compatible with standard cloud storage protocols and application programming interfaces. The system utilizes a shared-nothing architecture that eliminates central metadata servers, relying instead on a decentralized hash table to map objects across the cluster. Data availability and resilience are maintained throug