30 open-source projects similar to apache/iceberg, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best Iceberg alternative.
Delta is a lakehouse table format that brings ACID transactions and data warehouse consistency to large scale data lakes on cloud object storage. It serves as an ACID transaction manager, coordinating atomic commits and serializable isolation for concurrent reads and writes across distributed compute engines. The project provides a multi-engine interoperability layer that uses format translation to allow diverse SQL engines and processing frameworks to read and write the same tables. It functions as a data versioning system, utilizing a transaction log to enable time travel, historical snapsh
Apache Hudi is an open-source table format that brings ACID transactions, incremental processing, and multi-modal indexing to data lakes. It provides atomic commits with snapshot isolation, rollback, and optimistic concurrency control for reliable data lake operations, while supporting upserts, record-level updates, and deletions in large analytical datasets. The project distinguishes itself through a timeline-based architecture that coordinates all write operations, enabling features like time-travel querying, incremental change streaming, and multi-modal query views that include snapshot, i
LanceDB is a vector database and columnar data store designed to function as a versioned dataset manager and vector search engine. It serves as a high-performance backend for indexing and retrieving high-dimensional embeddings, providing the foundation for machine learning data pipelines. The system distinguishes itself through a combination of cloud-native object storage and immutable version tracking, allowing for data time-travel and reproducible AI experiments. It integrates hybrid search capabilities, merging dense vector similarity with BM25 full-text search and SQL-like scalar filters
Gravitino is a federated metadata lake and unified data catalog designed to manage tables, files, and AI models across diverse data sources and cloud storage. It serves as a centralized interface for governing schemas, access controls, and tagging across relational databases, messaging queues, and object stores. The project distinguishes itself by unifying the management of AI assets, such as machine learning models and their version lineages, alongside traditional tabular data. It also implements the Iceberg REST specification to provide a standardized metadata server and proxy for lakehouse
RisingWave is a cloud-native streaming database and real-time analytics engine that uses standard SQL to process continuous data streams. It functions as a streaming data lakehouse, combining the capabilities of a streaming SQL database with a platform that integrates streaming ingestion with open table formats. The system is distinguished by its use of the PostgreSQL wire protocol, allowing it to integrate with existing SQL tools and drivers. It employs a decoupled compute and storage architecture, persisting streaming state and materialized views in cloud object storage to enable independen
Pinot is a distributed, columnar analytical database designed for high-concurrency, low-latency query processing. It functions as a real-time OLAP datastore, enabling interactive, user-facing analytics by ingesting and querying massive datasets from both streaming and batch sources. The system architecture relies on a centralized controller for cluster coordination and a distributed segment-based storage model to ensure horizontal scalability. The platform distinguishes itself through a hybrid ingestion pipeline that unifies real-time event streams and historical batch data into a single quer
Apache Hive is a SQL-on-Hadoop data warehouse that enables querying and managing petabytes of data stored in distributed storage such as HDFS and cloud storage services. It provides a familiar SQL interface for batch analytics and reporting, supported by a core set of components including the HiveServer2 Thrift service for remote query execution, the Hive Metastore Service for central metadata management, the Hive ACID Transaction Engine for concurrent read-write operations, and the Hive LLAP Interactive Engine for low-latency analytical processing. The WebHCat REST API offers an HTTP interfac
Presto is a distributed SQL query engine designed for high-performance analytical processing across heterogeneous data sources. It functions as a data federation platform and massively parallel processing engine, allowing users to execute interactive queries against diverse storage systems without requiring data migration. By mapping remote metadata and structures to a unified relational namespace, it enables seamless cross-platform analysis through a standard SQL interface. The engine distinguishes itself through a pluggable connector architecture and a shared-nothing distributed processing
Prefect is a workflow orchestration platform designed to define, schedule, and monitor complex data pipelines as Python code. It functions as a container-native engine that wraps individual tasks in isolated environments, ensuring consistent dependencies and resource allocation across diverse infrastructure. By utilizing a state-machine-based orchestration model, the system tracks execution progress through discrete transitions and persistent event logs to maintain reliable and observable task processing. The platform distinguishes itself through a decoupled worker-API architecture, which sep
Arrow is a cross-language development platform for in-memory data. It provides a standardized, language-independent columnar memory format designed to accelerate analytical operations and improve memory efficiency on modern computing hardware. By utilizing a schema-driven approach, the framework enables the efficient organization of both flat and nested data structures. The project functions as an analytical data processing engine that facilitates high-performance computation directly on memory-resident datasets. It distinguishes itself through a zero-copy architecture, which allows multiple
SQLiteStudio is an open-source graphical tool for browsing, editing, and managing SQLite database files. It combines a full-featured SQL editor with syntax highlighting, a visual database schema designer for creating entity-relationship diagrams, and a plugin-based extensibility platform that allows adding custom functionality through C/C++, JavaScript, Tcl, or Python. The application distinguishes itself through its multi-language scripting engine, which embeds JavaScript, Tcl, and Python interpreters to enable user-defined functions and scripts within SQL queries. It supports encrypted data
Titan is a distributed graph database and computing engine designed for storing and querying massive datasets of interconnected nodes and edges across multi-machine clusters. It functions as a scalable graph storage layer and transactional store, providing a framework for executing large-scale graph processing jobs and deep traversals. The system is distinguished by its pluggable storage backend, which decouples the graph engine from the physical persistence layer. It utilizes vertex-cut data partitioning to balance processing loads and a set-cardinality property model that allows single prop
Sequel is a relational database toolkit for Ruby that provides object-relational mapping, a fluent SQL query builder, and schema migration capabilities. It maps database tables to Ruby classes with support for associations, validations, lifecycle hooks, and eager loading, offering a comprehensive ORM layer for building data-centric applications. Sequel distinguishes itself through a plugin-based extension architecture that allows composable customization of models, databases, and datasets without relying on deep inheritance hierarchies. It includes a thread-safe connection pool with support f
GreptimeDB is a distributed, open-source time-series database built for unified observability. It stores and queries metrics, logs, and traces together in a single columnar engine, supporting both SQL and PromQL for analysis. The database is designed as a Kubernetes-native operator with a decoupled compute and storage architecture, enabling horizontal scaling and multi-region deployment. What distinguishes GreptimeDB is its role as a multi-protocol ingestion gateway, accepting data through OpenTelemetry, Prometheus Remote Write, InfluxDB, Loki, Elasticsearch, Kafka, and MQTT protocols without
Lance is a columnar data format and storage layer designed for high-performance random access and the persistence of multimodal data. It functions as a vector database storage system, a multimodal data store, and a versioned dataset manager. The project distinguishes itself as a hybrid search engine that combines vector similarity search and full-text indexing on a single dataset. It provides unified storage for diverse data types including images, audio, and video, utilizing a system that lazy-loads large binary objects only when requested. The system manages dataset evolution through schem
Kaminari is a Ruby pagination library and ActiveRecord tool designed to divide large datasets into smaller pages using limit and offset logic. It functions as a data paging utility that manages record offsets and total count calculations for Ruby web applications. The project distinguishes itself by generating SEO-friendly navigation links and standardized HTML tags to improve search engine indexing. It supports localized navigation labels and translation files for multilingual interface design, and allows for customizable pagination themes via template overrides of view partials. The librar
This project is a relational database cheat sheet and SQL reference guide. It provides a collection of syntax examples and query documentation for managing relational databases using structured query language. The tool is implemented as a static site with client-side searchable documentation, allowing for immediate filtering of technical content through a browser-based index. The reference covers relational database management, including data retrieval, database schema management, and record maintenance. It also includes guidance on relational data manipulation through table joins and the g
RocksDB is a high-performance, embeddable persistent key-value library and storage engine based on Log-Structured Merge-trees. It is designed to provide durable storage for large-scale datasets, integrating directly into applications to manage data on flash and RAM-based hardware. The engine is distinguished by its focus on minimizing read and write amplification through multi-threaded compaction and custom memory allocators. It features specialized optimizations for flash storage, including support for zoned block devices, and provides the ability to extend store behavior via external plugin
This project is a reference library of architectural blueprints, study materials, and design patterns for building scalable, high-availability distributed systems. It serves as a technical guide for scalability engineering, providing structural solutions for common engineering challenges. The repository focuses on distributed systems design, covering essential patterns for data replication, consensus algorithms, and transaction management. It distinguishes itself by offering detailed blueprints for specialized domains, including real-time data streaming, large-scale data storage, and high-ava
ToyDB is a distributed SQL database that provides a system for storing and querying data across multiple nodes. It focuses on maintaining strong consistency and fault tolerance through the implementation of a distributed consensus algorithm. The project distinguishes itself by supporting historical data versioning, enabling time-travel queries to retrieve the state of the database from a specific point in the past. It utilizes multi-version concurrency control to manage ACID transactions and ensure data integrity during concurrent operations. The system covers relational data modeling with t
This project is an automated machine learning framework and toolkit designed for training and tuning custom models for classification, regression, and recommendations. It functions as a multimodal machine learning toolkit capable of processing and training models using a combination of text, image, audio, and sensor data. The framework distinguishes itself as a multimodal data processor that can handle and visualize large datasets on a single machine using column-oriented disk storage. It includes a core machine learning model generator that converts trained models into formats compatible wit
Stellar Core is the primary software implementation of the Stellar blockchain network, serving as a distributed ledger and a Federated Byzantine Agreement system. It functions as a core node that maintains the shared state of the network and provides a runtime environment for executing WebAssembly smart contracts. The project enables the creation and management of digital assets, including the implementation of decentralized exchanges through distributed orderbooks and automated liquidity pools. It facilitates cross-border payment settlement by routing assets via path payments and bridging di
A Scala API for Apache Beam and Google Cloud Dataflow.
Useful scripts, udfs, views, and other utilities for migration and data warehouse operations in BigQuery.
Cloud Dataflow Google-provided templates for solving in-Cloud data tasks
This project is a reactive, offline-first NoSQL database engine designed for JavaScript applications. It provides a robust framework for managing application state by synchronizing data across browsers, mobile devices, and server-side runtimes. By treating local storage as the primary source of truth, it enables applications to remain functional without network connectivity, automatically reconciling changes with remote backends once a connection is restored. The database distinguishes itself through a modular architecture that supports cross-environment synchronization and high-performance d
Anko is an Android Kotlin library designed to simplify application development through a set of domain-specific languages and extensions. It functions as a programmatic UI DSL, an SQLite wrapper, an SDK utility, and an asynchronous framework. The project provides a declarative layout system that allows developers to build user interfaces through code instead of static XML markup. It distinguishes itself by offering a fluent database layer that eliminates manual cursor management and a concurrency system that uses weak references to prevent memory leaks in activities. The library covers broad