30 open-source projects similar to zarr-developers/zarr-python, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best Zarr Python alternative.
Alluxio is a virtual distributed file system and data orchestration layer that serves as a high-performance caching layer between cloud storage and compute clusters. It acts as a distributed data cache designed to accelerate data access for large-scale analytics and machine learning workloads. The system provides a unified interface that presents multiple heterogeneous storage backends as a single coherent namespace. This allows for the unification of diverse storage systems, enabling computation engines to access data from different providers without changing application code. The project c
Arrow is a cross-language development platform for in-memory data. It provides a standardized, language-independent columnar memory format designed to accelerate analytical operations and improve memory efficiency on modern computing hardware. By utilizing a schema-driven approach, the framework enables the efficient organization of both flat and nested data structures. The project functions as an analytical data processing engine that facilitates high-performance computation directly on memory-resident datasets. It distinguishes itself through a zero-copy architecture, which allows multiple
Apache Druid is a real-time analytics database and distributed columnar time-series store designed for sub-second analytical queries. It functions as a data platform featuring a distributed SQL query engine and a real-time data ingestion system for moving historical and streaming data from external sources. The system is distinguished by its ability to provide low-latency analytics under high concurrency to power operational dashboards. It implements a Kerberos-secured environment for user authentication and employs a shared-nothing cluster architecture to enable horizontal scaling. The plat
Apache Hudi is an open-source table format that brings ACID transactions, incremental processing, and multi-modal indexing to data lakes. It provides atomic commits with snapshot isolation, rollback, and optimistic concurrency control for reliable data lake operations, while supporting upserts, record-level updates, and deletions in large analytical datasets. The project distinguishes itself through a timeline-based architecture that coordinates all write operations, enabling features like time-travel querying, incremental change streaming, and multi-modal query views that include snapshot, i
Iceberg is an open table format and big data table manager designed for huge analytic datasets in cloud storage. It provides a specification for tracking large-scale datasets to maintain transactional consistency and structural integrity. The project utilizes a standardized REST catalog interface to manage table metadata, ensuring interoperability between different compute engines. This allows diverse query engines to connect to a single table interface and maintain consistency across different processing frameworks. Its core capabilities include managing large-scale analytic tables, coordin
Ignite is a distributed in-memory data grid and compute platform. It functions as a distributed SQL database and storage engine designed to store and process large datasets in RAM to minimize latency and increase calculation speed. The system is distinguished by a multi-tier storage engine that manages data placement across memory and disk to balance high-speed access with large capacity. It features a distributed compute grid that executes custom logic directly on the nodes where data resides to reduce network traffic. The platform provides a broad set of capabilities including ACID transac
Pinot is a distributed, columnar analytical database designed for high-concurrency, low-latency query processing. It functions as a real-time OLAP datastore, enabling interactive, user-facing analytics by ingesting and querying massive datasets from both streaming and batch sources. The system architecture relies on a centralized controller for cluster coordination and a distributed segment-based storage model to ensure horizontal scalability. The platform distinguishes itself through a hybrid ingestion pipeline that unifies real-time event streams and historical batch data into a single quer
Casibase is an open-source platform that orchestrates multi-turn conversations with large language models and manages retrieval-augmented knowledge bases from a single interface. It provides a unified system for connecting to over 30 AI model providers, ingesting documents into vector embeddings for semantic search, and running autonomous agent loops that can drive a browser, search the web, execute commands, and integrate with external tools. The platform distinguishes itself by combining AI conversation management with infrastructure and application orchestration capabilities. It includes a
Chroma is a specialized vector database designed to index and retrieve high-dimensional data representations for semantic similarity search. It functions as a comprehensive platform for information retrieval, enabling the storage and management of unstructured documents alongside structured metadata. By mapping data into numerical representations, the system facilitates rapid similarity lookups across large datasets. The platform distinguishes itself through a hybrid search infrastructure that combines dense vector embeddings with sparse keyword and regular expression matching to balance sema
ClickHouse is a high-performance, columnar analytical database designed for real-time query execution and large-scale data aggregation. It functions as a distributed data warehouse capable of processing petabytes of information, while also providing an embedded engine that integrates directly into applications for native query capabilities without external dependencies. The system is built to handle high-throughput ingestion and complex analytical workloads, delivering millisecond-level latency for interactive dashboards and operational monitoring. The platform distinguishes itself through ad
CuPy is a CUDA array computing library that implements a NumPy-compatible interface for executing array operations and numerical computing on NVIDIA GPUs. It serves as a GPU-accelerated numerical library and a CUDA-based SciPy implementation, offloading heavy calculations to graphics hardware to increase processing speed for scientific and engineering workloads. The library enables multi-framework tensor exchange, allowing data buffers to be shared between different deep learning frameworks using standardized memory layouts to avoid memory copies. It also supports custom GPU kernel integratio
Dask is a parallel computing framework and distributed task scheduler designed to scale Python data science workflows from single machines to large clusters. It functions as a cluster resource manager that orchestrates computational logic by representing tasks and their dependencies as directed acyclic graphs. This architecture allows the system to automate the distribution of workloads across available hardware while managing complex execution requirements. The project distinguishes itself through a lazy evaluation engine that defers data operations until they are explicitly requested, enabl
Delta is a lakehouse table format that brings ACID transactions and data warehouse consistency to large scale data lakes on cloud object storage. It serves as an ACID transaction manager, coordinating atomic commits and serializable isolation for concurrent reads and writes across distributed compute engines. The project provides a multi-engine interoperability layer that uses format translation to allow diverse SQL engines and processing frameworks to read and write the same tables. It functions as a data versioning system, utilizing a transaction log to enable time travel, historical snapsh
Represent, send, store and search multimodal data
MinHash, LSH, LSH Forest, Weighted MinHash, HyperLogLog, HyperLogLog++, LSH Ensemble and HNSW
Gel is an object-relational database system that models data as a graph of interconnected objects. By utilizing a strongly typed schema, it enables complex relational queries and polymorphic data structures without the need for traditional join tables. The system integrates native vector storage and similarity search operators, allowing it to function as both a relational and a vector database for semantic data retrieval. The platform distinguishes itself through a comprehensive suite of developer-centric automation tools. It features a declarative migration system that tracks and versions sc
Library for reading and writing large multi-dimensional arrays.
A Python package for manipulating 2-dimensional tabular data structures
h2o-3 is a distributed machine learning platform and automated machine learning framework designed for training and deploying predictive models using distributed in-memory computing. It functions as a deep learning framework and a distributed model scoring engine, capable of operating as a Kubernetes ML cluster to process large datasets in parallel. The platform distinguishes itself through automated machine learning capabilities that automatically select the best algorithms and hyperparameters to optimize model performance. It provides specialized deep learning toolkits for tasks including i
HDF5 for Python -- The h5py package is a Pythonic interface to the HDF5 binary data format.
Simple, safe way to store and distribute tensors
InfluxDB is a specialized time series database platform engineered for the high-speed ingestion, compression, and retrieval of timestamped data at scale. It functions as a distributed metrics platform, providing the infrastructure necessary to organize and analyze massive volumes of time-stamped information to identify trends, patterns, and anomalies within complex data streams. The platform distinguishes itself through a functional dataflow engine that utilizes a specialized programming language for complex analytical transformations and automated tasks. This architecture is supported by a p
Fast NumPy array functions written in C
Marqo is an ecommerce product discovery platform, multimodal vector database, and AI search merchandising tool. It provides infrastructure for implementing semantic search and recommendations, allowing shoppers to find products using natural language and images. The platform distinguishes itself through a hybrid ranking pipeline that combines neural semantic scores with business-defined boosting and pinning rules. It features a conversational commerce engine that uses large language models to process user intent and provides a search performance analytics suite for measuring conversion uplift
Milvus is a specialized vector database engine designed for the indexing, management, and high-speed similarity retrieval of high-dimensional vector embeddings. It functions as a similarity search engine capable of identifying nearest neighbors within large-scale vector spaces, supporting the storage and retrieval of billions of data points while maintaining consistent performance. The system utilizes a distributed architecture that decouples storage, query, and coordination into independent services, allowing for horizontal scaling across clusters. It employs a global indexing mechanism that
Modin is a distributed dataframe library and parallel data processing engine designed to handle large datasets that exceed system memory. It functions as a distributed computing framework that parallelizes data manipulation tasks across multiple CPU cores or clusters to increase throughput and avoid memory errors. The project mirrors the Pandas API, allowing for the distribution of data workflows without changing core code logic. It utilizes a pluggable backend interface, which enables users to switch between different distributed execution engines to optimize performance based on available h
Pandas is a high-performance data analysis library that provides a comprehensive framework for manipulating, cleaning, and transforming structured datasets. It centers on labeled one-dimensional and two-dimensional data structures, allowing users to construct, filter, and reshape tabular information while performing complex arithmetic and logical operations. The library distinguishes itself through a sophisticated indexing engine that enables automatic data alignment during calculations and relational merges. By utilizing a block-based memory layout, it optimizes cache locality for vectorized