30 open-source projects similar to apache/arrow, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best Arrow alternative.
Velox is a high-performance C++ query execution engine and columnar data processing library. It serves as a composable framework for implementing analytical query engines, providing a vectorized expression evaluator and a toolkit for data management systems. The project is distinguished by its use of vectorized columnar execution and arena-based memory allocation to process large-scale datasets. It features specialized optimizations such as broadcast join table caching, dynamic filter push-down, and dictionary encoding to reduce memory overhead and accelerate analytical reads. The engine cov
Pinot is a distributed, columnar analytical database designed for high-concurrency, low-latency query processing. It functions as a real-time OLAP datastore, enabling interactive, user-facing analytics by ingesting and querying massive datasets from both streaming and batch sources. The system architecture relies on a centralized controller for cluster coordination and a distributed segment-based storage model to ensure horizontal scalability. The platform distinguishes itself through a hybrid ingestion pipeline that unifies real-time event streams and historical batch data into a single quer
Delta is a lakehouse table format that brings ACID transactions and data warehouse consistency to large scale data lakes on cloud object storage. It serves as an ACID transaction manager, coordinating atomic commits and serializable isolation for concurrent reads and writes across distributed compute engines. The project provides a multi-engine interoperability layer that uses format translation to allow diverse SQL engines and processing frameworks to read and write the same tables. It functions as a data versioning system, utilizing a transaction log to enable time travel, historical snapsh
Polars is a high-performance columnar data processing library designed for efficient analytical workflows. It functions as a structured data library that organizes information into typed columns, utilizing the Apache Arrow memory format to enable zero-copy data sharing and cache-friendly, vectorized operations. The engine is built to handle large-scale tabular datasets, providing both local and distributed analytical runtimes that scale from single-machine environments to multi-node clusters. The project distinguishes itself through a sophisticated lazy query engine that constructs abstract e
DuckDB is an embedded, in-process analytical SQL database and OLAP database management system. It functions as a data engine for Parquet and CSV files, allowing users to execute complex SQL queries on large datasets without requiring a separate server process. The system is designed for local analytical processing and embedded data science workflows. It enables the direct querying and analysis of Parquet and CSV files from disk, bypassing the need to load data into a permanent database. The engine provides high-performance analytical SQL execution, including support for window functions and
Fory is a cross-language serialization framework and binary data serializer designed to convert complex object graphs into a compact binary format for high-performance data exchange. It includes an IDL-based schema compiler to transform interface definition language files into type-safe native data models and a schema evolution manager to maintain forward and backward compatibility. The project features a zero-copy data access layer that allows reading specific fields from binary rows without deserializing the entire object. It supports dual-mode serialization, enabling a toggle between a por
Dask is a parallel computing framework and distributed task scheduler designed to scale Python data science workflows from single machines to large clusters. It functions as a cluster resource manager that orchestrates computational logic by representing tasks and their dependencies as directed acyclic graphs. This architecture allows the system to automate the distribution of workloads across available hardware while managing complex execution requirements. The project distinguishes itself through a lazy evaluation engine that defers data operations until they are explicitly requested, enabl
GraalVM is a polyglot virtual machine and high-performance runtime designed to execute multiple programming languages within a single environment. It functions as a JVM language toolkit for building language implementations, a native image compiler for transforming bytecode into standalone binaries, and an execution environment for LLVM bitcode and WebAssembly modules. The project is distinguished by its polyglot interoperability framework, which allows different languages to share data and execution state with low overhead. It utilizes self-modifying abstract syntax trees to optimize languag
Graal is a compiler and runtime architecture designed for high-performance execution and polyglot interoperability. It utilizes a graph-based representation of program logic to perform global optimizations and JIT compilation. The project features a meta-circular interpretation framework and a specialized partial evaluation mechanism, which allow for the creation of new programming languages and the automatic optimization of their semantics into machine code. It enables multiple diverse programming languages to share memory and communicate through a standardized cross-language protocol within
DuckDB is an in-process analytical database engine designed to run directly within an application process. As a zero-dependency, embedded system, it provides enterprise-grade SQL data processing capabilities without the overhead of managing a dedicated database server. It is built to handle complex analytical and aggregation tasks by storing and retrieving information in columns, allowing for high-performance relational data manipulation. The engine distinguishes itself through a columnar vectorized execution model that maximizes CPU cache efficiency during query operations. It employs adapti
This project is a curated collection of programming exercises designed to build proficiency in numerical computing and data manipulation. It provides a structured learning path for mastering multidimensional array operations, vectorized arithmetic, and statistical analysis. The repository focuses on developing practical expertise in array-based workflows, emphasizing techniques such as memory management, efficient data processing, and the replacement of explicit loops with vectorized operations. Users engage with hands-on challenges that cover the full lifecycle of numerical data, from initia
Toon is a data serialization library and toolkit designed to convert complex objects into compact, human-readable formats optimized for large language models. By focusing on token efficiency, the library minimizes the context window footprint of structured data through techniques like key folding and tabular layout optimization. It provides a streaming-capable processor that handles the encoding and decoding of hierarchical data while maintaining structural integrity. The project distinguishes itself through its path-aware transformation pipeline and configurable serialization logic, which al
Thrift is a cross-language remote procedure call framework and data serialization protocol. It provides an interface definition language to specify data types and service interfaces in a neutral format, enabling the automated generation of client and server code across multiple programming languages. The project functions as a polyglot service communicator using a layered software stack to ensure interoperable communication. It focuses on implementing cross-language remote procedure calls and transforming complex data structures into standardized formats for efficient network transport. The
jetson-inference is a set of libraries and tools for executing optimized deep learning models on embedded GPU hardware. Its primary purpose is to enable real-time computer vision and AI inference at the edge with low latency and high throughput. The project distinguishes itself through high-performance streaming analytics and the ability to execute concurrent AI pipelines on auto-grade silicon. It provides specialized support for multi-sensor stream processing, utilizing zero-copy data transport to load camera frames directly into GPU memory. The codebase covers a broad surface of capabiliti
Trino is a distributed SQL query engine designed for large-scale data analytics. It functions as a data federation platform, providing a unified interface that allows users to execute complex analytical queries across multiple heterogeneous data sources simultaneously without requiring data movement or transformation. The engine utilizes a massively parallel processing architecture to scale compute resources across clusters for high-speed data retrieval. It distinguishes itself through a cost-based query optimizer that analyzes metadata to determine efficient execution plans, alongside dynami
CuPy is a CUDA array computing library that implements a NumPy-compatible interface for executing array operations and numerical computing on NVIDIA GPUs. It serves as a GPU-accelerated numerical library and a CUDA-based SciPy implementation, offloading heavy calculations to graphics hardware to increase processing speed for scientific and engineering workloads. The library enables multi-framework tensor exchange, allowing data buffers to be shared between different deep learning frameworks using standardized memory layouts to avoid memory copies. It also supports custom GPU kernel integratio
This project is a comprehensive educational curriculum designed to teach Python programming through the lens of data science and financial analysis. It provides a structured guide for learning how to process complex numerical information, build data models, and perform scientific computing tasks using standard industry libraries. The materials focus on practical applications, enabling users to develop skills in financial data analysis and interactive exploration. By working through these resources, learners gain experience in executing high-performance mathematical operations, transforming ra
iii is a distributed service orchestrator and event-driven workflow engine designed to compose and manage cross-language functions and workers through a central execution engine. It functions as a multi-language service mesh and WebSocket service gateway, providing a persistent communication layer for remote service workers. The platform enables dynamic runtime extensions, allowing new workers and capabilities to be deployed and registered into a live environment without requiring system restarts. It distinguishes itself by offering machine-readable skill exposure and agent capability integra
MessagePack-CSharp is a high-performance binary serializer for .NET that converts C# objects to and from the compact MessagePack format. It uses compile-time source generation to produce AOT-safe formatters and resolvers, eliminating runtime reflection and enabling ahead-of-time compilation scenarios. The serializer encodes object fields as integer indices instead of string keys, producing compact binary output with deterministic field ordering, and provides stack-allocated reader and writer structs for direct encoding and decoding of MessagePack primitives without heap allocations. The libra
FlatBuffers is a cross-platform serialization library designed for performance-critical applications that require efficient, zero-copy data access. By organizing data in a structured binary format, it allows applications to read and write complex data structures directly from memory-mapped buffers without the need for intermediate parsing or temporary object allocation. The project distinguishes itself through a schema-driven approach that balances high-performance access with long-term data evolution. It utilizes a unique memory layout featuring relative offsets and inline fixed-size structu
This project is a collection of educational notes and tutorials focused on Python programming, scientific computing, and data analysis. It serves as a reference for learning language basics, advanced techniques, and object-oriented design. The materials include implementation guides for building linear, logistic, and convolutional neural networks using symbolic graph frameworks. It also provides instruction on manipulating and visualizing structured data frames and performing complex mathematical operations through numerical libraries. The repository includes a system for converting interact
protoactor-go is a framework for building concurrent and distributed systems in Go using the actor model. It provides a distributed actor system that enables isolated entities to communicate via asynchronous messaging and share state across a cluster. The framework implements a multi-language actor protocol, allowing interoperability between actors written in Go, C#, and Java. It further supports a virtual actor implementation, where actors are automatically instantiated across a network based on a unique identity. The system includes a supervision model for managing actor lifecycles and fau
This project is a Protocol Buffers implementation for Go, providing a binary serialization framework to convert native data structures into a compact binary format for efficient network transmission and storage. It functions as a language bindings generator, utilizing a compiler plugin to create Go source code from platform-neutral protocol buffer definitions. The implementation includes a JSON data mapper that transforms structured binary messages into JSON format to facilitate compatibility with web services and external APIs. It also enables cross-language data exchange by using a common s
Apache Flink is a distributed processing engine designed for both high-throughput, low-latency data streams and finite batch workloads. It functions as a stateful stream processor and a SQL stream processing engine, providing a unified runtime to execute relational queries and event-based transformations. The system is distinguished by its ability to manage persistent operator state to ensure exactly-once processing guarantees and consistency during failures. It features specialized capabilities for complex event processing to detect temporal patterns and handles out-of-order events using eve
Apache Hive is a SQL-on-Hadoop data warehouse that enables querying and managing petabytes of data stored in distributed storage such as HDFS and cloud storage services. It provides a familiar SQL interface for batch analytics and reporting, supported by a core set of components including the HiveServer2 Thrift service for remote query execution, the Hive Metastore Service for central metadata management, the Hive ACID Transaction Engine for concurrent read-write operations, and the Hive LLAP Interactive Engine for low-latency analytical processing. The WebHCat REST API offers an HTTP interfac
Hyper is a low-level networking library designed for building high-performance HTTP clients and servers. It provides a foundational toolkit for creating network services that leverage asynchronous execution and memory-safe data handling, supporting both HTTP/1 and HTTP/2 protocols. The library distinguishes itself through a protocol-agnostic architecture that separates transport logic from HTTP semantics. It utilizes a service-trait abstraction to decouple network logic from the underlying transport, enabling developers to inject custom middleware for request interception and response transfo
This project is a machine learning array framework and tensor computation library designed for high-performance numerical computing. It provides a comprehensive suite of tools for constructing and training neural networks, featuring an automatic differentiation engine that facilitates gradient-based optimization and complex mathematical modeling. The library distinguishes itself through a unified memory architecture that allows data to be shared across CPU and GPU devices without explicit copies, significantly reducing data movement overhead. Its execution model relies on a lazy evaluation en
Albumentations is a computer vision image augmentation library designed to increase training data diversity for deep learning models. It provides a toolset for applying geometric and color transformations to images and annotations, including a specialized collection of 3D operations for volumetric data used in medical and scientific imaging. The library functions as an image mask and bounding box transformer, automatically updating masks, bounding boxes, and keypoints when images undergo geometric changes. This ensures that spatial alterations remain synchronized across images and their assoc
TigerBeetle is a distributed financial accounting database designed for high-volume transaction processing. It functions as a specialized transaction engine that enforces strict double-entry bookkeeping invariants, ensuring that every debit and credit is balanced and accounted for with absolute consistency. By utilizing a consensus-based replication model, the system provides high availability and data durability across geographically distributed clusters, making it suitable for mission-critical financial infrastructure. The system distinguishes itself through a performance-oriented architect
Iceberg is an open table format and big data table manager designed for huge analytic datasets in cloud storage. It provides a specification for tracking large-scale datasets to maintain transactional consistency and structural integrity. The project utilizes a standardized REST catalog interface to manage table metadata, ensuring interoperability between different compute engines. This allows diverse query engines to connect to a single table interface and maintain consistency across different processing frameworks. Its core capabilities include managing large-scale analytic tables, coordin