9 repositorios
High-performance analytical queries using columnar storage for aggregations, buckets, and facets.
Distinguishing note: None of the candidates cover database-level analytical processing using columnar storage; most focus on performance metrics or search tools.
Explore 9 awesome GitHub repositories matching data & databases · Columnar Analytics. Refine with filters or upvote what's useful.
Perspective is a columnar data analytics engine and high-performance visualization component powered by WebAssembly. It provides a system for analyzing and visualizing large or streaming datasets through interactive data grids and charts, utilizing a compiled binary to achieve near-native performance within the browser. The project distinguishes itself through a WebSocket-based data streaming interface and deep Apache Arrow integration, which minimize memory overhead when synchronizing tables between servers and clients. It acts as a remote query proxy capable of translating visualization con
Stores strongly typed data in columns to enable high-performance aggregation and filtering of large datasets in memory.
Apache DataFusion is an extensible, columnar SQL query engine that runs embedded within a host application without requiring a separate server process. It processes data in columnar batches using Apache Arrow for memory-efficient analytics, and can scale analytic workloads across multiple nodes for parallel execution. The engine supports both SQL and DataFrame queries through a modular, streaming architecture that allows custom operators, data sources, functions, and optimizer rules. The engine distinguishes itself through its modular extension framework, which enables building custom query e
Processes data in Arrow columnar batches through a streaming pipeline without materializing intermediate results.
ParadeDB is a database extension that integrates full-text search, vector database capabilities, and real-time analytics directly into a relational engine. It functions as a plugin that adds new storage and query execution capabilities to an existing database architecture. The project distinguishes itself by supporting hybrid search workflows that combine lexical keyword matching with dense and sparse vector similarity in a single query. It utilizes reciprocal rank fusion to merge these ranked result sets and employs logical replication to synchronize data from external instances, removing th
Calculates high-performance aggregates, buckets, and facets using specialized columnar storage.
This project is an open source relational database management system and SQL database designed for storing and managing structured data. It functions as a relational database for ensuring consistency and reliability, while also operating as a vector database for storing and querying high-dimensional vector embeddings. The system incorporates a columnar storage engine to optimize analytical query processing and large-scale data aggregation. It further enables vector similarity search, allowing users to find similar items by querying vector embeddings. The software covers a broad capability su
Executes high-performance analytical queries using columnar storage for efficient aggregations.
AliSQL is a fork of MySQL by Alibaba that extends the relational database management system with enhancements for high performance, scalability, and enterprise-grade availability. It retains the core MySQL identity as a SQL-based database for storing, organizing, and retrieving structured data, while adding optimizations for large-scale transactional and analytical workloads. The project differentiates itself through a set of Alibaba-specific improvements, including a columnar engine for accelerating analytical queries directly on MySQL tables, and a distributed, shared-nothing NDB Cluster en
Embeds a DuckDB columnar engine to execute analytical SQL queries directly on MySQL tables.
Perfetto is a platform for system-level performance tracing and analysis on Linux and Android. It combines a high-throughput trace recorder, a SQL-based query engine, and a browser-based visualizer into a single toolchain. The platform covers CPU scheduling and call-stack profiling, native and Java heap memory allocation tracking, GPU and graphics events, and system-wide counters such as CPU frequency and power consumption. The architecture decouples trace recording from offline analysis, using a compact protobuf format for event encoding and columnar storage for efficient SQL queries. The we
Stores parsed trace events in a column-oriented database for fast analytic queries.
Daft is a distributed dataframe library and multimodal data processor designed to handle large-scale structured and unstructured data. It functions as a vectorized execution engine that processes tables alongside images, audio, and video, utilizing a unified schema to manage diverse data types. The project distinguishes itself by combining distributed data engineering with large-scale AI inference. It provides an AI data pipeline for batch-optimizing model prompts and generating high-dimensional text embeddings, while utilizing zero-copy memory sharing to execute custom Python functions witho
Utilizes vectorized columnar processing on contiguous memory blocks to maximize hardware utilization.
Grafana Tempo is a high-scale distributed tracing backend and columnar trace database. It serves as an observability data store that persists and queries spans and traces using OpenTelemetry standards, allowing for the analysis of request flows across microservices. The system distinguishes itself by using an object-store based backend with columnar Parquet storage. This architecture enables efficient attribute searching and large-scale data retrieval through dedicated attribute columnization and block-based data partitioning. It includes a specialized TraceQL query engine for filtering trace
Organizes trace data into columnar Parquet files to enable efficient attribute filtering and high-performance retrieval.
RoaringBitmap is a Java-based library designed for the memory-efficient storage and high-speed querying of large sets of integers. It functions as an in-memory analytics tool that maintains compact data representations while supporting rapid set calculations, such as intersections, unions, and differences. The library distinguishes itself through a hybrid compression strategy that automatically selects between bitsets, sorted arrays, or run-length encoding based on the density of the data. It utilizes a two-level hierarchical index to provide constant-time random access lookups, ensuring perf
Functions as an in-memory analytics tool for performing rapid set calculations on compressed data without requiring full decompression.