5 dépôts
Filtering data at the storage layer during ingestion to reduce the volume of data transferred to memory.
Distinct from Predicate-Based Filtering: Specifically refers to the architectural pattern of pushing filter predicates to the file reader.
Explore 5 awesome GitHub repositories matching data & databases · Predicate Pushdown. Refine with filters or upvote what's useful.
cuDF is a GPU-accelerated dataframe library and data processing engine designed for manipulating and analyzing large tabular datasets. It provides a high-level API for executing filtering, joining, and aggregating operations directly on GPU hardware. The project integrates the Apache Arrow memory format to enable zero-copy data transfers and includes a just-in-time compiler for executing custom user-defined functions on the GPU. The library features specialized acceleration for existing workflows by redirecting standard Pandas dataframe calls and Polars query plans to a GPU backend. It also p
Filters data at the file level during Parquet or ORC ingestion to minimize GPU memory transfers.
LanceDB is a vector database and columnar data store designed to function as a versioned dataset manager and vector search engine. It serves as a high-performance backend for indexing and retrieving high-dimensional embeddings, providing the foundation for machine learning data pipelines. The system distinguishes itself through a combination of cloud-native object storage and immutable version tracking, allowing for data time-travel and reproducible AI experiments. It integrates hybrid search capabilities, merging dense vector similarity with BM25 full-text search and SQL-like scalar filters
Executes SQL-like filters directly at the storage layer to reduce data transfer during queries.
ParadeDB is a database extension that integrates full-text search, vector database capabilities, and real-time analytics directly into a relational engine. It functions as a plugin that adds new storage and query execution capabilities to an existing database architecture. The project distinguishes itself by supporting hybrid search workflows that combine lexical keyword matching with dense and sparse vector similarity in a single query. It utilizes reciprocal rank fusion to merge these ranked result sets and employs logical replication to synchronize data from external instances, removing th
Filters data at the storage layer during index scans to reduce data movement and processing overhead.
Octosql est un moteur de requête SQL fédéré, un transformateur de données et un processeur SQL de flux. Il permet aux utilisateurs d'exécuter des instructions SQL uniques sur plusieurs sources de données disparates, y compris différents types de bases de données et formats de fichiers, afin de fusionner et transformer les résultats en un ensemble unifié. Le système se distingue en traitant les fichiers CSV, JSONLines et Parquet comme des tables virtuelles et en utilisant une architecture basée sur des plugins pour étendre la connectivité aux moteurs de stockage externes. Il fonctionne comme un processeur de flux pour les flux de données infinis, utilisant des filigranes (watermarks), des rétractions et des fenêtres glissantes pour maintenir la cohérence des événements hors séquence. De plus, il sert de générateur de données SQL capable de produire des jeux de données synthétiques et des flux d'enregistrements via des fonctions table. Le moteur inclut des capacités de jointure de données inter-sources et d'analyse multi-sources, optimisées par le push-down de prédicats côté source pour réduire le transfert de données. Il gère des données complexes via un système de typage statique avec des types union et offre une observabilité grâce à la visualisation des plans d'exécution de requêtes.
Optimizes performance by pushing filters directly to the data source to reduce record transfer volume.
MiniOB is an open-source educational relational database kernel designed for learning the internals of database systems. It implements a dual-engine storage architecture combining B+ Tree and LSM-Tree, supports SQL parsing and query execution, and provides transactional processing with multi-version concurrency control. The system communicates with clients using the MySQL wire protocol and includes a vector database extension for storing and querying high-dimensional vectors. The project distinguishes itself through its comprehensive coverage of core database concepts in a single, learnable c
Move filter conditions from the WHERE clause closer to the table scan to reduce rows processed early.