17 dépôts
Utilities for improving database query performance and data retrieval.
Distinguishing note: Focuses on performance tuning for data lists rather than general database management.
Explore 17 awesome GitHub repositories matching data & databases · Query Optimizers. Refine with filters or upvote what's useful.
Apache Spark is a unified distributed data processing engine designed for large-scale data analysis and computation graphs. It functions as a distributed machine learning framework, a graph processing system, a real-time stream processor, and a SQL analytics engine. The system enables the execution of distributed SQL querying, large-scale graph analysis, and real-time stream analytics across clusters of machines. It also provides a scalable environment for implementing machine learning algorithms and predictive model development on massive datasets. The engine incorporates relational query e
Implements a cost-based and rule-based optimizer to transform SQL expressions into efficient physical execution plans.
Polars is a high-performance columnar data processing library designed for efficient analytical workflows. It functions as a structured data library that organizes information into typed columns, utilizing the Apache Arrow memory format to enable zero-copy data sharing and cache-friendly, vectorized operations. The engine is built to handle large-scale tabular datasets, providing both local and distributed analytical runtimes that scale from single-machine environments to multi-node clusters. The project distinguishes itself through a sophisticated lazy query engine that constructs abstract e
Optimizes query execution by filtering rows and selecting columns as close to the source as possible.
DuckDB is an in-process analytical database engine designed to run directly within an application process. As a zero-dependency, embedded system, it provides enterprise-grade SQL data processing capabilities without the overhead of managing a dedicated database server. It is built to handle complex analytical and aggregation tasks by storing and retrieving information in columns, allowing for high-performance relational data manipulation. The engine distinguishes itself through a columnar vectorized execution model that maximizes CPU cache efficiency during query operations. It employs adapti
Dynamically selects efficient execution plans for analytical workloads at runtime.
Presto is a distributed SQL query engine designed for high-performance analytical processing across heterogeneous data sources. It functions as a data federation platform and massively parallel processing engine, allowing users to execute interactive queries against diverse storage systems without requiring data migration. By mapping remote metadata and structures to a unified relational namespace, it enables seamless cross-platform analysis through a standard SQL interface. The engine distinguishes itself through a pluggable connector architecture and a shared-nothing distributed processing
Provides cost-based query optimization to rewrite execution paths based on table statistics and historical data.
Trino is a distributed SQL query engine designed for large-scale data analytics. It functions as a data federation platform, providing a unified interface that allows users to execute complex analytical queries across multiple heterogeneous data sources simultaneously without requiring data movement or transformation. The engine utilizes a massively parallel processing architecture to scale compute resources across clusters for high-speed data retrieval. It distinguishes itself through a cost-based query optimizer that analyzes metadata to determine efficient execution plans, alongside dynami
Utilizes cost-based optimization to analyze metadata and statistics for generating efficient query execution plans.
Citus is a PostgreSQL extension that transforms a standard database into a distributed system. It functions as a sharding framework and distributed SQL engine, enabling horizontal scaling by partitioning tables across a cluster of nodes. By utilizing a coordinator-worker topology, the system manages metadata and routes queries to the appropriate nodes, allowing for parallel execution of complex operations across distributed data shards. The platform distinguishes itself through its specialized support for multi-tenant architectures and real-time analytical processing. It enables tenant-based
Pushes operations to worker nodes based on distribution columns to minimize data movement and maximize parallel computation.
MySQL Server is a relational database management system designed to organize and store structured information. It functions as a comprehensive SQL server platform that provides reliable transactional integrity and high-performance query execution for enterprise data management. The system distinguishes itself through a pluggable storage engine architecture that decouples logical query processing from physical data storage, allowing for specialized handling of diverse workloads. It maintains data consistency and high concurrency through multi-version concurrency control and write-ahead logging
Analyzes table statistics and index availability to select the most efficient execution plan for retrieving data from complex relational structures.
Matrix is a suite of mobile application performance management and analysis tools. It provides a plugin-based monitoring system for capturing crashes, lags, and memory leaks, alongside a static binary auditor for reducing installation package size and a bytecode instrumentation tool for performance tracking. The project distinguishes itself through native memory debugging and a SQLite query linter that identifies inefficient database patterns. It employs native interception techniques to detect memory leaks and heap corruption without requiring source code recompilation, and uses a custom run
Detects full table scans and missing prepared statements in database queries to improve retrieval speed.
Manticoresearch is a high-performance search engine and database designed for indexing and retrieving large datasets. It functions as a full-text search engine, a vector search database, and a SQL-based search database, providing a distributed search cluster architecture. The system provides an alternative to the Elasticsearch stack, offering a compatible API for indexing and searching structured and unstructured data. It distinguishes itself by supporting multiple retrieval methods, including vector matching for similarity search, geospatial queries, and traditional full-text ranking. The p
Implements a cost-based optimizer that uses data statistics and secondary indexes to determine the most efficient execution plan.
StarRocks is a distributed SQL OLAP database engine designed for real-time analytics and high-performance multi-dimensional analysis. It functions as a data lakehouse query engine that enables SQL execution across large datasets and external open table formats without requiring local data imports. The system employs a shared-nothing distributed architecture and utilizes the MySQL protocol to integrate with business intelligence tools. It maintains real-time data consistency through a primary key upsert model and accelerates query response times using vectorized execution and cost-based optimi
Implements a cost-based optimizer that determines the most efficient execution plan using table statistics.
This project is a curated collection of academic papers, books, and technical resources designed for studying the architecture and implementation of database management systems. It serves as a comprehensive educational guide for engineers and researchers looking to understand the fundamental principles behind modern data storage and retrieval. The repository distinguishes itself by providing structured learning paths across critical database domains, including the design of persistent storage engines, the mechanics of query optimization, and the complexities of distributed transaction managem
Offers technical resources on cost-based query optimization strategies using statistical data to determine efficient execution paths.
YugabyteDB is a distributed SQL database and relational data store designed for horizontal scalability and high availability across multiple nodes or regions. It functions as a cloud-native system that ensures continuous availability and supports PostgreSQL compatible query languages and drivers. The system includes specialized capabilities as a vector database for AI, utilizing high-dimensional indexing to perform similarity searches. It is engineered as a multi-region cloud database that synchronizes data across different geographic locations to maintain global availability. The project co
Provides a cost-based optimizer that analyzes data statistics to select the most efficient query execution plans.
This project is a collection of educational resources and curricula designed for mastering AI pair programming and prompt engineering. It provides a structured training course and instructional materials for integrating AI assistants into the software development lifecycle. The materials cover the use of large language models to modernize legacy code and translate applications between programming languages. It includes a specific guide for crafting natural language queries to generate code and automate development workflows. The content addresses a broad range of capabilities, including AI-a
Offers techniques for using advanced AI prompting to refine and optimize complex database queries.
Apache Hudi is an open-source table format that brings ACID transactions, incremental processing, and multi-modal indexing to data lakes. It provides atomic commits with snapshot isolation, rollback, and optimistic concurrency control for reliable data lake operations, while supporting upserts, record-level updates, and deletions in large analytical datasets. The project distinguishes itself through a timeline-based architecture that coordinates all write operations, enabling features like time-travel querying, incremental change streaming, and multi-modal query views that include snapshot, i
Serves snapshot queries using only columnar storage for high performance on analytical workloads.
Apache Hive is a SQL-on-Hadoop data warehouse that enables querying and managing petabytes of data stored in distributed storage such as HDFS and cloud storage services. It provides a familiar SQL interface for batch analytics and reporting, supported by a core set of components including the HiveServer2 Thrift service for remote query execution, the Hive Metastore Service for central metadata management, the Hive ACID Transaction Engine for concurrent read-write operations, and the Hive LLAP Interactive Engine for low-latency analytical processing. The WebHCat REST API offers an HTTP interfac
Uses a cost-based optimizer with table statistics and materialized views for query planning.
Calcite est un framework pour l'analyse, l'optimisation et la traduction de requêtes SQL en algèbre relationnelle pour une exécution sur diverses sources de données. Il fonctionne comme un moteur de requête multi-sources, une bibliothèque d'analyse SQL et un optimiseur d'algèbre relationnelle. Le projet fournit un moteur d'optimisation basé sur les coûts qui transforme les plans de requête logiques en plans d'exécution physiques efficaces à l'aide de règles enfichables. Il utilise des adaptateurs de traduction pour convertir les requêtes SQL standard dans les formats natifs de bases de données et systèmes de messagerie externes, permettant la fédération de données sur des systèmes de stockage hétérogènes. Le système couvre le cycle de vie complet des requêtes, incluant l'analyse SQL et la validation par rapport aux schémas, la traduction d'expressions en opérateurs algébriques et la sélection de plans d'exécution efficaces. Il inclut également une interface en ligne de commande pour exécuter des requêtes et gérer les connexions aux sources de données.
Implements a cost-based optimizer that estimates resource costs to select the most efficient physical execution plans.
H2 is a JDBC-compliant relational database management system written in Java. It functions as an embeddable SQL database that can run directly within an application process to remove network latency, or as an in-memory database for high-performance volatile storage. It also includes a web-based console for executing SQL commands and administering schemas. The system is characterized by its flexible deployment modes, including a standalone server mode for remote TCP/IP access and a mixed mode for simultaneous local and remote connectivity. It features a dialect emulation layer and compatibilit
Uses table statistics to determine the most efficient physical execution path for SQL statements.