High-performance tools that enable direct SQL querying of CSV and Parquet data formats without ingestion.
DuckDB is an embedded, in-process analytical SQL database and OLAP database management system. It functions as a data engine for Parquet and CSV files, allowing users to execute complex SQL queries on large datasets without requiring a separate server process. The system is designed for local analytical processing and embedded data science workflows. It enables the direct querying and analysis of Parquet and CSV files from disk, bypassing the need to load data into a permanent database. The engine provides high-performance analytical SQL execution, including support for window functions and nested subqueries. It incorporates a columnar storage layout and vectorized query execution to handle large-scale data manipulation and exploration. The database is accessible via a standalone command line interface and language-specific bindings for Python, R, Java, and Wasm.
DuckDB is a high-performance, in-process analytical engine that allows you to execute complex SQL queries directly against Parquet and CSV files without needing to import data or manage a separate server.
DuckDB is an in-process analytical database engine designed to run directly within an application process. As a zero-dependency, embedded system, it provides enterprise-grade SQL data processing capabilities without the overhead of managing a dedicated database server. It is built to handle complex analytical and aggregation tasks by storing and retrieving information in columns, allowing for high-performance relational data manipulation. The engine distinguishes itself through a columnar vectorized execution model that maximizes CPU cache efficiency during query operations. It employs adaptive query optimization to dynamically select execution plans at runtime and utilizes zero-copy ingestion to map external data formats directly into memory. To facilitate integration with analytical programming environments, the system supports high-performance data exchange through standardized memory formats and provides specialized connectors for Python, R, and Java. The project covers a broad capability surface, including advanced relational join operations, incremental result streaming for large datasets, and flexible data ingestion from various file formats. It supports complex data types and provides a comprehensive command-line interface for interactive session management and batch processing. The codebase is designed for portability, offering single-file amalgamation to simplify integration into external projects and build systems.
DuckDB is a high-performance, in-process analytical engine that natively supports querying CSV and Parquet files via SQL without requiring data ingestion, perfectly matching the zero-ETL requirement.
csvkit is a composable Unix-style command-line toolkit for converting, filtering, and analyzing CSV files directly from the terminal. It provides a suite of focused single-purpose commands that can be combined via pipes to build complex data processing workflows, with a modular architecture that includes a column-type inference engine for automatically detecting data types and a streaming-pipeline design for efficient handling of tabular data. The toolkit distinguishes itself through its SQL-engine abstraction layer, which allows users to run SQL queries directly against CSV files without requiring a database server, treating them as database tables for flexible analysis. It also offers a format-agnostic serialization bridge for converting between CSV, JSON, Excel, and fixed-width formats, along with an in-memory aggregation engine for computing summary statistics and an interactive Python shell that pre-loads CSV data as lists for ad-hoc analysis. Beyond its core identity, csvkit covers a broad range of CSV data operations including inspection of file structure and schema, cleaning and validation to remove duplicates and fix malformed rows, filtering and sorting by column values, joining multiple files on common columns, and splitting data based on column values. It also supports database integration for importing CSV data into PostgreSQL and exporting query results back to CSV, as well as formatted terminal display of tabular data as aligned tables.
This toolkit provides a SQL-on-files interface that allows you to execute queries directly against CSV data without an import process, though it lacks native support for Parquet files.
Polars is a high-performance columnar data processing library designed for efficient analytical workflows. It functions as a structured data library that organizes information into typed columns, utilizing the Apache Arrow memory format to enable zero-copy data sharing and cache-friendly, vectorized operations. The engine is built to handle large-scale tabular datasets, providing both local and distributed analytical runtimes that scale from single-machine environments to multi-node clusters. The project distinguishes itself through a sophisticated lazy query engine that constructs abstract execution plans. By deferring data operations until collection, the engine performs predicate and projection pushdown to minimize memory overhead and data passes. It further optimizes performance through a multi-threaded parallel execution model and a streaming batch processor, which allows for the analysis of datasets that exceed available system memory by processing them in manageable chunks. The library provides a comprehensive expression framework for complex data engineering, supporting aggregation, arithmetic, and logical transformations across various data types, including nested structures and categorical data. It integrates with external systems through native connectivity for cloud storage, relational databases, and remote repositories, while offering diagnostic tools to visualize query plans and monitor performance. Polars is available as a native library with language bindings for Python and R, allowing users to integrate high-performance data manipulation into existing analytical pipelines without complex build steps.
Polars is a high-performance data processing engine that supports SQL-like operations and direct querying of CSV and Parquet files without requiring data import, fitting the requirements for a zero-ETL analytical tool.
Presto is a distributed SQL query engine designed for high-performance analytical processing across heterogeneous data sources. It functions as a data federation platform and massively parallel processing engine, allowing users to execute interactive queries against diverse storage systems without requiring data migration. By mapping remote metadata and structures to a unified relational namespace, it enables seamless cross-platform analysis through a standard SQL interface. The engine distinguishes itself through a pluggable connector architecture and a shared-nothing distributed processing model that coordinates tasks across worker nodes. It incorporates cost-based query optimization to rewrite execution paths based on table statistics and historical data, ensuring efficient resource utilization. To maintain stability during large-scale operations, the system features a memory-spilling execution engine that offloads intermediate results to disk when memory thresholds are exceeded. The platform provides extensive capabilities for multi-tenant resource management, allowing administrators to enforce concurrency, memory, and CPU limits through hierarchical resource grouping. It supports a wide range of analytical operations, including advanced windowing, geospatial processing, and probabilistic data structures for approximate statistics. Security is integrated through granular access control policies, role-based authentication, and encrypted communication across the cluster. Presto is implemented in Java and supports deployment via containerized instances or distributed cluster orchestration in Kubernetes environments.
Presto is a distributed SQL engine that natively supports querying CSV and Parquet files via its connector architecture, allowing you to perform high-performance analytical queries without needing to import data into a database.
Doris is a distributed SQL data warehouse designed for high-performance analytical workloads and real-time data processing. It functions as a unified platform that integrates traditional relational warehousing with lakehouse query capabilities, allowing users to execute analytical operations directly against external data lakes without requiring data migration. The system distinguishes itself through a shared-nothing, massively parallel processing architecture that utilizes vectorized query execution and columnar storage to maintain sub-second latency. It supports dynamic schema evolution, enabling real-time updates to table structures, and provides elastic resource scaling by decoupling compute and storage layers to accommodate fluctuating workload demands. Beyond standard analytical processing, the platform incorporates vector database functionality to support artificial intelligence and semantic search applications. It enables hybrid search by combining structured SQL analytics with full-text filtering and vector similarity, facilitating complex retrieval-augmented generation workflows within a single environment. The engine is built to handle high-concurrency requirements, supporting thousands of simultaneous queries per second for enterprise-scale operations.
Doris is a distributed SQL data warehouse and lakehouse engine that supports querying external data formats like Parquet and CSV directly, fitting the requirement for a SQL-on-files query engine despite its broader focus on full-scale data warehousing.
StarRocks is a distributed SQL OLAP database engine designed for real-time analytics and high-performance multi-dimensional analysis. It functions as a data lakehouse query engine that enables SQL execution across large datasets and external open table formats without requiring local data imports. The system employs a shared-nothing distributed architecture and utilizes the MySQL protocol to integrate with business intelligence tools. It maintains real-time data consistency through a primary key upsert model and accelerates query response times using vectorized execution and cost-based optimization. Broad capabilities include the use of automated materialized views to reduce scan volumes and multi-tenant resource isolation to manage CPU and memory quotas across concurrent workloads. The engine also supports automatic resource balancing and data recovery during cluster scaling.
StarRocks is a high-performance distributed SQL engine that supports querying external data lake formats and files directly without requiring data imports, fitting the requirements for a zero-ETL analytical tool.
ClickHouse is a high-performance, columnar analytical database designed for real-time query execution and large-scale data aggregation. It functions as a distributed data warehouse capable of processing petabytes of information, while also providing an embedded engine that integrates directly into applications for native query capabilities without external dependencies. The system is built to handle high-throughput ingestion and complex analytical workloads, delivering millisecond-level latency for interactive dashboards and operational monitoring. The platform distinguishes itself through advanced storage and execution techniques, including vectorized query processing and a merge tree storage engine that maintains performance during massive insertions. It features adaptive subcolumn mapping for semi-structured data and supports native vector search for machine learning and generative AI applications. To facilitate efficient data movement, the engine utilizes zero-copy shared memory buffers, minimizing overhead when interacting with external analytical tools or processing diverse file formats like Parquet, JSON, and Arrow. Beyond its core storage and processing capabilities, the project provides a comprehensive suite of tools for observability, security, and data integration. It includes built-in support for natural language querying, automated workflow orchestration for AI agents, and extensive diagnostic features for query plan inspection. The platform also offers robust cloud infrastructure management, including support for private networking, compliant deployment strategies, and integrated billing consolidation.
ClickHouse is a high-performance analytical database that includes a powerful engine for querying Parquet and CSV files directly without requiring a prior data import, fitting the requirements for a SQL-on-files tool.
q is a command-line utility for the processing, filtering, and aggregation of tabular text and database files using standard SQL syntax. It functions as a query engine that treats CSV and TSV files, as well as standard input, as relational database tables. The tool distinguishes itself by providing a persistent cache layer that stores processed tabular data in a binary format to accelerate repeated queries on large datasets. It also maps individual filenames or stream identifiers to relational table names, enabling SQL joins across disparate text files. The project covers a broad range of data analysis capabilities, including automated schema detection for column types, tabular output formatting, and the ability to export processed in-memory datasets into physical SQLite database files. It integrates directly into Unix pipelines by accepting tabular data via standard input.
This command-line utility allows you to execute SQL queries directly against CSV and TSV files without an import process, though it lacks native support for Parquet files.
Daft is a distributed dataframe library and multimodal data processor designed to handle large-scale structured and unstructured data. It functions as a vectorized execution engine that processes tables alongside images, audio, and video, utilizing a unified schema to manage diverse data types. The project distinguishes itself by combining distributed data engineering with large-scale AI inference. It provides an AI data pipeline for batch-optimizing model prompts and generating high-dimensional text embeddings, while utilizing zero-copy memory sharing to execute custom Python functions without processing overhead. Its capabilities extend across cloud data lakehouse connectivity, supporting open table formats like Iceberg, Delta Lake, and Hudi. The engine employs lazy-evaluated execution plans and sampling-based schema inference to manage datasets that exceed single-node memory, scaling workloads from local cores to distributed Kubernetes clusters. The system further includes a comprehensive suite for data transformation, covering columnar aggregation, window functions, and geospatial manipulation, as well as specialized tools for audio transcription and video frame extraction.
Daft is a high-performance distributed dataframe engine that supports SQL-like operations and direct querying of Parquet and CSV files in data lakes without requiring a database import.
Apache Spark is a unified distributed data processing engine designed for large-scale data analysis and computation graphs. It functions as a distributed machine learning framework, a graph processing system, a real-time stream processor, and a SQL analytics engine. The system enables the execution of distributed SQL querying, large-scale graph analysis, and real-time stream analytics across clusters of machines. It also provides a scalable environment for implementing machine learning algorithms and predictive model development on massive datasets. The engine incorporates relational query execution, graph data manipulation, and continuous data flow processing. It includes capabilities for distributed job execution, interactive query shells, and the integration of user-defined functions. The project includes distributed cluster security with network traffic encryption and supports metadata management via Hive metastore integration.
Apache Spark is a powerful distributed data processing engine that includes a robust SQL interface capable of querying CSV and Parquet files directly without requiring a database import.
TextQL is a command line SQL query engine designed to execute relational queries directly against structured text files, such as CSV and TSV, without requiring a database import. It functions as a relational text file analyzer and a CSV processor that treats plain text files as virtual tables for filtering, joining, and aggregating data. The tool is built as a pipe-compatible data transformation utility, allowing it to process data from standard input and output formatted datasets. It enables relational joins across multiple files or directories within a single query to analyze relationships between different datasets. The engine includes automatic data type detection for numeric, date, and time values to ensure accurate calculations and sorting. It also supports the loading of external shared libraries to extend the query language with custom mathematical, string, and aggregate functions. Results can be exported to files using configurable delimiters.
TextQL is a command-line SQL engine that allows you to run relational queries directly against CSV and TSV files without an import process, though it lacks native support for Parquet files.