Datafusion

Datafusion

Apache DataFusion is an extensible, columnar SQL query engine that runs embedded within a host application without requiring a separate server process. It processes data in columnar batches using Apache Arrow for memory-efficient analytics, and can scale analytic workloads across multiple nodes for parallel execution. The engine supports both SQL and DataFrame queries through a modular, streaming architecture that allows custom operators, data sources, functions, and optimizer rules.

The engine distinguishes itself through its modular extension framework, which enables building custom query engines by modifying all extension points including data sources, query languages, and custom operators. It provides a lazy DataFrame API that defines query pipelines as deferred transformations, optimized and executed only when results are collected. DataFusion also supports Substrait interchange for passing query plans across language and system boundaries, and includes language bindings for Python, C, Ruby, and Java.

The system handles data ingestion from multiple file formats including Parquet, CSV, JSON, and Avro, as well as in-memory data sources. It supports full DDL and DML operations for creating and modifying tables, views, and schemas. DataFusion includes a rule-based query optimizer that applies filter pushdown, join reordering, and expression simplification automatically, and provides query plan analysis through EXPLAIN commands. The engine can also replace Apache Spark's native execution engine to improve query performance on Arrow data.

Documentation and API governance ensure that public functions are marked with deprecation notices and remain available for six major versions or six months before removal.

Features

SQL Query Execution - Runs SQL queries against data with a full query planner and columnar streaming engine.

Embedded SQL Query Engines - Runs as an embedded SQL engine within a host application without requiring a separate server process.

Query Engine Extensions - Supports custom operators, data sources, functions, and optimizer rules for domain-specific needs.

Apache Arrow Processing - Stores and processes data in Apache Arrow's columnar format for zero-copy sharing and vectorized operations.

Streaming Columnar Executions - Processes data in Arrow columnar batches through a streaming pipeline without materializing intermediate results.

Columnar Engines - Processes data in columnar batches using Apache Arrow for memory-efficient analytics.

Extensible Query Execution Frameworks - Provides a modular extension framework for building custom query engines with custom operators, data sources, and functions.

Dataframe Engines - Provides a lazy DataFrame API for building and executing analytic queries programmatically.

Distributed Execution Coordinators - Scales analytic workloads across a cluster by splitting and coordinating query fragments on multiple nodes.

Distributed Query Engines - Splits and coordinates analytic workloads across multiple nodes for parallel execution.

Distributed Query Processing - Scales analytic workloads across a cluster by splitting and coordinating query fragments.

Columnar File Format Loading - Loads data from Parquet, Avro, and compressed formats directly into Arrow columnar memory for analysis.

Lazy Query Pipelines - Defines query pipelines as deferred transformations that are optimized and executed only when results are collected.

Query Execution Pipelines - Constructs query pipelines as deferred transformations that are optimized and executed only when results are collected.

Relational Query Optimizers - Applies advanced optimizations like filter pushdown, join reordering, and expression simplification automatically.

Rule-Based Plan Optimizations - Applies a configurable chain of rewrite rules for filter pushdown, join reordering, and expression simplification.

SQL Query Execution Engines - An extensible, columnar query engine that executes SQL and DataFrame queries with a modular, streaming architecture.

Multi-Format Data Loading - Reads and writes data in Parquet, CSV, JSON, and Avro formats without additional configuration.

Tabular DataFrames - Constructs and manipulates tabular data through a lazy DataFrame API with filtering, aggregation, and joins.

Modular Extensibility Frameworks - Provides trait-based extension points for custom data sources, operators, optimizer rules, and functions.

Native Spark Accelerators - Replaces Apache Spark's native execution engine to improve query performance on Arrow data.

Data Transformation Functions - Applies aggregate, window, and scalar functions to compute statistics, rankings, and transformations on data.

Execution Plan Analysis - Displays the physical plan and execution metrics of a query using EXPLAIN and EXPLAIN ANALYZE.

Distributed Query Fragmenters - Splits physical plans into parallel fragments that can be scheduled and executed across multiple nodes.

File Data Ingestion - Reads structured data from file formats such as CSV into a DataFrame for further querying and analysis.

Cloud Object Storage - Reads data asynchronously from AWS S3, Azure Blob Storage, and Google Cloud Storage.

In-Memory Data Loading - Creates a DataFrame from programmatically defined rows or Arrow record batches without external storage.

Data Manipulation Operations - Supports INSERT and COPY commands for modifying data in tables.

Object Store Streaming Readers - Reads data asynchronously from cloud storage services using range requests and connection pooling.

Substrait Plan Interchange - Passes query plans across language and system boundaries using the Substrait interchange format.

Substrait Plan Interchanges - Serializes and deserializes query plans using the Substrait binary format for cross-language portability.

Expression Functions - Applies built-in functions for nested types, cryptography, date/time, encoding, regular expressions, and Unicode operations.

DDL Executions - Executes CREATE, ALTER, and DROP operations on database objects through a type-safe API.

Language Bindings - Provides official language bindings for Python, C, Ruby, and Java to call the query engine directly.

Substrait Plan Interchanges - Implements the Substrait interchange format to pass query plans across language and system boundaries.

apachedatafusion

Features

Star history