Apache DataFusion is an extensible, columnar SQL query engine that runs embedded within a host application without requiring a separate server process. It processes data in columnar batches using Apache Arrow for memory-efficient analytics, and can scale analytic workloads across multiple nodes for parallel execution. The engine supports both SQL and DataFrame queries through a modular, streaming architecture that allows custom operators, data sources, functions, and optimizer rules.
The engine distinguishes itself through its modular extension framework, which enables building custom query engines by modifying all extension points including data sources, query languages, and custom operators. It provides a lazy DataFrame API that defines query pipelines as deferred transformations, optimized and executed only when results are collected. DataFusion also supports Substrait interchange for passing query plans across language and system boundaries, and includes language bindings for Python, C, Ruby, and Java.
The system handles data ingestion from multiple file formats including Parquet, CSV, JSON, and Avro, as well as in-memory data sources. It supports full DDL and DML operations for creating and modifying tables, views, and schemas. DataFusion includes a rule-based query optimizer that applies filter pushdown, join reordering, and expression simplification automatically, and provides query plan analysis through EXPLAIN commands. The engine can also replace Apache Spark's native execution engine to improve query performance on Arrow data.
Documentation and API governance ensure that public functions are marked with deprecation notices and remain available for six major versions or six months before removal.