Daft is a distributed dataframe library and multimodal data processor designed to handle large-scale structured and unstructured data. It functions as a vectorized execution engine that processes tables alongside images, audio, and video, utilizing a unified schema to manage diverse data types.
The project distinguishes itself by combining distributed data engineering with large-scale AI inference. It provides an AI data pipeline for batch-optimizing model prompts and generating high-dimensional text embeddings, while utilizing zero-copy memory sharing to execute custom Python functions without processing overhead.
Its capabilities extend across cloud data lakehouse connectivity, supporting open table formats like Iceberg, Delta Lake, and Hudi. The engine employs lazy-evaluated execution plans and sampling-based schema inference to manage datasets that exceed single-node memory, scaling workloads from local cores to distributed Kubernetes clusters.
The system further includes a comprehensive suite for data transformation, covering columnar aggregation, window functions, and geospatial manipulation, as well as specialized tools for audio transcription and video frame extraction.