Data-Juicer is an open-source framework for cleaning, filtering, deduplicating, and transforming multimodal datasets to prepare them for training large language and vision models. It functions as a distributed data pipeline engine that runs processing jobs across Ray clusters, handling billions of samples with automatic operator fusion and adaptive parallelism. The framework provides a library of operators that leverage large language models for semantic extraction, filtering, and data synthesis within processing pipelines.
The project distinguishes itself through a YAML-based data recipe system for composing reproducible, version-controlled data workflows that can be shared and reused across environments. It includes a configurable quality gate system, lazy dependency injection for operator-specific packages, and a multimodal operator registry that provides a unified interface for text, image, audio, and video operators within a single pipeline. The operator-fusion pipeline compiler automatically merges adjacent data operators into fused execution units to reduce I/O and scheduling overhead, while sample-level lineage tracing records the origin and transformation history of each sample for auditability.
The framework covers data cleaning and deduplication across distributed clusters, image, line-level, record-level, text, and video deduplication methods. It provides data filtering and selection based on audio, image, LLM, multimodal, quality, sample selection, and text criteria. Data processing and transformation capabilities span agent data preparation, audio processing, batch aggregation, dataset enhancement, mixing, repartitioning, domain-specific processing, field transformation, foundation model curation, image processing, language splitting, LLM operators, multimodal processing, question-answer calibration, synthetic data generation, text processing, and video data processing for embodied AI. The project also includes data quality and analysis tools for dataset profiling, visualization, and model evaluation, as well as RAG index building by extracting, normalizing, chunking, deduplicating, and profiling content for retrieval-augmented generation systems.
Documentation and support are available through a Q&A copilot integrated into documentation and chat platforms.