Data Juicer

Data-Juicer is an open-source framework for cleaning, filtering, deduplicating, and transforming multimodal datasets to prepare them for training large language and vision models. It functions as a distributed data pipeline engine that runs processing jobs across Ray clusters, handling billions of samples with automatic operator fusion and adaptive parallelism. The framework provides a library of operators that leverage large language models for semantic extraction, filtering, and data synthesis within processing pipelines.

The project distinguishes itself through a YAML-based data recipe system for composing reproducible, version-controlled data workflows that can be shared and reused across environments. It includes a configurable quality gate system, lazy dependency injection for operator-specific packages, and a multimodal operator registry that provides a unified interface for text, image, audio, and video operators within a single pipeline. The operator-fusion pipeline compiler automatically merges adjacent data operators into fused execution units to reduce I/O and scheduling overhead, while sample-level lineage tracing records the origin and transformation history of each sample for auditability.

The framework covers data cleaning and deduplication across distributed clusters, image, line-level, record-level, text, and video deduplication methods. It provides data filtering and selection based on audio, image, LLM, multimodal, quality, sample selection, and text criteria. Data processing and transformation capabilities span agent data preparation, audio processing, batch aggregation, dataset enhancement, mixing, repartitioning, domain-specific processing, field transformation, foundation model curation, image processing, language splitting, LLM operators, multimodal processing, question-answer calibration, synthetic data generation, text processing, and video data processing for embodied AI. The project also includes data quality and analysis tools for dataset profiling, visualization, and model evaluation, as well as RAG index building by extracting, normalizing, chunking, deduplicating, and profiling content for retrieval-augmented generation systems.

Documentation and support are available through a Q&A copilot integrated into documentation and chat platforms.

Features

Multimodal Data Processing - Applies over 200 built-in operators to clean, deduplicate, and transform text, image, audio, and video data for AI training.

Data Curation Pipelines - Prepares datasets for pre-training, fine-tuning, and evaluation of large language and multimodal models.

Declarative Data Recipes - Defines reproducible, version-controlled data workflows as declarative YAML files without imperative code.

Dataset Curation - Applies configurable filters and enhancement operators to upgrade the quality of existing datasets.

Text Dataset Curators - Applies filtering, enhancement, and mixing strategies to upgrade pre-training and post-tuning datasets.

Ray-Based Data Processing - Distributes data processing across Ray clusters with automatic parallelism and fault-tolerant checkpointing.

LLM and VLM Inference Pipelines - Executes large language and vision model inference pipelines using Ray and vLLM for distributed processing.

Dataset Batch Inference - Runs large-scale inference jobs that process entire datasets through a language model in a single execution.

Foundation Models - Filters, deduplicates, and structures datasets for pre-training, fine-tuning, and evaluation of foundation models.

Information Extraction - Uses language models to extract structured fields from unstructured text within data pipelines.

LLM-Based Data Transformations - Applies semantic operators that use large language models to extract, filter, and structure data.

Multimodal AI Toolkits - Ships a toolkit with built-in operators for text, image, audio, and video data curation.

LLM-Based Classifiers - Runs a quality classifier on web-crawled text to filter low-quality samples.

Training Data Curators - Prepares datasets for pre-training, fine-tuning, reinforcement learning, and evaluation of large models.

LLM Frameworks and Libraries - Provides a library of operators that leverage large language models for semantic extraction and filtering.

Data Preprocessing for Modeling - Prepares and filters datasets for pre-training, fine-tuning, reinforcement learning, and evaluation of large AI models.

Multimodal Data Curators - Provides a dedicated framework for curating multimodal datasets for training large language and vision models.

Multimodal Data Preprocessing - Cleans, filters, deduplicates, and transforms multimodal datasets to prepare them for training large language and vision models.

Cross-Block Sample Deduplications - Detects and removes duplicate samples using exact matching or fuzzy methods across modalities.

Distributed Sample Deduplications - Removes duplicate samples across a Ray cluster using exact matching or MinHash LSH.

Multimodal - Provides a unified registry for text, image, audio, and video operators within a single pipeline.

Duplicate Sample Removals - Removes duplicate samples using exact matching or fuzzy hashing methods like MinHash and SimHash.

Data Pipelines - Assembles modular operators into reproducible YAML pipelines for versioning and hot-reloading.

Data Processing Pipelines - Orchestrates configurable sequences of operators to clean, filter, and transform multimodal datasets.

Distributed Computing - Executes data workflows across multiple machines to handle large-scale datasets efficiently.

Distributed Data Processing Engines - Runs data processing jobs across Ray clusters, handling billions of samples with automatic operator fusion and adaptive parallelism.

Data Processing Recipes - Defines reproducible data workflows as YAML recipes that can be versioned and shared.

Distributed Pipeline Executors - Applies operations like LLM inference and repartitioning across entire datasets using distributed engines.

Distributed Data Processing - Scales data processing across multiple machines to handle large datasets efficiently.

Intra-Dataset Deduplication - Removes duplicate or near-duplicate samples from datasets using fuzzy matching techniques.

Large-Scale Deduplications - Removes duplicate entries from massive corpora using distributed computing on thousands of cores.

LLM-Based Data Extractors and Filters - Applies language model operators to map, extract, and conditionally filter data within processing pipelines.

Text Quality Filtering - Removes samples that fail quality checks based on configurable metrics.

Multimodal Quality Filters - Removes samples that fail configurable quality checks on text, image, audio, or video attributes.

Metadata-Based Analysis Filters - Keeps or removes text samples based on analysis from a large language model.

Unified Cluster Scaling - Runs data pipelines on a single machine or across thousands of nodes without custom glue code.

Data Pipeline Scaling - Processes billions of samples across hundreds of compute nodes with automatic operator fusion and adaptive parallelism.

Distributed Data Workload Scaling - Distributes data processing across thousands of nodes to handle billions of samples in hours.

Data Operation Pipelines - Chains 200+ built-in operators into reproducible YAML pipelines for multimodal data.

Fused Operation Pipelines - Automatically merges adjacent data operators into fused execution units to reduce I/O and scheduling overhead.

Execution Pipelines - Provides a configurable pipeline execution engine that runs sequences of data processing operators.

Document-Level Deduplications - Removes duplicate text samples by comparing documents using exact matching or fingerprinting.

Modular Data Pipelines - Assembles reproducible YAML pipelines from 200+ composable operators with hot-reloading.

ML Pipeline Reproducibility - Defines data processing workflows as version-controlled YAML recipes for sharing and reuse.

Data Pipeline Lineage Inspectors - Tracks the origin and transformation history of each sample through the processing pipeline.

Multimodal - Provides configurable quality gates that filter low-quality multimodal samples using computed metrics.

Agentic Interaction Training - Cleans, structures, and quality-gates tool traces and conversation logs for training agent systems.

Agent Data Synthesizers and Analyzers - Provides operators for cleaning and analyzing agent interaction datasets for training.

Agent Interaction Data Cleaners - Provides operators for cleaning, structuring, and quality-gating agent interaction traces for training.

Agent Data Cleaners - Provides operators for cleaning and quality-gating agent interaction logs for training.

Agent Data Preparations - Provides operators for cleaning and structuring agent interaction datasets for training.

Task-Specific Synthetic Data - Generates synthetic datasets for specific tasks using large models.

Dataset Quality Analyzers - Computes statistics and profiles on datasets to assess quality, diversity, and distribution before model training.

Dataset Statistics Analyzers - Computes aggregate metrics and visual summaries to inform processing decisions.

Document Chunking & Embedding - Provides document chunking and embedding operators for preparing RAG pipeline inputs.

Alignment-Based Filters - Keeps or removes samples based on similarity between image and text content.

Domain-Specific Processing Pipelines - Applies tailored pipelines to scientific literature, code, or instruction data for model training.

Dataset Samplers and Mixers - Provides operators to select and combine samples from multiple datasets using configurable mixing strategies.

Model Feedback Loops - Iterates on data processing and model training together using model performance feedback.

Reproducible Dataset Builders - Recreates published training datasets by applying documented steps to raw sources.

RAG Data Pipelines - Ships operators for extracting, normalizing, chunking, and deduplicating content for RAG indexes.

Agent Interaction Data Processors - Provides operators for cleaning and quality-checking agent interaction data for training.

Synthetic Data Generation - Generates or augments datasets using configurable recipes within an isolated environment.

Ranking-Based Selections - Chooses a subset of samples from a dataset based on field frequency, random selection, or sorted field values.

Text Cleaning Utilities - Edits text samples by cleaning HTML, emails, links, or converting character sets.

Language Confidence Filters - Keeps or removes text samples based on detected language and confidence score.

Semantic Content Deduplication - Removes boilerplate lines, templates, and copyright notices using global frequency analysis.

Text Transformation Functions - Edits individual samples by cleaning text, converting formats, adding noise, or extracting image attributes.

Dataset Exports - Outputs cleaned and transformed data in various formats, including sharded, parallel, and S3 exports.

Field Transformations - Applies user-defined mapping functions to modify, enrich, or clean individual dataset fields.

Line-Level Deduplications - Removes duplicate rows across documents using global frequency analysis to eliminate boilerplate.

Large-Scale Line Deduplications - Removes duplicate content across documents and lines using global frequency analysis to reduce redundancy.

Out-of-Core Processing - Processes tens of billions of samples or terabytes of data in hours using distributed computing and automatic operator fusion.

Data Pipeline Optimizations - Accelerates data processing with automatic operator fusion and adaptive parallelism for faster execution.

RAG Index Builders - Extracts, normalizes, chunks, and deduplicates content to build high-quality indexes for retrieval-augmented generation.

Property-Based Filters - Keeps or removes audio samples based on duration, file size, or signal-to-noise ratio.

Exact Image Deduplications - Removes duplicate image samples by comparing images using exact matching.

Custom Task Operator Extensions - Allows creating new data processing operators by inheriting base classes and registering them for automatic discovery.

Property-Based Filters - Keeps or removes image samples based on aesthetics, aspect ratio, face count, or NSFW score.

Text Line Deduplication - Removes boilerplate lines such as templates and copyright notices across documents using global frequency analysis.

Lazy - Ships a lazy dependency injection system that installs operator packages on first use to avoid bloat.

Pipeline Execution Optimizations - Optimizes pipeline execution by fusing operators and adapting parallelism to achieve speedups of 2-10x.

Agent Trajectory Cleaners - Provides operators for sanitizing and structuring agent trajectory logs for training.

Document-Level Deduplications - Removes duplicate video samples by comparing videos using exact matching.

Agent Trace Cleaners - Provides operators for structuring, de-identifying, and quality-gating agent interaction traces.

Agent Trace Processors - Provides operators for cleaning and quality-gating agent tool traces for downstream use.

Sample-Level Lineage Tracing - Records the origin and transformation history of each sample through processing steps for auditability.

Model Evaluation Benchmarks - Automatically runs processed datasets through evaluation frameworks to measure model performance.

datajuicerdata-juicer

Data Juicer

Features

Star history