2 مستودعات
Documenting the specific steps of data transformation through scripts or pseudocode to ensure computational reproducibility.
Distinct from Step Data Mappers: Candidates focus on UI steps or AI step mappers, not the documentation of data cleaning recipes.
Explore 2 awesome GitHub repositories matching data & databases · Data Processing Recipes. Refine with filters or upvote what's useful.
This project is a research data sharing framework and provenance protocol designed to ensure computational reproducibility. It provides a standardized set of guidelines for transforming raw source data into tidy formats through documented processing scripts and cleaning workflows. The framework distinguishes itself by emphasizing a strict provenance-based packaging system. It requires the organization of raw data, processing recipes, and code books into a single package, ensuring that original unmodified sources are preserved to allow for independent verification of all transformation steps.
Creates script or pseudocode recipes that convert raw data into tidy datasets to ensure computational reproducibility.
Data-Juicer is an open-source framework for cleaning, filtering, deduplicating, and transforming multimodal datasets to prepare them for training large language and vision models. It functions as a distributed data pipeline engine that runs processing jobs across Ray clusters, handling billions of samples with automatic operator fusion and adaptive parallelism. The framework provides a library of operators that leverage large language models for semantic extraction, filtering, and data synthesis within processing pipelines. The project distinguishes itself through a YAML-based data recipe sys
Defines reproducible data workflows as YAML recipes that can be versioned and shared.