The visitor is looking for tools to manage, clean, and prepare raw data for machine learning model training pipelines.

doccano/doccano is the closest match — This is a collaborative platform specifically designed for data labeling and annotation, which serves as a critical component of the data preparation pipeline for machine learning.. Other strong matches: apache/spark, heartexlabs/label-studio, apache/beam, openrefine/openrefine.

Why does doccano/doccano match “a toolkit for dataset cleaning and curation”?

This is a collaborative platform specifically designed for data labeling and annotation, which serves as a critical component of the data preparation pipeline for machine learning.

Why does apache/spark match “a toolkit for dataset cleaning and curation”?

Apache Spark is a powerful distributed processing engine that provides the large-scale data computation and transformation capabilities required to prepare massive datasets for machine learning pipelines.

Why does heartexlabs/label-studio match “a toolkit for dataset cleaning and curation”?

Label Studio is a specialized data annotation and labeling platform that serves as a critical component for preparing training datasets, though it focuses on the human-in-the-loop labeling stage rather than general-purpose data cleaning or pipeline orchestration.

Why does apache/beam match “a toolkit for dataset cleaning and curation”?

Apache Beam is a powerful distributed processing framework that provides the core orchestration and transformation capabilities needed to build scalable data preparation and ETL pipelines for machine learning.

Why does openrefine/openrefine match “a toolkit for dataset cleaning and curation”?

OpenRefine is a powerful platform for cleaning, transforming, and standardizing messy datasets, though it focuses more on interactive data wrangling than on full-scale pipeline orchestration or automated labeling.

Dataset Cleaning and Preparation Tools

These open-source libraries and frameworks automate data validation, transformation, and cleaning tasks for machine learning.

Find the best repos with AI.We'll search the best matching repositories with AI.

doccano/doccano
doccano/doccano
10,674View on GitHub
Doccano is a collaborative data labeling platform and machine learning dataset management system. It provides a web-based interface for teams to import raw text, mark datasets, and export structured annotations for model training. The project specifically supports text annotation for classification and named entity recognition tasks. It enables teams to coordinate multiple users on a single project to maintain consistent labeling guidelines and increase the speed of dataset creation. The system includes tools for data management and team coordination, providing the ability to import raw data
This is a collaborative platform specifically designed for data labeling and annotation, which serves as a critical component of the data preparation pipeline for machine learning.
PythonData Labeling PlatformsData Labeling ToolsData Labeling Interfaces
View on GitHub10,674
apache/spark
apache/spark
43,467View on GitHub
Apache Spark is a unified distributed data processing engine designed for large-scale data analysis and computation graphs. It functions as a distributed machine learning framework, a graph processing system, a real-time stream processor, and a SQL analytics engine. The system enables the execution of distributed SQL querying, large-scale graph analysis, and real-time stream analytics across clusters of machines. It also provides a scalable environment for implementing machine learning algorithms and predictive model development on massive datasets. The engine incorporates relational query e
Apache Spark is a powerful distributed processing engine that provides the large-scale data computation and transformation capabilities required to prepare massive datasets for machine learning pipelines.
ScalaDistributed Data Processing FrameworksLarge-Scale Data ComputationDistributed Datasets
View on GitHub43,467
heartexlabs/label-studio
heartexlabs/label-studio
27,626View on GitHub
Label Studio is a multi-type data labeling tool and data annotation workspace designed to prepare datasets for machine learning training. It functions as a cloud-integrated data pipeline that imports raw data from storage, manages the annotation process, and exports labels into standardized formats. The platform features a machine learning model integration framework that connects to external model servers. This enables model-assisted annotation and active learning, allowing the system to perform pre-labeling and refine predictions based on human feedback. The software provides project manag
Label Studio is a specialized data annotation and labeling platform that serves as a critical component for preparing training datasets, though it focuses on the human-in-the-loop labeling stage rather than general-purpose data cleaning or pipeline orchestration.
TypeScriptData Labeling PlatformsData Annotation Workflows
View on GitHub27,626
apache/beam
apache/beam
8,612View on GitHub
Apache Beam is a distributed data pipeline framework and unified data processing model designed to handle both bounded batch data and unbounded real-time streams. It provides a system for building scalable, data-parallel workflows that operate across compute clusters using a single programming model. The framework utilizes a cross-runner pipeline abstraction that decouples the data processing logic from the underlying execution backend, allowing the same pipeline to run on different distributed compute engines. It supports multi-language pipeline development by translating high-level code fro
Apache Beam is a powerful distributed processing framework that provides the core orchestration and transformation capabilities needed to build scalable data preparation and ETL pipelines for machine learning.
JavaDistributed ComputingDistributed Data Processing Frameworks
View on GitHub8,612
openrefine/openrefine
OpenRefine/OpenRefine
11,866View on GitHub
OpenRefine is a data cleaning tool and wrangling platform used to transform raw, messy datasets into consistent and structured formats. It operates as a Java-based data processor that runs a local server and provides a web browser interface for managing and manipulating data. The platform includes a data reconciliation engine for matching local entries against external knowledge bases to standardize entities. It also functions as a web data augmentation tool, allowing users to fetch and integrate information from external web sources to enrich their datasets. The system provides a transforma
OpenRefine is a powerful platform for cleaning, transforming, and standardizing messy datasets, though it focuses more on interactive data wrangling than on full-scale pipeline orchestration or automated labeling.
JavaData Cleaning Utilities
View on GitHub11,866
ydataai/ydata-profiling
ydataai/ydata-profiling
13,388View on GitHub
Ydata-profiling is an automated exploratory data analysis framework designed to generate comprehensive statistical reports and visual summaries from dataframes. It functions as a diagnostic tool for assessing data quality, identifying missing values, duplicates, and outliers, while providing a scalable engine for profiling massive datasets across distributed enterprise environments. The project distinguishes itself through its ability to handle large-scale data through distributed task orchestration and lazy stream processing, which minimizes memory overhead during complex computations. It in
This tool provides automated data quality assessment, profiling, and drift detection, serving as a critical diagnostic component for cleaning and preparing datasets for machine learning pipelines.
PythonDistributed ComputingData Quality Frameworks
View on GitHub13,388
featuretools/featuretools
featuretools/featuretools
7,655View on GitHub
Featuretools is a Python data science library and automated feature engineering framework designed to create predictive features from multiple related datasets. It automates the data preparation and transformation steps required for machine learning models through deep feature synthesis. The library enables the automatic generation of comprehensive feature tables by applying recursive transformations to relational data. It supports the transformation of unstructured text into structured numeric features and allows users to define custom primitives to extend the synthesis process with specific
Featuretools is a specialized library for automated feature engineering and relational data transformation, which serves as a powerful component for preparing data for machine learning pipelines even though it lacks built-in data labeling or versioning features.
PythonDistributed Data Processing FrameworksLarge-Scale Data Computation
View on GitHub7,655
alteryx/featuretools
alteryx/featuretools
7,658View on GitHub
Featuretools is an automated feature engineering library and data transformation framework written in Python. It automatically generates machine learning feature vectors from multi-table datasets by applying synthesis patterns to relational and timestamped data. The system functions as a distributed feature synthesis engine, allowing the process of creating feature vectors to scale across multiple cores or clusters to handle large-scale datasets. The library supports the synthesis of multi-table datasets, time series feature generation, and the creation of custom machine learning primitives
This library focuses on automated feature engineering and transformation for machine learning, serving as a specialized tool for preparing data features rather than a general-purpose data cleaning or pipeline orchestration platform.
PythonDistributed ComputingLarge-Scale Data Computation
View on GitHub7,658
dagster-io/dagster
dagster-io/dagster
14,974View on GitHub
Dagster is a data orchestration platform designed to manage the entire lifecycle of data assets through declarative modeling and version-controlled code. It functions as a workflow engine that treats data assets as first-class primitives, allowing teams to define, schedule, and monitor complex pipelines while maintaining clear visibility into lineage, dependencies, and data quality. The platform distinguishes itself by using a code-as-configuration framework that enables standard software engineering practices, such as unit testing and local mocking, to be applied directly to data workflows.
Dagster is a robust data orchestration platform that manages the lifecycle and quality of data assets, providing the necessary pipeline infrastructure to support complex data preparation and cleaning workflows for machine learning.
PythonDistributed ComputingData Quality FrameworksData Lineage
View on GitHub14,974
humansignal/label-studio
HumanSignal/label-studio
27,619View on GitHub
Label Studio is a multi-modal data annotation platform designed to create and manage high-quality training datasets for machine learning. It functions as a self-hosted, containerized environment that supports secure, private deployments, including air-gapped configurations. The platform provides a centralized workspace for labeling diverse media types, such as images, text, audio, and time-series data, to support supervised and reinforcement learning workflows. The platform distinguishes itself through deep integration with machine learning backends, enabling active learning loops, automated
Label Studio is a specialized platform for data labeling and annotation that integrates into machine learning pipelines, making it a highly relevant tool for the data preparation phase of model training.
TypeScriptData Labeling ToolsData Annotation Workflows
View on GitHub27,619
modin-project/modin
modin-project/modin
10,389View on GitHub
Modin is a distributed dataframe library and parallel data processing engine designed to handle large datasets that exceed system memory. It functions as a distributed computing framework that parallelizes data manipulation tasks across multiple CPU cores or clusters to increase throughput and avoid memory errors. The project mirrors the Pandas API, allowing for the distribution of data workflows without changing core code logic. It utilizes a pluggable backend interface, which enables users to switch between different distributed execution engines to optimize performance based on available h
This is a distributed dataframe library designed for parallel data manipulation and processing, serving as a building block for data pipelines rather than a comprehensive tool for data validation, versioning, or labeling.
PythonDistributed Compute FrameworksDistributed Data Processing FrameworksLarge-Scale Data Computation
View on GitHub10,389
cleanlab/cleanlab
cleanlab/cleanlab
11,513View on GitHub
Cleanlab is a data-centric AI library and toolkit designed to improve machine learning model performance by detecting label errors and increasing overall dataset quality. It implements a confident learning framework that iteratively refines label noise estimates by comparing model predictions with estimated label probabilities to identify mislabeled examples. The project provides specialized utilities for active learning optimization, allowing for the selection of the most impactful examples for labeling or re-labeling. It also includes an outlier detection tool to identify atypical data poin
Cleanlab is a specialized library for identifying label errors, detecting outliers, and improving dataset quality, which directly addresses the data cleaning and validation aspects of your machine learning pipeline.
PythonData Cleaning Procedures
View on GitHub11,513
dathere/qsv
dathere/qsv
3,687View on GitHub
qsv is a high-performance command line toolkit for querying, transforming, and analyzing comma-separated value files. It functions as a data wrangling interface and a tabular data profiler, featuring a query engine capable of executing SQL statements and joins directly on flat files without requiring a database. The project is distinguished by its ability to process massive datasets that exceed available system memory. This is achieved through disk-based external memory processing, including multithreaded merge sorting, on-disk hash tables for deduplication, and lightweight file indexing for
This toolkit provides robust command-line utilities for cleaning, profiling, and transforming tabular data, making it a highly effective tool for the data wrangling phase of a machine learning pipeline.
RustData Cleaning ProceduresTabular Data Wrangling
View on GitHub3,687
apache/hadoop
apache/hadoop
15,567View on GitHub
Hadoop is a big data infrastructure suite and distributed data processing framework designed to store and process massive datasets across clusters of computers. It consists of a distributed storage system for managing large files across multiple nodes and a parallel computing engine for processing data across a distributed cluster. The framework implements a distributed file system to ensure fault tolerance and high throughput, paired with a programming model that processes large datasets in parallel. It manages the underlying hardware and software environment required for distributed big dat
Hadoop is a foundational distributed storage and processing infrastructure that provides the underlying compute engine for big data, but it lacks the built-in data validation, labeling, and pipeline orchestration features required for a dedicated data preparation and cleaning tool.
JavaDistributed ComputingDistributed Data Processing FrameworksLarge-Scale Data Computation
View on GitHub15,567
cvat-ai/cvat
cvat-ai/cvat
15,317View on GitHub
CVAT is an open-source, web-based platform designed for annotating images, videos, and 3D point clouds to create high-quality training datasets for machine learning. It functions as a containerized server that orchestrates the entire lifecycle of computer vision data, from initial task creation and manual labeling to quality assurance and final dataset export. The platform distinguishes itself through deep integration with machine learning models, allowing users to deploy custom AI models as serverless functions for automated object detection, tracking, and skeleton annotation. It supports co
This is a specialized platform for data labeling and annotation, which is a critical component of the data preparation pipeline for computer vision models, though it does not provide general-purpose data cleaning or transformation features.
PythonData Annotation WorkflowsSpatial Data Labeling
View on GitHub15,317
dask/dask
dask/dask
13,746View on GitHub
Dask is a parallel computing framework and distributed task scheduler designed to scale Python data science workflows from single machines to large clusters. It functions as a cluster resource manager that orchestrates computational logic by representing tasks and their dependencies as directed acyclic graphs. This architecture allows the system to automate the distribution of workloads across available hardware while managing complex execution requirements. The project distinguishes itself through a lazy evaluation engine that defers data operations until they are explicitly requested, enabl
Dask is a powerful distributed computing framework that provides the large-scale data processing and pipeline orchestration necessary for preparing massive datasets, though it requires integration with other libraries for specific tasks like data labeling or versioning.
PythonDistributed ComputingDistributed Datasets
View on GitHub13,746
pandas-dev/pandas
pandas-dev/pandas
49,039View on GitHub
Pandas is a high-performance data analysis library that provides a comprehensive framework for manipulating, cleaning, and transforming structured datasets. It centers on labeled one-dimensional and two-dimensional data structures, allowing users to construct, filter, and reshape tabular information while performing complex arithmetic and logical operations. The library distinguishes itself through a sophisticated indexing engine that enables automatic data alignment during calculations and relational merges. By utilizing a block-based memory layout, it optimizes cache locality for vectorized
Pandas is a foundational library for data manipulation and cleaning that provides the essential programmatic tools to transform and normalize structured datasets for machine learning pipelines.
PythonData Cleaning Utilities
View on GitHub49,039
wireservice/csvkit
wireservice/csvkit
6,390View on GitHub
csvkit is a composable Unix-style command-line toolkit for converting, filtering, and analyzing CSV files directly from the terminal. It provides a suite of focused single-purpose commands that can be combined via pipes to build complex data processing workflows, with a modular architecture that includes a column-type inference engine for automatically detecting data types and a streaming-pipeline design for efficient handling of tabular data. The toolkit distinguishes itself through its SQL-engine abstraction layer, which allows users to run SQL queries directly against CSV files without req
This toolkit provides a robust set of command-line utilities for cleaning, filtering, and transforming tabular data, making it a practical choice for the initial stages of a data preparation pipeline.
PythonData Cleaning Procedures
View on GitHub6,390
feast-dev/feast
feast-dev/feast
6,727View on GitHub
Feast is an open-source feature store for machine learning that provides a central platform for defining, storing, and serving features across both training and inference workflows. It operates as a declarative system where feature definitions are written as code in Python files, synchronized to a central registry, and made available for low-latency online retrieval or point-in-time correct historical joins for training datasets. The project abstracts storage behind a pluggable architecture, allowing offline and online backends to be swapped without changing retrieval logic, and coordinates ma
Feast is a feature store designed to manage and serve features for machine learning pipelines, which serves as a critical component for data preparation and consistency in training workflows.
PythonDistributed Computing
View on GitHub6,727

Dataset Cleaning and Preparation Tools

doccano/doccano

apache/spark

heartexlabs/label-studio

apache/beam

OpenRefine/OpenRefine

ydataai/ydata-profiling

featuretools/featuretools

alteryx/featuretools

dagster-io/dagster

HumanSignal/label-studio

modin-project/modin

cleanlab/cleanlab

dathere/qsv

apache/hadoop

cvat-ai/cvat

dask/dask

pandas-dev/pandas

wireservice/csvkit

feast-dev/feast