# Dataset Cleaning and Preparation Tools

> Search results for `curate and clean datasets before training` on awesome-repositories.com. 116 total matches; showing the first 50.

Explore on the web: https://awesome-repositories.com/q/curate-and-clean-datasets-before-training

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [this search on awesome-repositories.com](https://awesome-repositories.com/q/curate-and-clean-datasets-before-training).**

## Results

- [huggingface/pytorch-image-models](https://awesome-repositories.com/repository/huggingface-pytorch-image-models.md) (36,893 ⭐) — This project is a comprehensive library of state-of-the-art neural network architectures designed for image classification and feature extraction. It provides a complete deep learning training framework that supports distributed execution, allowing users to build, train, and fine-tune vision models using optimized schedulers and pre-configured training recipes.

The library distinguishes itself through a modular backbone architecture that treats neural networks as decoupled feature extractors, enabling the retrieval of multi-scale outputs for downstream tasks like object detection and segmenta
- [axolotl-ai-cloud/axolotl](https://awesome-repositories.com/repository/axolotl-ai-cloud-axolotl.md) (12,059 ⭐) — Axolotl is a configuration-driven framework designed for the fine-tuning, evaluation, and quantization of large language models. It functions as a comprehensive orchestrator for distributed training, enabling users to manage complex workflows across multi-node and multi-GPU environments. By utilizing structured configuration files, the platform streamlines the setup of training parameters, dataset paths, and hardware distribution strategies.

The project distinguishes itself through its support for diverse training methodologies, including full-parameter tuning, parameter-efficient adaptation,
- [huggingface/datasets](https://awesome-repositories.com/repository/huggingface-datasets.md) (21,643 ⭐) — Datasets is a library designed for the management, processing, and sharing of large-scale data collections for machine learning workflows. It functions as both a data processing framework and a versioning platform, providing tools to organize, filter, and transform massive datasets while ensuring reproducibility across research and development teams.

The library distinguishes itself by enabling the handling of datasets that exceed available system memory. It utilizes memory-mapped file access, disk-based caching, and lazy iterative streaming to maintain performance when working with large-sca
- [alibaba/mnn](https://awesome-repositories.com/repository/alibaba-mnn.md) (14,242 ⭐) — MNN is a high-performance inference engine and framework designed for on-device machine learning. It provides a comprehensive environment for executing, optimizing, and deploying neural network models directly on mobile and resource-constrained edge devices.

The framework distinguishes itself through a robust model optimization toolkit that supports quantization, compression, and structural graph manipulation to minimize memory footprint and maximize execution speed. It features a modular architecture that abstracts hardware-specific backends, allowing models to run efficiently across diverse
- [baato/before-after](https://awesome-repositories.com/repository/baato-before-after.md) (28 ⭐) — Technical stack for generating before-after map (with vector tiles), which allows users to understand how map data in OSM has changed over time.
- [huggingface/smolagents](https://awesome-repositories.com/repository/huggingface-smolagents.md) (27,885 ⭐) — This framework provides a development toolkit for building autonomous agents that utilize language models to solve complex, non-deterministic tasks. Its core design centers on a code-executing architecture where agents generate and run Python code snippets to perform logic, data manipulation, and tool interactions. By moving beyond structured data formats, the system enables agents to manage program flow and object state through iterative reasoning cycles.

The project distinguishes itself through its focus on code-based agent implementation and secure execution environments. Developers can ch
- [openmanus/openmanus-rl](https://awesome-repositories.com/repository/openmanus-openmanus-rl.md) (3,916 ⭐) — OpenManus-RL is a reinforcement learning framework and distributed training pipeline designed to train large language models as agents. It serves as an agentic reasoning optimizer and reward model trainer, providing the infrastructure to improve model decision-making through reward-based policy optimization.

The project distinguishes itself through a distributed architecture that supports parameter sharding across multiple compute nodes and a coordinated rollout system for collecting interaction trajectories. It incorporates advanced reasoning strategies, such as Tree-of-Thoughts and Monte Ca
- [nvidia/nemo-curator](https://awesome-repositories.com/repository/nvidia-nemo-curator-2.md) (1,620 ⭐) — Scalable data pre processing and curation toolkit for LLMs
- [datajuicer/data-juicer](https://awesome-repositories.com/repository/datajuicer-data-juicer.md) (6,574 ⭐) — Data-Juicer is an open-source framework for cleaning, filtering, deduplicating, and transforming multimodal datasets to prepare them for training large language and vision models. It functions as a distributed data pipeline engine that runs processing jobs across Ray clusters, handling billions of samples with automatic operator fusion and adaptive parallelism. The framework provides a library of operators that leverage large language models for semantic extraction, filtering, and data synthesis within processing pipelines.

The project distinguishes itself through a YAML-based data recipe sys
- [chiphuyen/aie-book](https://awesome-repositories.com/repository/chiphuyen-aie-book.md) (13,779 ⭐) — This project serves as a comprehensive educational resource and technical handbook for engineers building applications powered by large language models. It provides a structured framework for mastering the principles of artificial intelligence engineering, covering the full lifecycle of model development from initial design to production deployment.

The repository distinguishes itself by offering a deep dive into the practical implementation of advanced design patterns, including retrieval-augmented generation, agentic tool orchestration, and parameter-efficient model adaptation. It emphasize
- [nvidia-nemo/curator](https://awesome-repositories.com/repository/nvidia-nemo-curator.md) (1,619 ⭐) — Scalable data pre processing and curation toolkit for LLMs
- [istio/istio](https://awesome-repositories.com/repository/istio-istio.md) (38,226 ⭐) — Istio is a service mesh infrastructure that provides a centralized control plane to manage, secure, and observe communication between distributed microservices. It functions as a policy-driven network traffic controller, enabling developers to route, balance, and secure service-to-service traffic without requiring modifications to application code. The system enforces zero-trust security by utilizing mutual transport layer authentication to verify cryptographic identities for every network request.

The project distinguishes itself through a sidecar-less proxy architecture, which offloads netw
- [rednaga/training](https://awesome-repositories.com/repository/rednaga-training.md) (431 ⭐) — Training materials crafted and publicly provided by Red Naga members
- [huggingface/transformers.js](https://awesome-repositories.com/repository/huggingface-transformers-js.md) (15,420 ⭐) — This library is a web-native engine designed to execute pretrained machine learning models directly within the browser. It functions as a client-side inference framework, enabling developers to run complex neural networks for natural language processing, computer vision, and audio tasks without requiring a backend server or external API calls.

The framework distinguishes itself by providing a unified pipeline-based abstraction that handles the entire lifecycle of model execution. It manages the dynamic retrieval of model weights and configurations from remote registries, while simultaneously
- [arize-ai/phoenix](https://awesome-repositories.com/repository/arize-ai-phoenix.md) (8,605 ⭐) — Arize Phoenix is an LLM observability platform and evaluation framework designed to capture execution traces and monitor large language model applications. It serves as a prompt management system for versioning and testing templates, and as a self-hosted AI operations infrastructure for managing telemetry and experiments.

The platform differentiates itself through a specialized embedding visualization tool used to detect data drift and optimize vector search. It provides a comprehensive evaluation suite that utilizes judge-based evaluators and ground-truth datasets to score model outputs, and
- [netflix/curator](https://awesome-repositories.com/repository/netflix-curator.md) (2,135 ⭐) — ZooKeeper client wrapper and rich ZooKeeper framework
- [huggingface/open-r1](https://awesome-repositories.com/repository/huggingface-open-r1.md) (26,326 ⭐) — Open-r1 is a framework designed for the large-scale training, distillation, and optimization of language models focused on complex reasoning and programming tasks. It provides a comprehensive suite of tools for managing distributed training jobs across multi-node clusters, enabling the development of high-performance models through reinforcement learning and supervised fine-tuning.

The project distinguishes itself by integrating secure, containerized code execution environments directly into the training and evaluation lifecycle. By allowing models to run and verify code snippets against test
- [conardli/easy-dataset](https://awesome-repositories.com/repository/conardli-easy-dataset.md) (13,394 ⭐) — Easy-dataset is a comprehensive platform designed for the end-to-end management of machine learning datasets, specifically tailored for language and vision model fine-tuning. It functions as a centralized environment for the entire data lifecycle, encompassing the automated generation of synthetic training data, the structural organization of document collections, and the systematic annotation of individual data points.

The platform distinguishes itself through its integrated evaluation and orchestration capabilities. It provides a dedicated suite for benchmarking models, featuring blind side
- [paulescu/hands-on-train-and-deploy-ml](https://awesome-repositories.com/repository/paulescu-hands-on-train-and-deploy-ml.md) (885 ⭐) — Train and Deploy an ML REST API to predict crypto prices, in 10 steps
- [camel-ai/camel](https://awesome-repositories.com/repository/camel-ai-camel.md) (17,253 ⭐) — This project is a comprehensive framework for building and managing autonomous agent systems. It provides a unified architecture for orchestrating multi-agent societies, where specialized agents collaborate through roleplay to decompose and solve complex tasks. The system integrates language models with external environments, enabling agents to perform real-world actions through a standardized tool-calling abstraction layer.

The framework distinguishes itself through its focus on iterative reasoning and data reliability. It employs automated feedback loops to refine agent outputs and self-eva
- [d2l-ai/d2l-zh](https://awesome-repositories.com/repository/d2l-ai-d2l-zh.md) (78,493 ⭐) — This project is an open-source, interactive educational platform designed to teach deep learning through a comprehensive, code-first curriculum. It provides a structured learning path that covers foundational mathematics, modern neural network architectures, and practical optimization techniques, enabling practitioners to master complex artificial intelligence concepts through hands-on experimentation.

The platform distinguishes itself by integrating technical explanations with executable Jupyter notebooks. This design allows readers to modify code and hyperparameters in real-time, facilitati
- [morvanzhou/tutorials](https://awesome-repositories.com/repository/morvanzhou-tutorials.md) (12,952 ⭐) — This repository is a comprehensive collection of instructional guides and practical examples for Python development, focusing on machine learning, data science, and web scraping. It provides implementations for neural networks, reinforcement learning algorithms, and deep learning architectures using PyTorch, alongside detailed manuals for scientific computing and data visualization.

The project distinguishes itself by offering specialized tutorials on concurrent programming to optimize CPU performance and guides for setting up Linux development environments. It covers the implementation of ad
- [gokumohandas/made-with-ml](https://awesome-repositories.com/repository/gokumohandas-made-with-ml.md) (48,343 ⭐) — Made-With-ML is an automated documentation generator and developer experience platform designed to transform source code into structured, searchable reference websites. It functions as a codebase intelligence tool that parses implementation details to provide clear explanations of logic and data requirements.

The system distinguishes itself by leveraging language-level type annotations and structured code comments to generate interface specifications. By utilizing static analysis to extract metadata, it automates the transformation of docstrings into web-ready documentation, ensuring that tec
- [appsecco/breaking-and-pwning-apps-and-servers-aws-azure-training](https://awesome-repositories.com/repository/appsecco-breaking-and-pwning-apps-and-servers-aws-azure-training.md) (952 ⭐) — Course content, lab setup instructions and documentation of our very popular Breaking and Pwning Apps and Servers on AWS and Azure hands on training!
- [dathere/qsv](https://awesome-repositories.com/repository/dathere-qsv.md) (3,687 ⭐) — qsv is a high-performance command line toolkit for querying, transforming, and analyzing comma-separated value files. It functions as a data wrangling interface and a tabular data profiler, featuring a query engine capable of executing SQL statements and joins directly on flat files without requiring a database.

The project is distinguished by its ability to process massive datasets that exceed available system memory. This is achieved through disk-based external memory processing, including multithreaded merge sorting, on-disk hash tables for deduplication, and lightweight file indexing for
- [nglgzz/awesome-clean-tech](https://awesome-repositories.com/repository/nglgzz-awesome-clean-tech.md) (464 ⭐) — A community curated list of awesome clean tech companies
- [google-gemini/cookbook](https://awesome-repositories.com/repository/google-gemini-cookbook.md) (17,418 ⭐) — The Gemini Cookbook is a comprehensive collection of implementation patterns, code samples, and development guides designed for building applications with Google Gemini models. It serves as a central resource for developers to integrate multimodal generative artificial intelligence into their software, providing the necessary frameworks to manage model interactions, stateful workflows, and structured data extraction.

The repository distinguishes itself by offering specialized toolkits for autonomous agent orchestration, enabling the construction of agents that can execute code, browse the web
- [raaminz/training](https://awesome-repositories.com/repository/raaminz-training.md) (28 ⭐) — This Repository is all about my training classes
- [wireservice/csvkit](https://awesome-repositories.com/repository/wireservice-csvkit.md) (6,390 ⭐) — csvkit is a composable Unix-style command-line toolkit for converting, filtering, and analyzing CSV files directly from the terminal. It provides a suite of focused single-purpose commands that can be combined via pipes to build complex data processing workflows, with a modular architecture that includes a column-type inference engine for automatically detecting data types and a streaming-pipeline design for efficient handling of tabular data.

The toolkit distinguishes itself through its SQL-engine abstraction layer, which allows users to run SQL queries directly against CSV files without req
- [microsoft/vscode-copilot-chat](https://awesome-repositories.com/repository/microsoft-vscode-copilot-chat.md) (9,493 ⭐) — This project is an AI-powered IDE extension and LLM coding assistant that provides a conversational interface for generating, refactoring, and debugging code. It functions as an AI agent framework and a Model Context Protocol client, connecting AI models to external data sources and tools to automate complex development tasks.

The system is distinguished by its use of autonomous AI agents capable of multi-step task execution, including the ability to read files, modify code, and run terminal commands iteratively. It supports recursive agent orchestration through subagent delegation and employ
- [google-gemini/gemini-fullstack-langgraph-quickstart](https://awesome-repositories.com/repository/google-gemini-gemini-fullstack-langgraph-quickstart.md) (18,217 ⭐) — This project is an agentic workflow orchestrator designed for building and deploying autonomous systems that perform multi-step reasoning. It functions as a tool-augmented engine, enabling developers to chain model calls with external function execution to complete complex, user-defined tasks. By integrating large language models with persistent memory and stateful logic, the framework supports the creation of intelligent applications capable of independent operation.

The platform distinguishes itself through graph-based state orchestration, which allows developers to define logic steps and t
- [unsplash/datasets](https://awesome-repositories.com/repository/unsplash-datasets.md) (2,671 ⭐) — This project is an open-source visual dataset and machine learning image library. It provides large-scale collections of high-quality photos and metadata designed for training computer vision models and conducting research into image categorization and retrieval.

The repository specifically offers semantic search datasets that pair images with AI and human-generated keywords to analyze search intent and visual metaphors. It also serves as an image metadata archive, providing structured EXIF data and camera specifications for technical analysis.

The available data covers broad capability area
- [crewaiinc/crewai](https://awesome-repositories.com/repository/crewaiinc-crewai.md) (53,687 ⭐) — CrewAI is a multi-agent orchestration framework designed for building autonomous systems that execute complex, multi-step workflows. It provides a development platform where specialized agents are defined with specific roles, goals, and tool sets to perform tasks collaboratively. By leveraging a declarative workflow engine, the system manages task dependencies, state transitions, and execution logic, allowing for the creation of structured, stateful sequences of operations.

The framework distinguishes itself through its hierarchical management capabilities, which utilize manager agents to coo
- [microsoft/data-science-for-beginners](https://awesome-repositories.com/repository/microsoft-data-science-for-beginners.md) (35,657 ⭐) — This project is a comprehensive educational curriculum designed to teach the fundamental concepts, workflows, and tools of data science. It provides a structured learning path that covers the end-to-end data science lifecycle, including data acquisition, maintenance, processing, and pattern discovery, while grounding theoretical knowledge in practical, real-world applications.

The curriculum distinguishes itself through a data-driven pedagogical design that utilizes interactive, notebook-based lessons. By combining narrative text with live code blocks, the platform allows learners to experime
- [bespokelabsai/curator](https://awesome-repositories.com/repository/bespokelabsai-curator.md) (1,637 ⭐)
- [autogluon/autogluon](https://awesome-repositories.com/repository/autogluon-autogluon.md) (9,997 ⭐) — AutoGluon is an automated machine learning framework and multimodal library designed to automate the end-to-end pipeline from data preprocessing to high-accuracy model training and validation. It functions as an automated model trainer for tabular, image, text, and time series data, as well as a tool for time series forecasting and foundation model finetuning.

The project is distinguished by its ability to jointly process and fuse different data types, allowing for the construction of multimodal neural networks that integrate images, text, and structured tables. It supports zero-shot inferenc
- [leomaurodesenv/game-datasets](https://awesome-repositories.com/repository/leomaurodesenv-game-datasets.md) (1,072 ⭐) — :video_game: A curated list of awesome game datasets, and tools to artificial intelligence in games
- [tensorflow/datasets](https://awesome-repositories.com/repository/tensorflow-datasets.md) (4,575 ⭐) — TensorFlow Datasets provides many public datasets as tf.data.Datasets.
- [facebookresearch/audiocraft](https://awesome-repositories.com/repository/facebookresearch-audiocraft.md) (23,379 ⭐) — Audiocraft is a deep learning audio library and machine learning framework designed for training, fine-tuning, and evaluating generative models for music and sound effects. It functions as a text-to-music generative model and a neural audio codec, providing the tools necessary to compress audio signals into discrete representations and synthesize high-fidelity waveforms from textual descriptions.

The framework is distinguished by its ability to combine multiple conditioning signals, allowing for the generation of audio based on text prompts, melodic excerpts, or style-based audio clips. It al
- [cleanlab/cleanlab](https://awesome-repositories.com/repository/cleanlab-cleanlab.md) (11,513 ⭐) — Cleanlab is a data-centric AI library and toolkit designed to improve machine learning model performance by detecting label errors and increasing overall dataset quality. It implements a confident learning framework that iteratively refines label noise estimates by comparing model predictions with estimated label probabilities to identify mislabeled examples.

The project provides specialized utilities for active learning optimization, allowing for the selection of the most impactful examples for labeling or re-labeling. It also includes an outlier detection tool to identify atypical data poin
- [p-e-w/heretic](https://awesome-repositories.com/repository/p-e-w-heretic.md) (8,509 ⭐) — Heretic is a specialized toolkit for removing safety alignment and refusal constraints from transformer-based language models. It utilizes directional ablation to suppress model refusals and restore unrestricted output capabilities.

The project provides a framework for quantifying the effectiveness of these modifications by measuring refusal rates and evaluating divergence from the original model behavior. It also includes a suite for residual vector analysis, allowing for the calculation of geometric relationships between prompts and the visualization of hidden states across model layers.

A
- [facebookresearch/map-anything](https://awesome-repositories.com/repository/facebookresearch-map-anything.md) (2,915 ⭐) — Map-anything is a 3D scene reconstruction framework and neural geometry estimator designed to transform two-dimensional images into metric three-dimensional spatial representations using feed-forward neural networks. It provides a specialized toolkit for predicting camera intrinsics and ray directions from single images without requiring external geometric metadata.

The project includes a 3D model benchmarking suite that utilizes a unified model wrapper to standardize outputs from diverse reconstruction models. This allows for consistent evaluation and accuracy measurement across various spat
- [subeeshvasu/awsome-gan-training](https://awesome-repositories.com/repository/subeeshvasu-awsome-gan-training.md) (30 ⭐) — A curated list of resources related to training of GANs
- [corentinj/real-time-voice-cloning](https://awesome-repositories.com/repository/corentinj-real-time-voice-cloning.md) (59,918 ⭐) — This project is a neural text-to-speech engine and voice cloning toolkit designed to generate synthetic speech that mimics the vocal characteristics of a target speaker. It functions as a real-time audio synthesizer, utilizing a deep learning pipeline to convert written text into high-fidelity speech output with minimal latency.

The system employs a transfer learning framework that leverages pre-trained speaker verification models to adapt synthesis to new, unseen vocal identities. By using an encoder-based speaker embedding process, the toolkit maps variable-length audio samples into a laten
- [muffinista/before-dawn](https://awesome-repositories.com/repository/muffinista-before-dawn.md) (213 ⭐) — A desktop screensaver app using web technologies
- [datawhalechina/self-llm](https://awesome-repositories.com/repository/datawhalechina-self-llm.md) (30,941 ⭐) — This project is an open-source educational resource providing structured, step-by-step guides for fine-tuning large language models. It focuses on adapting pre-trained transformer-based causal models to custom datasets, enabling users to transfer specific writing styles or domain knowledge into generative AI models.

The repository distinguishes itself by emphasizing parameter-efficient training techniques, specifically low-rank adaptation. By providing practical implementations for updating only a small subset of model weights, it allows for the customization of massive neural networks on con
- [aws/aws-cdk](https://awesome-repositories.com/repository/aws-aws-cdk.md) (12,817 ⭐) — The AWS Cloud Development Kit is an infrastructure-as-code framework that enables developers to define and provision cloud resources using familiar programming languages. By utilizing construct-based synthesis, it translates high-level, object-oriented code into declarative templates, allowing for the automated management of complex cloud environments through a centralized, code-driven control plane.

The framework distinguishes itself through its ability to model infrastructure as a dependency-aware resource graph, ensuring that components are provisioned and updated in the correct order. It
- [nielsrogge/transformers-tutorials](https://awesome-repositories.com/repository/nielsrogge-transformers-tutorials.md) (11,641 ⭐) — This is a collection of tutorials and practical demonstrations for implementing machine learning tasks using the HuggingFace Transformers library. It serves as a guide for applying transformer architectures across computer vision, natural language processing, and audio analysis.

The repository provides implementation examples for multimodal model deployment, including the combination of text, image, and audio inputs. It includes resources for optimizing pre-trained models through fine-tuning on custom datasets and provides examples for preparing PyTorch datasets by converting raw files into t
- [xsahil03x/before_after](https://awesome-repositories.com/repository/xsahil03x-before-after.md) (1,026 ⭐)
- [michael0x2a/curated-programming-resources](https://awesome-repositories.com/repository/michael0x2a-curated-programming-resources.md) (3,238 ⭐) — A [curated and annotated list of resources][resources] for learning programming and computer science.