30 open-source projects similar to cleanlab/cleanlab, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best Cleanlab alternative.
A general purpose recommender metrics library for fair evaluation.
Little Ball of Fur - A graph sampling extension library for NetworKit and NetworkX (CIKM 2020)
Karate Club: An API Oriented Open-source Python Framework for Unsupervised Learning on Graphs (CIKM 2020)
A PyTorch and TorchDrug based deep learning library for drug pair scoring. (KDD 2022)
sktime is a machine learning framework designed for time series analysis. It provides a unified interface for performing time series forecasting, classification, and anomaly detection, integrating these capabilities into a standardized toolkit compatible with the scikit-learn API. The framework allows for the construction of complex analysis workflows through model pipelining and ensemble-based aggregation. It uses adapter-based integration to wrap external time series libraries, providing a single entry point for diverse algorithmic implementations. Its capabilities cover temporal data tran
River is a Python framework for online machine learning, designed to train and evaluate models on streaming data. It enables incremental learning by updating model parameters one observation at a time, eliminating the need to store full training datasets in memory. The library distinguishes itself through a dedicated concept drift detection system that monitors changes in data distributions to trigger model adaptation. It also provides a progressive validation framework that simulates real-time deployment by testing models on samples before using them for training. The system covers a broad
Towhee is a framework that is dedicated to making neural data processing pipelines simple and fast.
:truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark
The official implementation of "The Shapley Value of Classifiers in Ensemble Games" (CIKM 2021).
Featuretools is an automated feature engineering library and data transformation framework written in Python. It automatically generates machine learning feature vectors from multi-table datasets by applying synthesis patterns to relational and timestamped data. The system functions as a distributed feature synthesis engine, allowing the process of creating feature vectors to scale across multiple cores or clusters to handle large-scale datasets. The library supports the synthesis of multi-table datasets, time series feature generation, and the creation of custom machine learning primitives
Move fast from data science prototype to pipeline. Capture, analyze, and transform messy notebooks into data pipelines with just two lines of code.
Examples of Machine Learning code using Comet.ml
Albumentations is a computer vision image augmentation library designed to increase training data diversity for deep learning models. It provides a toolset for applying geometric and color transformations to images and annotations, including a specialized collection of 3D operations for volumetric data used in medical and scientific imaging. The library functions as an image mask and bounding box transformer, automatically updating masks, bounding boxes, and keypoints when images undergo geometric changes. This ensures that spatial alterations remain synchronized across images and their assoc
This repository is a comprehensive collection of instructional guides and practical examples for Python development, focusing on machine learning, data science, and web scraping. It provides implementations for neural networks, reinforcement learning algorithms, and deep learning architectures using PyTorch, alongside detailed manuals for scientific computing and data visualization. The project distinguishes itself by offering specialized tutorials on concurrent programming to optimize CPU performance and guides for setting up Linux development environments. It covers the implementation of ad
Lightly is a self-supervised learning framework and computer vision data curation tool designed to manage large image datasets and train models on unlabeled data. It functions as a PyTorch vision library and dataset management SDK, providing tools to convert raw images into high-dimensional vectors for similarity search, visualization, and feature extraction. The project implements a variety of self-supervised architectures, including MoCo, SimCLR, VICReg, Barlow Twins, and masked image modeling. It distinguishes itself by combining these learning frameworks with active learning capabilities,
Oumi is a comprehensive large language model development platform designed for synthesizing data, fine-tuning models, and running performance evaluations. It serves as a unified environment for the entire model lifecycle, encompassing a training and fine-tuning suite, an evaluation framework, and tools for synthetic data generation and model distillation. The platform is distinguished by its iterative, failure-driven synthesis approach, which analyzes model weaknesses during evaluation to generate targeted training data. It utilizes an LLM-based judge framework to programmatically score respo
PyOD is a Python anomaly detection library used to identify outliers in tabular, time series, graph, text, and image data. It provides a collection of algorithms for detecting anomalous data points and includes a unified detector interface that standardizes input and output signatures across its available detection algorithms. The project features a multi-modal outlier detector for identifying anomalies across diverse formats including unstructured text and images, as well as a specialized toolkit for graph-based and time-series anomaly detection. It includes an ensemble framework for combini
This project serves as an educational and practical resource for mastering machine learning workflows using Python. It provides a comprehensive collection of code examples and exercises designed to guide users through the implementation of predictive systems, ranging from fundamental algorithms to deep learning architectures. The repository distinguishes itself by offering a structured approach to both classical machine learning and neural network training. It covers the full lifecycle of model development, including the orchestration of reusable data transformation pipelines, advanced ensemb
CVAT is an open-source computer vision annotation tool and visual dataset management platform. It provides a self-hosted interface for labeling images, videos, and 3D data to create datasets for vision AI models. The platform features AI-assisted data labeling to automate the creation of masks and bounding boxes, utilizing a plug-in system to connect external machine learning models. It includes a consensus-based quality assurance system that verifies label accuracy by comparing independent annotations. The system covers collaborative team management, project organization through task decomp
AutoGluon is an automated machine learning framework and multimodal library designed to automate the end-to-end pipeline from data preprocessing to high-accuracy model training and validation. It functions as an automated model trainer for tabular, image, text, and time series data, as well as a tool for time series forecasting and foundation model finetuning. The project is distinguished by its ability to jointly process and fuse different data types, allowing for the construction of multimodal neural networks that integrate images, text, and structured tables. It supports zero-shot inferenc
Deepchecks is a machine learning model validation framework and MLOps testing library. It serves as an AI data quality suite and performance evaluator designed to verify the integrity and performance of models and datasets from research through production. The project functions as a model monitoring tool for tracking data drift and performance degradation in production environments. It allows for the creation of custom validation suites and utilizes a pluggable check architecture to automate quality checks within continuous integration pipelines. The framework covers a broad range of capabil
Apache Mahout - an environment for quickly creating scalable, performant machine learning applications.
CML is a pipeline automation tool for training and evaluating machine learning models, functioning as a CI/CD system for machine learning. It serves as a cloud compute orchestrator and Git-based workflow manager that automates model training cycles through branch management, automated commits, and integrated reporting. The project distinguishes itself by provisioning ephemeral cloud instances or Kubernetes nodes to provide specialized hardware for compute-heavy tasks. It also manages remote compute runners, allowing the connection of self-hosted GPU clusters or on-premise machines to execute
Comet LLM is an observability platform and evaluation framework designed for large language model applications and agentic workflows. It functions as a system for tracing, monitoring, and debugging execution flows while providing tools for prompt optimization and the enforcement of AI safety guardrails. The platform distinguishes itself through a combination of model-based scoring and heuristic metrics to quantify output quality and detect hallucinations. It includes a dedicated prompt and agent optimizer with an interactive playground for refining templates and tool configurations. For retri
CatBoost is a gradient boosting machine learning library used to train decision tree ensembles for regression, classification, and ranking tasks. It functions as a high-performance framework that provides a categorical data processor for transforming non-numeric features, a distributed trainer for large-scale datasets, and GPU acceleration to speed up model construction. The library distinguishes itself through native handling of categorical data and text features, removing the need for manual encoding. It includes a specialized model interpretability tool that leverages SHAP values and featu
source code from the book Genetic Algorithms with Python by Clinton Sheppard
Evidently is an AI observability platform and evaluation framework designed to quantify the performance of machine learning models and large language models. It functions as a monitoring tool for detecting data drift and quality degradation in tabular datasets, while providing a specialized analyzer for the faithfulness and correctness of retrieval augmented generation systems. The project distinguishes itself through an evaluation framework that utilizes judge models and custom rubrics to score language model outputs. It includes tools for iterative prompt optimization and the generation of
JAX is a hardware-accelerated array library and automatic differentiation system for numerical computing. It provides a framework compatible with NumPy that extends array operations with a just-in-time compiler to transform Python functions into optimized kernels for execution on GPU and TPU accelerators. The system differentiates itself through the use of an XLA-based compiler and a single program multiple data sharding model. These capabilities allow the library to distribute large-scale computations across multiple hardware accelerators using both automatic parallelization and manual shard