30 open-source projects similar to azure/machinelearningnotebooks, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best MachineLearningNotebooks alternative.
ClearML is a comprehensive MLOps platform designed to manage the entire machine learning lifecycle. It functions as an experiment tracking tool, a data versioning system, and a pipeline orchestrator, while providing infrastructure for GPU cluster management and model serving. The platform is distinguished by its ability to handle hybrid-cloud compute scheduling and fractional GPU allocation, allowing multiple workloads to share a single hardware accelerator. It employs a metadata-based approach to data versioning, using virtual views to track large datasets and artifacts without duplicating r
Metaflow is a Python machine learning framework and MLOps workflow orchestrator designed to manage the lifecycle of data pipelines from local prototyping to production. It serves as a distributed compute manager and an experiment tracking system, enabling the creation of reproducible pipelines that transition between development and high-availability production environments. The framework distinguishes itself through an integrated checkpointing system that automatically persists intermediate data artifacts to remote storage, allowing failed runs to be resumed from the last successful step. It
This repository is a collection of Jupyter notebooks providing reference implementations and templates for building, training, and deploying machine learning models using Amazon SageMaker. It serves as an example library for implementing model architectures and automating the machine learning lifecycle. The library provides practical patterns for machine learning training, data engineering, and model deployment. It includes implementation guides for MLOps, including workflows for model monitoring, lineage tracking, and hyperparameter tuning. The examples cover a broad range of capabilities i
ZenML is an extensible machine learning orchestration framework designed to manage the end-to-end lifecycle of data pipelines and AI agent workflows. It functions as a durable orchestrator that executes machine learning tasks as directed acyclic graphs, ensuring that every step is containerized for consistent performance across local, cloud, and hybrid infrastructure. By decoupling pipeline code from underlying compute and storage backends, the platform allows developers to define infrastructure-agnostic stacks that remain portable across diverse environments. The project distinguishes itself
ClearML is a comprehensive MLOps platform designed to manage the end-to-end machine learning lifecycle, from initial experimentation to production deployment. It provides a suite of integrated tools including a pipeline orchestrator for automating workflows, an experiment tracking tool for logging hyperparameters and metrics, and a metadata-driven data versioning system for managing large-scale datasets and model artifacts. The platform is distinguished by its advanced compute management and serving capabilities. It features a GPU compute manager that supports fractional resource slicing and
This project is a structured learning curriculum and technical reference for mastering deep learning with TensorFlow. It provides a comprehensive guide for building, training, and deploying neural networks, combining theoretical fundamentals with practical implementation examples. The repository distinguishes itself by covering the end-to-end machine learning workflow, from low-level tensor mathematics and linear algebra to the creation of complex model architectures. It includes specific guidance on developing data pipelines for diverse data types, such as images, text, and time-series seque
TransformerLab is an MLOps orchestration platform and research environment designed for the training, fine-tuning, and evaluation of large language models. It serves as a centralized control plane for managing machine learning jobs and coordinating distributed GPU compute across hybrid cloud and on-premise providers. The platform distinguishes itself through agent-driven model optimization, using AI assistants to analyze metrics and automatically propose and queue hyperparameter experiments. It provides a remote development environment that allows users to launch interactive notebooks, code e
The Hugging Face Hub Python client is a library that provides programmatic access to the Hugging Face Hub, a centralized platform for hosting and collaborating on machine learning models, datasets, and demo applications. It serves as the primary SDK for interacting with the Hub's API, enabling users to download and upload models and datasets, manage repositories, authenticate via tokens or OAuth, and run inference on hosted models through a unified interface. The client distinguishes itself through a comprehensive set of capabilities that go beyond basic file transfer. It includes a CLI exten
This project provides a comprehensive technical guide and framework for engineering large-scale machine learning systems. It covers the full lifecycle of model development, focusing on the infrastructure and computational principles required to build, train, and serve generative AI models across distributed GPU clusters. The repository distinguishes itself by offering deep-dive tutorials and implementation strategies for complex system challenges. It emphasizes high-performance architectural primitives, such as collective communication orchestration, distributed tensor sharding, and static gr
Oumi is a comprehensive large language model development platform designed for synthesizing data, fine-tuning models, and running performance evaluations. It serves as a unified environment for the entire model lifecycle, encompassing a training and fine-tuning suite, an evaluation framework, and tools for synthetic data generation and model distillation. The platform is distinguished by its iterative, failure-driven synthesis approach, which analyzes model weaknesses during evaluation to generate targeted training data. It utilizes an LLM-based judge framework to programmatically score respo
h2o-3 is a distributed machine learning platform and automated machine learning framework designed for training and deploying predictive models using distributed in-memory computing. It functions as a deep learning framework and a distributed model scoring engine, capable of operating as a Kubernetes ML cluster to process large datasets in parallel. The platform distinguishes itself through automated machine learning capabilities that automatically select the best algorithms and hyperparameters to optimize model performance. It provides specialized deep learning toolkits for tasks including i
CatBoost is a gradient boosting machine learning library used to train decision tree ensembles for regression, classification, and ranking tasks. It functions as a high-performance framework that provides a categorical data processor for transforming non-numeric features, a distributed trainer for large-scale datasets, and GPU acceleration to speed up model construction. The library distinguishes itself through native handling of categorical data and text features, removing the need for manual encoding. It includes a specialized model interpretability tool that leverages SHAP values and featu
This project is an educational platform and research toolkit designed to teach deep learning through a combination of mathematical theory, visual diagrams, and executable code. It provides a comprehensive environment for building, training, and evaluating neural networks, grounding complex concepts in interactive computational notebooks that allow for hands-on experimentation. The framework distinguishes itself by interleaving theoretical foundations—including linear algebra, calculus, and probability—with practical implementations across multiple industry-standard libraries. It supports flex
This is a PyTorch-based training pipeline designed for reproducible image classification benchmarking on the CIFAR-10 dataset. It integrates GPU-accelerated computation, data augmentation, learning rate scheduling, and checkpointing to produce consistent accuracy measurements across multiple ResNet architectures. The project distinguishes itself by providing a fixed-architecture benchmark suite that trains a predefined set of ResNet variants, from ResNet18 through ResNet152, on CIFAR-10. It implements a step-based learning rate decay schedule at predetermined epochs to stabilize convergence,
Kubeflow is a Kubernetes machine learning platform and containerized toolkit designed to orchestrate the entire machine learning lifecycle. It functions as an MLOps workflow orchestrator and infrastructure layer for building, training, and deploying models within containerized environments. The project provides specialized infrastructure for scaling compute resources and managing GPU workloads for large-scale distributed training. It automates the transition of models from experimental development to production through workflow orchestration and model deployment services. The platform covers
This project is a machine learning experiment tracker and event file generator that enables the recording of scalars, images, and histograms to monitor model performance. It functions as an integration bridge that allows training metrics from PyTorch to be logged into files compatible with the TensorBoard dashboard. The system includes a remote log synchronizer designed to stream experiment data to cloud services. This allows for the remote management and analysis of training results and the comparison of datasets across different training runs. The utility covers a broad range of monitoring
pysheeet is a technical reference library providing a curated collection of code snippets and implementation patterns for advanced Python development, system integration, and high-performance computing. It serves as a comprehensive guide for implementing low-level network programming, native C extensions, and asynchronous and concurrent programming. The project provides specialized frameworks for the development and deployment of large language models, including tools for distributed GPU inference and high-performance serving. It also includes detailed patterns for high-performance computing
DeepCTR-Torch is a deep learning library for building click-through rate prediction models. It provides a modular framework for assembling custom prediction architectures from pre-built core, interaction, and sequence layers, enabling the construction of deep neural networks that estimate click probability from user behavior data. The library specializes in feature interaction modeling, offering components for learning low-order, high-order, and adaptive-order feature crosses. It supports multi-task learning for predicting multiple objectives simultaneously, such as click and conversion rates
PyCaret is a Python AutoML platform and MLOps lifecycle manager designed to automate machine learning workflows. It functions as a low-code environment that leverages a scikit-learn native engine to execute preprocessing, training, and evaluation for tabular data. The platform distinguishes itself as an LLM-powered ML copilot, using large language model agents to analyze datasets, design experiment configurations, and explain model results. It also serves as a Kubernetes ML orchestrator and model registry, enabling the versioning of trained pipelines and their promotion to production API endp
PyOD is a Python anomaly detection library used to identify outliers in tabular, time series, graph, text, and image data. It provides a collection of algorithms for detecting anomalous data points and includes a unified detector interface that standardizes input and output signatures across its available detection algorithms. The project features a multi-modal outlier detector for identifying anomalies across diverse formats including unstructured text and images, as well as a specialized toolkit for graph-based and time-series anomaly detection. It includes an ensemble framework for combini
Ludwig is a declarative machine learning framework designed for training neural networks and large language models using configuration files instead of manual coding. It functions as a multimodal model builder and a low-code tool for supervised fine-tuning, allowing users to build models that process mixed inputs of text, images, audio, and tabular data. The project distinguishes itself through an automated hyperparameter optimizer and a system for large language model fine-tuning using parameter-efficient adapters. It features a multimodal data pipeline and the ability to automatically gener
This is a cross-platform framework for building, training, and deploying custom machine learning models within the .NET ecosystem. It provides a predictive modeling engine for classification, regression, and forecasting tasks, alongside an inference runtime to generate predictions across different hardware architectures. The framework includes a gradient boosting library and supports interoperability with external models via a standardized open format. It features tools for prediction explainability, allowing the analysis of feature importance to debug model behavior and identify bias. The p
This project is an educational resource and software architecture framework focused on the technical foundations of large language model engineering. It provides a collection of guides and design patterns for building and maintaining professional, scalable systems using large language models. The resource outlines practical implementation patterns for orchestrating workflows that combine prompt engineering, model calls, and vector databases. It focuses on transforming prompt development into a structured engineering process to ensure reliable model outputs in production environments. The cov
This is a reference guide for designing, deploying, and maintaining production-ready machine learning systems, grounded in MLOps best practices. It covers the complete machine learning lifecycle, from system design and workflow planning through to deployment and ongoing maintenance, with a focus on reliability, scalability, and maintainability as business requirements evolve. The guide provides an architecture reference for establishing shared ML infrastructure, including model registries and feature stores that standardize asset reuse across teams. It details pipeline automation through conf
SLIME is a distributed reinforcement learning framework for large language model post-training that bridges Megatron training with SGLang inference servers. It orchestrates scalable RL loops across GPU clusters, decoupling training and inference into independent processes that communicate over HTTP and NCCL for independent scaling and fault tolerance. The system supports multi-agent reinforcement learning workflows with parallel agent instances, customizable rollout strategies, and personalized agent serving that improves models from prior conversations without disrupting API serving. The fra
Gorse is a personalized recommendation engine server and machine learning pipeline designed to suggest items to users based on their behavior and preferences. It operates as a distributed system that separates training, candidate generation, and serving nodes to support high-throughput workloads. The system utilizes a multi-stage recommendation pipeline to refine results through retrieval, scoring, and reranking. It generates personalized suggestions using collaborative filtering, matrix factorization, and item-to-item similarity models, while also providing non-personalized and fallback reco
This repository is the official documentation for TensorFlow, a machine learning framework. It provides comprehensive guides, tutorials, and API references for building, training, and deploying machine learning models. The documentation covers the full lifecycle of machine learning projects, from constructing data pipelines and building neural networks with high-level APIs to customizing training loops and deploying trained models in production, on edge devices, or in browsers. The documentation includes step-by-step tutorials for a range of tasks, including reinforcement learning, ranking mo
This project is a Transformer machine translation model and attention-based neural network implemented using the PyTorch deep learning framework. It functions as a text-to-text translation tool designed to convert source sequences into target language text. The implementation focuses on neural machine translation, covering the development of sequence-to-sequence architectures. It includes the full pipeline for translation, from text sequence preprocessing and vocabulary creation to model training and text generation inference. The system incorporates standard transformer components such as a
Cube Studio is a cloud-native MLOps platform and Kubernetes-based AI orchestrator designed for the entire machine learning lifecycle. It provides a distributed training framework for large-scale model fine-tuning, a GPU resource manager for hardware virtualization, and an ML pipeline orchestrator that uses visual directed acyclic graphs to manage end-to-end workflows. The platform distinguishes itself through its specialized LLM inference server, which supports retrieval-augmented generation and the construction of private knowledge bases. It features a dedicated system for supervised fine-tu