Which open-source GitHub repositories match “ML frameworks and MLOps”?

keras-team/keras is the closest match — Keras is a high-level deep learning framework designed for constructing and training neural networks through the composition of modular, functional layers. It serves as a comprehensive modeling toolkit that provides standardized procedures for defining, evaluating, and deploying complex architectures. By utilizing a directed acyclic graph approach, the framework allows users to build intricate models with multiple inputs, outputs, and shared layers, ensur…

Why does keras-team/keras match “ML frameworks and MLOps”?

Keras is a high-level deep learning framework designed for constructing and training neural networks through the composition of modular, functional layers. It serves as a comprehensive modeling toolkit that provides standardized procedures for defining, evaluating, and deploying complex architectur…

Why does eugeneyan/applied-ml match “ML frameworks and MLOps”?

This project is a comprehensive, curated knowledge base designed to support the development and maintenance of production-grade machine learning systems. It serves as a centralized repository of industry-standard technical literature, engineering case studies, and research papers, providing a struc…

Why does ageron/handson-ml2 match “ML frameworks and MLOps”?

This project provides a collection of practical machine learning code examples, including implementations for supervised, unsupervised, and reinforcement learning algorithms. It features deep learning model implementations for convolutional, recurrent, and generative architectures, alongside specif…

Why does lightning-ai/pytorch-lightning match “ML frameworks and MLOps”?

PyTorch Lightning is a deep learning research framework that provides a structured environment for organizing machine learning code. It functions as a unified trainer orchestrator, centralizing the execution flow by managing the interaction between hardware resources, data loaders, and model compon…

Why does voltagent/awesome-claude-code-subagents match “ML frameworks and MLOps”?

This project provides a framework for managing multi-agent systems, designed to automate complex software development, infrastructure, and business workflows. It functions as a multi-agent workflow orchestrator that routes tasks to domain-specific workers while maintaining state persistence and inf…

ML frameworks and MLOps

Explore libraries for building machine learning models and tools for managing their entire production lifecycle.

Find the best repos with AI.We'll search the best matching repositories with AI.

keras-team/keras
keras-team/keras
64,094View on GitHub
Keras is a high-level deep learning framework designed for constructing and training neural networks through the composition of modular, functional layers. It serves as a comprehensive modeling toolkit that provides standardized procedures for defining, evaluating, and deploying complex architectures. By utilizing a directed acyclic graph approach, the framework allows users to build intricate models with multiple inputs, outputs, and shared layers, ensuring consistent numerical execution through functional state management. The project distinguishes itself as a multi-backend machine learning engine that decouples high-level model definitions from low-level execution logic. This backend-agnostic architecture enables users to author model code once and deploy it across diverse hardware accelerators and tensor processing frameworks without rewriting core logic. Users can dynamically switch between different computational engines to optimize performance, while native utilities support large-scale distributed training by separating model topology from hardware-specific sharding and parallelism requirements. Beyond its core modeling capabilities, the framework includes an extensive ecosystem for specialized tasks such as hyperparameter optimization, recommendation system development, and the integration of pre-trained generative models for text and image synthesis. It supports both functional composition and object-oriented subclassing, allowing for the creation of custom layers and models that maintain compatibility with standard training loops, data streaming, and callback management. The framework is distributed as a Python package and provides a unified interface for managing the entire training lifecycle, from data pipeline preparation to model serialization and export.
PythonFrameworksModel DefinitionArchitectures
View on GitHub64,094
eugeneyan/applied-ml
eugeneyan/applied-ml
29,783View on GitHub
This project is a comprehensive, curated knowledge base designed to support the development and maintenance of production-grade machine learning systems. It serves as a centralized repository of industry-standard technical literature, engineering case studies, and research papers, providing a structured reference for practitioners navigating the complexities of modern data science and machine learning engineering. The resource distinguishes itself through a cross-domain approach that bridges the gap between academic research and practical implementation. By synthesizing proven industry architectures and operational strategies, it offers a unified framework for managing the entire machine learning lifecycle, from initial data infrastructure and pipeline development to model deployment, versioning, and continuous monitoring. The collection covers a broad spectrum of technical domains, including data quality management, feature engineering, and the application of various machine learning tasks such as natural language processing, computer vision, and reinforcement learning. It also addresses critical operational concerns like system efficiency, privacy-preserving techniques, and the ethical considerations inherent in automated decision-making systems. The repository is maintained through a community-driven model, ensuring that the documentation remains aligned with evolving industry standards. All content is delivered via static markdown files, providing a highly accessible and version-controlled format for long-form technical research.
Lifecycle ManagementData PipelinesMachine Learning Operations Platforms
View on GitHub29,783
ageron/handson-ml2
ageron/handson-ml2
29,938View on GitHub
This project provides a collection of practical machine learning code examples, including implementations for supervised, unsupervised, and reinforcement learning algorithms. It features deep learning model implementations for convolutional, recurrent, and generative architectures, alongside specific examples of reinforcement learning agents that maximize rewards in simulated environments. The repository includes dedicated data preprocessing pipelines for sanitization, feature scaling, and dimensionality reduction. It also provides implementations for a wide range of specific models, such as random forests, support vector machines, autoencoders, and generative adversarial networks. Broad capability areas cover the entire machine learning lifecycle, including data engineering, model evaluation through cross-validation, hyperparameter tuning, and MLOps deployment workflows. It also incorporates mathematical foundations like linear algebra and differential calculus. The project is delivered as a set of Jupyter Notebooks and includes configurations for containerized environments to ensure consistent execution of the examples.
Jupyter NotebookMachine Learning ImplementationsConvolutional Neural NetworksData Preparation
View on GitHub29,938
lightning-ai/pytorch-lightning
Lightning-AI/pytorch-lightning
31,201View on GitHub
PyTorch Lightning is a deep learning research framework that provides a structured environment for organizing machine learning code. It functions as a unified trainer orchestrator, centralizing the execution flow by managing the interaction between hardware resources, data loaders, and model components. By decoupling model architecture from training logic, the framework enables researchers to maintain clean, modular codebases that remain portable across different environments. The framework distinguishes itself through a hardware-agnostic abstraction layer that scales deep learning workloads across multiple accelerators without requiring manual management of parallelization or synchronization logic. It utilizes a hook-based execution lifecycle and a plugin system to inject custom behaviors, such as logging, checkpointing, and early stopping, directly into the training loop. This modular approach allows developers to extend training functionality without modifying the underlying core application code. Beyond its core orchestration capabilities, the project enforces a standardized structure for training pipelines to simplify collaboration and improve experiment reproducibility. It includes state-based serialization to capture the full training state, ensuring that sessions can be consistently resumed after interruptions. The framework is distributed as a Python package and provides a consistent class-based interface for managing complex machine learning workflows.
PythonDeep Learning FrameworksModular Training OrchestratorsTraining Orchestrators
View on GitHub31,201
voltagent/awesome-claude-code-subagents
VoltAgent/awesome-claude-code-subagents
21,906View on GitHub
This project provides a framework for managing multi-agent systems, designed to automate complex software development, infrastructure, and business workflows. It functions as a multi-agent workflow orchestrator that routes tasks to domain-specific workers while maintaining state persistence and infrastructure automation. By leveraging large language models, the system decomposes high-level objectives into actionable plans, ensuring that complex operations are executed with consistency and reliability. The framework distinguishes itself through its hierarchical agent registry and policy-driven tool access, which enforce security boundaries by restricting agent operations based on defined functional roles. It utilizes context-aware task routing to match incoming requests with specific agent capabilities and model performance profiles, while implementing deterministic fallback mechanisms to maintain operational continuity when agents encounter errors or context limits. This architecture allows for modular capability expansion and reproducible environment configurations through version-controlled templates. The system covers a broad capability surface, including automated technical documentation, cloud infrastructure management, and security auditing. It supports diverse domains such as API design, database optimization, and system reliability engineering, providing tools for incident response, performance monitoring, and compliance enforcement. These capabilities are integrated into a command-line interface that enables developers to search, fetch, and deploy specialized subagents directly from the repository.
ShellAgent Discovery InterfacesAgentic Task AutomationAgentic Task Orchestrators
View on GitHub21,906
hiyouga/llamafactory
hiyouga/LlamaFactory
72,213View on GitHub
LlamaFactory is a unified framework for fine-tuning and adapting large language models. It provides a comprehensive platform that standardizes training workflows across diverse machine learning architectures, allowing users to execute both full-tuning and parameter-efficient methods through a single interface. The project distinguishes itself by offering a low-code visual dashboard that enables users to configure experiments and monitor performance metrics in real time without writing extensive custom scripts. It also features a configuration-driven orchestration system that decouples experiment logic from the underlying execution engine, alongside an OpenAPI-compliant server that exposes trained models as standard network endpoints for integration with external software. Beyond its core training capabilities, the platform supports real-time experiment tracking by streaming performance data to external monitoring services. This allows for the evaluation of model progress and the optimization of parameters throughout the development lifecycle. The software is designed to be installed and configured as a standalone environment for managing the end-to-end lifecycle of language model adaptation.
PythonExperiment TrackingLanguage Model Fine-TuningLarge Language Model Fine-Tuning Frameworks
View on GitHub72,213
pycaret/pycaret
pycaret/pycaret
9,811View on GitHub
PyCaret is a Python AutoML platform and MLOps lifecycle manager designed to automate machine learning workflows. It functions as a low-code environment that leverages a scikit-learn native engine to execute preprocessing, training, and evaluation for tabular data. The platform distinguishes itself as an LLM-powered ML copilot, using large language model agents to analyze datasets, design experiment configurations, and explain model results. It also serves as a Kubernetes ML orchestrator and model registry, enabling the versioning of trained pipelines and their promotion to production API endpoints. Its broader capabilities cover the end-to-end machine learning lifecycle, including automated model selection, hyperparameter tuning, and time-series forecasting. The system includes tools for MLOps observability, such as data drift detection, performance monitoring, and the ability to roll back deployments. The software can be deployed via containers or Kubernetes charts, with support for airgapped environments and integrated GPU compute worker pools.
PythonAutomated Machine LearningMachine Learning Workflow LibrariesAI Agent Integrations
View on GitHub9,811
tensorflow/models
tensorflow/models
77,663View on GitHub
This repository serves as a centralized collection of state-of-the-art deep learning architectures and reference implementations designed for research and application development. It provides a comprehensive toolkit for computer vision and natural language processing, offering pre-built models and training pipelines for tasks ranging from image classification and object detection to complex sequence modeling. The project distinguishes itself by providing a flexible execution harness that manages the entire training lifecycle, including data ingestion and backpropagation. It supports scalable training across distributed hardware environments through collective communication primitives and utilizes configuration-driven experimentation to decouple hyperparameters from source code. By structuring neural architectures through hierarchical class compositions and employing checkpoint-based state persistence, the repository ensures that research workflows remain modular, reproducible, and fault-tolerant. These implementations demonstrate industry-standard patterns for constructing and deploying neural networks, including optimized graph-based execution for hardware acceleration. The repository functions as a reference for best practices in deep learning, providing documented examples for vision, language, and training loop management.
PythonComputer Vision ModelsDevelopment and Orchestration ToolsDistributed Parameter Synchronisation
View on GitHub77,663
chiphuyen/dmls-book
chiphuyen/dmls-book
4,395View on GitHub
This is a reference guide for designing, deploying, and maintaining production-ready machine learning systems, grounded in MLOps best practices. It covers the complete machine learning lifecycle, from system design and workflow planning through to deployment and ongoing maintenance, with a focus on reliability, scalability, and maintainability as business requirements evolve. The guide provides an architecture reference for establishing shared ML infrastructure, including model registries and feature stores that standardize asset reuse across teams. It details pipeline automation through configurable directed acyclic graphs with automated triggers and retry logic, and describes a production monitoring framework for detecting performance degradation, data drift, and algorithmic bias in real time. Responsible AI implementation is addressed through built-in fairness checks and bias detection mechanisms that validate model outputs against ethical guidelines. The material is organized around key architectural patterns such as DAG-based pipeline orchestration, infrastructure-as-code provisioning, and a pipeline-defined ML lifecycle with clear handoff points from data collection to production monitoring. It serves as a practical manual for planning end-to-end ML workflows and designing systems that stay reliable and maintainable over time.
Production Machine Learning GuidesDAG-Based OrchestrationFeature Stores
View on GitHub4,395
hpcaitech/colossalai
hpcaitech/ColossalAI
41,395View on GitHub
ColossalAI is a distributed deep learning framework designed for training and deploying massive artificial intelligence models across clusters of hardware accelerators. It functions as a parallel computing engine that partitions model workloads and data across multiple processors to maximize memory efficiency and throughput. The platform distinguishes itself through a comprehensive suite of parallelization strategies, including multi-dimensional tensor parallelism and pipeline-based model parallelism, which segment neural network layers and stages across devices. To support large-scale generative models in production, it provides a distributed inference runtime that utilizes dynamic request batching and optimized communication primitives to manage high volumes of concurrent traffic and minimize latency. The framework incorporates a large model optimization suite that enables the execution of complex models on limited hardware. This includes heterogeneous memory offloading, which moves parameters between GPU memory and system storage, and kernel-level computation optimizations that replace standard operations to reduce memory overhead. These capabilities facilitate both the training of massive models and the deployment of generative applications in production environments.
PythonDistributed Deep Learning FrameworksDistributed Training OrchestratorsLarge-Scale Model Training
View on GitHub41,395
karpathy/nanochat
karpathy/nanochat
55,103View on GitHub
Nanochat is a lightweight execution environment designed for training and running language models on standard consumer hardware. It functions as both a neural network training framework and an inference engine, enabling users to perform backpropagation-based training and model execution directly on general-purpose processors without the need for dedicated graphics hardware. The project distinguishes itself through a suite of optimization tools that prioritize efficiency on local machines. By utilizing memory-mapped weight loading and CPU-optimized vector math, it maximizes throughput for interactive sessions. Furthermore, the framework includes a quantization toolkit that allows users to adjust the numerical precision of weights and activations, effectively balancing memory consumption against computational speed. The platform supports a range of capabilities for transformer architecture experimentation, including the configuration of training parameters and the management of local data pipelines. It employs a stateless generation loop to process tokens through self-contained execution cycles, facilitating the development and fine-tuning of custom models in a private, local environment.
PythonLocal Inference RuntimesTransformer Inference EnginesTraining Frameworks
View on GitHub55,103
ethicalml/awesome-production-machine-learning
EthicalML/awesome-production-machine-learning
20,638View on GitHub
A curated list of awesome open source libraries to deploy, monitor, version and scale your machine learning
Awesome ListApplied Machine LearningCurated Research Lists
View on GitHub20,638
tensorflow/tensorflow
tensorflow/tensorflow
195,697View on GitHub
TensorFlow is a comprehensive machine learning framework designed for the construction, training, and deployment of complex mathematical models. It utilizes a graph-based execution model that represents operations as directed acyclic graphs, enabling automatic differentiation and efficient parallel processing. The system provides high-level interfaces for defining neural network architectures, alongside a robust engine for managing multidimensional array structures and tensor mathematics. The framework distinguishes itself through a scalable distributed runtime that orchestrates workloads across heterogeneous hardware accelerators and decentralized network nodes. It employs deferred-execution symbolic graphs to perform graph-level optimizations, fusion, and ahead-of-time kernel compilation for specific hardware architectures. To ensure consistent performance across production environments, it features a standardized serialization format for model graphs and specialized tools for model serving, quantization, and compression. Beyond core training capabilities, the platform includes a high-throughput data ingestion engine that supports asynchronous, multi-threaded pipelines to prevent bottlenecks. It also offers extensive support for hardware abstraction, allowing for pluggable device integration and containerized acceleration. The ecosystem is rounded out by utilities for data validation, federated learning, and specialized modeling tasks, providing a complete toolchain for moving models from research into high-availability production environments.
C++FrameworksDeferred-Execution Symbolic GraphsDistributed Training Frameworks
View on GitHub195,697
rapidai/rapidocr
RapidAI/RapidOCR
5,968View on GitHub
RapidOCR is an offline deep-learning OCR engine that detects and recognizes text in images using ONNX Runtime, operating entirely without an internet connection. It provides a unified inference pipeline that runs across multiple platforms including Windows, Linux, macOS, Android, and Raspberry Pi, with programming language bindings for Python, C++, Java, and C#. The engine separates text detection and recognition into independent modules that can be swapped or fine-tuned individually, and abstracts the inference backend behind a unified interface allowing seamless switching between ONNX Runtime, OpenVINO, PaddlePaddle, PyTorch, MNN, and TensorRT. It supports over 80 languages by combining language-specific recognition models with a unified text detection backbone, and offers both lightweight mobile-optimized and higher-accuracy server-grade model variants selected at runtime. The project includes a command-line tool for extracting text from images and URLs with bounding boxes and confidence scores, and provides structured programmatic output with separate fields for bounding boxes, recognized text, and confidence scores. It can classify text line orientation before recognition to improve accuracy, and visualize results by drawing detected text regions onto the original image. For deployment, the OCR engine can be packaged into a Docker container for consistent environments across platforms, or bundled into a standalone executable using PyInstaller that removes the Python runtime dependency. The project also includes utilities for converting PaddleOCR models to ONNX format and fine-tuning them on custom data for specialized text recognition scenarios.
PythonOCR PipelinesCross-Platform Offline OCRMulti-Language Recognition Models
View on GitHub5,968
zsdonghao/tensorlayer
zsdonghao/tensorlayer
7,384View on GitHub
Tensorlayer is a deep learning framework and cross-backend AI library used to construct and execute neural network models. It serves as a scientific neural network toolkit providing customizable layers and architectures designed for research applications in science and engineering. The library enables multi-backend model execution, allowing the same model code to run across different deep learning frameworks, GPUs, and specialized AI accelerators. It includes a reinforcement learning library that provides both low-level and high-level tools for developing intelligent agents.
PythonMulti-Backend AbstractionsBackend-Agnostic EnginesCross-Framework API Wrappers
View on GitHub7,384
google-ai-edge/mediapipe
google-ai-edge/mediapipe
35,660View on GitHub
MediaPipe is a cross-platform machine learning framework designed for deploying vision, audio, and text processing models across mobile, desktop, and web environments. It functions as an on-device inference engine that executes complex models locally on edge hardware, ensuring low latency and privacy without requiring a constant internet connection. The framework utilizes a graph-based pipeline orchestration system where data flows through a directed network of modular calculators to ensure synchronized and deterministic processing. It distinguishes itself through a unified runtime that provides consistent hardware abstraction and high-performance data pipelines, which manage synchronized streams of audio, video, and sensor data. To maximize throughput, the system employs hardware-accelerated tensor execution and zero-copy memory management, offloading heavy mathematical computations to specialized GPU or NPU backends. Beyond local inference, the platform includes a generative AI integration layer that connects applications to remote language models. This interface supports real-time conversational interactions, streaming responses, and multi-turn prompts, with built-in capabilities for request structuring, response parsing, and authentication. These features allow developers to combine local media analysis with remote generative services within a single, modular architecture.
C++Machine Learning FrameworksCross-Platform Inference FrameworksModel Deployment Frameworks
View on GitHub35,660
huggingface/open-r1
huggingface/open-r1
26,326View on GitHub
Open-r1 is a framework designed for the large-scale training, distillation, and optimization of language models focused on complex reasoning and programming tasks. It provides a comprehensive suite of tools for managing distributed training jobs across multi-node clusters, enabling the development of high-performance models through reinforcement learning and supervised fine-tuning. The project distinguishes itself by integrating secure, containerized code execution environments directly into the training and evaluation lifecycle. By allowing models to run and verify code snippets against test cases, the framework improves accuracy in mathematical and logical problem-solving. It further supports advanced reasoning capabilities through group relative policy optimization and automated synthetic data pipelines, which curate and filter high-quality reasoning traces for model updates. The system utilizes modular, configuration-driven recipes to streamline complex workflows, including data decontamination, dataset composition, and multi-node orchestration. It includes standardized benchmarking tools to measure performance across reasoning and coding domains, ensuring that training processes remain reproducible and data-centric. The framework is built to handle the full lifecycle of model improvement, from initial synthetic data generation to final performance evaluation on high-performance computing clusters.
PythonCode-Integrated Training FrameworksLarge Scale Training SuitesReasoning Model Training Suites
View on GitHub26,326
visenger/awesome-mlops
visenger/awesome-mlops
0View on GitHub
An awesome list of references for MLOps - Machine Learning Operations :pointright: ml-ops.org*
Awesome ListAwesome ListsCommunity Resources
View on GitHub0
aymericdamien/tensorflow-examples
aymericdamien/TensorFlow-Examples
43,749View on GitHub
This repository serves as a structured educational resource for machine learning and deep learning, providing a library of executable scripts and notebooks. It is designed to help users master the practical application of data processing, model evaluation, and neural network construction through annotated code samples and guided tutorials. The collection focuses on translating theoretical mathematical concepts into functional code, offering proven patterns for common tasks such as classification and regression. By providing curated examples of layer construction and training loops, the repository enables users to prototype experimental models and implement fundamental algorithms using standard industry frameworks. The materials cover the core mechanics of tensor-based data flow, automatic differentiation, and computational graph execution. These examples illustrate how to manage model state and optimize mathematical structures for hardware acceleration, providing a practical guide for those learning to build and train models within the framework.
Jupyter NotebookAutomatic Differentiation EnginesDeep Learning Code LibrariesTensor Processing Libraries
View on GitHub43,749
nousresearch/hermes-agent
NousResearch/hermes-agent
195,049View on GitHub
Hermes-agent is an autonomous AI agent framework and runtime designed to execute complex tasks and synthesize new skills from execution traces. It includes a provider-agnostic gateway for routing requests across multiple model backends and a serverless runtime that suspends idle agent instances and resumes them on demand across containers and virtual machines. The project provides a desktop automation toolset that controls native GUI workflows on Linux by querying accessibility APIs and injecting input events. It further distinguishes itself with the ability to generate procedural skills from previous execution trajectories and the use of natural language descriptions to schedule recurring automated tasks. The framework covers high-level capabilities including agentic memory management via semantic search and vector-based retrieval, dynamic plugin loading, and the orchestration of parallel subagents through remote procedure calls. It also supports the generation of compressed training trajectories to improve tool-calling accuracy.
PythonAutonomous Agent FrameworksAutonomous Task ExecutionAccessibility Tree Automation
View on GitHub195,049

ML frameworks and MLOps

keras-team/keras

eugeneyan/applied-ml

ageron/handson-ml2

Lightning-AI/pytorch-lightning

VoltAgent/awesome-claude-code-subagents

hiyouga/LlamaFactory

pycaret/pycaret

tensorflow/models

chiphuyen/dmls-book

hpcaitech/ColossalAI

karpathy/nanochat

EthicalML/awesome-production-machine-learning

tensorflow/tensorflow

RapidAI/RapidOCR

zsdonghao/tensorlayer

google-ai-edge/mediapipe

huggingface/open-r1

visenger/awesome-mlops

aymericdamien/TensorFlow-Examples

NousResearch/hermes-agent