LLM inference and serving

Explore open-source frameworks and engines designed for deploying and serving large language models efficiently.

Find the best repos with AI.We'll search the best matching repositories with AI.

mlabonne/llm-course
mlabonne/llm-course
80,178View on GitHub
This project is a comprehensive educational curriculum and engineering handbook focused on the lifecycle of large language models. It serves as a structured knowledge base for machine learning practitioners, covering the fundamental mathematical and architectural principles of transformer-based sequence modeling, as well as the practical implementation of supervised instruction fine-tuning and preference-based model alignment. The repository distinguishes itself by providing a deep dive into advanced model composition and optimization techniques. It details methodologies for weight-space mode
AI Research RepositoriesAwesome ListFine-Tuning Strategies
View on GitHub80,178
huggingface/transformers
huggingface/transformers
161,630View on GitHub
Transformers is a comprehensive library for machine learning that provides a unified interface for training, fine-tuning, and deploying transformer-based models. It supports a wide range of tasks, including text classification, language modeling, question answering, and sequence-to-sequence translation, while offering specialized architectures for both text and vision processing. The framework includes tools for managing the entire model lifecycle, from data preprocessing and tokenization to distributed training and inference. The library features extensive support for model optimization and
PythonAPI FrameworksByte Pair EncodingsHybrid
View on GitHub161,630
tensorflow/tensorflow
tensorflow/tensorflow
195,697View on GitHub
TensorFlow is a comprehensive machine learning framework designed for the construction, training, and deployment of complex mathematical models. It utilizes a graph-based execution model that represents operations as directed acyclic graphs, enabling automatic differentiation and efficient parallel processing. The system provides high-level interfaces for defining neural network architectures, alongside a robust engine for managing multidimensional array structures and tensor mathematics. The framework distinguishes itself through a scalable distributed runtime that orchestrates workloads acr
C++FrameworksDeferred-Execution Symbolic GraphsDistributed Training Frameworks
View on GitHub195,697
dusty-nv/jetson-inference
dusty-nv/jetson-inference
8,734View on GitHub
jetson-inference is a set of libraries and tools for executing optimized deep learning models on embedded GPU hardware. Its primary purpose is to enable real-time computer vision and AI inference at the edge with low latency and high throughput. The project distinguishes itself through high-performance streaming analytics and the ability to execute concurrent AI pipelines on auto-grade silicon. It provides specialized support for multi-sensor stream processing, utilizing zero-copy data transport to load camera frames directly into GPU memory. The codebase covers a broad surface of capabiliti
C++Computer Vision PlatformsDeep Learning Inference EnginesEdge AI Model Deployment
View on GitHub8,734
meta-llama/llama
meta-llama/llama
59,464View on GitHub
Llama is a computational framework and runtime environment designed for executing transformer-based neural networks locally. It functions as a generative AI inference engine, enabling the processing of input sequences through pre-trained model weights to produce text completions and structured data outputs directly on your own hardware. The system distinguishes itself through specialized memory and computation management techniques, including memory-mapped weight loading and quantization-aware inference, which allow for efficient execution on standard consumer hardware. It utilizes a stateles
PythonInference EnginesLarge Language Model RuntimesLocal Inference Engines
View on GitHub59,464
modeltc/lightllm
ModelTC/LightLLM
3,901View on GitHub
LightLLM is a high-performance serving framework for deploying and executing large language models. It functions as a multi-GPU inference engine and server capable of handling dense architectures, mixture-of-experts designs, and multimodal models that process both text and images. The system is distinguished by its specialized support for Mixture-of-Experts models using expert parallelism and fused kernels. It implements structured text generation through deterministic state machines and pushdown automata to enforce precise output formats. To optimize throughput, the framework employs specula
PythonFused MoE GPU KernelsInference ExecutionLLM Serving Architectures
View on GitHub3,901
ggml-org/llama.cpp
ggml-org/llama.cpp
116,799View on GitHub
Llama.cpp is an inference engine designed for the local execution of text-based and multimodal language models on consumer hardware. It provides a core environment for running models that process both text and image inputs, utilizing hardware-accelerated backends to optimize performance across diverse CPU and GPU architectures. The project distinguishes itself by offering a lightweight HTTP server that adheres to standard API specifications, enabling chat completion, embeddings, and reranking services. It includes a suite of tools for model quantization and conversion, which reduces memory us
C++Hardware Abstraction LayersText-Only Inference EnginesMultimodal Inference Engines
View on GitHub116,799
nvidia/isaac-gr00t
NVIDIA/Isaac-GR00T
6,222View on GitHub
Jupyter NotebookGPU Application Development EnvironmentsGPU-Accelerated Robot Simulators3D Asset Labelers
View on GitHub6,222
vllm-project/vllm
vllm-project/vllm
83,048View on GitHub
vLLM is a high-throughput inference engine designed for the efficient serving and execution of large language models. It functions as a production-ready distributed model server, providing standard API protocols for online serving while also supporting offline batch processing. The system is built to maximize token generation speed and memory efficiency, enabling both large-scale cloud deployments and local execution on personal hardware. The project distinguishes itself through advanced memory management and request scheduling techniques, most notably its use of non-contiguous key-value cach
PythonContinuous Batching StrategiesCustom Model Execution EnginesDistributed Model Servers
View on GitHub83,048
tensorflow/tfjs-examples
tensorflow/tfjs-examples
6,783View on GitHub
This repository provides a collection of practical demonstrations and implementation guides for machine learning tasks using TensorFlow.js. It serves as a resource for developers to explore model architectures, training workflows, and data manipulation techniques across domains such as computer vision, natural language processing, and reinforcement learning. The project covers the full lifecycle of machine learning development, including tensor-based mathematical operations, model construction via high-level layer APIs or low-level tensor logic, and model serialization for various storage med
JavaScriptManual Memory ManagementCore Model APIsModel Execution APIs
View on GitHub6,783
lmcache/lmcache
LMCache/LMCache
6,909View on GitHub
LMCache is a distributed key-value cache manager and tiering system designed to accelerate large language model inference. It functions as a tiered storage layer that offloads tensors from GPU memory to CPU RAM, local disks, or remote object stores, enabling the reuse of cached prefixes across different inference sessions and serving engines. The system differentiates itself through a disaggregated prefill-decode model, which separates prompt processing from token generation by transferring caches between distributed compute nodes. It utilizes peer-to-peer orchestration to share and retrieve
PythonLLM KV Cache StoresPrefill-Decode DisaggregationAsynchronous Cache Offloading
View on GitHub6,909
hiyouga/llamafactory
hiyouga/LlamaFactory
72,213View on GitHub
LlamaFactory is a unified framework for fine-tuning and adapting large language models. It provides a comprehensive platform that standardizes training workflows across diverse machine learning architectures, allowing users to execute both full-tuning and parameter-efficient methods through a single interface. The project distinguishes itself by offering a low-code visual dashboard that enables users to configure experiments and monitor performance metrics in real time without writing extensive custom scripts. It also features a configuration-driven orchestration system that decouples experim
PythonExperiment TrackingLanguage Model Fine-TuningLarge Language Model Fine-Tuning Frameworks
View on GitHub72,213
ai-dynamo/dynamo
ai-dynamo/dynamo
6,112View on GitHub
Dynamo is a distributed inference orchestration platform designed for large language models. It functions as a system to coordinate prefill and decode phases across GPU nodes, utilizing a multi-backend runtime adapter to connect engines like vLLM and TensorRT-LLM through a unified block-oriented memory interface. An OpenAI-compatible API server provides the frontend for integration with existing tools and clients. The project is distinguished by its disaggregated serving architecture, which separates prompt processing and token generation onto independent GPU pools to optimize throughput and
RustDisaggregated Inference OrchestrationPrefill-Decode DisaggregationActivation and KV Cache Offloaders
View on GitHub6,112
nomic-ai/gpt4all
nomic-ai/gpt4all
77,375View on GitHub
GPT4All is a cross-platform runtime environment designed to execute large language models directly on local consumer hardware. By leveraging an optimized C++ inference backend, it enables private, offline AI interactions without requiring an internet connection or external cloud services. The project provides a comprehensive ecosystem for managing the entire model lifecycle, including discovery, downloading, and configuration of local weights. What distinguishes the platform is its integrated retrieval-augmented generation engine, which allows users to index local documents into semantic vect
C++C++ Inference BackendsLanguage Model OrchestrationLocal AI Inference
View on GitHub77,375
zai-org/chatglm-6b
zai-org/ChatGLM-6B
41,039View on GitHub
ChatGLM-6B is a generative AI inference engine designed for local execution of transformer-based language models. It provides a comprehensive runtime environment that allows users to load and run pre-trained neural network weights directly on their own hardware, ensuring data privacy and independence from external cloud services. The project distinguishes itself through a hardware-agnostic execution backend that supports deployment across diverse environments, including standard processors, Apple Silicon, and multi-GPU configurations. It incorporates advanced optimization techniques such as w
PythonAutoregressive Inference EnginesLocal Inference EnginesModel Runtimes
View on GitHub41,039
thu-pacman/chitu
thu-pacman/chitu
3,265View on GitHub
Chitu is a distributed serving platform and orchestrator for large language model inference. It functions as a compute manager designed to deploy and scale model workloads across diverse hardware architectures, including GPUs, CPUs, and heterogeneous hardware clusters. The platform enables model deployment across a wide range of targets, including NVIDIA GPUs, regional chipsets, and legacy hardware. It manages the execution of models across these varying environments to increase available computing capacity and optimize resource utilization. The system includes capabilities for distributed i
PythonLLM Serving ArchitecturesCross-Hardware Model InferenceDistributed Inference Orchestrators
View on GitHub3,265
karpathy/nanochat
karpathy/nanochat
55,103View on GitHub
Nanochat is a lightweight execution environment designed for training and running language models on standard consumer hardware. It functions as both a neural network training framework and an inference engine, enabling users to perform backpropagation-based training and model execution directly on general-purpose processors without the need for dedicated graphics hardware. The project distinguishes itself through a suite of optimization tools that prioritize efficiency on local machines. By utilizing memory-mapped weight loading and CPU-optimized vector math, it maximizes throughput for inte
PythonLocal Inference RuntimesTransformer Inference EnginesTraining Frameworks
View on GitHub55,103
coleam00/local-ai-packaged
coleam00/local-ai-packaged
3,539View on GitHub
This project is a containerized local AI infrastructure stack designed to deploy large language models and vector databases on private hardware. It functions as an orchestration platform that combines AI runners, knowledge graphs, and a visual workflow builder for creating agentic chatflows and automating tasks via tool integration. The platform distinguishes itself through a low-code approach to agent orchestration, utilizing a visual interface to design complex sequences and connect agents to external tools and search engines. It includes a dedicated local observability stack to track promp
PythonAI Service OrchestrationLocal AI Deployment PlatformsAgent Workflow Orchestrations
View on GitHub3,539
hpcaitech/colossalai
hpcaitech/ColossalAI
41,395View on GitHub
ColossalAI is a distributed deep learning framework designed for training and deploying massive artificial intelligence models across clusters of hardware accelerators. It functions as a parallel computing engine that partitions model workloads and data across multiple processors to maximize memory efficiency and throughput. The platform distinguishes itself through a comprehensive suite of parallelization strategies, including multi-dimensional tensor parallelism and pipeline-based model parallelism, which segment neural network layers and stages across devices. To support large-scale genera
PythonDistributed Deep Learning FrameworksDistributed Training OrchestratorsLarge-Scale Model Training
View on GitHub41,395
open-llm-vtuber/open-llm-vtuber
Open-LLM-VTuber/Open-LLM-VTuber
5,946View on GitHub
PythonAI Desktop CompanionsAI Backend AbstractionsAI Backend Integrations
View on GitHub5,946
qwenlm/qwen3
QwenLM/Qwen3
27,324View on GitHub
Qwen3 is a transformer-based large language model designed as a generative AI foundation for understanding, reasoning, and generating human language. It functions as a comprehensive ecosystem for model training, fine-tuning, and production-ready inference, providing the underlying architecture and weights necessary to build diverse artificial intelligence applications. The project distinguishes itself through extensive support for model quantization and distributed inference, enabling efficient execution across a wide range of hardware from consumer-grade devices to scalable cloud infrastruct
PythonGenerative AI FoundationsLarge Language ModelsModel Training Frameworks
View on GitHub27,324
zackriya-solutions/meeting-minutes
Zackriya-Solutions/meeting-minutes
12,757View on GitHub
This project is a self-hosted meeting transcription and summarization tool that converts audio recordings into text transcripts and structured notes using large language models. It functions as an enterprise meeting documentation manager, allowing for the organization and editing of timestamped records. The system prioritizes data privacy through local-first processing and the ability to deploy on private infrastructure. It supports a provider-agnostic architecture, enabling users to connect to local AI engines, self-hosted servers, or cloud-based API endpoints for both transcription and summ
RustAudio TranscriptionMeeting SummarizationAI Model Configurations
View on GitHub12,757
deepseek-ai/deepseek-v3
deepseek-ai/DeepSeek-V3
103,753View on GitHub
DeepSeek-V3 is a large language model that provides comprehensive resources for model utilization, including technical specifications, pre-trained weights, and evaluation benchmarks. The project details the core transformer architecture, including parameter counts and multi-token prediction modules, while supporting native 8-bit floating-point quantization. The repository offers extensive support for local and distributed inference through integration with multiple frameworks and engines. It includes documentation for deploying the model across various hardware configurations, such as GPUs an
PythonModel WeightsInference FrameworksFrontier Models
View on GitHub103,753
infiniflow/ragflow
infiniflow/ragflow
82,922View on GitHub
This project is a comprehensive retrieval-augmented generation platform designed for building, managing, and deploying knowledge-based AI applications. It provides a unified environment for organizing datasets, configuring conversational chat assistants, and developing autonomous agents that execute multi-step reasoning workflows. By integrating document intelligence with advanced retrieval pipelines, the platform enables the creation of grounded, verifiable responses supported by traceable citations. The platform distinguishes itself through deep document understanding and sophisticated know
PythonAutonomous AgentsChat AssistantsGrounded Answer Generation
View on GitHub82,922
microsoft/bitnet
microsoft/BitNet
39,327View on GitHub
BitNet is a quantized inference engine designed to execute highly compressed language models by performing arithmetic on low-precision, bit-level weight data. It functions as a model optimization toolkit and a high-performance kernel library, enabling the execution of large language models on consumer hardware by reducing memory footprints and increasing processing speeds. The project distinguishes itself through hardware-specific kernel optimizations that leverage native processor instructions to accelerate matrix multiplication. By utilizing packed integer arithmetic and memory-aligned weig
PythonQuantized Inference RuntimesEfficient Inference EnginesInference Runtimes
View on GitHub39,327
internlm/lmdeploy
InternLM/lmdeploy
7,903View on GitHub
lmdeploy is a high-performance inference engine and deployment framework for large language models and vision models. It functions as a multi-modal model server and compression toolkit designed to serve models with high throughput and low latency. The system enables the distribution of model services across multiple machines using request-based load balancing and tensor parallelism. It includes specialized tools for model quantization and compression to reduce the memory footprint of weights and caches. The framework covers broad capability areas including production deployment, distributed
PythonLarge Language Model DeploymentsLLM Deployment FrameworksContinuous Batching Strategies
View on GitHub7,903
mudler/localai
mudler/LocalAI
46,889View on GitHub
LocalAI is a self-hosted inference server that enables the execution of machine learning models directly on local hardware. By providing a unified interface for text, image, and audio processing, it allows users to maintain full control over data privacy and infrastructure costs while eliminating dependencies on external network services. The platform functions as an API gateway that mimics standard cloud-based artificial intelligence interfaces, allowing existing applications to integrate local models as drop-in replacements. It utilizes a container-based architecture to package runtimes and
GoInference ServersLocal Inference EnginesLocal Model Serving
View on GitHub46,889
qwenlm/qwen
QwenLM/Qwen
21,294View on GitHub
Qwen is a comprehensive framework for large language model development, serving, and deployment. It provides a complete ecosystem for transformer-based sequence modeling, offering base models alongside specialized tools for instruction-tuned alignment, fine-tuning, and long-context inference. The project is designed to support both research and production environments, enabling users to train, optimize, and host generative models locally or across distributed hardware. The framework distinguishes itself through its focus on high-performance serving and extensibility. It features a high-perfor
PythonLarge Language ModelsOpenAI-Compatible APIsSequence Learning Models
View on GitHub21,294
redis/go-redis
redis/go-redis
22,159View on GitHub
This project is a feature-rich Go client library designed for interacting with Redis. It serves as a comprehensive interface for managing remote data stores, enabling developers to execute standard database commands, handle complex data structures, and perform asynchronous operations within Go applications. The library distinguishes itself through its support for advanced Redis capabilities, including connection pooling, pipelining, and transactional integrity. It provides specialized primitives for managing distributed clusters, including automated topology updates and request routing to sha
GoRedis ClientsApplication CachingDatabase Command Interfaces
View on GitHub22,159
jingyaogong/minimind
jingyaogong/minimind
51,834View on GitHub
This project is a comprehensive framework for the entire lifecycle of transformer-based language models, supporting everything from foundational pretraining to specialized deployment. It provides a modular toolkit for defining neural network architectures, managing data preparation pipelines, and executing training routines across various scales. The framework is designed to handle the full model development process, including supervised fine-tuning, behavioral alignment, and the integration of agentic capabilities. What distinguishes this framework is its focus on efficient training and adva
PythonModel Training ToolkitsAgentic FrameworksAgentic Training Frameworks
View on GitHub51,834
sgl-project/sglang
sgl-project/sglang
29,079View on GitHub
Sglang is a high-performance inference engine and serving system designed for large language and multimodal models. It provides a programmable interface for orchestrating complex generation workflows, enabling developers to coordinate multi-turn dialogues, tool invocations, and reasoning chains through a domain-specific language. The platform is built to support production-scale deployments, offering an OpenAI-compatible API that allows for integration with existing application ecosystems. The system distinguishes itself through a disaggregated architecture that separates compute-intensive pr
PythonChat Completion ServicesDisaggregated InferenceHigh-Throughput Model Serving
View on GitHub29,079
haotian-liu/llava
haotian-liu/LLaVA
24,465View on GitHub
LLaVA is a multimodal large language model architecture designed to process and interpret both image and text inputs to generate natural language responses. It functions as a research-oriented platform for visual instruction tuning, providing a framework to align language models with human intent through training on diverse datasets of paired images and text queries. The system distinguishes itself through a specialized vision-language training pipeline that connects visual data to language models using projection layers and instruction-based fine-tuning. It supports distributed inference by
PythonMultimodal Large Language ModelsVision-Language PipelinesVisual Instruction Tuning
View on GitHub24,465
nvidia/tensorrt-llm
NVIDIA/TensorRT-LLM
12,913View on GitHub
TensorRT-LLM is a platform and toolkit designed for compiling, optimizing, and serving transformer-based models on accelerated hardware. It functions as a framework that transforms machine learning models into efficient execution graphs, providing an engine to refine these models for specific hardware to maximize throughput and minimize latency during text generation. The project distinguishes itself through advanced execution strategies that manage the entire inference pipeline. It utilizes kernel-level fusion and static graph execution to optimize mathematical operations and computational f
PythonGPU-AcceleratedLarge Language Model OptimizationModel Compilation
View on GitHub12,913
karpathy/nanogpt
karpathy/nanoGPT
59,730View on GitHub
nanoGPT is a lightweight engine for training and fine-tuning transformer-based language models from scratch. It provides a minimalist codebase designed for educational exploration and rapid experimentation with neural network architectures, utilizing self-attention and feed-forward layers to process sequences and predict subsequent elements. The project distinguishes itself through a focus on high-speed data ingestion and hardware-accelerated performance. It includes a dedicated pipeline for transforming raw text into memory-mapped binary files, which enables efficient streaming during traini
PythonTransformerGenerative Text InferenceLarge Language Model Training Frameworks
View on GitHub59,730
meta-llama/llama-cookbook
meta-llama/llama-cookbook
18,375View on GitHub
This project is a collection of implementation guides, recipes, and developer resources for building applications with Llama models. It serves as a comprehensive kit for developing autonomous agents, establishing retrieval-augmented generation systems, and executing model fine-tuning. The resource provides specific patterns for multimodal workflows that process text, images, and audio. It includes specialized guidance on adapting pre-trained model weights for targeted tasks and implementing tool-calling orchestration to connect models with external APIs and functions. The codebase covers a b
Jupyter NotebookAgentic LLM FrameworksAI Agent DevelopmentExternal Tool Integration
View on GitHub18,375
xtekky/gpt4free
xtekky/gpt4free
66,335View on GitHub
This project provides a unified interface for interacting with a wide range of artificial intelligence services, acting as a central orchestration layer for text and image generation. It standardizes access to diverse AI backends, allowing developers to integrate multiple language and vision models through a single, consistent programming interface. By abstracting provider-specific protocols and authentication requirements, the tool simplifies the development of applications that rely on external AI services. The platform distinguishes itself through a resilient request routing architecture d
PythonAI Request RoutersConversation ManagementFailover Strategies
View on GitHub66,335
openbmb/voxcpm
OpenBMB/VoxCPM
29,985View on GitHub
VoxCPM is a multilingual speech synthesis system and text-to-speech inference server. It functions as an AI voice cloning tool and a synthetic voice designer, capable of generating natural speech across global languages and regional dialects using a GPU-accelerated audio generator. The project features a speech model fine-tuning framework that supports both full parameter updates and low-rank adaptation for customizing voice characteristics. It enables high-fidelity voice cloning from reference audio, including cross-lingual voice transfer and acoustic environment mimicry, as well as the crea
PythonMultilingual Speech ModelsSpeech SynthesisText-to-Speech
View on GitHub29,985
unslothai/unsloth
unslothai/unsloth
66,628View on GitHub
Unsloth is a high-performance training and inference platform designed to optimize the lifecycle of large language and multimodal models. It provides a comprehensive engine for fine-tuning, executing, and managing models locally, with a focus on reducing memory consumption and increasing compute speed on consumer-grade hardware. The platform distinguishes itself through hand-optimized kernels and automated computational graph techniques that maximize hardware throughput. It supports advanced training methodologies, including reinforcement learning for reasoning and efficient adapter-based fin
PythonLanguage Model TrainingCustom Kernel AcceleratorsEfficient Training Pipelines
View on GitHub66,628
deepseek-ai/deepseek-coder
deepseek-ai/DeepSeek-Coder
22,804View on GitHub
DeepSeek-Coder is a large language model and foundational neural network architecture designed specifically for software development tasks. It functions as an artificial intelligence assistant capable of interpreting complex programming instructions to generate, transpile, and structure source code. The system distinguishes itself through its ability to perform project-level code generation, analyzing broader context and patterns across entire software projects rather than isolated files. It supports multimodal input processing, allowing for the integration of text and visual data to inform i
PythonAI Coding AssistantsAI-Assisted DevelopmentGenerative Code Assistants
View on GitHub22,804
tatsu-lab/stanford_alpaca
tatsu-lab/stanford_alpaca
30,266View on GitHub
This project provides an end-to-end framework for adapting large language models to follow user instructions through supervised fine-tuning. It functions as a comprehensive training pipeline that enables the creation of specialized assistant models by minimizing the difference between predicted outputs and target responses within structured instruction datasets. The framework distinguishes itself by integrating synthetic data generation with memory-efficient training techniques. It utilizes powerful language models to iteratively expand small sets of human-written seeds into diverse, high-qua
PythonInstruction Fine-Tuning FrameworksInstruction TuningInstruction Tuning Frameworks
View on GitHub30,266
geeeekexplorer/nano-vllm
GeeeekExplorer/nano-vllm
11,745View on GitHub
Nano-vllm is a high-performance inference engine designed for executing large language models locally. It functions as a specialized runtime that prioritizes accelerated token generation and efficient hardware utilization for text generation tasks. The project distinguishes itself through a comprehensive suite of optimization techniques, including a graph compilation engine that transforms neural network operations into pre-compiled execution plans. It also incorporates a tensor parallelism framework to distribute model weights across multiple hardware accelerators, effectively reducing memor
PythonLocal Inference EnginesLarge Language Model OptimizationLocal Model Execution
View on GitHub11,745
modular/modular
modular/modular
26,357View on GitHub
Modular is a unified machine learning development platform designed for building, compiling, and deploying high-performance neural network models. It provides a comprehensive execution engine that supports both local and production-grade inference, enabling developers to manage the entire model lifecycle from initial architecture definition to scalable, containerized service deployment. The platform distinguishes itself through a hardware-agnostic runtime that abstracts diverse silicon architectures, allowing models to execute efficiently across varied compute environments. It includes a spec
MojoGenerative AI FrameworksInference RuntimesLocal Model Servers
View on GitHub26,357
ludwig-ai/ludwig
ludwig-ai/ludwig
11,717View on GitHub
Ludwig is a multimodal machine learning platform and low-code framework designed for building, training, and deploying neural networks. It enables the construction of models that process text, images, audio, and tabular data through a unified interface using declarative configuration files rather than custom code. The system features a specialized low-code framework for large language models, supporting supervised fine-tuning, preference alignment, and a constrained decoding tool to force structured data output via logit extraction. It also includes an automated model architecture search to i
PythonDeclarative Model SynthesisLow-Code Machine Learning ToolsMultimodal Machine Learning
View on GitHub11,717
huggingface/open-r1
huggingface/open-r1
26,326View on GitHub
Open-r1 is a framework designed for the large-scale training, distillation, and optimization of language models focused on complex reasoning and programming tasks. It provides a comprehensive suite of tools for managing distributed training jobs across multi-node clusters, enabling the development of high-performance models through reinforcement learning and supervised fine-tuning. The project distinguishes itself by integrating secure, containerized code execution environments directly into the training and evaluation lifecycle. By allowing models to run and verify code snippets against test
PythonCode-Integrated Training FrameworksLarge Scale Training SuitesReasoning Model Training Suites
View on GitHub26,326
pytorch/examples
pytorch/examples
23,752View on GitHub
This repository serves as a comprehensive collection of reference implementations for the PyTorch machine learning library. It provides practical examples for building, training, and deploying deep learning models, functioning as a toolkit for developers to explore neural network architectures and training workflows. The project distinguishes itself by offering concrete demonstrations of complex machine learning operations, ranging from computer vision tasks like object detection and depth estimation to the training of large-scale transformer models. These examples illustrate how to implement
PythonMachine Learning ImplementationsPython Machine Learning LibrariesDeep Learning Frameworks
View on GitHub23,752
oobabooga/text-generation-webui
oobabooga/text-generation-webui
47,323View on GitHub
This project is a comprehensive platform for hosting and interacting with large language models directly on local hardware. It provides a web-based graphical interface that allows users to manage model loading, configure generation parameters, and execute text or chat interactions entirely offline. By running models locally, the software ensures complete data privacy and eliminates reliance on external cloud services for generative tasks. Beyond basic inference, the platform functions as a versatile workbench for generative AI development. It includes an integrated pipeline for fine-tuning mo
PythonLocal Inference EnginesLocal Model RuntimesModel Serving APIs
View on GitHub47,323
exo-explore/exo
exo-explore/exo
45,380View on GitHub
Exo is a distributed inference engine designed to run machine learning models across local hardware. It functions as a network orchestration layer that automatically discovers available devices to form a unified computing cluster, allowing users to scale artificial intelligence workloads by distributing computational tasks across multiple machines. The platform distinguishes itself through its ability to manage the entire lifecycle of local models while providing a standardized gateway for external applications. By translating local model outputs into industry-standard formats, it enables exi
PythonDistributed AI SystemsDistributed Inference EnginesInference Engines
View on GitHub45,380
paddlepaddle/paddleocr
PaddlePaddle/PaddleOCR
82,412View on GitHub
PaddleOCR is a comprehensive optical character recognition framework designed for detecting and transcribing text from images and documents into structured, machine-readable formats. It provides a modular computer vision pipeline that decouples image preprocessing, text detection, and character recognition into independent, configurable stages. This architecture supports automated document digitization and multilingual text recognition, capable of identifying text in over one hundred languages across diverse environments ranging from scanned documents to industrial scenes. The framework disti
PythonModular Vision PipelinesMultilingual Text RecognitionDeep Learning
View on GitHub82,412

LLM inference and serving

mlabonne/llm-course

huggingface/transformers

tensorflow/tensorflow

dusty-nv/jetson-inference

meta-llama/llama

ModelTC/LightLLM

ggml-org/llama.cpp

NVIDIA/Isaac-GR00T

vllm-project/vllm

tensorflow/tfjs-examples

LMCache/LMCache

hiyouga/LlamaFactory

ai-dynamo/dynamo

nomic-ai/gpt4all

zai-org/ChatGLM-6B

thu-pacman/chitu

karpathy/nanochat

coleam00/local-ai-packaged

hpcaitech/ColossalAI

Open-LLM-VTuber/Open-LLM-VTuber

QwenLM/Qwen3

Zackriya-Solutions/meeting-minutes

deepseek-ai/DeepSeek-V3

infiniflow/ragflow

microsoft/BitNet

InternLM/lmdeploy

mudler/LocalAI

QwenLM/Qwen

redis/go-redis

jingyaogong/minimind

sgl-project/sglang

haotian-liu/LLaVA

NVIDIA/TensorRT-LLM

karpathy/nanoGPT

meta-llama/llama-cookbook

xtekky/gpt4free

OpenBMB/VoxCPM

unslothai/unsloth

deepseek-ai/DeepSeek-Coder

tatsu-lab/stanford_alpaca

GeeeekExplorer/nano-vllm

modular/modular

ludwig-ai/ludwig

huggingface/open-r1

pytorch/examples

oobabooga/text-generation-webui

exo-explore/exo

PaddlePaddle/PaddleOCR