Why is sgl-project/sglang a recommended Model Serving GitHub Repositories repository?

Serves large language models via high-performance APIs supporting both request-response and streaming token generation.

Why is fishaudio/fish-speech a recommended Model Serving GitHub Repositories repository?

Optimizes audio delivery using continuous batching and prefix caching for low-latency production inference.

Why is huggingface/peft a recommended Model Serving GitHub Repositories repository?

Combines several trained adapter modules using weighted averages to create unified adapter configurations.

Why is cocktailpeanut/dalai a recommended Model Serving GitHub Repositories repository?

Executes LLaMA models locally using a simple command-line interface.

Why is lyhue1991/eat_tensorflow2_in_30_days a recommended Model Serving GitHub Repositories repository?

Provides techniques for exporting trained models to standardized formats for production API serving.

Why is vikhyat/moondream a recommended Model Serving GitHub Repositories repository?

Manages production traffic through automatic batching, prefix caching, and streaming responses.

Why is karminski/one-small-step a recommended Model Serving GitHub Repositories repository?

Explains how GGUF format uses memory-mapped file access for near-instant model loading and startup.

Why is hvision-nku/storydiffusion a recommended Model Serving GitHub Repositories repository?

Enables full generation pipelines to run on consumer GPUs by reducing batch size and model precision.

Why is haifengl/smile a recommended Model Serving GitHub Repositories repository?

Generates text responses from LLaMA-3 models with support for chat and streaming API serving.

Why is strands-agents/sdk-python a recommended Model Serving GitHub Repositories repository?

Connects to Meta-hosted Llama API endpoints to run inference without managing your own infrastructure.

29 个仓库

Awesome GitHub RepositoriesModel Serving

Infrastructure and techniques for deploying and optimizing machine learning models for production inference.

Distinguishing note: Focuses on production-grade serving optimizations like batching and caching, distinct from model training.

Explore 29 awesome GitHub repositories matching devops & infrastructure · Model Serving. Refine with filters or upvote what's useful.

用 AI 发现最棒的仓库。我们将通过 AI 为您搜索最匹配的仓库。

sgl-project/sglang
sgl-project/sglang
29,079在 GitHub 上查看
Sglang is a high-performance inference engine and serving system designed for large language and multimodal models. It provides a programmable interface for orchestrating complex generation workflows, enabling developers to coordinate multi-turn dialogues, tool invocations, and reasoning chains through a domain-specific language. The platform is built to support production-scale deployments, offering an OpenAI-compatible API that allows for integration with existing application ecosystems. The system distinguishes itself through a disaggregated architecture that separates compute-intensive pr
Serves large language models via high-performance APIs supporting both request-response and streaming token generation.
Pythonattentionblackwellcuda
在 GitHub 上查看29,079
fishaudio/fish-speech
fishaudio/fish-speech
24,928在 GitHub 上查看
This project is a generative speech synthesis engine that converts text into high-fidelity human speech. It utilizes a two-stage autoregressive transformer architecture that separates semantic token prediction from acoustic detail reconstruction to balance linguistic accuracy with audio quality. The system is designed to support multilingual output and conversational AI development, enabling the generation of context-aware speech that maintains flow across multiple dialogue turns. The platform distinguishes itself through a production-ready inference server that employs continuous batching to
Optimizes audio delivery using continuous batching and prefix caching for low-latency production inference.
Pythonllamatransformertts
在 GitHub 上查看24,928
huggingface/peft
huggingface/peft
21,274在 GitHub 上查看
This library provides a framework for parameter-efficient fine-tuning, enabling the adaptation of large pretrained models by training only a small subset of parameters. It functions as a distributed model training system and optimization toolkit, designed to reduce the computational and memory requirements typically associated with full model fine-tuning. The project distinguishes itself through a suite of methods for modular adapter composition, including low-rank matrix decomposition and activation-based scaling. It supports the integration of multiple task-specific adapter modules, allowin
Combines several trained adapter modules using weighted averages to create unified adapter configurations.
Pythonadapterdiffusionfine-tuning
在 GitHub 上查看21,274
cocktailpeanut/dalai
cocktailpeanut/dalai
12,920在 GitHub 上查看
The simplest way to run LLaMA on your local machine
Executes LLaMA models locally using a simple command-line interface.
CSSaillamallm
在 GitHub 上查看12,920
lyhue1991/eat_tensorflow2_in_30_days
lyhue1991/eat_tensorflow2_in_30_days
9,933在 GitHub 上查看
This project is a structured learning curriculum and technical reference for mastering deep learning with TensorFlow. It provides a comprehensive guide for building, training, and deploying neural networks, combining theoretical fundamentals with practical implementation examples. The repository distinguishes itself by covering the end-to-end machine learning workflow, from low-level tensor mathematics and linear algebra to the creation of complex model architectures. It includes specific guidance on developing data pipelines for diverse data types, such as images, text, and time-series seque
Provides techniques for exporting trained models to standardized formats for production API serving.
Pythontensorflowtensorflow-examplestensorflow-tutorial
在 GitHub 上查看9,933
vikhyat/moondream
vikhyat/moondream
9,769在 GitHub 上查看
Moondream is a small-scale vision language model designed to reason across images to generate captions and answer natural language questions. It functions as an edge-optimized system capable of performing visual question answering, image captioning, and object detection. The project distinguishes itself through a lightweight architecture designed for local inference on embedded devices, workstations, and air-gapped hardware. It supports the execution of models on local GPUs and Apple Silicon to ensure data privacy and low latency. The system's capabilities include identifying precise object
Manages production traffic through automatic batching, prefix caching, and streaming responses.
Python
在 GitHub 上查看9,769
karminski/one-small-step
karminski/one-small-step
6,699在 GitHub 上查看
One Small Step is an educational resource that explains core AI and large language model concepts through short, accessible articles designed to be read in under five minutes. It covers the structure and function of key LLM components like attention mechanisms and tokenization, as well as foundational machine learning mathematics such as matrix rank and overfitting. The project also serves as a guide to the GGUF file format, which packages all model parameters and metadata into a single compact binary file for cross-platform deployment without external dependencies. It explains how this forma
Explains how GGUF format uses memory-mapped file access for near-instant model loading and startup.
在 GitHub 上查看6,699
hvision-nku/storydiffusion
HVision-NKU/StoryDiffusion
6,430在 GitHub 上查看
StoryDiffusion is a generative AI system designed for consistent character image and video generation. It utilizes a pluggable cross-attention module to inject shared character representations into pretrained diffusion models, allowing for visual identity stability across multiple images and scenes without retraining the base model. The project features a video generation pipeline that produces temporally coherent sequences from text prompts or condition images. It employs a latent space motion interpolator to predict intermediate frames and semantic motion, enabling long-range video generati
Enables full generation pipelines to run on consumer GPUs by reducing batch size and model precision.
Jupyter Notebook
在 GitHub 上查看6,430
haifengl/smile
haifengl/smile
6,387在 GitHub 上查看
Smile is a comprehensive JVM machine learning library and statistical computing toolkit. It provides a suite of algorithms for classification, regression, and clustering, implemented natively for Java, Scala, and Kotlin. The project also functions as a deep learning framework, a natural language processing library, and an inference engine for large language models. The library distinguishes itself through GPU acceleration via LibTorch bindings and support for the ONNX model interchange format. It includes specialized capabilities for large language model inference, featuring Byte-Pair Encodin
Generates text responses from LLaMA-3 models with support for chat and streaming API serving.
Java
在 GitHub 上查看6,387
strands-agents/sdk-python
strands-agents/sdk-python
6,176在 GitHub 上查看
This is an open-source Python SDK for building and orchestrating production-grade AI agents. It provides a unified framework for creating conversational agents that can use tools, maintain state, and coordinate across multiple language model providers including OpenAI, Anthropic, Google, Amazon Bedrock, and locally-hosted models. The SDK supports multi-agent orchestration through graphs, teams, and swarms, allowing several specialized agents to collaborate on complex tasks. Agents can be composed as callable tools that other agents invoke, and the framework includes policy handlers that inspe
Connects to Meta-hosted Llama API endpoints to run inference without managing your own infrastructure.
Python
在 GitHub 上查看6,176
ai-dynamo/dynamo
ai-dynamo/dynamo
6,112在 GitHub 上查看
Dynamo is a distributed inference orchestration platform designed for large language models. It functions as a system to coordinate prefill and decode phases across GPU nodes, utilizing a multi-backend runtime adapter to connect engines like vLLM and TensorRT-LLM through a unified block-oriented memory interface. An OpenAI-compatible API server provides the frontend for integration with existing tools and clients. The project is distinguished by its disaggregated serving architecture, which separates prompt processing and token generation onto independent GPU pools to optimize throughput and
Dynamically loads and removes fine-tuned LoRA adapters from storage without restarting the inference engine.
Rust
在 GitHub 上查看6,112
federatedai/fate
FederatedAI/FATE
6,048在 GitHub 上查看
FATE is an open-source federated learning platform that enables multiple organizations to collaboratively train machine learning models without exposing raw data to any party. It provides a complete framework for private data collaboration, allowing participants to jointly compute on sensitive information while maintaining data privacy and security guarantees through secure multi-party computation protocols. The platform distinguishes itself through its comprehensive infrastructure management capabilities, supporting automated deployment of multi-party clusters using Ansible-driven provisioni
Deploys trained models into production for high-performance inference across participating parties.
Pythonalgorithmfatefederated-learning
在 GitHub 上查看6,048
serge-chat/serge
serge-chat/serge
5,725在 GitHub 上查看
Serge is a self-hosted web chat interface for running large language models locally using the llama.cpp inference engine. It loads GGUF-format model files directly on your own machine, removing the need for internet connectivity or external API keys, and streams responses to the browser in real time via WebSocket connections. The project is packaged for containerized deployment using Docker and Docker Compose, with a Traefik reverse proxy that handles HTTP and WebSocket routing along with automatic TLS certificate management. Ready-made Kubernetes manifests are also provided, enabling deploym
Uses llama.cpp as the core inference engine to run GGUF model files locally without external API dependencies.
Sveltealpacadockerfastapi
在 GitHub 上查看5,725
nsarrazin/serge
nsarrazin/serge
5,725在 GitHub 上查看
A web interface for chatting with Alpaca through llama.cpp. Fully dockerized, with an easy to use API.
Uses llama.cpp as the core inference runtime for running GGUF-format models locally with CPU-optimized performance.
Svelte
在 GitHub 上查看5,725
kserve/kserve
kserve/kserve
5,576在 GitHub 上查看
KServe is a Kubernetes-native platform for deploying and serving machine learning models as scalable inference services. It supports both generative AI models, including large language models, and traditional predictive models from frameworks such as TensorFlow, PyTorch, Scikit-Learn, XGBoost, and ONNX. The platform manages the full lifecycle of model deployments, including revision tracking, canary rollouts, A/B testing, and automatic rollbacks, and provides serverless scale-to-zero capabilities for cost-efficient resource management. KServe distinguishes itself through a standardized infere
Deploys models from TensorFlow, PyTorch, Scikit-Learn, XGBoost, and ONNX for real-time scoring and batch prediction.
Go
在 GitHub 上查看5,576
kubeflow/kfserving
kubeflow/kfserving
5,576在 GitHub 上查看
KServe is an open platform for deploying and serving generative and predictive AI models on Kubernetes. It defines inference services as custom resources with declarative YAML specifications, enabling a Kubernetes-native approach to model deployment and lifecycle management. The platform leverages Knative-based serverless scaling for automatic scale-to-zero and revision management, and supports a pluggable serving runtime architecture that maps model formats to containerized execution environments. KServe distinguishes itself through model-aware autoscaling that scales replicas based on token
Caches large model weights on local nodes to cut startup time from minutes to seconds.
Go
在 GitHub 上查看5,576
imoneoi/openchat
imoneoi/openchat
5,481在 GitHub 上查看
OpenChat 是一个用于训练、微调和部署大语言模型的框架，针对对话和数学推理任务进行了优化。它提供了从训练流水线、部署栈到基于 Web 的聊天界面的全生命周期管理。该项目专注于在消费级硬件上实现高性能模型执行，无需企业级加速器。它包含一个生产就绪的推理服务器，实现了 OpenAI 聊天补全协议，并利用动态请求批处理来优化硬件吞吐量。该系统涵盖了整个操作工作流，包括数据集分词、通过无填充训练（padding-free training）进行模型微调以及强化学习。它还扩展到支持基于密钥认证的 API 托管，并提供用于实时人机交互的图形用户界面。
Optimizes model execution to enable high-performance LLM inference on non-enterprise GPUs.
Python
在 GitHub 上查看5,481
wenet-e2e/wenet
wenet-e2e/wenet
5,035在 GitHub 上查看
WeNet is an end-to-end automatic speech recognition (ASR) toolkit designed for both Chinese and English, built around transformer-based models. It supports streaming and non-streaming inference out of the box, and is structured to be production-ready, with model export and deployment paths for servers and mobile devices. The toolkit distinguishes itself through a chunk-based streaming transformer architecture that processes audio in fixed-size segments for low latency while preserving context across chunks. It jointly trains models with both CTC and attention loss to combine alignment accurac
Serves trained ASR models in both real-time streaming and batch processing modes for production use.
Pythonasrautomatic-speech-recognitionconformer
在 GitHub 上查看5,035
vllm-project/aibrix
vllm-project/aibrix
4,882在 GitHub 上查看
Aibrix 是一个推理编排器，专为跨分布式 vLLM 集群扩展、路由和管理大语言模型部署而设计。它作为一个集中式网关，用于负载均衡并将流量路由到特定的模型副本和版本。该系统通过 GPU 集群自动缩放器管理资源效率，该缩放器根据实时请求量调整计算实例数量。它通过在单个集群内混合不同加速器类型，并利用模型适配器编排器在共享基础模型上部署轻量级参数适配器，进一步优化了操作。广泛的功能包括使用分布式键值缓存管理器在推理引擎之间共享 Token 数据，以及实施硬件健康监控以检测处理单元故障。该项目还提供了一个统一的指标流水线，以标准化跨不同运行时环境的性能数据收集。
Manages the dynamic loading and serving of lightweight adapters to run multiple model variants on shared hardware.
Go
在 GitHub 上查看4,882
openmlsys/openmlsys
openmlsys/openmlsys
4,813在 GitHub 上查看
该项目是一个全面的教育资源和课程，专注于完整机器学习软件和硬件栈的设计与实现。它作为架构机器学习系统的技术参考，涵盖从低级编程接口到大规模部署基础设施的各个方面。该项目提供关于多个专业领域的教学指导，包括通过中间表示和图优化开发 AI 编译器。它涵盖了跨 GPU 集群进行分布式训练所需的架构模式，以及为优化专用芯片上的工作负载而进行的硬件加速器编程。该资源还详细介绍了生产环境的模型服务框架实现以及强化学习流水线的构建。其范围扩展到 ML 系统的核心组件，例如自动微分、张量抽象和 GPU 资源的编排。
Details infrastructure and techniques for deploying and optimizing machine learning models for production inference.
TeXcomputer-systemsmachine-learningsoftware-architecture
在 GitHub 上查看4,813

Awesome Model Serving GitHub Repositories

sgl-project/sglang

fishaudio/fish-speech

huggingface/peft

cocktailpeanut/dalai

lyhue1991/eat_tensorflow2_in_30_days

vikhyat/moondream

karminski/one-small-step

HVision-NKU/StoryDiffusion

haifengl/smile

strands-agents/sdk-python

ai-dynamo/dynamo

FederatedAI/FATE

serge-chat/serge

nsarrazin/serge

kserve/kserve

kubeflow/kfserving

imoneoi/openchat

wenet-e2e/wenet

vllm-project/aibrix

openmlsys/openmlsys

探索子标签