29 个仓库
Infrastructure and techniques for deploying and optimizing machine learning models for production inference.
Distinguishing note: Focuses on production-grade serving optimizations like batching and caching, distinct from model training.
Explore 29 awesome GitHub repositories matching devops & infrastructure · Model Serving. Refine with filters or upvote what's useful.
Sglang is a high-performance inference engine and serving system designed for large language and multimodal models. It provides a programmable interface for orchestrating complex generation workflows, enabling developers to coordinate multi-turn dialogues, tool invocations, and reasoning chains through a domain-specific language. The platform is built to support production-scale deployments, offering an OpenAI-compatible API that allows for integration with existing application ecosystems. The system distinguishes itself through a disaggregated architecture that separates compute-intensive pr
Serves large language models via high-performance APIs supporting both request-response and streaming token generation.
This project is a generative speech synthesis engine that converts text into high-fidelity human speech. It utilizes a two-stage autoregressive transformer architecture that separates semantic token prediction from acoustic detail reconstruction to balance linguistic accuracy with audio quality. The system is designed to support multilingual output and conversational AI development, enabling the generation of context-aware speech that maintains flow across multiple dialogue turns. The platform distinguishes itself through a production-ready inference server that employs continuous batching to
Optimizes audio delivery using continuous batching and prefix caching for low-latency production inference.
This library provides a framework for parameter-efficient fine-tuning, enabling the adaptation of large pretrained models by training only a small subset of parameters. It functions as a distributed model training system and optimization toolkit, designed to reduce the computational and memory requirements typically associated with full model fine-tuning. The project distinguishes itself through a suite of methods for modular adapter composition, including low-rank matrix decomposition and activation-based scaling. It supports the integration of multiple task-specific adapter modules, allowin
Combines several trained adapter modules using weighted averages to create unified adapter configurations.
The simplest way to run LLaMA on your local machine
Executes LLaMA models locally using a simple command-line interface.
This project is a structured learning curriculum and technical reference for mastering deep learning with TensorFlow. It provides a comprehensive guide for building, training, and deploying neural networks, combining theoretical fundamentals with practical implementation examples. The repository distinguishes itself by covering the end-to-end machine learning workflow, from low-level tensor mathematics and linear algebra to the creation of complex model architectures. It includes specific guidance on developing data pipelines for diverse data types, such as images, text, and time-series seque
Provides techniques for exporting trained models to standardized formats for production API serving.
Moondream is a small-scale vision language model designed to reason across images to generate captions and answer natural language questions. It functions as an edge-optimized system capable of performing visual question answering, image captioning, and object detection. The project distinguishes itself through a lightweight architecture designed for local inference on embedded devices, workstations, and air-gapped hardware. It supports the execution of models on local GPUs and Apple Silicon to ensure data privacy and low latency. The system's capabilities include identifying precise object
Manages production traffic through automatic batching, prefix caching, and streaming responses.
One Small Step is an educational resource that explains core AI and large language model concepts through short, accessible articles designed to be read in under five minutes. It covers the structure and function of key LLM components like attention mechanisms and tokenization, as well as foundational machine learning mathematics such as matrix rank and overfitting. The project also serves as a guide to the GGUF file format, which packages all model parameters and metadata into a single compact binary file for cross-platform deployment without external dependencies. It explains how this forma
Explains how GGUF format uses memory-mapped file access for near-instant model loading and startup.
StoryDiffusion is a generative AI system designed for consistent character image and video generation. It utilizes a pluggable cross-attention module to inject shared character representations into pretrained diffusion models, allowing for visual identity stability across multiple images and scenes without retraining the base model. The project features a video generation pipeline that produces temporally coherent sequences from text prompts or condition images. It employs a latent space motion interpolator to predict intermediate frames and semantic motion, enabling long-range video generati
Enables full generation pipelines to run on consumer GPUs by reducing batch size and model precision.
Smile is a comprehensive JVM machine learning library and statistical computing toolkit. It provides a suite of algorithms for classification, regression, and clustering, implemented natively for Java, Scala, and Kotlin. The project also functions as a deep learning framework, a natural language processing library, and an inference engine for large language models. The library distinguishes itself through GPU acceleration via LibTorch bindings and support for the ONNX model interchange format. It includes specialized capabilities for large language model inference, featuring Byte-Pair Encodin
Generates text responses from LLaMA-3 models with support for chat and streaming API serving.
This is an open-source Python SDK for building and orchestrating production-grade AI agents. It provides a unified framework for creating conversational agents that can use tools, maintain state, and coordinate across multiple language model providers including OpenAI, Anthropic, Google, Amazon Bedrock, and locally-hosted models. The SDK supports multi-agent orchestration through graphs, teams, and swarms, allowing several specialized agents to collaborate on complex tasks. Agents can be composed as callable tools that other agents invoke, and the framework includes policy handlers that inspe
Connects to Meta-hosted Llama API endpoints to run inference without managing your own infrastructure.
Dynamo is a distributed inference orchestration platform designed for large language models. It functions as a system to coordinate prefill and decode phases across GPU nodes, utilizing a multi-backend runtime adapter to connect engines like vLLM and TensorRT-LLM through a unified block-oriented memory interface. An OpenAI-compatible API server provides the frontend for integration with existing tools and clients. The project is distinguished by its disaggregated serving architecture, which separates prompt processing and token generation onto independent GPU pools to optimize throughput and
Dynamically loads and removes fine-tuned LoRA adapters from storage without restarting the inference engine.
FATE is an open-source federated learning platform that enables multiple organizations to collaboratively train machine learning models without exposing raw data to any party. It provides a complete framework for private data collaboration, allowing participants to jointly compute on sensitive information while maintaining data privacy and security guarantees through secure multi-party computation protocols. The platform distinguishes itself through its comprehensive infrastructure management capabilities, supporting automated deployment of multi-party clusters using Ansible-driven provisioni
Deploys trained models into production for high-performance inference across participating parties.
Serge is a self-hosted web chat interface for running large language models locally using the llama.cpp inference engine. It loads GGUF-format model files directly on your own machine, removing the need for internet connectivity or external API keys, and streams responses to the browser in real time via WebSocket connections. The project is packaged for containerized deployment using Docker and Docker Compose, with a Traefik reverse proxy that handles HTTP and WebSocket routing along with automatic TLS certificate management. Ready-made Kubernetes manifests are also provided, enabling deploym
Uses llama.cpp as the core inference engine to run GGUF model files locally without external API dependencies.
A web interface for chatting with Alpaca through llama.cpp. Fully dockerized, with an easy to use API.
Uses llama.cpp as the core inference runtime for running GGUF-format models locally with CPU-optimized performance.
KServe is a Kubernetes-native platform for deploying and serving machine learning models as scalable inference services. It supports both generative AI models, including large language models, and traditional predictive models from frameworks such as TensorFlow, PyTorch, Scikit-Learn, XGBoost, and ONNX. The platform manages the full lifecycle of model deployments, including revision tracking, canary rollouts, A/B testing, and automatic rollbacks, and provides serverless scale-to-zero capabilities for cost-efficient resource management. KServe distinguishes itself through a standardized infere
Deploys models from TensorFlow, PyTorch, Scikit-Learn, XGBoost, and ONNX for real-time scoring and batch prediction.
KServe is an open platform for deploying and serving generative and predictive AI models on Kubernetes. It defines inference services as custom resources with declarative YAML specifications, enabling a Kubernetes-native approach to model deployment and lifecycle management. The platform leverages Knative-based serverless scaling for automatic scale-to-zero and revision management, and supports a pluggable serving runtime architecture that maps model formats to containerized execution environments. KServe distinguishes itself through model-aware autoscaling that scales replicas based on token
Caches large model weights on local nodes to cut startup time from minutes to seconds.
OpenChat 是一个用于训练、微调和部署大语言模型的框架,针对对话和数学推理任务进行了优化。它提供了从训练流水线、部署栈到基于 Web 的聊天界面的全生命周期管理。 该项目专注于在消费级硬件上实现高性能模型执行,无需企业级加速器。它包含一个生产就绪的推理服务器,实现了 OpenAI 聊天补全协议,并利用动态请求批处理来优化硬件吞吐量。 该系统涵盖了整个操作工作流,包括数据集分词、通过无填充训练(padding-free training)进行模型微调以及强化学习。它还扩展到支持基于密钥认证的 API 托管,并提供用于实时人机交互的图形用户界面。
Optimizes model execution to enable high-performance LLM inference on non-enterprise GPUs.
WeNet is an end-to-end automatic speech recognition (ASR) toolkit designed for both Chinese and English, built around transformer-based models. It supports streaming and non-streaming inference out of the box, and is structured to be production-ready, with model export and deployment paths for servers and mobile devices. The toolkit distinguishes itself through a chunk-based streaming transformer architecture that processes audio in fixed-size segments for low latency while preserving context across chunks. It jointly trains models with both CTC and attention loss to combine alignment accurac
Serves trained ASR models in both real-time streaming and batch processing modes for production use.
Aibrix 是一个推理编排器,专为跨分布式 vLLM 集群扩展、路由和管理大语言模型部署而设计。它作为一个集中式网关,用于负载均衡并将流量路由到特定的模型副本和版本。 该系统通过 GPU 集群自动缩放器管理资源效率,该缩放器根据实时请求量调整计算实例数量。它通过在单个集群内混合不同加速器类型,并利用模型适配器编排器在共享基础模型上部署轻量级参数适配器,进一步优化了操作。 广泛的功能包括使用分布式键值缓存管理器在推理引擎之间共享 Token 数据,以及实施硬件健康监控以检测处理单元故障。该项目还提供了一个统一的指标流水线,以标准化跨不同运行时环境的性能数据收集。
Manages the dynamic loading and serving of lightweight adapters to run multiple model variants on shared hardware.
该项目是一个全面的教育资源和课程,专注于完整机器学习软件和硬件栈的设计与实现。它作为架构机器学习系统的技术参考,涵盖从低级编程接口到大规模部署基础设施的各个方面。 该项目提供关于多个专业领域的教学指导,包括通过中间表示和图优化开发 AI 编译器。它涵盖了跨 GPU 集群进行分布式训练所需的架构模式,以及为优化专用芯片上的工作负载而进行的硬件加速器编程。 该资源还详细介绍了生产环境的模型服务框架实现以及强化学习流水线的构建。其范围扩展到 ML 系统的核心组件,例如自动微分、张量抽象和 GPU 资源的编排。
Details infrastructure and techniques for deploying and optimizing machine learning models for production inference.