29 Repos
Infrastructure and techniques for deploying and optimizing machine learning models for production inference.
Distinguishing note: Focuses on production-grade serving optimizations like batching and caching, distinct from model training.
Explore 29 awesome GitHub repositories matching devops & infrastructure · Model Serving. Refine with filters or upvote what's useful.
Sglang is a high-performance inference engine and serving system designed for large language and multimodal models. It provides a programmable interface for orchestrating complex generation workflows, enabling developers to coordinate multi-turn dialogues, tool invocations, and reasoning chains through a domain-specific language. The platform is built to support production-scale deployments, offering an OpenAI-compatible API that allows for integration with existing application ecosystems. The system distinguishes itself through a disaggregated architecture that separates compute-intensive pr
Serves large language models via high-performance APIs supporting both request-response and streaming token generation.
This project is a generative speech synthesis engine that converts text into high-fidelity human speech. It utilizes a two-stage autoregressive transformer architecture that separates semantic token prediction from acoustic detail reconstruction to balance linguistic accuracy with audio quality. The system is designed to support multilingual output and conversational AI development, enabling the generation of context-aware speech that maintains flow across multiple dialogue turns. The platform distinguishes itself through a production-ready inference server that employs continuous batching to
Optimizes audio delivery using continuous batching and prefix caching for low-latency production inference.
This library provides a framework for parameter-efficient fine-tuning, enabling the adaptation of large pretrained models by training only a small subset of parameters. It functions as a distributed model training system and optimization toolkit, designed to reduce the computational and memory requirements typically associated with full model fine-tuning. The project distinguishes itself through a suite of methods for modular adapter composition, including low-rank matrix decomposition and activation-based scaling. It supports the integration of multiple task-specific adapter modules, allowin
Combines several trained adapter modules using weighted averages to create unified adapter configurations.
The simplest way to run LLaMA on your local machine
Executes LLaMA models locally using a simple command-line interface.
This project is a structured learning curriculum and technical reference for mastering deep learning with TensorFlow. It provides a comprehensive guide for building, training, and deploying neural networks, combining theoretical fundamentals with practical implementation examples. The repository distinguishes itself by covering the end-to-end machine learning workflow, from low-level tensor mathematics and linear algebra to the creation of complex model architectures. It includes specific guidance on developing data pipelines for diverse data types, such as images, text, and time-series seque
Provides techniques for exporting trained models to standardized formats for production API serving.
Moondream is a small-scale vision language model designed to reason across images to generate captions and answer natural language questions. It functions as an edge-optimized system capable of performing visual question answering, image captioning, and object detection. The project distinguishes itself through a lightweight architecture designed for local inference on embedded devices, workstations, and air-gapped hardware. It supports the execution of models on local GPUs and Apple Silicon to ensure data privacy and low latency. The system's capabilities include identifying precise object
Manages production traffic through automatic batching, prefix caching, and streaming responses.
One Small Step is an educational resource that explains core AI and large language model concepts through short, accessible articles designed to be read in under five minutes. It covers the structure and function of key LLM components like attention mechanisms and tokenization, as well as foundational machine learning mathematics such as matrix rank and overfitting. The project also serves as a guide to the GGUF file format, which packages all model parameters and metadata into a single compact binary file for cross-platform deployment without external dependencies. It explains how this forma
Explains how GGUF format uses memory-mapped file access for near-instant model loading and startup.
StoryDiffusion is a generative AI system designed for consistent character image and video generation. It utilizes a pluggable cross-attention module to inject shared character representations into pretrained diffusion models, allowing for visual identity stability across multiple images and scenes without retraining the base model. The project features a video generation pipeline that produces temporally coherent sequences from text prompts or condition images. It employs a latent space motion interpolator to predict intermediate frames and semantic motion, enabling long-range video generati
Enables full generation pipelines to run on consumer GPUs by reducing batch size and model precision.
Smile is a comprehensive JVM machine learning library and statistical computing toolkit. It provides a suite of algorithms for classification, regression, and clustering, implemented natively for Java, Scala, and Kotlin. The project also functions as a deep learning framework, a natural language processing library, and an inference engine for large language models. The library distinguishes itself through GPU acceleration via LibTorch bindings and support for the ONNX model interchange format. It includes specialized capabilities for large language model inference, featuring Byte-Pair Encodin
Generates text responses from LLaMA-3 models with support for chat and streaming API serving.
This is an open-source Python SDK for building and orchestrating production-grade AI agents. It provides a unified framework for creating conversational agents that can use tools, maintain state, and coordinate across multiple language model providers including OpenAI, Anthropic, Google, Amazon Bedrock, and locally-hosted models. The SDK supports multi-agent orchestration through graphs, teams, and swarms, allowing several specialized agents to collaborate on complex tasks. Agents can be composed as callable tools that other agents invoke, and the framework includes policy handlers that inspe
Connects to Meta-hosted Llama API endpoints to run inference without managing your own infrastructure.
Dynamo is a distributed inference orchestration platform designed for large language models. It functions as a system to coordinate prefill and decode phases across GPU nodes, utilizing a multi-backend runtime adapter to connect engines like vLLM and TensorRT-LLM through a unified block-oriented memory interface. An OpenAI-compatible API server provides the frontend for integration with existing tools and clients. The project is distinguished by its disaggregated serving architecture, which separates prompt processing and token generation onto independent GPU pools to optimize throughput and
Dynamically loads and removes fine-tuned LoRA adapters from storage without restarting the inference engine.
FATE is an open-source federated learning platform that enables multiple organizations to collaboratively train machine learning models without exposing raw data to any party. It provides a complete framework for private data collaboration, allowing participants to jointly compute on sensitive information while maintaining data privacy and security guarantees through secure multi-party computation protocols. The platform distinguishes itself through its comprehensive infrastructure management capabilities, supporting automated deployment of multi-party clusters using Ansible-driven provisioni
Deploys trained models into production for high-performance inference across participating parties.
Serge is a self-hosted web chat interface for running large language models locally using the llama.cpp inference engine. It loads GGUF-format model files directly on your own machine, removing the need for internet connectivity or external API keys, and streams responses to the browser in real time via WebSocket connections. The project is packaged for containerized deployment using Docker and Docker Compose, with a Traefik reverse proxy that handles HTTP and WebSocket routing along with automatic TLS certificate management. Ready-made Kubernetes manifests are also provided, enabling deploym
Uses llama.cpp as the core inference engine to run GGUF model files locally without external API dependencies.
A web interface for chatting with Alpaca through llama.cpp. Fully dockerized, with an easy to use API.
Uses llama.cpp as the core inference runtime for running GGUF-format models locally with CPU-optimized performance.
KServe is a Kubernetes-native platform for deploying and serving machine learning models as scalable inference services. It supports both generative AI models, including large language models, and traditional predictive models from frameworks such as TensorFlow, PyTorch, Scikit-Learn, XGBoost, and ONNX. The platform manages the full lifecycle of model deployments, including revision tracking, canary rollouts, A/B testing, and automatic rollbacks, and provides serverless scale-to-zero capabilities for cost-efficient resource management. KServe distinguishes itself through a standardized infere
Deploys models from TensorFlow, PyTorch, Scikit-Learn, XGBoost, and ONNX for real-time scoring and batch prediction.
KServe is an open platform for deploying and serving generative and predictive AI models on Kubernetes. It defines inference services as custom resources with declarative YAML specifications, enabling a Kubernetes-native approach to model deployment and lifecycle management. The platform leverages Knative-based serverless scaling for automatic scale-to-zero and revision management, and supports a pluggable serving runtime architecture that maps model formats to containerized execution environments. KServe distinguishes itself through model-aware autoscaling that scales replicas based on token
Caches large model weights on local nodes to cut startup time from minutes to seconds.
OpenChat ist ein Framework für das Training, Fine-Tuning und Deployment von Large Language Models, die für Konversations- und mathematische Schlussfolgerungsaufgaben optimiert sind. Es bietet einen umfassenden Lebenszyklus für diese Modelle, von Trainings-Pipelines und Deployment-Stacks bis hin zu einer webbasierten Chat-Oberfläche. Das Projekt konzentriert sich darauf, eine leistungsstarke Modellausführung auf Consumer-Hardware ohne den Bedarf an Enterprise-Beschleunigern zu ermöglichen. Es enthält einen produktionsreifen Inference-Server, der das OpenAI-Chat-Completion-Protokoll implementiert und dynamisches Request-Batching nutzt, um den Hardware-Durchsatz zu optimieren. Das System deckt den gesamten operativen Workflow ab, einschließlich Dataset-Tokenisierung und Modell-Fine-Tuning mittels Padding-freiem Training und Reinforcement Learning. Es erweitert dies um API-Hosting mit schlüsselbasierter Authentifizierung und eine grafische Benutzeroberfläche für die menschliche Interaktion in Echtzeit.
Optimizes model execution to enable high-performance LLM inference on non-enterprise GPUs.
WeNet is an end-to-end automatic speech recognition (ASR) toolkit designed for both Chinese and English, built around transformer-based models. It supports streaming and non-streaming inference out of the box, and is structured to be production-ready, with model export and deployment paths for servers and mobile devices. The toolkit distinguishes itself through a chunk-based streaming transformer architecture that processes audio in fixed-size segments for low latency while preserving context across chunks. It jointly trains models with both CTC and attention loss to combine alignment accurac
Serves trained ASR models in both real-time streaming and batch processing modes for production use.
Aibrix ist ein Inferenz-Orchestrator, der für die Skalierung, das Routing und die Verwaltung der Bereitstellung großer Sprachmodelle über verteilte vLLM-Cluster entwickelt wurde. Er dient als zentrales Gateway für Load-Balancing und das Routing von Traffic zu spezifischen Modell-Replikaten und -Versionen. Das System verwaltet Ressourceneffizienz durch einen GPU-Cluster-Autoscaler, der die Anzahl der Compute-Instanzen basierend auf dem Echtzeit-Request-Volumen anpasst. Es optimiert den Betrieb weiter durch das Mischen verschiedener Beschleunigertypen innerhalb eines Clusters und die Nutzung eines Modell-Adapter-Orchestrators, um leichtgewichtige Parameter-Adapter auf geteilten Basismodellen bereitzustellen. Zu den breiten Funktionen gehören die Verwendung eines verteilten Key-Value-Cache-Managers zum Teilen von Token-Daten über Inferenz-Engines hinweg und die Implementierung von Hardware-Health-Monitoring zur Erkennung von Ausfällen der Verarbeitungseinheiten. Das Projekt bietet zudem eine einheitliche Metrik-Pipeline, um die Sammlung von Performancedaten über diverse Laufzeitumgebungen hinweg zu standardisieren.
Manages the dynamic loading and serving of lightweight adapters to run multiple model variants on shared hardware.
Dieses Projekt ist eine umfassende Bildungsressource und ein Lehrplan, der sich auf das Design und die Implementierung des gesamten Machine-Learning-Software- und Hardware-Stacks konzentriert. Es dient als technische Referenz für die Architektur von Machine-Learning-Systemen, die von Low-Level-Programmierschnittstellen bis hin zur Deployment-Infrastruktur im großen Maßstab reicht. Das Projekt bietet instruktive Anleitungen zu mehreren spezialisierten Bereichen, einschließlich der Entwicklung von KI-Compilern durch Zwischenrepräsentationen und Graph-Optimierungen. Es deckt die Architekturmuster ab, die für verteiltes Training über GPU-Cluster hinweg erforderlich sind, sowie die Programmierung von Hardware-Beschleunigern zur Optimierung von Workloads auf spezialisierten Chips. Die Ressource beschreibt zudem die Implementierung von Modell-Serving-Frameworks für Produktionsumgebungen und das Design von Reinforcement-Learning-Pipelines. Ihr Umfang erstreckt sich auf die Kernkomponenten von ML-Systemen, wie automatische Differenzierung, Tensor-Abstraktionen und die Orchestrierung von GPU-Ressourcen.
Details infrastructure and techniques for deploying and optimizing machine learning models for production inference.