17 repositorios
Grouping multiple model inference requests into a single hardware execution pass to maximize throughput.
Distinct from Request Batching: Focuses on GPU/NPU compute batching for model inference rather than general data operation or network request batching.
Explore 17 awesome GitHub repositories matching data & databases · Inference Batching. Refine with filters or upvote what's useful.
Este proyecto es un servicio de incrustación BERT de alto rendimiento y servidor de inferencia diseñado para mapear secuencias de texto en vectores numéricos de longitud fija. Funciona como un microservicio de aprendizaje automático y servidor de modelos distribuido que desacopla el manejo de solicitudes de la computación pesada. El sistema utiliza una infraestructura de mensajería ZeroMQ para proporcionar comunicación de baja latencia entre clientes distribuidos y el servidor de inferencia. Incorpora procesamiento por lotes del lado del servidor y escalado de carga de trabajo de GPU para maximizar la utilización del hardware y gestionar grandes volúmenes de solicitudes. La plataforma admite infraestructura de búsqueda semántica generando incrustaciones intermodales tanto para texto como para imágenes dentro de un espacio vectorial compartido. Esto permite la búsqueda intermodal, la clasificación de relevancia de contenido y la re-clasificación de resultados basada en la alineación semántica entre el contenido visual y las descripciones de texto. El servicio se puede implementar como un microservicio elástico accesible a través de protocolos gRPC, HTTP o WebSocket, con streaming dúplex sin bloqueo para manejar grandes conjuntos de datos.
Groups individual requests into optimized batches to maximize GPU throughput during inference.
StreamDiffusion is an interactive generative AI framework and inference engine designed for the low-latency delivery of image and video streams. It provides a real-time Stable Diffusion pipeline for text-to-image and image-to-image generation, enabling the creation of continuous generative image streams with minimized computational delay. The framework optimizes throughput using a pre-computed cache engine and residual-based guidance approximation to reduce the number of required model passes. It further manages GPU load through similarity-based frame skipping, which avoids redundant computat
Implements batching of inference requests to maximize GPU throughput and minimize computational overhead.
FlexLLMGen is an inference engine and runtime designed to run large language models on a single GPU by combining weight compression with tensor offloading. It reduces model weight memory usage by approximately 70% through 4-bit quantization, and stores model parameters, attention cache, and hidden states across GPU, CPU, and disk to fit models larger than available GPU memory. The project distinguishes itself through a throughput-oriented batching approach that processes multiple generation requests together in large batches to maximize throughput on a single GPU. It also supports distributed
Processes multiple generation requests together in large batches to maximize throughput on a single GPU.
This project is an AI singing voice conversion system and vocal processor used for training generative voice models and converting vocal recordings or live input into a target voice. It functions as a VITS model trainer and a real-time voice changer that transforms vocal timbre and pitch to change the identity of a singer. The system provides a graphical management dashboard for controlling training hyperparameters and voice conversion presets. It supports low-latency audio streaming for live microphone input and employs pitch estimation to ensure precise matching between source and target vo
Implements grouping of multiple audio segments into single GPU execution passes to accelerate batch inference throughput.
KServe is a Kubernetes-native platform for deploying and serving machine learning models as scalable inference services. It supports both generative AI models, including large language models, and traditional predictive models from frameworks such as TensorFlow, PyTorch, Scikit-Learn, XGBoost, and ONNX. The platform manages the full lifecycle of model deployments, including revision tracking, canary rollouts, A/B testing, and automatic rollbacks, and provides serverless scale-to-zero capabilities for cost-efficient resource management. KServe distinguishes itself through a standardized infere
Groups multiple prediction requests into a single batch to improve throughput on GPU and CPU runtimes.
KServe is an open platform for deploying and serving generative and predictive AI models on Kubernetes. It defines inference services as custom resources with declarative YAML specifications, enabling a Kubernetes-native approach to model deployment and lifecycle management. The platform leverages Knative-based serverless scaling for automatic scale-to-zero and revision management, and supports a pluggable serving runtime architecture that maps model formats to containerized execution environments. KServe distinguishes itself through model-aware autoscaling that scales replicas based on token
Accumulates multiple prediction requests and processes them together to increase throughput.
Este proyecto es una guía arquitectónica de MLOps y framework para diseñar y desplegar sistemas de deep learning en entornos de producción. Proporciona un enfoque estructurado para el despliegue de inferencia de modelos, orquestación de pipelines de ML y la creación de arquitecturas de machine learning a nivel de producción. El proyecto se distingue por un enfoque en deep learning distribuido y edge AI. Cubre metodologías para paralelizar el entrenamiento de modelos a través de múltiples GPUs para manejar grandes datasets y aplica técnicas como cuantización y destilación para reducir el tamaño del modelo para hardware embebido. La superficie de capacidad se extiende al monitoreo y observabilidad, incluyendo el seguimiento del rendimiento del modelo, deriva de datos (data drift) y métricas de experimentos. También aborda la orquestación de flujos de trabajo de datos, versionado de datasets mediante almacenes de objetos y la gestión de solicitudes de inferencia de alto volumen utilizando procesamiento por lotes adaptativo y orquestación basada en contenedores.
Implements adaptive batching to maximize GPU throughput while maintaining latency limits for model inference.
exllamav2 es una librería de inferencia de alto rendimiento diseñada para ejecutar modelos de lenguaje grandes localmente en GPUs de grado consumidor. Proporciona un runner acelerado por GPU y herramientas de cuantización para permitir la ejecución de modelos sin depender de servicios de computación en la nube. El proyecto cuenta con una utilidad de cuantización que comprime modelos en bitrates mixtos de entre dos y ocho bits para reducir los requisitos de VRAM. Se distingue por un generador de texto por lotes que maneja solicitudes agrupadas y deduplica datos de caché para aumentar el rendimiento. La librería cubre una amplia superficie de capacidades, incluyendo streaming de tokens asíncrono para salida en tiempo real, ejecución de kernels de GPU personalizados para operaciones de álgebra lineal y mapeo de memoria local para acceso de baja latencia a los pesos del modelo.
Groups multiple model inference requests into a single hardware execution pass to maximize GPU throughput.
exllamav2 es un motor de inferencia de alto rendimiento y framework para ejecutar modelos de lenguaje grandes localmente en GPUs de clase consumidor. Proporciona un sistema completo para el despliegue de modelos locales, incluyendo un motor de inferencia especializado y herramientas para la cuantización de modelos. El proyecto cuenta con un framework de inferencia multi-GPU que distribuye las cargas de trabajo entre múltiples tarjetas gráficas para ejecutar modelos que exceden la capacidad de memoria de un solo dispositivo. Incluye un cuantizador de modelos de GPU capaz de convertir modelos a formatos de precisión mixta de entre 2 y 8 bits para equilibrar el uso de memoria y la precisión. El motor admite la generación de texto de alto rendimiento mediante inferencia paralela basada en lotes y streaming de salida asíncrono. Estas capacidades están respaldadas por kernels CUDA personalizados y deduplicación de caché para optimizar el uso del hardware y reducir la latencia durante la generación de tokens.
Executes multiple text completion prompts simultaneously using batch-based parallel inference to maximize GPU utilization.
Este proyecto es un framework de servicio de modelos de PyTorch diseñado para desplegar y escalar modelos de machine learning en producción a través de endpoints de red escalables. Funciona como un servidor de inferencia de alto rendimiento, optimizador y gestor del ciclo de vida del modelo que maneja la carga de modelos, el procesamiento por lotes (batching) de solicitudes y la aceleración por hardware. El sistema se distingue por sus capacidades avanzadas de orquestación y optimización, como el encadenamiento de múltiples modelos en flujos de trabajo secuenciales mediante grafos de ejecución y el uso de procesamiento por lotes dinámico para mejorar el rendimiento y la latencia. Proporciona soporte especializado para IA generativa y modelos de lenguaje de gran tamaño (LLM) mediante procesamiento por lotes continuo y paralelismo de tensores. Las áreas de capacidad incluyen la gestión de recursos de GPU en hardware diverso como NVIDIA, AMD y Apple Silicon, así como una gestión integral del ciclo de vida del modelo para registro, versionado y escalado de trabajadores. También integra herramientas de observabilidad para rastrear la salud del sistema y el rendimiento del modelo mediante métricas compatibles con Prometheus. El servidor se gestiona a través de una interfaz de línea de comandos utilizada para el control del ciclo de vida y la configuración de parámetros de tiempo de ejecución.
Groups multiple model inference requests into a single hardware execution pass to maximize GPU throughput.
tiny-llm is a large language model inference engine and transformer model implementation. It serves as a quantized model runtime and paged key-value cache manager, providing a specialized inference stack optimized for Apple Silicon. The system distinguishes itself through high-throughput execution techniques, including continuous batching and paged attention. It utilizes a paged memory system to eliminate fragmentation during token generation and employs on-the-fly dequantization of compressed weights to reduce the memory footprint during matrix multiplication. The project covers a broad ran
Groups multiple incoming requests into a single hardware execution pass to maximize throughput.
LitServe is a Python AI inference server framework and LLM serving framework designed for high-concurrency inference. It functions as a distributed AI model server and dynamic batching inference engine, providing the tools to build and host custom servers that run AI models. The framework distinguishes itself through a dynamic-batching request queue that groups individual inference requests into single tensors to maximize GPU throughput. It supports distributed GPU scaling, allowing model workloads to be spread across multiple hardware accelerators to balance compute loads and increase total
Groups multiple incoming AI requests into single batches to maximize GPU hardware utilization.
LightLLM is a high-performance serving framework for deploying and executing large language models. It functions as a multi-GPU inference engine and server capable of handling dense architectures, mixture-of-experts designs, and multimodal models that process both text and images. The system is distinguished by its specialized support for Mixture-of-Experts models using expert parallelism and fused kernels. It implements structured text generation through deterministic state machines and pushdown automata to enforce precise output formats. To optimize throughput, the framework employs specula
Merges new requests into active inference batches by calculating estimated token usage against hardware capacity.
WhisperLive is a real-time speech-to-text server that converts live audio streams into text using Whisper models. It functions as a backend service that receives microphone input via WebSockets and provides incremental transcriptions with word-level timestamps. The system utilizes a GPU-accelerated inference engine and a keyword-boosted transcription API to improve the recognition accuracy of domain-specific jargon, acronyms, and product names. It also includes a speaker diarization tool that clusters audio embeddings to identify and label different participants within a recording. Additiona
Groups multiple concurrent user audio segments into single GPU calls to maximize system throughput.
Lorax is a GPU-accelerated inference server and multi-adapter engine designed for serving large language models. It functions as a high-throughput system capable of deploying models via Kubernetes and managing the dynamic swapping of Low-Rank Adaptation adapters per request. The server distinguishes itself through multi-adapter dynamic batching, which allows requests using different adapter weights to be processed in a single GPU forward pass. It employs just-in-time adapter loading and weighted adapter merging to maximize throughput and enable multi-tasking without sacrificing performance.
Processes requests using different LoRA adapters in a single GPU forward pass to maximize throughput.
mini-sglang is a collection of tools for large language model inference, serving as an OpenAI-compatible inference server, a memory-efficient prefill engine, and a tensor parallelism runtime. It also functions as a local batch processing engine for offline benchmarking and ablation studies. The project focuses on acceleration and memory management through a KV cache manager that reuses precomputed caches for shared request prefixes. It handles large model workloads by distributing tasks across multiple GPUs and manages peak memory consumption by splitting long input sequences into smaller chu
Provides a local batch processing engine to maximize hardware utilization for offline benchmarking.
llm-d is a distributed serving framework designed for large language model inference. It functions as an inference orchestrator and gateway, providing a control plane for deploying model replicas and managing hardware accelerators. The system includes a batch inference scheduler and a cache manager to coordinate request flow and memory utilization. The project is distinguished by a disaggregated serving architecture that separates prefill and decode execution phases across specialized workers to maximize throughput. It employs a hardware-agnostic control plane and tiered cache offloading, mov
Manages large volumes of offline inference requests through queuing and flow control to maximize hardware utilization.