Why is hanxiao/bert-as-service a recommended Inference Batching GitHub Repositories repository?

Groups individual requests into optimized batches to maximize GPU throughput during inference.

Why is cumulo-autumn/streamdiffusion a recommended Inference Batching GitHub Repositories repository?

Implements batching of inference requests to maximize GPU throughput and minimize computational overhead.

Why is fminference/flexllmgen a recommended Inference Batching GitHub Repositories repository?

Processes multiple generation requests together in large batches to maximize throughput on a single GPU.

Why is voicepaw/so-vits-svc-fork a recommended Inference Batching GitHub Repositories repository?

Implements grouping of multiple audio segments into single GPU execution passes to accelerate batch inference throughput.

Why is kserve/kserve a recommended Inference Batching GitHub Repositories repository?

Groups multiple prediction requests into a single batch to improve throughput on GPU and CPU runtimes.

Why is kubeflow/kfserving a recommended Inference Batching GitHub Repositories repository?

Accumulates multiple prediction requests and processes them together to increase throughput.

Why is alirezadir/production-level-deep-learning a recommended Inference Batching GitHub Repositories repository?

Implements adaptive batching to maximize GPU throughput while maintaining latency limits for model inference.

Why is turboderp-org/exllamav2 a recommended Inference Batching GitHub Repositories repository?

Executes multiple text completion prompts simultaneously using batch-based parallel inference to maximize GPU utilization.

17 repositorios

Awesome GitHub RepositoriesInference Batching

Grouping multiple model inference requests into a single hardware execution pass to maximize throughput.

Distinct from Request Batching: Focuses on GPU/NPU compute batching for model inference rather than general data operation or network request batching.

Explore 17 awesome GitHub repositories matching data & databases · Inference Batching. Refine with filters or upvote what's useful.

Encuentra los mejores repositorios con IA.Buscaremos los repositorios que mejor coincidan usando IA.

hanxiao/bert-as-service
hanxiao/bert-as-service
12,831Ver en GitHub
Este proyecto es un servicio de incrustación BERT de alto rendimiento y servidor de inferencia diseñado para mapear secuencias de texto en vectores numéricos de longitud fija. Funciona como un microservicio de aprendizaje automático y servidor de modelos distribuido que desacopla el manejo de solicitudes de la computación pesada. El sistema utiliza una infraestructura de mensajería ZeroMQ para proporcionar comunicación de baja latencia entre clientes distribuidos y el servidor de inferencia. Incorpora procesamiento por lotes del lado del servidor y escalado de carga de trabajo de GPU para maximizar la utilización del hardware y gestionar grandes volúmenes de solicitudes. La plataforma admite infraestructura de búsqueda semántica generando incrustaciones intermodales tanto para texto como para imágenes dentro de un espacio vectorial compartido. Esto permite la búsqueda intermodal, la clasificación de relevancia de contenido y la re-clasificación de resultados basada en la alineación semántica entre el contenido visual y las descripciones de texto. El servicio se puede implementar como un microservicio elástico accesible a través de protocolos gRPC, HTTP o WebSocket, con streaming dúplex sin bloqueo para manejar grandes conjuntos de datos.
Groups individual requests into optimized batches to maximize GPU throughput during inference.
Python
Ver en GitHub12,831
cumulo-autumn/streamdiffusion
cumulo-autumn/StreamDiffusion
10,770Ver en GitHub
StreamDiffusion is an interactive generative AI framework and inference engine designed for the low-latency delivery of image and video streams. It provides a real-time Stable Diffusion pipeline for text-to-image and image-to-image generation, enabling the creation of continuous generative image streams with minimized computational delay. The framework optimizes throughput using a pre-computed cache engine and residual-based guidance approximation to reduce the number of required model passes. It further manages GPU load through similarity-based frame skipping, which avoids redundant computat
Implements batching of inference requests to maximize GPU throughput and minimize computational overhead.
Python
Ver en GitHub10,770
fminference/flexllmgen
FMInference/FlexLLMGen
9,362Ver en GitHub
FlexLLMGen is an inference engine and runtime designed to run large language models on a single GPU by combining weight compression with tensor offloading. It reduces model weight memory usage by approximately 70% through 4-bit quantization, and stores model parameters, attention cache, and hidden states across GPU, CPU, and disk to fit models larger than available GPU memory. The project distinguishes itself through a throughput-oriented batching approach that processes multiple generation requests together in large batches to maximize throughput on a single GPU. It also supports distributed
Processes multiple generation requests together in large batches to maximize throughput on a single GPU.
Pythondeep-learninggpt-3high-throughput
Ver en GitHub9,362
voicepaw/so-vits-svc-fork
voicepaw/so-vits-svc-fork
9,318Ver en GitHub
This project is an AI singing voice conversion system and vocal processor used for training generative voice models and converting vocal recordings or live input into a target voice. It functions as a VITS model trainer and a real-time voice changer that transforms vocal timbre and pitch to change the identity of a singer. The system provides a graphical management dashboard for controlling training hyperparameters and voice conversion presets. It supports low-latency audio streaming for live microphone input and employs pitch estimation to ensure precise matching between source and target vo
Implements grouping of multiple audio segments into single GPU execution passes to accelerate batch inference throughput.
Pythoncontentvecdeep-learninggan
Ver en GitHub9,318
kserve/kserve
kserve/kserve
5,576Ver en GitHub
KServe is a Kubernetes-native platform for deploying and serving machine learning models as scalable inference services. It supports both generative AI models, including large language models, and traditional predictive models from frameworks such as TensorFlow, PyTorch, Scikit-Learn, XGBoost, and ONNX. The platform manages the full lifecycle of model deployments, including revision tracking, canary rollouts, A/B testing, and automatic rollbacks, and provides serverless scale-to-zero capabilities for cost-efficient resource management. KServe distinguishes itself through a standardized infere
Groups multiple prediction requests into a single batch to improve throughput on GPU and CPU runtimes.
Go
Ver en GitHub5,576
kubeflow/kfserving
kubeflow/kfserving
5,576Ver en GitHub
KServe is an open platform for deploying and serving generative and predictive AI models on Kubernetes. It defines inference services as custom resources with declarative YAML specifications, enabling a Kubernetes-native approach to model deployment and lifecycle management. The platform leverages Knative-based serverless scaling for automatic scale-to-zero and revision management, and supports a pluggable serving runtime architecture that maps model formats to containerized execution environments. KServe distinguishes itself through model-aware autoscaling that scales replicas based on token
Accumulates multiple prediction requests and processes them together to increase throughput.
Go
Ver en GitHub5,576
alirezadir/production-level-deep-learning
alirezadir/Production-Level-Deep-Learning
4,647Ver en GitHub
Este proyecto es una guía arquitectónica de MLOps y framework para diseñar y desplegar sistemas de deep learning en entornos de producción. Proporciona un enfoque estructurado para el despliegue de inferencia de modelos, orquestación de pipelines de ML y la creación de arquitecturas de machine learning a nivel de producción. El proyecto se distingue por un enfoque en deep learning distribuido y edge AI. Cubre metodologías para paralelizar el entrenamiento de modelos a través de múltiples GPUs para manejar grandes datasets y aplica técnicas como cuantización y destilación para reducir el tamaño del modelo para hardware embebido. La superficie de capacidad se extiende al monitoreo y observabilidad, incluyendo el seguimiento del rendimiento del modelo, deriva de datos (data drift) y métricas de experimentos. También aborda la orquestación de flujos de trabajo de datos, versionado de datasets mediante almacenes de objetos y la gestión de solicitudes de inferencia de alto volumen utilizando procesamiento por lotes adaptativo y orquestación basada en contenedores.
Implements adaptive batching to maximize GPU throughput while maintaining latency limits for model inference.
aiartificial-intelligencedeep-learning
Ver en GitHub4,647
turboderp/exllamav2
turboderp/exllamav2
4,553Ver en GitHub
exllamav2 es una librería de inferencia de alto rendimiento diseñada para ejecutar modelos de lenguaje grandes localmente en GPUs de grado consumidor. Proporciona un runner acelerado por GPU y herramientas de cuantización para permitir la ejecución de modelos sin depender de servicios de computación en la nube. El proyecto cuenta con una utilidad de cuantización que comprime modelos en bitrates mixtos de entre dos y ocho bits para reducir los requisitos de VRAM. Se distingue por un generador de texto por lotes que maneja solicitudes agrupadas y deduplica datos de caché para aumentar el rendimiento. La librería cubre una amplia superficie de capacidades, incluyendo streaming de tokens asíncrono para salida en tiempo real, ejecución de kernels de GPU personalizados para operaciones de álgebra lineal y mapeo de memoria local para acceso de baja latencia a los pesos del modelo.
Groups multiple model inference requests into a single hardware execution pass to maximize GPU throughput.
Python
Ver en GitHub4,553
turboderp-org/exllamav2
turboderp-org/exllamav2
4,552Ver en GitHub
exllamav2 es un motor de inferencia de alto rendimiento y framework para ejecutar modelos de lenguaje grandes localmente en GPUs de clase consumidor. Proporciona un sistema completo para el despliegue de modelos locales, incluyendo un motor de inferencia especializado y herramientas para la cuantización de modelos. El proyecto cuenta con un framework de inferencia multi-GPU que distribuye las cargas de trabajo entre múltiples tarjetas gráficas para ejecutar modelos que exceden la capacidad de memoria de un solo dispositivo. Incluye un cuantizador de modelos de GPU capaz de convertir modelos a formatos de precisión mixta de entre 2 y 8 bits para equilibrar el uso de memoria y la precisión. El motor admite la generación de texto de alto rendimiento mediante inferencia paralela basada en lotes y streaming de salida asíncrono. Estas capacidades están respaldadas por kernels CUDA personalizados y deduplicación de caché para optimizar el uso del hardware y reducir la latencia durante la generación de tokens.
Executes multiple text completion prompts simultaneously using batch-based parallel inference to maximize GPU utilization.
Python
Ver en GitHub4,552
pytorch/serve
pytorch/serve
4,354Ver en GitHub
Este proyecto es un framework de servicio de modelos de PyTorch diseñado para desplegar y escalar modelos de machine learning en producción a través de endpoints de red escalables. Funciona como un servidor de inferencia de alto rendimiento, optimizador y gestor del ciclo de vida del modelo que maneja la carga de modelos, el procesamiento por lotes (batching) de solicitudes y la aceleración por hardware. El sistema se distingue por sus capacidades avanzadas de orquestación y optimización, como el encadenamiento de múltiples modelos en flujos de trabajo secuenciales mediante grafos de ejecución y el uso de procesamiento por lotes dinámico para mejorar el rendimiento y la latencia. Proporciona soporte especializado para IA generativa y modelos de lenguaje de gran tamaño (LLM) mediante procesamiento por lotes continuo y paralelismo de tensores. Las áreas de capacidad incluyen la gestión de recursos de GPU en hardware diverso como NVIDIA, AMD y Apple Silicon, así como una gestión integral del ciclo de vida del modelo para registro, versionado y escalado de trabajadores. También integra herramientas de observabilidad para rastrear la salud del sistema y el rendimiento del modelo mediante métricas compatibles con Prometheus. El servidor se gestiona a través de una interfaz de línea de comandos utilizada para el control del ciclo de vida y la configuración de parámetros de tiempo de ejecución.
Groups multiple model inference requests into a single hardware execution pass to maximize GPU throughput.
Java
Ver en GitHub4,354
skyzh/tiny-llm
skyzh/tiny-llm
4,304Ver en GitHub
tiny-llm is a large language model inference engine and transformer model implementation. It serves as a quantized model runtime and paged key-value cache manager, providing a specialized inference stack optimized for Apple Silicon. The system distinguishes itself through high-throughput execution techniques, including continuous batching and paged attention. It utilizes a paged memory system to eliminate fragmentation during token generation and employs on-the-fly dequantization of compressed weights to reduce the memory footprint during matrix multiplication. The project covers a broad ran
Groups multiple incoming requests into a single hardware execution pass to maximize throughput.
Pythoncourselarge-language-modelllm
Ver en GitHub4,304
lightning-ai/litserve
Lightning-AI/LitServe
3,894Ver en GitHub
LitServe is a Python AI inference server framework and LLM serving framework designed for high-concurrency inference. It functions as a distributed AI model server and dynamic batching inference engine, providing the tools to build and host custom servers that run AI models. The framework distinguishes itself through a dynamic-batching request queue that groups individual inference requests into single tensors to maximize GPU throughput. It supports distributed GPU scaling, allowing model workloads to be spread across multiple hardware accelerators to balance compute loads and increase total
Groups multiple incoming AI requests into single batches to maximize GPU hardware utilization.
Python
Ver en GitHub3,894
modeltc/lightllm
ModelTC/LightLLM
3,901Ver en GitHub
LightLLM is a high-performance serving framework for deploying and executing large language models. It functions as a multi-GPU inference engine and server capable of handling dense architectures, mixture-of-experts designs, and multimodal models that process both text and images. The system is distinguished by its specialized support for Mixture-of-Experts models using expert parallelism and fused kernels. It implements structured text generation through deterministic state machines and pushdown automata to enforce precise output formats. To optimize throughput, the framework employs specula
Merges new requests into active inference batches by calculating estimated token usage against hardware capacity.
Pythondeep-learninggptllama
Ver en GitHub3,901
collabora/whisperlive
collabora/WhisperLive
3,819Ver en GitHub
WhisperLive is a real-time speech-to-text server that converts live audio streams into text using Whisper models. It functions as a backend service that receives microphone input via WebSockets and provides incremental transcriptions with word-level timestamps. The system utilizes a GPU-accelerated inference engine and a keyword-boosted transcription API to improve the recognition accuracy of domain-specific jargon, acronyms, and product names. It also includes a speaker diarization tool that clusters audio embeddings to identify and label different participants within a recording. Additiona
Groups multiple concurrent user audio segments into single GPU calls to maximize system throughput.
Pythondictationobsopenai
Ver en GitHub3,819
predibase/lorax
predibase/lorax
3,724Ver en GitHub
Lorax is a GPU-accelerated inference server and multi-adapter engine designed for serving large language models. It functions as a high-throughput system capable of deploying models via Kubernetes and managing the dynamic swapping of Low-Rank Adaptation adapters per request. The server distinguishes itself through multi-adapter dynamic batching, which allows requests using different adapter weights to be processed in a single GPU forward pass. It employs just-in-time adapter loading and weighted adapter merging to maximize throughput and enable multi-tasking without sacrificing performance.
Processes requests using different LoRA adapters in a single GPU forward pass to maximize throughput.
Pythonfine-tuninggptllama
Ver en GitHub3,724
sgl-project/mini-sglang
sgl-project/mini-sglang
3,514Ver en GitHub
mini-sglang is a collection of tools for large language model inference, serving as an OpenAI-compatible inference server, a memory-efficient prefill engine, and a tensor parallelism runtime. It also functions as a local batch processing engine for offline benchmarking and ablation studies. The project focuses on acceleration and memory management through a KV cache manager that reuses precomputed caches for shared request prefixes. It handles large model workloads by distributing tasks across multiple GPUs and manages peak memory consumption by splitting long input sequences into smaller chu
Provides a local batch processing engine to maximize hardware utilization for offline benchmarking.
Python
Ver en GitHub3,514
llm-d/llm-d
llm-d/llm-d
2,514Ver en GitHub
llm-d is a distributed serving framework designed for large language model inference. It functions as an inference orchestrator and gateway, providing a control plane for deploying model replicas and managing hardware accelerators. The system includes a batch inference scheduler and a cache manager to coordinate request flow and memory utilization. The project is distinguished by a disaggregated serving architecture that separates prefill and decode execution phases across specialized workers to maximize throughput. It employs a hardware-agnostic control plane and tiered cache offloading, mov
Manages large volumes of offline inference requests through queuing and flow control to maximize hardware utilization.
Shell
Ver en GitHub2,514

Awesome Inference Batching GitHub Repositories

hanxiao/bert-as-service

cumulo-autumn/StreamDiffusion

FMInference/FlexLLMGen

voicepaw/so-vits-svc-fork

kserve/kserve

kubeflow/kfserving

alirezadir/Production-Level-Deep-Learning

turboderp/exllamav2

turboderp-org/exllamav2

pytorch/serve

skyzh/tiny-llm

Lightning-AI/LitServe

ModelTC/LightLLM

collabora/WhisperLive

predibase/lorax

sgl-project/mini-sglang

llm-d/llm-d

Explorar subetiquetas