Why is hanxiao/bert-as-service a recommended Inference Batching GitHub Repositories repository?

Groups individual requests into optimized batches to maximize GPU throughput during inference.

Why is cumulo-autumn/streamdiffusion a recommended Inference Batching GitHub Repositories repository?

Implements batching of inference requests to maximize GPU throughput and minimize computational overhead.

Why is fminference/flexllmgen a recommended Inference Batching GitHub Repositories repository?

Processes multiple generation requests together in large batches to maximize throughput on a single GPU.

Why is voicepaw/so-vits-svc-fork a recommended Inference Batching GitHub Repositories repository?

Implements grouping of multiple audio segments into single GPU execution passes to accelerate batch inference throughput.

Why is kserve/kserve a recommended Inference Batching GitHub Repositories repository?

Groups multiple prediction requests into a single batch to improve throughput on GPU and CPU runtimes.

Why is kubeflow/kfserving a recommended Inference Batching GitHub Repositories repository?

Accumulates multiple prediction requests and processes them together to increase throughput.

Why is alirezadir/production-level-deep-learning a recommended Inference Batching GitHub Repositories repository?

Implements adaptive batching to maximize GPU throughput while maintaining latency limits for model inference.

Why is turboderp-org/exllamav2 a recommended Inference Batching GitHub Repositories repository?

Executes multiple text completion prompts simultaneously using batch-based parallel inference to maximize GPU utilization.

17 مستودعات

Awesome GitHub RepositoriesInference Batching

Grouping multiple model inference requests into a single hardware execution pass to maximize throughput.

Distinct from Request Batching: Focuses on GPU/NPU compute batching for model inference rather than general data operation or network request batching.

Explore 17 awesome GitHub repositories matching data & databases · Inference Batching. Refine with filters or upvote what's useful.

اعثر على أفضل المستودعات باستخدام الذكاء الاصطناعي.سنبحث عن أفضل المستودعات المطابقة باستخدام الذكاء الاصطناعي.

hanxiao/bert-as-service
hanxiao/bert-as-service
12,831عرض على GitHub
هذا المشروع عبارة عن خدمة تضمين BERT عالية الأداء وخادم استدلال مصمم لتعيين تسلسلات النص إلى متجهات رقمية ذات طول ثابت. يعمل كخدمة ميكرو لتعلم الآلة وخادم نموذج موزع يفصل معالجة الطلبات عن الحوسبة الثقيلة. يستخدم النظام بنية تحتية للمراسلة ZeroMQ لتوفير تواصل منخفض زمن الوصول بين العملاء الموزعين وخادم الاستدلال. يدمج معالجة الدفعات من جانب الخادم وتوسيع نطاق عبء عمل GPU لزيادة استخدام الأجهزة وإدارة أحجام الطلبات العالية. تدعم المنصة بنية تحتية للبحث الدلالي من خلال توليد تضمينات متعددة الوسائط لكل من النص والصور داخل مساحة متجه مشتركة. هذا يتيح البحث متعدد الوسائط، وترتيب صلة المحتوى، وإعادة ترتيب النتائج بناءً على المحاذاة الدلالية بين المحتوى المرئي وأوصاف النص. يمكن نشر الخدمة كخدمة ميكرو مرنة يمكن الوصول إليها عبر بروتوكولات gRPC أو HTTP أو WebSocket، وتتميز ببث مزدوج غير محظور للتعامل مع مجموعات البيانات الكبيرة.
Groups individual requests into optimized batches to maximize GPU throughput during inference.
Python
عرض على GitHub12,831
cumulo-autumn/streamdiffusion
cumulo-autumn/StreamDiffusion
10,770عرض على GitHub
StreamDiffusion is an interactive generative AI framework and inference engine designed for the low-latency delivery of image and video streams. It provides a real-time Stable Diffusion pipeline for text-to-image and image-to-image generation, enabling the creation of continuous generative image streams with minimized computational delay. The framework optimizes throughput using a pre-computed cache engine and residual-based guidance approximation to reduce the number of required model passes. It further manages GPU load through similarity-based frame skipping, which avoids redundant computat
Implements batching of inference requests to maximize GPU throughput and minimize computational overhead.
Python
عرض على GitHub10,770
fminference/flexllmgen
FMInference/FlexLLMGen
9,362عرض على GitHub
FlexLLMGen is an inference engine and runtime designed to run large language models on a single GPU by combining weight compression with tensor offloading. It reduces model weight memory usage by approximately 70% through 4-bit quantization, and stores model parameters, attention cache, and hidden states across GPU, CPU, and disk to fit models larger than available GPU memory. The project distinguishes itself through a throughput-oriented batching approach that processes multiple generation requests together in large batches to maximize throughput on a single GPU. It also supports distributed
Processes multiple generation requests together in large batches to maximize throughput on a single GPU.
Pythondeep-learninggpt-3high-throughput
عرض على GitHub9,362
voicepaw/so-vits-svc-fork
voicepaw/so-vits-svc-fork
9,318عرض على GitHub
This project is an AI singing voice conversion system and vocal processor used for training generative voice models and converting vocal recordings or live input into a target voice. It functions as a VITS model trainer and a real-time voice changer that transforms vocal timbre and pitch to change the identity of a singer. The system provides a graphical management dashboard for controlling training hyperparameters and voice conversion presets. It supports low-latency audio streaming for live microphone input and employs pitch estimation to ensure precise matching between source and target vo
Implements grouping of multiple audio segments into single GPU execution passes to accelerate batch inference throughput.
Pythoncontentvecdeep-learninggan
عرض على GitHub9,318
kserve/kserve
kserve/kserve
5,576عرض على GitHub
KServe is a Kubernetes-native platform for deploying and serving machine learning models as scalable inference services. It supports both generative AI models, including large language models, and traditional predictive models from frameworks such as TensorFlow, PyTorch, Scikit-Learn, XGBoost, and ONNX. The platform manages the full lifecycle of model deployments, including revision tracking, canary rollouts, A/B testing, and automatic rollbacks, and provides serverless scale-to-zero capabilities for cost-efficient resource management. KServe distinguishes itself through a standardized infere
Groups multiple prediction requests into a single batch to improve throughput on GPU and CPU runtimes.
Go
عرض على GitHub5,576
kubeflow/kfserving
kubeflow/kfserving
5,576عرض على GitHub
KServe is an open platform for deploying and serving generative and predictive AI models on Kubernetes. It defines inference services as custom resources with declarative YAML specifications, enabling a Kubernetes-native approach to model deployment and lifecycle management. The platform leverages Knative-based serverless scaling for automatic scale-to-zero and revision management, and supports a pluggable serving runtime architecture that maps model formats to containerized execution environments. KServe distinguishes itself through model-aware autoscaling that scales replicas based on token
Accumulates multiple prediction requests and processes them together to increase throughput.
Go
عرض على GitHub5,576
alirezadir/production-level-deep-learning
alirezadir/Production-Level-Deep-Learning
4,647عرض على GitHub
This project is an MLOps architectural guide and framework for designing and deploying deep learning systems into production environments. It provides a structured approach to model inference deployment, ML pipeline orchestration, and the creation of production-level machine learning architectures. The project distinguishes itself through a focus on distributed deep learning and edge AI optimization. It covers methodologies for parallelizing model training across multiple GPUs to handle large datasets and applies techniques like quantization and distillation to reduce model size for embedded
Implements adaptive batching to maximize GPU throughput while maintaining latency limits for model inference.
aiartificial-intelligencedeep-learning
عرض على GitHub4,647
turboderp/exllamav2
turboderp/exllamav2
4,553عرض على GitHub
exllamav2 هي مكتبة استنتاج عالية الأداء مصممة لتشغيل نماذج اللغات الكبيرة محلياً على وحدات معالجة الرسومات (GPUs) المخصصة للمستهلكين. توفر مشغلاً مسرعاً بواسطة GPU وأدوات تكميم لتمكين تنفيذ النموذج دون الاعتماد على خدمات الحوسبة السحابية. يتميز المشروع بأداة تكميم تضغط النماذج إلى معدلات بت مختلطة بين اثنين وثمانية بت لتقليل متطلبات ذاكرة الفيديو (VRAM). يتميز بمولد نصوص مجمع يتعامل مع الطلبات المجمعة ويزيل تكرار بيانات ذاكرة التخزين المؤقت لزيادة الإنتاجية. تغطي المكتبة سطح قدرة واسعاً بما في ذلك تدفق الرموز غير المتزامن للمخرجات في الوقت الفعلي، وتنفيذ نواة GPU مخصصة لعمليات الجبر الخطي، وتعيين الذاكرة المحلية للوصول منخفض زمن الوصول إلى أوزان النموذج.
Groups multiple model inference requests into a single hardware execution pass to maximize GPU throughput.
Python
عرض على GitHub4,553
turboderp-org/exllamav2
turboderp-org/exllamav2
4,552عرض على GitHub
exllamav2 هو محرك استنتاج وإطار عمل عالي الأداء لتنفيذ نماذج اللغات الكبيرة محلياً على وحدات معالجة الرسومات (GPUs) من فئة المستهلك. يوفر نظاماً كاملاً لنشر النماذج محلياً، بما في ذلك محرك استنتاج متخصص وأدوات لتكميم النموذج. يتميز المشروع بإطار عمل استنتاج متعدد وحدات معالجة الرسومات يوزع أعباء العمل عبر بطاقات رسومات متعددة لتشغيل النماذج التي تتجاوز سعة ذاكرة جهاز واحد. يتضمن مكمم نموذج GPU قادراً على تحويل النماذج إلى تنسيقات مختلطة الدقة بين 2 و8 بت لموازنة استخدام الذاكرة والدقة. يدعم المحرك توليد نصوص عالي الإنتاجية من خلال الاستنتاج المتوازي القائم على الدفعات وتدفق المخرجات غير المتزامن. يتم دعم هذه القدرات بواسطة نواة CUDA مخصصة وإزالة تكرار ذاكرة التخزين المؤقت لتحسين استخدام الأجهزة وتقليل زمن الوصول أثناء توليد الرموز.
Executes multiple text completion prompts simultaneously using batch-based parallel inference to maximize GPU utilization.
Python
عرض على GitHub4,552
pytorch/serve
pytorch/serve
4,354عرض على GitHub
هذا المشروع هو إطار عمل لخدمة نماذج PyTorch مصمم لنشر وتوسيع نطاق نماذج تعلم الآلة في الإنتاج عبر نقاط نهاية شبكة قابلة للتوسع. يعمل كخادم استنتاج عالي الأداء، ومحسن، ومدير دورة حياة النموذج الذي يتعامل مع تحميل النموذج، وتجميع الطلبات، وتسريع الأجهزة. يتميز النظام بقدرات تنسيق وتحسين متقدمة، مثل ربط نماذج متعددة في سير عمل تسلسلي باستخدام رسوم بيانية للتنفيذ واستخدام التجميع الديناميكي لتحسين الإنتاجية وزمن الانتقال. يوفر دعماً متخصصاً للذكاء الاصطناعي التوليدي ونماذج اللغات الكبيرة من خلال التجميع المستمر وتوازي الموترات (tensor parallelism). تغطي مجالات القدرات الواسعة إدارة موارد GPU عبر أجهزة متنوعة مثل NVIDIA وAMD وApple Silicon، بالإضافة إلى إدارة شاملة لدورة حياة النموذج للتسجيل، وإصدار النسخ، وتوسيع نطاق العمال (workers). كما يدمج أدوات مراقبة لتتبع صحة النظام وأداء النموذج عبر مقاييس متوافقة مع Prometheus. يتم إدارة الخادم من خلال واجهة سطر أوامر تستخدم للتحكم في دورة الحياة وتكوين معلمات وقت التشغيل.
Groups multiple model inference requests into a single hardware execution pass to maximize GPU throughput.
Java
عرض على GitHub4,354
skyzh/tiny-llm
skyzh/tiny-llm
4,304عرض على GitHub
tiny-llm is a large language model inference engine and transformer model implementation. It serves as a quantized model runtime and paged key-value cache manager, providing a specialized inference stack optimized for Apple Silicon. The system distinguishes itself through high-throughput execution techniques, including continuous batching and paged attention. It utilizes a paged memory system to eliminate fragmentation during token generation and employs on-the-fly dequantization of compressed weights to reduce the memory footprint during matrix multiplication. The project covers a broad ran
Groups multiple incoming requests into a single hardware execution pass to maximize throughput.
Pythoncourselarge-language-modelllm
عرض على GitHub4,304
lightning-ai/litserve
Lightning-AI/LitServe
3,894عرض على GitHub
LitServe هو إطار عمل لخادم استدلال الذكاء الاصطناعي بلغة Python وإطار عمل لخدمة النماذج اللغوية الكبيرة (LLM) مصمم للاستدلال عالي التزامن. يعمل كخادم نماذج ذكاء اصطناعي موزع ومحرك استدلال مجمع ديناميكياً، مما يوفر الأدوات لبناء واستضافة خوادم مخصصة تشغل نماذج الذكاء الاصطناعي. يتميز إطار العمل بطابور طلبات مجمع ديناميكياً يجمع طلبات الاستدلال الفردية في موترات (tensors) واحدة لزيادة إنتاجية GPU. يدعم توسيع نطاق GPU الموزع، مما يسمح بتوزيع أعباء عمل النماذج عبر مسرعات أجهزة متعددة لموازنة أحمال الحوسبة وزيادة السعة الإجمالية. يوفر النظام واجهة غلاف عالية المستوى تفصل المعالجة المسبقة للطلب والمعالجة اللاحقة عن منطق تنفيذ النموذج الأساسي. كما يتضمن إمكانات لبث النماذج في الوقت الفعلي لتقديم المخرجات بشكل تزايدي ويستخدم حلقة أحداث غير متزامنة للتعامل مع طلبات الشبكة المتزامنة.
Groups multiple incoming AI requests into single batches to maximize GPU hardware utilization.
Python
عرض على GitHub3,894
modeltc/lightllm
ModelTC/LightLLM
3,901عرض على GitHub
LightLLM is a high-performance serving framework for deploying and executing large language models. It functions as a multi-GPU inference engine and server capable of handling dense architectures, mixture-of-experts designs, and multimodal models that process both text and images. The system is distinguished by its specialized support for Mixture-of-Experts models using expert parallelism and fused kernels. It implements structured text generation through deterministic state machines and pushdown automata to enforce precise output formats. To optimize throughput, the framework employs specula
Merges new requests into active inference batches by calculating estimated token usage against hardware capacity.
Pythondeep-learninggptllama
عرض على GitHub3,901
collabora/whisperlive
collabora/WhisperLive
3,819عرض على GitHub
WhisperLive is a real-time speech-to-text server that converts live audio streams into text using Whisper models. It functions as a backend service that receives microphone input via WebSockets and provides incremental transcriptions with word-level timestamps. The system utilizes a GPU-accelerated inference engine and a keyword-boosted transcription API to improve the recognition accuracy of domain-specific jargon, acronyms, and product names. It also includes a speaker diarization tool that clusters audio embeddings to identify and label different participants within a recording. Additiona
Groups multiple concurrent user audio segments into single GPU calls to maximize system throughput.
Pythondictationobsopenai
عرض على GitHub3,819
predibase/lorax
predibase/lorax
3,724عرض على GitHub
Lorax is a GPU-accelerated inference server and multi-adapter engine designed for serving large language models. It functions as a high-throughput system capable of deploying models via Kubernetes and managing the dynamic swapping of Low-Rank Adaptation adapters per request. The server distinguishes itself through multi-adapter dynamic batching, which allows requests using different adapter weights to be processed in a single GPU forward pass. It employs just-in-time adapter loading and weighted adapter merging to maximize throughput and enable multi-tasking without sacrificing performance.
Processes requests using different LoRA adapters in a single GPU forward pass to maximize throughput.
Pythonfine-tuninggptllama
عرض على GitHub3,724
sgl-project/mini-sglang
sgl-project/mini-sglang
3,514عرض على GitHub
mini-sglang is a collection of tools for large language model inference, serving as an OpenAI-compatible inference server, a memory-efficient prefill engine, and a tensor parallelism runtime. It also functions as a local batch processing engine for offline benchmarking and ablation studies. The project focuses on acceleration and memory management through a KV cache manager that reuses precomputed caches for shared request prefixes. It handles large model workloads by distributing tasks across multiple GPUs and manages peak memory consumption by splitting long input sequences into smaller chu
Provides a local batch processing engine to maximize hardware utilization for offline benchmarking.
Python
عرض على GitHub3,514
llm-d/llm-d
llm-d/llm-d
2,514عرض على GitHub
llm-d is a distributed serving framework designed for large language model inference. It functions as an inference orchestrator and gateway, providing a control plane for deploying model replicas and managing hardware accelerators. The system includes a batch inference scheduler and a cache manager to coordinate request flow and memory utilization. The project is distinguished by a disaggregated serving architecture that separates prefill and decode execution phases across specialized workers to maximize throughput. It employs a hardware-agnostic control plane and tiered cache offloading, mov
Manages large volumes of offline inference requests through queuing and flow control to maximize hardware utilization.
Shell
عرض على GitHub2,514

Awesome Inference Batching GitHub Repositories

hanxiao/bert-as-service

cumulo-autumn/StreamDiffusion

FMInference/FlexLLMGen

voicepaw/so-vits-svc-fork

kserve/kserve

kubeflow/kfserving

alirezadir/Production-Level-Deep-Learning

turboderp/exllamav2

turboderp-org/exllamav2

pytorch/serve

skyzh/tiny-llm

Lightning-AI/LitServe

ModelTC/LightLLM

collabora/WhisperLive

predibase/lorax

sgl-project/mini-sglang

llm-d/llm-d

استكشف الوسوم الفرعية