Why is netflix/hystrix a recommended Request Batching GitHub Repositories repository?

Groups multiple concurrent calls into a single batch execution to reduce the total load on downstream systems.

Why is apple/foundationdb a recommended Request Batching GitHub Repositories repository?

Groups multiple read requests into a single server call to reduce network overhead and improve throughput.

Why is hanxiao/bert-as-service a recommended Request Batching GitHub Repositories repository?

Groups individual requests into optimized batches to maximize GPU throughput during inference.

Why is cumulo-autumn/streamdiffusion a recommended Request Batching GitHub Repositories repository?

Implements batching of inference requests to maximize GPU throughput and minimize computational overhead.

Why is fminference/flexllmgen a recommended Request Batching GitHub Repositories repository?

Processes multiple generation requests together in large batches to maximize throughput on a single GPU.

Why is voicepaw/so-vits-svc-fork a recommended Request Batching GitHub Repositories repository?

Implements grouping of multiple audio segments into single GPU execution passes to accelerate batch inference throughput.

Why is hashicorp/raft a recommended Request Batching GitHub Repositories repository?

Haftraft processes multiple committed log entries in a single operation to improve throughput and reduce system overhead.

Why is graphql-hive/graphql-yoga a recommended Request Batching GitHub Repositories repository?

Allows combining multiple GraphQL requests into a single network call to reduce overhead and round trips.

Why is wang-xinyu/tensorrtx a recommended Request Batching GitHub Repositories repository?

Implements dynamic batching for inference workloads to optimize the balance between throughput and latency.

Why is burnash/gspread a recommended Request Batching GitHub Repositories repository?

Implements request batching to group multiple data updates into single network calls for improved performance.

38 dépôts

Awesome GitHub RepositoriesRequest Batching

Techniques for grouping multiple small data operations into a single larger request to increase throughput.

Distinct from Obsolete Entry Clearing: The candidates focus on log inspection or cleanup; this is a performance optimization for processing multiple log entries together.

Explore 38 awesome GitHub repositories matching data & databases · Request Batching. Refine with filters or upvote what's useful.

Trouvez les meilleurs dépôts grâce à l'IA.Nous recherchons les dépôts les plus pertinents grâce à l'IA.

netflix/hystrix
Netflix/Hystrix
24,461Voir sur GitHub
Hystrix is a latency and fault tolerance library designed to prevent cascading failures in distributed systems. It functions as a circuit breaker implementation that monitors failure thresholds and opens circuits to isolate remote calls when downstream services degrade. The project distinguishes itself by providing multiple isolation mechanisms, utilizing dedicated thread pools and semaphores to ensure that latency in one dependency does not saturate the entire system. It also features a request collapsing and batching engine that groups concurrent calls into single executions to reduce the t
Groups multiple concurrent calls into a single batch execution to reduce the total load on downstream systems.
Java
Voir sur GitHub24,461
apple/foundationdb
apple/foundationdb
16,446Voir sur GitHub
FoundationDB is an ACID-compliant distributed transactional key-value store. It functions as a scalable database engine that ensures strict serializability and data consistency across a cluster of servers using a shared-nothing architecture. The system is distinguished by its multi-region replication capabilities, allowing data to be synchronized across different datacenters for high availability and disaster recovery. It utilizes optimistic concurrency control to manage distributed transactions and employs a majority-based coordination system to maintain cluster state. The platform provides
Groups multiple read requests into a single server call to reduce network overhead and improve throughput.
C++aciddistributed-databasefoundationdb
Voir sur GitHub16,446
hanxiao/bert-as-service
hanxiao/bert-as-service
12,831Voir sur GitHub
Ce projet est un service d'intégration BERT haute performance et un serveur d'inférence conçu pour mapper des séquences de texte en vecteurs numériques de longueur fixe. Il fonctionne comme un microservice d'apprentissage automatique et un serveur de modèle distribué qui découple la gestion des requêtes du calcul lourd. Le système utilise une infrastructure de messagerie ZeroMQ pour fournir une communication à faible latence entre les clients distribués et le serveur d'inférence. Il incorpore le traitement par lots côté serveur et la mise à l'échelle de la charge de travail GPU pour maximiser l'utilisation du matériel et gérer des volumes de requêtes élevés. La plateforme prend en charge l'infrastructure de recherche sémantique en générant des intégrations transmodales pour le texte et les images au sein d'un espace vectoriel partagé. Cela permet la recherche transmodale, le classement de la pertinence du contenu et le reclassement des résultats basés sur l'alignement sémantique entre le contenu visuel et les descriptions textuelles. Le service peut être déployé en tant que microservice élastique accessible via les protocoles gRPC, HTTP ou WebSocket, avec un streaming duplex non bloquant pour gérer de grands ensembles de données.
Groups individual requests into optimized batches to maximize GPU throughput during inference.
Python
Voir sur GitHub12,831
cumulo-autumn/streamdiffusion
cumulo-autumn/StreamDiffusion
10,770Voir sur GitHub
StreamDiffusion is an interactive generative AI framework and inference engine designed for the low-latency delivery of image and video streams. It provides a real-time Stable Diffusion pipeline for text-to-image and image-to-image generation, enabling the creation of continuous generative image streams with minimized computational delay. The framework optimizes throughput using a pre-computed cache engine and residual-based guidance approximation to reduce the number of required model passes. It further manages GPU load through similarity-based frame skipping, which avoids redundant computat
Implements batching of inference requests to maximize GPU throughput and minimize computational overhead.
Python
Voir sur GitHub10,770
fminference/flexllmgen
FMInference/FlexLLMGen
9,362Voir sur GitHub
FlexLLMGen is an inference engine and runtime designed to run large language models on a single GPU by combining weight compression with tensor offloading. It reduces model weight memory usage by approximately 70% through 4-bit quantization, and stores model parameters, attention cache, and hidden states across GPU, CPU, and disk to fit models larger than available GPU memory. The project distinguishes itself through a throughput-oriented batching approach that processes multiple generation requests together in large batches to maximize throughput on a single GPU. It also supports distributed
Processes multiple generation requests together in large batches to maximize throughput on a single GPU.
Pythondeep-learninggpt-3high-throughput
Voir sur GitHub9,362
voicepaw/so-vits-svc-fork
voicepaw/so-vits-svc-fork
9,318Voir sur GitHub
This project is an AI singing voice conversion system and vocal processor used for training generative voice models and converting vocal recordings or live input into a target voice. It functions as a VITS model trainer and a real-time voice changer that transforms vocal timbre and pitch to change the identity of a singer. The system provides a graphical management dashboard for controlling training hyperparameters and voice conversion presets. It supports low-latency audio streaming for live microphone input and employs pitch estimation to ensure precise matching between source and target vo
Implements grouping of multiple audio segments into single GPU execution passes to accelerate batch inference throughput.
Pythoncontentvecdeep-learninggan
Voir sur GitHub9,318
hashicorp/raft
hashicorp/raft
9,037Voir sur GitHub
This is a Raft consensus library and distributed consensus engine implemented in Go. It provides the primitives necessary to build fault-tolerant distributed services by implementing a replicated state machine that ensures a group of servers agree on a shared system state through leader election and log replication. The project distinguishes itself through a pluggable architecture for storage backends and snapshot storage, decoupling the consensus logic from physical persistence. It includes specialized mechanisms for leadership transfer, protocol version management to support rolling upgrade
Haftraft processes multiple committed log entries in a single operation to improve throughput and reduce system overhead.
Go
Voir sur GitHub9,037
graphql-hive/graphql-yoga
graphql-hive/graphql-yoga
8,523Voir sur GitHub
Yoga is a GraphQL server framework and runtime-agnostic HTTP handler used to build and deploy GraphQL APIs. It functions as a toolkit for managing schemas and resolvers, providing a spec-compliant environment for hosting APIs across diverse JavaScript runtimes, including Node.js, Deno, Bun, and serverless cloud environments. The project distinguishes itself through its ability to act as an Apollo Federation gateway, composing multiple subgraphs into a single unified supergraph. It also serves as a dedicated subscription server, delivering real-time data streaming via both WebSockets and Serve
Allows combining multiple GraphQL requests into a single network call to reduce overhead and round trips.
TypeScriptbundenofetch
Voir sur GitHub8,523
wang-xinyu/tensorrtx
wang-xinyu/tensorrtx
7,802Voir sur GitHub
tensorrtx is a computer vision inference engine and model implementation library designed for graphics processor acceleration. It provides a framework for optimizing deep learning models through a GPU inference optimizer, a deep learning model converter for transforming weights from frameworks like TensorFlow and PyTorch, and a custom plugin library to implement operations not natively supported by the TensorRT API. The project distinguishes itself through a comprehensive collection of pre-defined network implementations, ranging from various YOLO versions and DETR transformers for object det
Implements dynamic batching for inference workloads to optimize the balance between throughput and latency.
C++arcfacecrnndetr
Voir sur GitHub7,802
burnash/gspread
burnash/gspread
7,479Voir sur GitHub
gspread is a Python client library and API wrapper designed for programmatically interacting with Google Sheets. It serves as a spreadsheet automation library that enables the creation, organization, and management of cloud-based spreadsheets via Python scripts. The library provides a simplified interface for Google Sheets automation, allowing users to read, write, and update data without writing raw HTTP requests. It supports cloud spreadsheet integration, enabling external Python applications to use Google Sheets as a data storage layer. The project covers a broad range of capabilities inc
Implements request batching to group multiple data updates into single network calls for improved performance.
Python
Voir sur GitHub7,479
infrasys-ai/aiinfra
Infrasys-AI/AIInfra
7,414Voir sur GitHub
Combines short requests into batches and splits long sequences across GPUs for balanced throughput.
Jupyter Notebookaiinfraaisystem
Voir sur GitHub7,414
nvidia/isaac-gr00t
NVIDIA/Isaac-GR00T
6,222Voir sur GitHub
Combines dynamic batching and concurrent execution to maximize hardware utilization during model serving.
Jupyter Notebook
Voir sur GitHub6,222
kserve/kserve
kserve/kserve
5,576Voir sur GitHub
KServe is a Kubernetes-native platform for deploying and serving machine learning models as scalable inference services. It supports both generative AI models, including large language models, and traditional predictive models from frameworks such as TensorFlow, PyTorch, Scikit-Learn, XGBoost, and ONNX. The platform manages the full lifecycle of model deployments, including revision tracking, canary rollouts, A/B testing, and automatic rollbacks, and provides serverless scale-to-zero capabilities for cost-efficient resource management. KServe distinguishes itself through a standardized infere
Groups multiple prediction requests into a single batch to improve throughput on GPU and CPU runtimes.
Go
Voir sur GitHub5,576
kubeflow/kfserving
kubeflow/kfserving
5,576Voir sur GitHub
KServe is an open platform for deploying and serving generative and predictive AI models on Kubernetes. It defines inference services as custom resources with declarative YAML specifications, enabling a Kubernetes-native approach to model deployment and lifecycle management. The platform leverages Knative-based serverless scaling for automatic scale-to-zero and revision management, and supports a pluggable serving runtime architecture that maps model formats to containerized execution environments. KServe distinguishes itself through model-aware autoscaling that scales replicas based on token
Accumulates multiple prediction requests and processes them together to increase throughput.
Go
Voir sur GitHub5,576
imoneoi/openchat
imoneoi/openchat
5,481Voir sur GitHub
OpenChat est un framework pour l'entraînement, le fine-tuning et le déploiement de grands modèles de langage optimisés pour les tâches de raisonnement conversationnel et mathématique. Il fournit un cycle de vie complet pour ces modèles, allant des pipelines d'entraînement et des stacks de déploiement à une interface de chat web. Le projet se concentre sur l'activation d'une exécution de modèle haute performance sur du matériel grand public sans avoir besoin d'accélérateurs de classe entreprise. Il inclut un serveur d'inférence prêt pour la production qui implémente le protocole de complétion de chat OpenAI et utilise le batching dynamique des requêtes pour optimiser le débit matériel. Le système couvre l'ensemble du flux de travail opérationnel, y compris la tokenisation des jeux de données et le fine-tuning des modèles via un entraînement sans padding et l'apprentissage par renforcement. Il s'étend également à l'hébergement d'API avec authentification par clé et une interface graphique pour l'interaction humaine en temps réel.
Uses dynamic request batching to group multiple API requests into a single inference pass for higher throughput.
Python
Voir sur GitHub5,481
middleapi/orpc
middleapi/orpc
4,862Voir sur GitHub
orpc is a contract-first API development framework for TypeScript that starts with a shared contract definition and generates type-safe clients and servers from that single source of truth. It guarantees end-to-end type safety, meaning inputs, outputs, errors, and streaming data are all checked at compile time across the client–server boundary. What distinguishes orpc from typical RPC frameworks is its ability to export contracts as OpenAPI specifications, to optimize server-side rendering by calling API handlers directly inside the server process, and to support real‑time bidirectional commu
Groups multiple API requests into a single call to reduce network overhead and improve efficiency.
TypeScriptapibunjscloudflare-worker
Voir sur GitHub4,862
ztxz16/fastllm
ztxz16/fastllm
4,779Voir sur GitHub
fastllm est un ensemble de composants logiciels spécialisés pour la conversion de poids de modèles, les runtimes de type « Mixture-of-Experts » et le parallélisme de tenseurs. Il fournit un serveur API compatible OpenAI pour exposer les capacités des grands modèles de langage via un format de requête standardisé. Le projet dispose d'un framework de parallélisme de tenseurs qui divise les charges de travail computationnelles sur plusieurs GPU pour accélérer l'exécution. Il inclut un runtime dédié optimisé pour les architectures Mixture-of-Experts et un outil de quantification pour convertir les poids des modèles en formats de précision inférieure afin de réduire l'utilisation de la mémoire et d'augmenter le débit. Le système couvre des workflows de haut niveau pour l'inférence distribuée, incluant la gestion de la mémoire mappée sur les périphériques, le batching dynamique et l'exécution en mode mixte. Il fournit également une interface en ligne de commande et une interface utilisateur en terminal pour la gestion des modèles et la configuration du déploiement.
Groups multiple incoming requests into single execution passes to maximize GPU utilization and reduce token latency.
C++
Voir sur GitHub4,779
alirezadir/production-level-deep-learning
alirezadir/Production-Level-Deep-Learning
4,647Voir sur GitHub
Ce projet est un guide architectural MLOps et un framework pour concevoir et déployer des systèmes d'apprentissage profond dans des environnements de production. Il fournit une approche structurée pour le déploiement de l'inférence de modèles, l'orchestration de pipelines ML et la création d'architectures de machine learning de niveau production. Le projet se distingue par un accent mis sur l'apprentissage profond distribué et l'IA en périphérie (edge AI). Il couvre des méthodologies pour paralléliser l'entraînement des modèles sur plusieurs GPU afin de gérer de grands jeux de données et applique des techniques comme la quantification et la distillation pour réduire la taille des modèles pour le matériel embarqué. La surface de capacité s'étend à la surveillance et à l'observabilité, incluant le suivi de la performance des modèles, la dérive des données et les métriques d'expérience. Il aborde également l'orchestration des flux de travail de données, le versionnage des jeux de données via des magasins d'objets, et la gestion des requêtes d'inférence à haut volume en utilisant le traitement par lots adaptatif et l'orchestration basée sur des conteneurs.
Implements adaptive batching to maximize GPU throughput while maintaining latency limits for model inference.
aiartificial-intelligencedeep-learning
Voir sur GitHub4,647
turboderp/exllamav2
turboderp/exllamav2
4,553Voir sur GitHub
exllamav2 is a high-performance inference library designed for running large language models locally on consumer-grade GPUs. It provides a GPU-accelerated runner and quantization tools to enable model execution without reliance on cloud-based computing services. The project features a quantization utility that compresses models into mixed bitrates between two and eight bits to reduce video RAM requirements. It distinguishes itself through a batched text generator that handles grouped requests and deduplicates cache data to increase throughput. The library covers a broad capability surface in
Groups multiple model inference requests into a single hardware execution pass to maximize GPU throughput.
Python
Voir sur GitHub4,553
turboderp-org/exllamav2
turboderp-org/exllamav2
4,552Voir sur GitHub
exllamav2 is a high-performance inference engine and framework for executing large language models locally on consumer-class GPUs. It provides a complete system for local model deployment, including a specialized inference engine and tools for model quantization. The project features a multi-GPU inference framework that distributes workloads across multiple graphics cards to run models that exceed the memory capacity of a single device. It includes a GPU model quantizer capable of converting models into mixed-precision formats between 2 and 8 bits to balance memory usage and accuracy. The en
Executes multiple text completion prompts simultaneously using batch-based parallel inference to maximize GPU utilization.
Python
Voir sur GitHub4,552

Awesome Request Batching GitHub Repositories

Netflix/Hystrix

apple/foundationdb

hanxiao/bert-as-service

cumulo-autumn/StreamDiffusion

FMInference/FlexLLMGen

voicepaw/so-vits-svc-fork

hashicorp/raft

graphql-hive/graphql-yoga

wang-xinyu/tensorrtx

burnash/gspread

Infrasys-AI/AIInfra

NVIDIA/Isaac-GR00T

kserve/kserve

kubeflow/kfserving

imoneoi/openchat

middleapi/orpc

ztxz16/fastllm

alirezadir/Production-Level-Deep-Learning

turboderp/exllamav2

turboderp-org/exllamav2

Explorer les sous-tags