Why is netflix/hystrix a recommended Request Batching GitHub Repositories repository?

Groups multiple concurrent calls into a single batch execution to reduce the total load on downstream systems.

Why is apple/foundationdb a recommended Request Batching GitHub Repositories repository?

Groups multiple read requests into a single server call to reduce network overhead and improve throughput.

Why is hanxiao/bert-as-service a recommended Request Batching GitHub Repositories repository?

Groups individual requests into optimized batches to maximize GPU throughput during inference.

Why is cumulo-autumn/streamdiffusion a recommended Request Batching GitHub Repositories repository?

Implements batching of inference requests to maximize GPU throughput and minimize computational overhead.

Why is fminference/flexllmgen a recommended Request Batching GitHub Repositories repository?

Processes multiple generation requests together in large batches to maximize throughput on a single GPU.

Why is voicepaw/so-vits-svc-fork a recommended Request Batching GitHub Repositories repository?

Implements grouping of multiple audio segments into single GPU execution passes to accelerate batch inference throughput.

Why is hashicorp/raft a recommended Request Batching GitHub Repositories repository?

Haftraft processes multiple committed log entries in a single operation to improve throughput and reduce system overhead.

Why is graphql-hive/graphql-yoga a recommended Request Batching GitHub Repositories repository?

Allows combining multiple GraphQL requests into a single network call to reduce overhead and round trips.

Why is wang-xinyu/tensorrtx a recommended Request Batching GitHub Repositories repository?

Implements dynamic batching for inference workloads to optimize the balance between throughput and latency.

Why is burnash/gspread a recommended Request Batching GitHub Repositories repository?

Implements request batching to group multiple data updates into single network calls for improved performance.

38 Repos

Awesome GitHub RepositoriesRequest Batching

Techniques for grouping multiple small data operations into a single larger request to increase throughput.

Distinct from Obsolete Entry Clearing: The candidates focus on log inspection or cleanup; this is a performance optimization for processing multiple log entries together.

Explore 38 awesome GitHub repositories matching data & databases · Request Batching. Refine with filters or upvote what's useful.

Finde die besten Repos mit KI.Wir suchen mit KI nach den am besten passenden Repositories.

netflix/hystrix
Netflix/Hystrix
24,461Auf GitHub ansehen
Hystrix is a latency and fault tolerance library designed to prevent cascading failures in distributed systems. It functions as a circuit breaker implementation that monitors failure thresholds and opens circuits to isolate remote calls when downstream services degrade. The project distinguishes itself by providing multiple isolation mechanisms, utilizing dedicated thread pools and semaphores to ensure that latency in one dependency does not saturate the entire system. It also features a request collapsing and batching engine that groups concurrent calls into single executions to reduce the t
Groups multiple concurrent calls into a single batch execution to reduce the total load on downstream systems.
Java
Auf GitHub ansehen24,461
apple/foundationdb
apple/foundationdb
16,446Auf GitHub ansehen
FoundationDB is an ACID-compliant distributed transactional key-value store. It functions as a scalable database engine that ensures strict serializability and data consistency across a cluster of servers using a shared-nothing architecture. The system is distinguished by its multi-region replication capabilities, allowing data to be synchronized across different datacenters for high availability and disaster recovery. It utilizes optimistic concurrency control to manage distributed transactions and employs a majority-based coordination system to maintain cluster state. The platform provides
Groups multiple read requests into a single server call to reduce network overhead and improve throughput.
C++aciddistributed-databasefoundationdb
Auf GitHub ansehen16,446
hanxiao/bert-as-service
hanxiao/bert-as-service
12,831Auf GitHub ansehen
Dieses Projekt ist ein BERT-Einbettungsdienst mit hoher Leistung und ein Inferenzserver, der darauf ausgelegt ist, Textsequenzen in numerische Vektoren fester Länge abzubilden. Es fungiert als Microservice für maschinelles Lernen und verteilter Modellserver, der die Anforderungsbehandlung von rechenintensiven Aufgaben entkoppelt. Das System nutzt eine ZeroMQ-Messaging-Infrastruktur, um eine Kommunikation mit geringer Latenz zwischen verteilten Clients und dem Inferenzserver bereitzustellen. Es integriert serverseitige Batch-Verarbeitung und GPU-Workload-Skalierung, um die Hardwareauslastung zu maximieren und hohe Anforderungsvolumina zu verwalten. Die Plattform unterstützt die Infrastruktur für semantische Suche durch die Generierung modalübergreifender Einbettungen für Text und Bilder innerhalb eines gemeinsamen Vektorraums. Dies ermöglicht modalübergreifende Suche, Relevanz-Ranking von Inhalten und das Re-Ranking von Ergebnissen basierend auf der semantischen Ausrichtung zwischen visuellem Inhalt und Textbeschreibungen. Der Dienst kann als elastischer Microservice bereitgestellt werden, der über gRPC-, HTTP- oder WebSocket-Protokolle zugänglich ist, und bietet nicht-blockierendes Duplex-Streaming für die Handhabung großer Datensätze.
Groups individual requests into optimized batches to maximize GPU throughput during inference.
Python
Auf GitHub ansehen12,831
cumulo-autumn/streamdiffusion
cumulo-autumn/StreamDiffusion
10,770Auf GitHub ansehen
StreamDiffusion is an interactive generative AI framework and inference engine designed for the low-latency delivery of image and video streams. It provides a real-time Stable Diffusion pipeline for text-to-image and image-to-image generation, enabling the creation of continuous generative image streams with minimized computational delay. The framework optimizes throughput using a pre-computed cache engine and residual-based guidance approximation to reduce the number of required model passes. It further manages GPU load through similarity-based frame skipping, which avoids redundant computat
Implements batching of inference requests to maximize GPU throughput and minimize computational overhead.
Python
Auf GitHub ansehen10,770
fminference/flexllmgen
FMInference/FlexLLMGen
9,362Auf GitHub ansehen
FlexLLMGen is an inference engine and runtime designed to run large language models on a single GPU by combining weight compression with tensor offloading. It reduces model weight memory usage by approximately 70% through 4-bit quantization, and stores model parameters, attention cache, and hidden states across GPU, CPU, and disk to fit models larger than available GPU memory. The project distinguishes itself through a throughput-oriented batching approach that processes multiple generation requests together in large batches to maximize throughput on a single GPU. It also supports distributed
Processes multiple generation requests together in large batches to maximize throughput on a single GPU.
Pythondeep-learninggpt-3high-throughput
Auf GitHub ansehen9,362
voicepaw/so-vits-svc-fork
voicepaw/so-vits-svc-fork
9,318Auf GitHub ansehen
This project is an AI singing voice conversion system and vocal processor used for training generative voice models and converting vocal recordings or live input into a target voice. It functions as a VITS model trainer and a real-time voice changer that transforms vocal timbre and pitch to change the identity of a singer. The system provides a graphical management dashboard for controlling training hyperparameters and voice conversion presets. It supports low-latency audio streaming for live microphone input and employs pitch estimation to ensure precise matching between source and target vo
Implements grouping of multiple audio segments into single GPU execution passes to accelerate batch inference throughput.
Pythoncontentvecdeep-learninggan
Auf GitHub ansehen9,318
hashicorp/raft
hashicorp/raft
9,037Auf GitHub ansehen
This is a Raft consensus library and distributed consensus engine implemented in Go. It provides the primitives necessary to build fault-tolerant distributed services by implementing a replicated state machine that ensures a group of servers agree on a shared system state through leader election and log replication. The project distinguishes itself through a pluggable architecture for storage backends and snapshot storage, decoupling the consensus logic from physical persistence. It includes specialized mechanisms for leadership transfer, protocol version management to support rolling upgrade
Haftraft processes multiple committed log entries in a single operation to improve throughput and reduce system overhead.
Go
Auf GitHub ansehen9,037
graphql-hive/graphql-yoga
graphql-hive/graphql-yoga
8,523Auf GitHub ansehen
Yoga is a GraphQL server framework and runtime-agnostic HTTP handler used to build and deploy GraphQL APIs. It functions as a toolkit for managing schemas and resolvers, providing a spec-compliant environment for hosting APIs across diverse JavaScript runtimes, including Node.js, Deno, Bun, and serverless cloud environments. The project distinguishes itself through its ability to act as an Apollo Federation gateway, composing multiple subgraphs into a single unified supergraph. It also serves as a dedicated subscription server, delivering real-time data streaming via both WebSockets and Serve
Allows combining multiple GraphQL requests into a single network call to reduce overhead and round trips.
TypeScriptbundenofetch
Auf GitHub ansehen8,523
wang-xinyu/tensorrtx
wang-xinyu/tensorrtx
7,802Auf GitHub ansehen
tensorrtx is a computer vision inference engine and model implementation library designed for graphics processor acceleration. It provides a framework for optimizing deep learning models through a GPU inference optimizer, a deep learning model converter for transforming weights from frameworks like TensorFlow and PyTorch, and a custom plugin library to implement operations not natively supported by the TensorRT API. The project distinguishes itself through a comprehensive collection of pre-defined network implementations, ranging from various YOLO versions and DETR transformers for object det
Implements dynamic batching for inference workloads to optimize the balance between throughput and latency.
C++arcfacecrnndetr
Auf GitHub ansehen7,802
burnash/gspread
burnash/gspread
7,479Auf GitHub ansehen
gspread is a Python client library and API wrapper designed for programmatically interacting with Google Sheets. It serves as a spreadsheet automation library that enables the creation, organization, and management of cloud-based spreadsheets via Python scripts. The library provides a simplified interface for Google Sheets automation, allowing users to read, write, and update data without writing raw HTTP requests. It supports cloud spreadsheet integration, enabling external Python applications to use Google Sheets as a data storage layer. The project covers a broad range of capabilities inc
Implements request batching to group multiple data updates into single network calls for improved performance.
Python
Auf GitHub ansehen7,479
infrasys-ai/aiinfra
Infrasys-AI/AIInfra
7,414Auf GitHub ansehen
Combines short requests into batches and splits long sequences across GPUs for balanced throughput.
Jupyter Notebookaiinfraaisystem
Auf GitHub ansehen7,414
nvidia/isaac-gr00t
NVIDIA/Isaac-GR00T
6,222Auf GitHub ansehen
Combines dynamic batching and concurrent execution to maximize hardware utilization during model serving.
Jupyter Notebook
Auf GitHub ansehen6,222
kserve/kserve
kserve/kserve
5,576Auf GitHub ansehen
KServe is a Kubernetes-native platform for deploying and serving machine learning models as scalable inference services. It supports both generative AI models, including large language models, and traditional predictive models from frameworks such as TensorFlow, PyTorch, Scikit-Learn, XGBoost, and ONNX. The platform manages the full lifecycle of model deployments, including revision tracking, canary rollouts, A/B testing, and automatic rollbacks, and provides serverless scale-to-zero capabilities for cost-efficient resource management. KServe distinguishes itself through a standardized infere
Groups multiple prediction requests into a single batch to improve throughput on GPU and CPU runtimes.
Go
Auf GitHub ansehen5,576
kubeflow/kfserving
kubeflow/kfserving
5,576Auf GitHub ansehen
KServe is an open platform for deploying and serving generative and predictive AI models on Kubernetes. It defines inference services as custom resources with declarative YAML specifications, enabling a Kubernetes-native approach to model deployment and lifecycle management. The platform leverages Knative-based serverless scaling for automatic scale-to-zero and revision management, and supports a pluggable serving runtime architecture that maps model formats to containerized execution environments. KServe distinguishes itself through model-aware autoscaling that scales replicas based on token
Accumulates multiple prediction requests and processes them together to increase throughput.
Go
Auf GitHub ansehen5,576
imoneoi/openchat
imoneoi/openchat
5,481Auf GitHub ansehen
OpenChat ist ein Framework für das Training, Fine-Tuning und Deployment von Large Language Models, die für Konversations- und mathematische Schlussfolgerungsaufgaben optimiert sind. Es bietet einen umfassenden Lebenszyklus für diese Modelle, von Trainings-Pipelines und Deployment-Stacks bis hin zu einer webbasierten Chat-Oberfläche. Das Projekt konzentriert sich darauf, eine leistungsstarke Modellausführung auf Consumer-Hardware ohne den Bedarf an Enterprise-Beschleunigern zu ermöglichen. Es enthält einen produktionsreifen Inference-Server, der das OpenAI-Chat-Completion-Protokoll implementiert und dynamisches Request-Batching nutzt, um den Hardware-Durchsatz zu optimieren. Das System deckt den gesamten operativen Workflow ab, einschließlich Dataset-Tokenisierung und Modell-Fine-Tuning mittels Padding-freiem Training und Reinforcement Learning. Es erweitert dies um API-Hosting mit schlüsselbasierter Authentifizierung und eine grafische Benutzeroberfläche für die menschliche Interaktion in Echtzeit.
Uses dynamic request batching to group multiple API requests into a single inference pass for higher throughput.
Python
Auf GitHub ansehen5,481
middleapi/orpc
middleapi/orpc
4,862Auf GitHub ansehen
orpc is a contract-first API development framework for TypeScript that starts with a shared contract definition and generates type-safe clients and servers from that single source of truth. It guarantees end-to-end type safety, meaning inputs, outputs, errors, and streaming data are all checked at compile time across the client–server boundary. What distinguishes orpc from typical RPC frameworks is its ability to export contracts as OpenAPI specifications, to optimize server-side rendering by calling API handlers directly inside the server process, and to support real‑time bidirectional commu
Groups multiple API requests into a single call to reduce network overhead and improve efficiency.
TypeScriptapibunjscloudflare-worker
Auf GitHub ansehen4,862
ztxz16/fastllm
ztxz16/fastllm
4,779Auf GitHub ansehen
fastllm is a set of specialized software components for model weight conversion, Mixture-of-Experts runtimes, and tensor parallelism. It provides an OpenAI compatible API server to expose large language model capabilities through a standardized request format. The project features a tensor parallelism framework that splits computational workloads across multiple GPUs to accelerate execution. It includes a dedicated runtime optimized for Mixture-of-Experts architectures and a quantization tool to convert model weights into lower precision formats to reduce memory usage and increase throughput.
Groups multiple incoming requests into single execution passes to maximize GPU utilization and reduce token latency.
C++
Auf GitHub ansehen4,779
alirezadir/production-level-deep-learning
alirezadir/Production-Level-Deep-Learning
4,647Auf GitHub ansehen
Dieses Projekt ist ein MLOps-Architekturleitfaden und ein Framework für das Design und Deployment von Deep-Learning-Systemen in Produktionsumgebungen. Es bietet einen strukturierten Ansatz für das Deployment von Modell-Inferenz, ML-Pipeline-Orchestrierung und die Erstellung von Machine-Learning-Architekturen auf Produktionsebene. Das Projekt zeichnet sich durch einen Fokus auf verteiltes Deep Learning und Edge-KI-Optimierung aus. Es deckt Methoden zur Parallelisierung des Modelltrainings über mehrere GPUs hinweg ab, um große Datensätze zu verarbeiten, und wendet Techniken wie Quantisierung und Destillation an, um die Modellgröße für Embedded-Hardware zu reduzieren. Die Funktionsfläche erstreckt sich auf Monitoring und Observability, einschließlich der Verfolgung von Modell-Performance, Data-Drift und Experiment-Metriken. Es adressiert zudem die Orchestrierung von Daten-Workflows, Datensatz-Versionierung über Object-Stores und die Verwaltung von Inferenzanfragen mit hohem Volumen mithilfe von adaptivem Batching und Container-basierter Orchestrierung.
Implements adaptive batching to maximize GPU throughput while maintaining latency limits for model inference.
aiartificial-intelligencedeep-learning
Auf GitHub ansehen4,647
turboderp/exllamav2
turboderp/exllamav2
4,553Auf GitHub ansehen
exllamav2 ist eine Hochleistungs-Inferenzbibliothek, die für das lokale Ausführen von Large Language Models auf Consumer-GPUs entwickelt wurde. Sie bietet einen GPU-beschleunigten Runner und Quantisierungstools, um die Modellausführung ohne Abhängigkeit von Cloud-Computing-Diensten zu ermöglichen. Das Projekt verfügt über ein Quantisierungs-Dienstprogramm, das Modelle in gemischte Bitraten zwischen zwei und acht Bit komprimiert, um den VRAM-Bedarf zu reduzieren. Es zeichnet sich durch einen gebatchten Textgenerator aus, der gruppierte Anfragen verarbeitet und Cache-Daten dedupliziert, um den Durchsatz zu erhöhen. Die Bibliothek deckt ein breites Funktionsspektrum ab, einschließlich asynchronem Token-Streaming für Echtzeit-Ausgabe, benutzerdefinierter GPU-Kernel-Ausführung für lineare Algebra-Operationen und lokalem Memory-Mapping für den Zugriff auf Modellgewichte mit geringer Latenz.
Groups multiple model inference requests into a single hardware execution pass to maximize GPU throughput.
Python
Auf GitHub ansehen4,553
turboderp-org/exllamav2
turboderp-org/exllamav2
4,552Auf GitHub ansehen
exllamav2 ist eine Hochleistungs-Inferenz-Engine und ein Framework für das lokale Ausführen von Large Language Models auf Consumer-GPUs. Es bietet ein vollständiges System für das lokale Modell-Deployment, einschließlich einer spezialisierten Inferenz-Engine und Tools für die Modellquantisierung. Das Projekt verfügt über ein Multi-GPU-Inferenz-Framework, das Arbeitslasten auf mehrere Grafikkarten verteilt, um Modelle auszuführen, die die Speicherkapazität eines einzelnen Geräts überschreiten. Es enthält einen GPU-Modell-Quantisierer, der Modelle in gemischte Präzisionsformate zwischen 2 und 8 Bit konvertieren kann, um Speichernutzung und Genauigkeit auszubalancieren. Die Engine unterstützt Textgenerierung mit hohem Durchsatz durch batch-basierte parallele Inferenz und asynchrones Output-Streaming. Diese Funktionen werden durch benutzerdefinierte CUDA-Kernel und Cache-Deduplizierung unterstützt, um die Hardwareauslastung zu optimieren und die Latenz während der Token-Generierung zu reduzieren.
Executes multiple text completion prompts simultaneously using batch-based parallel inference to maximize GPU utilization.
Python
Auf GitHub ansehen4,552