38 dépôts
Techniques for grouping multiple small data operations into a single larger request to increase throughput.
Distinct from Obsolete Entry Clearing: The candidates focus on log inspection or cleanup; this is a performance optimization for processing multiple log entries together.
Explore 38 awesome GitHub repositories matching data & databases · Request Batching. Refine with filters or upvote what's useful.
Hystrix is a latency and fault tolerance library designed to prevent cascading failures in distributed systems. It functions as a circuit breaker implementation that monitors failure thresholds and opens circuits to isolate remote calls when downstream services degrade. The project distinguishes itself by providing multiple isolation mechanisms, utilizing dedicated thread pools and semaphores to ensure that latency in one dependency does not saturate the entire system. It also features a request collapsing and batching engine that groups concurrent calls into single executions to reduce the t
Groups multiple concurrent calls into a single batch execution to reduce the total load on downstream systems.
FoundationDB is an ACID-compliant distributed transactional key-value store. It functions as a scalable database engine that ensures strict serializability and data consistency across a cluster of servers using a shared-nothing architecture. The system is distinguished by its multi-region replication capabilities, allowing data to be synchronized across different datacenters for high availability and disaster recovery. It utilizes optimistic concurrency control to manage distributed transactions and employs a majority-based coordination system to maintain cluster state. The platform provides
Groups multiple read requests into a single server call to reduce network overhead and improve throughput.
Ce projet est un service d'intégration BERT haute performance et un serveur d'inférence conçu pour mapper des séquences de texte en vecteurs numériques de longueur fixe. Il fonctionne comme un microservice d'apprentissage automatique et un serveur de modèle distribué qui découple la gestion des requêtes du calcul lourd. Le système utilise une infrastructure de messagerie ZeroMQ pour fournir une communication à faible latence entre les clients distribués et le serveur d'inférence. Il incorpore le traitement par lots côté serveur et la mise à l'échelle de la charge de travail GPU pour maximiser l'utilisation du matériel et gérer des volumes de requêtes élevés. La plateforme prend en charge l'infrastructure de recherche sémantique en générant des intégrations transmodales pour le texte et les images au sein d'un espace vectoriel partagé. Cela permet la recherche transmodale, le classement de la pertinence du contenu et le reclassement des résultats basés sur l'alignement sémantique entre le contenu visuel et les descriptions textuelles. Le service peut être déployé en tant que microservice élastique accessible via les protocoles gRPC, HTTP ou WebSocket, avec un streaming duplex non bloquant pour gérer de grands ensembles de données.
Groups individual requests into optimized batches to maximize GPU throughput during inference.
StreamDiffusion is an interactive generative AI framework and inference engine designed for the low-latency delivery of image and video streams. It provides a real-time Stable Diffusion pipeline for text-to-image and image-to-image generation, enabling the creation of continuous generative image streams with minimized computational delay. The framework optimizes throughput using a pre-computed cache engine and residual-based guidance approximation to reduce the number of required model passes. It further manages GPU load through similarity-based frame skipping, which avoids redundant computat
Implements batching of inference requests to maximize GPU throughput and minimize computational overhead.
FlexLLMGen is an inference engine and runtime designed to run large language models on a single GPU by combining weight compression with tensor offloading. It reduces model weight memory usage by approximately 70% through 4-bit quantization, and stores model parameters, attention cache, and hidden states across GPU, CPU, and disk to fit models larger than available GPU memory. The project distinguishes itself through a throughput-oriented batching approach that processes multiple generation requests together in large batches to maximize throughput on a single GPU. It also supports distributed
Processes multiple generation requests together in large batches to maximize throughput on a single GPU.
This project is an AI singing voice conversion system and vocal processor used for training generative voice models and converting vocal recordings or live input into a target voice. It functions as a VITS model trainer and a real-time voice changer that transforms vocal timbre and pitch to change the identity of a singer. The system provides a graphical management dashboard for controlling training hyperparameters and voice conversion presets. It supports low-latency audio streaming for live microphone input and employs pitch estimation to ensure precise matching between source and target vo
Implements grouping of multiple audio segments into single GPU execution passes to accelerate batch inference throughput.
This is a Raft consensus library and distributed consensus engine implemented in Go. It provides the primitives necessary to build fault-tolerant distributed services by implementing a replicated state machine that ensures a group of servers agree on a shared system state through leader election and log replication. The project distinguishes itself through a pluggable architecture for storage backends and snapshot storage, decoupling the consensus logic from physical persistence. It includes specialized mechanisms for leadership transfer, protocol version management to support rolling upgrade
Haftraft processes multiple committed log entries in a single operation to improve throughput and reduce system overhead.
Yoga is a GraphQL server framework and runtime-agnostic HTTP handler used to build and deploy GraphQL APIs. It functions as a toolkit for managing schemas and resolvers, providing a spec-compliant environment for hosting APIs across diverse JavaScript runtimes, including Node.js, Deno, Bun, and serverless cloud environments. The project distinguishes itself through its ability to act as an Apollo Federation gateway, composing multiple subgraphs into a single unified supergraph. It also serves as a dedicated subscription server, delivering real-time data streaming via both WebSockets and Serve
Allows combining multiple GraphQL requests into a single network call to reduce overhead and round trips.
tensorrtx is a computer vision inference engine and model implementation library designed for graphics processor acceleration. It provides a framework for optimizing deep learning models through a GPU inference optimizer, a deep learning model converter for transforming weights from frameworks like TensorFlow and PyTorch, and a custom plugin library to implement operations not natively supported by the TensorRT API. The project distinguishes itself through a comprehensive collection of pre-defined network implementations, ranging from various YOLO versions and DETR transformers for object det
Implements dynamic batching for inference workloads to optimize the balance between throughput and latency.
gspread is a Python client library and API wrapper designed for programmatically interacting with Google Sheets. It serves as a spreadsheet automation library that enables the creation, organization, and management of cloud-based spreadsheets via Python scripts. The library provides a simplified interface for Google Sheets automation, allowing users to read, write, and update data without writing raw HTTP requests. It supports cloud spreadsheet integration, enabling external Python applications to use Google Sheets as a data storage layer. The project covers a broad range of capabilities inc
Implements request batching to group multiple data updates into single network calls for improved performance.
Combines short requests into batches and splits long sequences across GPUs for balanced throughput.
Combines dynamic batching and concurrent execution to maximize hardware utilization during model serving.
KServe is a Kubernetes-native platform for deploying and serving machine learning models as scalable inference services. It supports both generative AI models, including large language models, and traditional predictive models from frameworks such as TensorFlow, PyTorch, Scikit-Learn, XGBoost, and ONNX. The platform manages the full lifecycle of model deployments, including revision tracking, canary rollouts, A/B testing, and automatic rollbacks, and provides serverless scale-to-zero capabilities for cost-efficient resource management. KServe distinguishes itself through a standardized infere
Groups multiple prediction requests into a single batch to improve throughput on GPU and CPU runtimes.
KServe is an open platform for deploying and serving generative and predictive AI models on Kubernetes. It defines inference services as custom resources with declarative YAML specifications, enabling a Kubernetes-native approach to model deployment and lifecycle management. The platform leverages Knative-based serverless scaling for automatic scale-to-zero and revision management, and supports a pluggable serving runtime architecture that maps model formats to containerized execution environments. KServe distinguishes itself through model-aware autoscaling that scales replicas based on token
Accumulates multiple prediction requests and processes them together to increase throughput.
OpenChat est un framework pour l'entraînement, le fine-tuning et le déploiement de grands modèles de langage optimisés pour les tâches de raisonnement conversationnel et mathématique. Il fournit un cycle de vie complet pour ces modèles, allant des pipelines d'entraînement et des stacks de déploiement à une interface de chat web. Le projet se concentre sur l'activation d'une exécution de modèle haute performance sur du matériel grand public sans avoir besoin d'accélérateurs de classe entreprise. Il inclut un serveur d'inférence prêt pour la production qui implémente le protocole de complétion de chat OpenAI et utilise le batching dynamique des requêtes pour optimiser le débit matériel. Le système couvre l'ensemble du flux de travail opérationnel, y compris la tokenisation des jeux de données et le fine-tuning des modèles via un entraînement sans padding et l'apprentissage par renforcement. Il s'étend également à l'hébergement d'API avec authentification par clé et une interface graphique pour l'interaction humaine en temps réel.
Uses dynamic request batching to group multiple API requests into a single inference pass for higher throughput.
orpc is a contract-first API development framework for TypeScript that starts with a shared contract definition and generates type-safe clients and servers from that single source of truth. It guarantees end-to-end type safety, meaning inputs, outputs, errors, and streaming data are all checked at compile time across the client–server boundary. What distinguishes orpc from typical RPC frameworks is its ability to export contracts as OpenAPI specifications, to optimize server-side rendering by calling API handlers directly inside the server process, and to support real‑time bidirectional commu
Groups multiple API requests into a single call to reduce network overhead and improve efficiency.
fastllm est un ensemble de composants logiciels spécialisés pour la conversion de poids de modèles, les runtimes de type « Mixture-of-Experts » et le parallélisme de tenseurs. Il fournit un serveur API compatible OpenAI pour exposer les capacités des grands modèles de langage via un format de requête standardisé. Le projet dispose d'un framework de parallélisme de tenseurs qui divise les charges de travail computationnelles sur plusieurs GPU pour accélérer l'exécution. Il inclut un runtime dédié optimisé pour les architectures Mixture-of-Experts et un outil de quantification pour convertir les poids des modèles en formats de précision inférieure afin de réduire l'utilisation de la mémoire et d'augmenter le débit. Le système couvre des workflows de haut niveau pour l'inférence distribuée, incluant la gestion de la mémoire mappée sur les périphériques, le batching dynamique et l'exécution en mode mixte. Il fournit également une interface en ligne de commande et une interface utilisateur en terminal pour la gestion des modèles et la configuration du déploiement.
Groups multiple incoming requests into single execution passes to maximize GPU utilization and reduce token latency.
Ce projet est un guide architectural MLOps et un framework pour concevoir et déployer des systèmes d'apprentissage profond dans des environnements de production. Il fournit une approche structurée pour le déploiement de l'inférence de modèles, l'orchestration de pipelines ML et la création d'architectures de machine learning de niveau production. Le projet se distingue par un accent mis sur l'apprentissage profond distribué et l'IA en périphérie (edge AI). Il couvre des méthodologies pour paralléliser l'entraînement des modèles sur plusieurs GPU afin de gérer de grands jeux de données et applique des techniques comme la quantification et la distillation pour réduire la taille des modèles pour le matériel embarqué. La surface de capacité s'étend à la surveillance et à l'observabilité, incluant le suivi de la performance des modèles, la dérive des données et les métriques d'expérience. Il aborde également l'orchestration des flux de travail de données, le versionnage des jeux de données via des magasins d'objets, et la gestion des requêtes d'inférence à haut volume en utilisant le traitement par lots adaptatif et l'orchestration basée sur des conteneurs.
Implements adaptive batching to maximize GPU throughput while maintaining latency limits for model inference.
exllamav2 is a high-performance inference library designed for running large language models locally on consumer-grade GPUs. It provides a GPU-accelerated runner and quantization tools to enable model execution without reliance on cloud-based computing services. The project features a quantization utility that compresses models into mixed bitrates between two and eight bits to reduce video RAM requirements. It distinguishes itself through a batched text generator that handles grouped requests and deduplicates cache data to increase throughput. The library covers a broad capability surface in
Groups multiple model inference requests into a single hardware execution pass to maximize GPU throughput.
exllamav2 is a high-performance inference engine and framework for executing large language models locally on consumer-class GPUs. It provides a complete system for local model deployment, including a specialized inference engine and tools for model quantization. The project features a multi-GPU inference framework that distributes workloads across multiple graphics cards to run models that exceed the memory capacity of a single device. It includes a GPU model quantizer capable of converting models into mixed-precision formats between 2 and 8 bits to balance memory usage and accuracy. The en
Executes multiple text completion prompts simultaneously using batch-based parallel inference to maximize GPU utilization.