38 个仓库
Techniques for grouping multiple small data operations into a single larger request to increase throughput.
Distinct from Obsolete Entry Clearing: The candidates focus on log inspection or cleanup; this is a performance optimization for processing multiple log entries together.
Explore 38 awesome GitHub repositories matching data & databases · Request Batching. Refine with filters or upvote what's useful.
Hystrix is a latency and fault tolerance library designed to prevent cascading failures in distributed systems. It functions as a circuit breaker implementation that monitors failure thresholds and opens circuits to isolate remote calls when downstream services degrade. The project distinguishes itself by providing multiple isolation mechanisms, utilizing dedicated thread pools and semaphores to ensure that latency in one dependency does not saturate the entire system. It also features a request collapsing and batching engine that groups concurrent calls into single executions to reduce the t
Groups multiple concurrent calls into a single batch execution to reduce the total load on downstream systems.
FoundationDB is an ACID-compliant distributed transactional key-value store. It functions as a scalable database engine that ensures strict serializability and data consistency across a cluster of servers using a shared-nothing architecture. The system is distinguished by its multi-region replication capabilities, allowing data to be synchronized across different datacenters for high availability and disaster recovery. It utilizes optimistic concurrency control to manage distributed transactions and employs a majority-based coordination system to maintain cluster state. The platform provides
Groups multiple read requests into a single server call to reduce network overhead and improve throughput.
该项目是一个高性能 BERT 嵌入服务和推理服务器,旨在将文本序列映射为固定长度的数值向量。它作为一个机器学习微服务和分布式模型服务器,将请求处理与繁重的计算解耦。 该系统利用 ZeroMQ 消息基础设施在分布式客户端和推理服务器之间提供低延迟通信。它结合了服务器端批处理和 GPU 工作负载扩展,以最大化硬件利用率并管理高请求量。 该平台通过在共享向量空间内为文本和图像生成跨模态嵌入来支持语义搜索基础设施。这实现了跨模态搜索、内容相关性排名以及基于视觉内容与文本描述之间语义对齐的结果重排序。 该服务可以作为可通过 gRPC、HTTP 或 WebSocket 协议访问的弹性微服务进行部署,具有用于处理大数据集的非阻塞双工流。
Groups individual requests into optimized batches to maximize GPU throughput during inference.
StreamDiffusion is an interactive generative AI framework and inference engine designed for the low-latency delivery of image and video streams. It provides a real-time Stable Diffusion pipeline for text-to-image and image-to-image generation, enabling the creation of continuous generative image streams with minimized computational delay. The framework optimizes throughput using a pre-computed cache engine and residual-based guidance approximation to reduce the number of required model passes. It further manages GPU load through similarity-based frame skipping, which avoids redundant computat
Implements batching of inference requests to maximize GPU throughput and minimize computational overhead.
FlexLLMGen is an inference engine and runtime designed to run large language models on a single GPU by combining weight compression with tensor offloading. It reduces model weight memory usage by approximately 70% through 4-bit quantization, and stores model parameters, attention cache, and hidden states across GPU, CPU, and disk to fit models larger than available GPU memory. The project distinguishes itself through a throughput-oriented batching approach that processes multiple generation requests together in large batches to maximize throughput on a single GPU. It also supports distributed
Processes multiple generation requests together in large batches to maximize throughput on a single GPU.
This project is an AI singing voice conversion system and vocal processor used for training generative voice models and converting vocal recordings or live input into a target voice. It functions as a VITS model trainer and a real-time voice changer that transforms vocal timbre and pitch to change the identity of a singer. The system provides a graphical management dashboard for controlling training hyperparameters and voice conversion presets. It supports low-latency audio streaming for live microphone input and employs pitch estimation to ensure precise matching between source and target vo
Implements grouping of multiple audio segments into single GPU execution passes to accelerate batch inference throughput.
This is a Raft consensus library and distributed consensus engine implemented in Go. It provides the primitives necessary to build fault-tolerant distributed services by implementing a replicated state machine that ensures a group of servers agree on a shared system state through leader election and log replication. The project distinguishes itself through a pluggable architecture for storage backends and snapshot storage, decoupling the consensus logic from physical persistence. It includes specialized mechanisms for leadership transfer, protocol version management to support rolling upgrade
Haftraft processes multiple committed log entries in a single operation to improve throughput and reduce system overhead.
Yoga is a GraphQL server framework and runtime-agnostic HTTP handler used to build and deploy GraphQL APIs. It functions as a toolkit for managing schemas and resolvers, providing a spec-compliant environment for hosting APIs across diverse JavaScript runtimes, including Node.js, Deno, Bun, and serverless cloud environments. The project distinguishes itself through its ability to act as an Apollo Federation gateway, composing multiple subgraphs into a single unified supergraph. It also serves as a dedicated subscription server, delivering real-time data streaming via both WebSockets and Serve
Allows combining multiple GraphQL requests into a single network call to reduce overhead and round trips.
tensorrtx is a computer vision inference engine and model implementation library designed for graphics processor acceleration. It provides a framework for optimizing deep learning models through a GPU inference optimizer, a deep learning model converter for transforming weights from frameworks like TensorFlow and PyTorch, and a custom plugin library to implement operations not natively supported by the TensorRT API. The project distinguishes itself through a comprehensive collection of pre-defined network implementations, ranging from various YOLO versions and DETR transformers for object det
Implements dynamic batching for inference workloads to optimize the balance between throughput and latency.
gspread is a Python client library and API wrapper designed for programmatically interacting with Google Sheets. It serves as a spreadsheet automation library that enables the creation, organization, and management of cloud-based spreadsheets via Python scripts. The library provides a simplified interface for Google Sheets automation, allowing users to read, write, and update data without writing raw HTTP requests. It supports cloud spreadsheet integration, enabling external Python applications to use Google Sheets as a data storage layer. The project covers a broad range of capabilities inc
Implements request batching to group multiple data updates into single network calls for improved performance.
Combines short requests into batches and splits long sequences across GPUs for balanced throughput.
Combines dynamic batching and concurrent execution to maximize hardware utilization during model serving.
KServe is a Kubernetes-native platform for deploying and serving machine learning models as scalable inference services. It supports both generative AI models, including large language models, and traditional predictive models from frameworks such as TensorFlow, PyTorch, Scikit-Learn, XGBoost, and ONNX. The platform manages the full lifecycle of model deployments, including revision tracking, canary rollouts, A/B testing, and automatic rollbacks, and provides serverless scale-to-zero capabilities for cost-efficient resource management. KServe distinguishes itself through a standardized infere
Groups multiple prediction requests into a single batch to improve throughput on GPU and CPU runtimes.
KServe is an open platform for deploying and serving generative and predictive AI models on Kubernetes. It defines inference services as custom resources with declarative YAML specifications, enabling a Kubernetes-native approach to model deployment and lifecycle management. The platform leverages Knative-based serverless scaling for automatic scale-to-zero and revision management, and supports a pluggable serving runtime architecture that maps model formats to containerized execution environments. KServe distinguishes itself through model-aware autoscaling that scales replicas based on token
Accumulates multiple prediction requests and processes them together to increase throughput.
OpenChat 是一个用于训练、微调和部署大语言模型的框架,针对对话和数学推理任务进行了优化。它提供了从训练流水线、部署栈到基于 Web 的聊天界面的全生命周期管理。 该项目专注于在消费级硬件上实现高性能模型执行,无需企业级加速器。它包含一个生产就绪的推理服务器,实现了 OpenAI 聊天补全协议,并利用动态请求批处理来优化硬件吞吐量。 该系统涵盖了整个操作工作流,包括数据集分词、通过无填充训练(padding-free training)进行模型微调以及强化学习。它还扩展到支持基于密钥认证的 API 托管,并提供用于实时人机交互的图形用户界面。
Uses dynamic request batching to group multiple API requests into a single inference pass for higher throughput.
orpc is a contract-first API development framework for TypeScript that starts with a shared contract definition and generates type-safe clients and servers from that single source of truth. It guarantees end-to-end type safety, meaning inputs, outputs, errors, and streaming data are all checked at compile time across the client–server boundary. What distinguishes orpc from typical RPC frameworks is its ability to export contracts as OpenAPI specifications, to optimize server-side rendering by calling API handlers directly inside the server process, and to support real‑time bidirectional commu
Groups multiple API requests into a single call to reduce network overhead and improve efficiency.
fastllm 是一套用于模型权重转换、混合专家 (MoE) 运行时和张量并行的专用软件组件。它提供了一个兼容 OpenAI 的 API 服务器,通过标准化的请求格式公开大语言模型功能。 该项目具有一个张量并行框架,可将计算工作负载拆分到多个 GPU 上以加速执行。它包含一个针对混合专家架构优化的专用运行时,以及一个将模型权重转换为低精度格式以减少内存使用并提高吞吐量的量化工具。 系统涵盖了分布式推理的高级工作流,包括设备映射内存管理、动态批处理和混合模式执行。它还提供了一个用于模型管理和部署配置的命令行界面和终端用户界面。
Groups multiple incoming requests into single execution passes to maximize GPU utilization and reduce token latency.
本项目是一套 MLOps 架构指南和框架,旨在设计并将深度学习系统部署到生产环境。它为模型推理部署、机器学习流水线编排以及生产级机器学习架构的构建提供了结构化的方法。 该项目的特色在于专注于分布式深度学习和边缘 AI 优化。它涵盖了在多个 GPU 上并行化模型训练以处理大规模数据集的方法,并应用了量化和蒸馏等技术来减小嵌入式硬件上的模型体积。 其功能范围还扩展到了监控和可观测性,包括跟踪模型性能、数据漂移和实验指标。此外,它还解决了数据工作流编排、通过对象存储进行数据集版本控制,以及使用自适应批处理和容器化编排来管理高并发推理请求的问题。
Implements adaptive batching to maximize GPU throughput while maintaining latency limits for model inference.
exllamav2 是一个高性能推理库,旨在在消费级 GPU 上本地运行大语言模型。它提供了一个 GPU 加速的运行器和量化工具,使模型执行无需依赖基于云的计算服务。 该项目具有一个量化实用程序,可将模型压缩为 2 到 8 位之间的混合比特率,以减少显存需求。它通过一个处理分组请求并对缓存数据进行去重的批处理文本生成器脱颖而出,从而提高了吞吐量。 该库涵盖了广泛的功能,包括用于实时输出的异步 Token 流式传输、用于线性代数运算的自定义 GPU 内核执行,以及用于低延迟访问模型权重的本地内存映射。
Groups multiple model inference requests into a single hardware execution pass to maximize GPU throughput.
exllamav2 是一个高性能推理引擎和框架,用于在消费级 GPU 上本地执行大语言模型。它提供了一个完整的本地模型部署系统,包括专门的推理引擎和模型量化工具。 该项目具有一个多 GPU 推理框架,可将工作负载分配到多个显卡上,以运行超过单个设备内存容量的模型。它包括一个 GPU 模型量化器,能够将模型转换为 2 到 8 位之间的混合精度格式,以平衡内存使用和准确性。 该引擎通过基于批处理的并行推理和异步输出流式传输支持高吞吐量文本生成。这些功能由自定义 CUDA 内核和缓存去重支持,以优化硬件利用率并减少 Token 生成期间的延迟。
Executes multiple text completion prompts simultaneously using batch-based parallel inference to maximize GPU utilization.