38 Repos
Techniques for grouping multiple small data operations into a single larger request to increase throughput.
Distinct from Obsolete Entry Clearing: The candidates focus on log inspection or cleanup; this is a performance optimization for processing multiple log entries together.
Explore 38 awesome GitHub repositories matching data & databases · Request Batching. Refine with filters or upvote what's useful.
Hystrix is a latency and fault tolerance library designed to prevent cascading failures in distributed systems. It functions as a circuit breaker implementation that monitors failure thresholds and opens circuits to isolate remote calls when downstream services degrade. The project distinguishes itself by providing multiple isolation mechanisms, utilizing dedicated thread pools and semaphores to ensure that latency in one dependency does not saturate the entire system. It also features a request collapsing and batching engine that groups concurrent calls into single executions to reduce the t
Groups multiple concurrent calls into a single batch execution to reduce the total load on downstream systems.
FoundationDB is an ACID-compliant distributed transactional key-value store. It functions as a scalable database engine that ensures strict serializability and data consistency across a cluster of servers using a shared-nothing architecture. The system is distinguished by its multi-region replication capabilities, allowing data to be synchronized across different datacenters for high availability and disaster recovery. It utilizes optimistic concurrency control to manage distributed transactions and employs a majority-based coordination system to maintain cluster state. The platform provides
Groups multiple read requests into a single server call to reduce network overhead and improve throughput.
Dieses Projekt ist ein BERT-Einbettungsdienst mit hoher Leistung und ein Inferenzserver, der darauf ausgelegt ist, Textsequenzen in numerische Vektoren fester Länge abzubilden. Es fungiert als Microservice für maschinelles Lernen und verteilter Modellserver, der die Anforderungsbehandlung von rechenintensiven Aufgaben entkoppelt. Das System nutzt eine ZeroMQ-Messaging-Infrastruktur, um eine Kommunikation mit geringer Latenz zwischen verteilten Clients und dem Inferenzserver bereitzustellen. Es integriert serverseitige Batch-Verarbeitung und GPU-Workload-Skalierung, um die Hardwareauslastung zu maximieren und hohe Anforderungsvolumina zu verwalten. Die Plattform unterstützt die Infrastruktur für semantische Suche durch die Generierung modalübergreifender Einbettungen für Text und Bilder innerhalb eines gemeinsamen Vektorraums. Dies ermöglicht modalübergreifende Suche, Relevanz-Ranking von Inhalten und das Re-Ranking von Ergebnissen basierend auf der semantischen Ausrichtung zwischen visuellem Inhalt und Textbeschreibungen. Der Dienst kann als elastischer Microservice bereitgestellt werden, der über gRPC-, HTTP- oder WebSocket-Protokolle zugänglich ist, und bietet nicht-blockierendes Duplex-Streaming für die Handhabung großer Datensätze.
Groups individual requests into optimized batches to maximize GPU throughput during inference.
StreamDiffusion is an interactive generative AI framework and inference engine designed for the low-latency delivery of image and video streams. It provides a real-time Stable Diffusion pipeline for text-to-image and image-to-image generation, enabling the creation of continuous generative image streams with minimized computational delay. The framework optimizes throughput using a pre-computed cache engine and residual-based guidance approximation to reduce the number of required model passes. It further manages GPU load through similarity-based frame skipping, which avoids redundant computat
Implements batching of inference requests to maximize GPU throughput and minimize computational overhead.
FlexLLMGen is an inference engine and runtime designed to run large language models on a single GPU by combining weight compression with tensor offloading. It reduces model weight memory usage by approximately 70% through 4-bit quantization, and stores model parameters, attention cache, and hidden states across GPU, CPU, and disk to fit models larger than available GPU memory. The project distinguishes itself through a throughput-oriented batching approach that processes multiple generation requests together in large batches to maximize throughput on a single GPU. It also supports distributed
Processes multiple generation requests together in large batches to maximize throughput on a single GPU.
This project is an AI singing voice conversion system and vocal processor used for training generative voice models and converting vocal recordings or live input into a target voice. It functions as a VITS model trainer and a real-time voice changer that transforms vocal timbre and pitch to change the identity of a singer. The system provides a graphical management dashboard for controlling training hyperparameters and voice conversion presets. It supports low-latency audio streaming for live microphone input and employs pitch estimation to ensure precise matching between source and target vo
Implements grouping of multiple audio segments into single GPU execution passes to accelerate batch inference throughput.
This is a Raft consensus library and distributed consensus engine implemented in Go. It provides the primitives necessary to build fault-tolerant distributed services by implementing a replicated state machine that ensures a group of servers agree on a shared system state through leader election and log replication. The project distinguishes itself through a pluggable architecture for storage backends and snapshot storage, decoupling the consensus logic from physical persistence. It includes specialized mechanisms for leadership transfer, protocol version management to support rolling upgrade
Haftraft processes multiple committed log entries in a single operation to improve throughput and reduce system overhead.
Yoga is a GraphQL server framework and runtime-agnostic HTTP handler used to build and deploy GraphQL APIs. It functions as a toolkit for managing schemas and resolvers, providing a spec-compliant environment for hosting APIs across diverse JavaScript runtimes, including Node.js, Deno, Bun, and serverless cloud environments. The project distinguishes itself through its ability to act as an Apollo Federation gateway, composing multiple subgraphs into a single unified supergraph. It also serves as a dedicated subscription server, delivering real-time data streaming via both WebSockets and Serve
Allows combining multiple GraphQL requests into a single network call to reduce overhead and round trips.
tensorrtx is a computer vision inference engine and model implementation library designed for graphics processor acceleration. It provides a framework for optimizing deep learning models through a GPU inference optimizer, a deep learning model converter for transforming weights from frameworks like TensorFlow and PyTorch, and a custom plugin library to implement operations not natively supported by the TensorRT API. The project distinguishes itself through a comprehensive collection of pre-defined network implementations, ranging from various YOLO versions and DETR transformers for object det
Implements dynamic batching for inference workloads to optimize the balance between throughput and latency.
gspread is a Python client library and API wrapper designed for programmatically interacting with Google Sheets. It serves as a spreadsheet automation library that enables the creation, organization, and management of cloud-based spreadsheets via Python scripts. The library provides a simplified interface for Google Sheets automation, allowing users to read, write, and update data without writing raw HTTP requests. It supports cloud spreadsheet integration, enabling external Python applications to use Google Sheets as a data storage layer. The project covers a broad range of capabilities inc
Implements request batching to group multiple data updates into single network calls for improved performance.
Combines short requests into batches and splits long sequences across GPUs for balanced throughput.
Combines dynamic batching and concurrent execution to maximize hardware utilization during model serving.
KServe is a Kubernetes-native platform for deploying and serving machine learning models as scalable inference services. It supports both generative AI models, including large language models, and traditional predictive models from frameworks such as TensorFlow, PyTorch, Scikit-Learn, XGBoost, and ONNX. The platform manages the full lifecycle of model deployments, including revision tracking, canary rollouts, A/B testing, and automatic rollbacks, and provides serverless scale-to-zero capabilities for cost-efficient resource management. KServe distinguishes itself through a standardized infere
Groups multiple prediction requests into a single batch to improve throughput on GPU and CPU runtimes.
KServe is an open platform for deploying and serving generative and predictive AI models on Kubernetes. It defines inference services as custom resources with declarative YAML specifications, enabling a Kubernetes-native approach to model deployment and lifecycle management. The platform leverages Knative-based serverless scaling for automatic scale-to-zero and revision management, and supports a pluggable serving runtime architecture that maps model formats to containerized execution environments. KServe distinguishes itself through model-aware autoscaling that scales replicas based on token
Accumulates multiple prediction requests and processes them together to increase throughput.
OpenChat ist ein Framework für das Training, Fine-Tuning und Deployment von Large Language Models, die für Konversations- und mathematische Schlussfolgerungsaufgaben optimiert sind. Es bietet einen umfassenden Lebenszyklus für diese Modelle, von Trainings-Pipelines und Deployment-Stacks bis hin zu einer webbasierten Chat-Oberfläche. Das Projekt konzentriert sich darauf, eine leistungsstarke Modellausführung auf Consumer-Hardware ohne den Bedarf an Enterprise-Beschleunigern zu ermöglichen. Es enthält einen produktionsreifen Inference-Server, der das OpenAI-Chat-Completion-Protokoll implementiert und dynamisches Request-Batching nutzt, um den Hardware-Durchsatz zu optimieren. Das System deckt den gesamten operativen Workflow ab, einschließlich Dataset-Tokenisierung und Modell-Fine-Tuning mittels Padding-freiem Training und Reinforcement Learning. Es erweitert dies um API-Hosting mit schlüsselbasierter Authentifizierung und eine grafische Benutzeroberfläche für die menschliche Interaktion in Echtzeit.
Uses dynamic request batching to group multiple API requests into a single inference pass for higher throughput.
orpc is a contract-first API development framework for TypeScript that starts with a shared contract definition and generates type-safe clients and servers from that single source of truth. It guarantees end-to-end type safety, meaning inputs, outputs, errors, and streaming data are all checked at compile time across the client–server boundary. What distinguishes orpc from typical RPC frameworks is its ability to export contracts as OpenAPI specifications, to optimize server-side rendering by calling API handlers directly inside the server process, and to support real‑time bidirectional commu
Groups multiple API requests into a single call to reduce network overhead and improve efficiency.
fastllm is a set of specialized software components for model weight conversion, Mixture-of-Experts runtimes, and tensor parallelism. It provides an OpenAI compatible API server to expose large language model capabilities through a standardized request format. The project features a tensor parallelism framework that splits computational workloads across multiple GPUs to accelerate execution. It includes a dedicated runtime optimized for Mixture-of-Experts architectures and a quantization tool to convert model weights into lower precision formats to reduce memory usage and increase throughput.
Groups multiple incoming requests into single execution passes to maximize GPU utilization and reduce token latency.
Dieses Projekt ist ein MLOps-Architekturleitfaden und ein Framework für das Design und Deployment von Deep-Learning-Systemen in Produktionsumgebungen. Es bietet einen strukturierten Ansatz für das Deployment von Modell-Inferenz, ML-Pipeline-Orchestrierung und die Erstellung von Machine-Learning-Architekturen auf Produktionsebene. Das Projekt zeichnet sich durch einen Fokus auf verteiltes Deep Learning und Edge-KI-Optimierung aus. Es deckt Methoden zur Parallelisierung des Modelltrainings über mehrere GPUs hinweg ab, um große Datensätze zu verarbeiten, und wendet Techniken wie Quantisierung und Destillation an, um die Modellgröße für Embedded-Hardware zu reduzieren. Die Funktionsfläche erstreckt sich auf Monitoring und Observability, einschließlich der Verfolgung von Modell-Performance, Data-Drift und Experiment-Metriken. Es adressiert zudem die Orchestrierung von Daten-Workflows, Datensatz-Versionierung über Object-Stores und die Verwaltung von Inferenzanfragen mit hohem Volumen mithilfe von adaptivem Batching und Container-basierter Orchestrierung.
Implements adaptive batching to maximize GPU throughput while maintaining latency limits for model inference.
exllamav2 ist eine Hochleistungs-Inferenzbibliothek, die für das lokale Ausführen von Large Language Models auf Consumer-GPUs entwickelt wurde. Sie bietet einen GPU-beschleunigten Runner und Quantisierungstools, um die Modellausführung ohne Abhängigkeit von Cloud-Computing-Diensten zu ermöglichen. Das Projekt verfügt über ein Quantisierungs-Dienstprogramm, das Modelle in gemischte Bitraten zwischen zwei und acht Bit komprimiert, um den VRAM-Bedarf zu reduzieren. Es zeichnet sich durch einen gebatchten Textgenerator aus, der gruppierte Anfragen verarbeitet und Cache-Daten dedupliziert, um den Durchsatz zu erhöhen. Die Bibliothek deckt ein breites Funktionsspektrum ab, einschließlich asynchronem Token-Streaming für Echtzeit-Ausgabe, benutzerdefinierter GPU-Kernel-Ausführung für lineare Algebra-Operationen und lokalem Memory-Mapping für den Zugriff auf Modellgewichte mit geringer Latenz.
Groups multiple model inference requests into a single hardware execution pass to maximize GPU throughput.
exllamav2 ist eine Hochleistungs-Inferenz-Engine und ein Framework für das lokale Ausführen von Large Language Models auf Consumer-GPUs. Es bietet ein vollständiges System für das lokale Modell-Deployment, einschließlich einer spezialisierten Inferenz-Engine und Tools für die Modellquantisierung. Das Projekt verfügt über ein Multi-GPU-Inferenz-Framework, das Arbeitslasten auf mehrere Grafikkarten verteilt, um Modelle auszuführen, die die Speicherkapazität eines einzelnen Geräts überschreiten. Es enthält einen GPU-Modell-Quantisierer, der Modelle in gemischte Präzisionsformate zwischen 2 und 8 Bit konvertieren kann, um Speichernutzung und Genauigkeit auszubalancieren. Die Engine unterstützt Textgenerierung mit hohem Durchsatz durch batch-basierte parallele Inferenz und asynchrones Output-Streaming. Diese Funktionen werden durch benutzerdefinierte CUDA-Kernel und Cache-Deduplizierung unterstützt, um die Hardwareauslastung zu optimieren und die Latenz während der Token-Generierung zu reduzieren.
Executes multiple text completion prompts simultaneously using batch-based parallel inference to maximize GPU utilization.