2 Repos
Techniques and strategies for maximizing throughput and reducing latency in model serving environments.
Distinguishing note: Focuses on serving-level performance rather than model architecture.
Explore 2 awesome GitHub repositories matching devops & infrastructure · Inference Optimization. Refine with filters or upvote what's useful.
Sglang is a high-performance inference engine and serving system designed for large language and multimodal models. It provides a programmable interface for orchestrating complex generation workflows, enabling developers to coordinate multi-turn dialogues, tool invocations, and reasoning chains through a domain-specific language. The platform is built to support production-scale deployments, offering an OpenAI-compatible API that allows for integration with existing application ecosystems. The system distinguishes itself through a disaggregated architecture that separates compute-intensive pr
Maximizes token generation rates using data-parallel attention and tensor parallelism.
This project is a generative speech synthesis engine that converts text into high-fidelity human speech. It utilizes a two-stage autoregressive transformer architecture that separates semantic token prediction from acoustic detail reconstruction to balance linguistic accuracy with audio quality. The system is designed to support multilingual output and conversational AI development, enabling the generation of context-aware speech that maintains flow across multiple dialogue turns. The platform distinguishes itself through a production-ready inference server that employs continuous batching to
Implements continuous batching to maximize hardware utilization and reduce latency in production.