1 repo
Components that manage and prioritize incoming inference requests to optimize throughput and latency.
Explore 1 awesome GitHub repository matching artificial intelligence & ml · Request Schedulers. Refine with filters or upvote what's useful.
vLLM is a high-throughput inference engine designed for the efficient serving and execution of large language models. It functions as a production-ready distributed model server, providing standard API protocols for online serving while also supporting offline batch processing. The system is built to maximize token gen
Decouples request ingestion from the inference loop to prioritize incoming traffic for high concurrency and low latency.