1 repo
Techniques for combining multiple operations into single execution units to reduce overhead.
Explore 1 awesome GitHub repository matching devops & infrastructure · Kernel Fusion Strategies. Refine with filters or upvote what's useful.
vLLM is a high-throughput inference engine designed for the efficient serving and execution of large language models. It functions as a production-ready distributed model server, providing standard API protocols for online serving while also supporting offline batch processing. The system is built to maximize token gen
Combines multiple operations into single GPU kernels to reduce memory overhead and improve computational throughput.