2 repository-uri
Techniques for overlapping the transfer of model weights from host memory to GPU with active computation.
Distinct from GPU Tensor Mapping: Existing candidates for asynchronous loading are related to UI or data tables, not GPU tensor transfers for ML inference.
Explore 2 awesome GitHub repositories matching artificial intelligence & ml · Asynchronous Tensor Loading. Refine with filters or upvote what's useful.
FlexGen is an inference engine for large language models designed for high-throughput execution on single or multiple GPUs. It functions as a framework for managing model execution through a combination of memory offloading, weight compression, and pipeline orchestration. The system enables the execution of models that exceed available GPU memory by moving tensors and caches between GPU memory, system RAM, and disk storage. It utilizes 4-bit weight quantization to reduce the memory footprint of model parameters, allowing for increased batch processing capacity. The project covers distributed
Overlaps weight transfers from host memory to GPU with the computation of current model layers.
PowerInfer is a high-performance local large language model inference engine and sparse inference framework. It provides a runtime for executing models on consumer-grade hardware, utilizing a GPU acceleration backend to optimize tensor operations for graphics processors. The system distinguishes itself through a sparse inference framework that increases generation speed by skipping computations based on activation sparsity in model weights. It includes a GGUF model converter for transforming weights and metadata into a unified binary format, as well as an OpenAI API compatible server for inte
Implements asynchronous loading of model weights to overlap data transfer with active GPU computation.