13 repository-uri
Specialized computational kernels designed to accelerate the token generation and decoding phases of large language models.
Distinguishing note: Focuses specifically on low-level kernel optimization for inference speed, distinct from general model training or high-level API wrappers.
Explore 13 awesome GitHub repositories matching artificial intelligence & ml · Inference Optimization Kernels. Refine with filters or upvote what's useful.
BitNet is a quantized inference engine designed to execute highly compressed language models by performing arithmetic on low-precision, bit-level weight data. It functions as a model optimization toolkit and a high-performance kernel library, enabling the execution of large language models on consumer hardware by reducing memory footprints and increasing processing speeds. The project distinguishes itself through hardware-specific kernel optimizations that leverage native processor instructions to accelerate matrix multiplication. By utilizing packed integer arithmetic and memory-aligned weig
Decode tokens using optimized kernels that reduce processing delays during the autoregressive generation phase of highly compressed language models.
This repository serves as a comprehensive collection of reference implementations for the PyTorch machine learning library. It provides practical examples for building, training, and deploying deep learning models, functioning as a toolkit for developers to explore neural network architectures and training workflows. The project distinguishes itself by offering concrete demonstrations of complex machine learning operations, ranging from computer vision tasks like object detection and depth estimation to the training of large-scale transformer models. These examples illustrate how to implement
Registers and selects specialized compute kernels at runtime to optimize execution paths for inference.
This project is a comprehensive framework for the training, fine-tuning, and deployment of large language models. It functions as a distributed deep learning platform that enables users to scale model workflows across multiple hardware nodes while providing tools for model evaluation and performance benchmarking. The platform distinguishes itself by offering specialized utilities for model compression and weight transformation, allowing users to reduce memory footprints and latency through quantization and pruning. It supports the adaptation of large models for consumer-grade hardware, facili
Utilizes specialized computational kernels to maximize throughput and minimize latency during text generation.
Ktransformers is a comprehensive framework designed for the operation, fine-tuning, and serving of large language models. It functions as a heterogeneous inference engine and quantized execution runtime, enabling the deployment of massive models by distributing computational workloads across both CPU and GPU resources. This architecture allows users to bypass local memory constraints, making it possible to run and train models that exceed the capacity of a single device. The project distinguishes itself through specialized support for sparse architectures, particularly mixture-of-experts mode
Implements specialized computational kernels to accelerate token generation and decoding phases of large language models.
Mamba is a deep learning framework designed for building and training sequence models that process long-range data dependencies with linear-time computational efficiency. By utilizing selective state space modeling, the library enables the construction of neural network architectures that replace traditional attention mechanisms with high-performance state space operations. The framework distinguishes itself through the use of data-dependent state gating, which allows the model to dynamically filter information flow based on the input sequence. To ensure high throughput, it incorporates hardw
Includes optimized hardware-specific kernels for executing complex state space calculations during model training and inference.
FlashMLA is an LLM attention kernel library and inference acceleration library providing a collection of high-performance CUDA kernels. It implements multi-head latent attention mechanisms designed to reduce memory overhead and increase throughput during the forward and backward passes of large language model inference. The library utilizes quantized cache attention kernels to improve computation efficiency across both sparse and dense token processing. It specifically optimizes the prefill and decoding phases of model inference through these latent attention implementations. The project cov
Improves speed and memory efficiency of LLM decoding and prefill stages using specialized kernels.
Text Generation Inference is a production-ready engine designed for the deployment and serving of large language models. It functions as a containerized runtime environment that manages model execution, scales across distributed hardware, and provides high-performance inference capabilities for demanding production environments. The project distinguishes itself through advanced optimization techniques, including continuous batching to maximize hardware utilization and tensor parallelism to shard large models across multiple accelerator cards. It supports efficient inference through custom com
Utilizes hand-optimized low-level compute kernels to accelerate transformer model inference operations.
lmdeploy is a high-performance inference engine and deployment framework for large language models and vision models. It functions as a multi-modal model server and compression toolkit designed to serve models with high throughput and low latency. The system enables the distribution of model services across multiple machines using request-based load balancing and tensor parallelism. It includes specialized tools for model quantization and compression to reduce the memory footprint of weights and caches. The framework covers broad capability areas including production deployment, distributed
Uses advanced execution kernels to increase requests per second and process model data more efficiently.
Cactus este un motor de inferență AI on-device conceput pentru executarea modelelor de limbaj mari (LLM), a modelelor de viziune și a sistemelor de tip speech-to-text pe hardware mobil și wearable. Oferă un graf de calcul tensorial programabil pentru definirea secvențelor de operații matriceale și funcții de activare, alături de un framework local de tip retrieval augmented generation (RAG) care fundamentează răspunsurile modelului folosind fișiere text locale. Proiectul include un SDK multiplatformă cu binding-uri de limbaj pentru integrarea capacităților AI în aplicații mobile și un sistem de conversie a modelelor care transformă formatele externe pentru execuție locală optimizată. Utilizează un sistem de rutare hibrid pentru a redirecționa sarcinile de lucru între execuția on-device și furnizorii cloud, în funcție de capacitatea hardware-ului. Motorul acoperă o suprafață largă de capabilități, inclusiv procesarea audio on-device pentru detectarea activității vocale și transcriere, generarea de vectori embedding pentru căutarea prin similaritate și integrarea de instrumente pentru parsarea output-urilor modelului în apeluri de funcții externe. Aceste procese sunt susținute de nuclee native optimizate, reglate pentru performanță cu latență scăzută pe hardware mobil.
Utilizes native kernels tuned for low-latency, energy-efficient mathematical operations on mobile hardware.
This project provides a comprehensive technical guide and framework for engineering large-scale machine learning systems. It covers the full lifecycle of model development, focusing on the infrastructure and computational principles required to build, train, and serve generative AI models across distributed GPU clusters. The repository distinguishes itself by offering deep-dive tutorials and implementation strategies for complex system challenges. It emphasizes high-performance architectural primitives, such as collective communication orchestration, distributed tensor sharding, and static gr
Ensures bitwise identical log-probability calculations by standardizing kernels and disabling non-deterministic optimizations.
AutoGPTQ este un set de instrumente de compresie a modelelor și un framework de cuantizare post-antrenare conceput pentru a reduce amprenta de memorie a modelelor de limbaj mari. Utilizează algoritmul GPTQ pentru a comprima ponderile rețelelor neuronale, reducând cerințele hardware și utilizarea VRAM. Proiectul servește drept accelerator de inferență prin furnizarea de nuclee optimizate care cresc viteza de generare a token-urilor. Dispune de extensibilitate a arhitecturii modelului, permițând adăugarea capabilităților de cuantizare la noi structuri de modele prin modele configurabile. Framework-ul acoperă un pipeline cuprinzător de cuantizare, incluzând compresia ponderilor pe niveluri, estimarea scalei bazată pe calibrare și maparea memoriei specifică preciziei. Include, de asemenea, sisteme pentru evaluarea performanței modelului pentru a măsura impactul cuantizării asupra acurateței în sarcini de limbaj și sumarizare.
Uses specialized computational kernels to accelerate the token generation and decoding phases of quantized LLMs.
tiny-llm is a large language model inference engine and transformer model implementation. It serves as a quantized model runtime and paged key-value cache manager, providing a specialized inference stack optimized for Apple Silicon. The system distinguishes itself through high-throughput execution techniques, including continuous batching and paged attention. It utilizes a paged memory system to eliminate fragmentation during token generation and employs on-the-fly dequantization of compressed weights to reduce the memory footprint during matrix multiplication. The project covers a broad ran
Implements custom low-level kernels to accelerate the token generation and decoding phases.
ComfyUI-nunchaku is a 4-bit diffusion inference engine and a set of nodes for running low-precision quantized diffusion models within ComfyUI visual workflows. It provides a backend that reduces memory overhead and increases generation speed for transformer models. The project includes specialized tools for identity-preserving generation and an image-to-image guidance toolkit that uses depth maps and reference images. It also features a multimodal visual question answering implementation and a utility for merging multiple quantized model files into single unified files. The engine covers a b
Implements fused kernel projections and rotations to accelerate transformer model inference speed.