13 个仓库
Specialized computational kernels designed to accelerate the token generation and decoding phases of large language models.
Distinguishing note: Focuses specifically on low-level kernel optimization for inference speed, distinct from general model training or high-level API wrappers.
Explore 13 awesome GitHub repositories matching artificial intelligence & ml · Inference Optimization Kernels. Refine with filters or upvote what's useful.
BitNet is a quantized inference engine designed to execute highly compressed language models by performing arithmetic on low-precision, bit-level weight data. It functions as a model optimization toolkit and a high-performance kernel library, enabling the execution of large language models on consumer hardware by reducing memory footprints and increasing processing speeds. The project distinguishes itself through hardware-specific kernel optimizations that leverage native processor instructions to accelerate matrix multiplication. By utilizing packed integer arithmetic and memory-aligned weig
Decode tokens using optimized kernels that reduce processing delays during the autoregressive generation phase of highly compressed language models.
This repository serves as a comprehensive collection of reference implementations for the PyTorch machine learning library. It provides practical examples for building, training, and deploying deep learning models, functioning as a toolkit for developers to explore neural network architectures and training workflows. The project distinguishes itself by offering concrete demonstrations of complex machine learning operations, ranging from computer vision tasks like object detection and depth estimation to the training of large-scale transformer models. These examples illustrate how to implement
Registers and selects specialized compute kernels at runtime to optimize execution paths for inference.
This project is a comprehensive framework for the training, fine-tuning, and deployment of large language models. It functions as a distributed deep learning platform that enables users to scale model workflows across multiple hardware nodes while providing tools for model evaluation and performance benchmarking. The platform distinguishes itself by offering specialized utilities for model compression and weight transformation, allowing users to reduce memory footprints and latency through quantization and pruning. It supports the adaptation of large models for consumer-grade hardware, facili
Utilizes specialized computational kernels to maximize throughput and minimize latency during text generation.
Ktransformers is a comprehensive framework designed for the operation, fine-tuning, and serving of large language models. It functions as a heterogeneous inference engine and quantized execution runtime, enabling the deployment of massive models by distributing computational workloads across both CPU and GPU resources. This architecture allows users to bypass local memory constraints, making it possible to run and train models that exceed the capacity of a single device. The project distinguishes itself through specialized support for sparse architectures, particularly mixture-of-experts mode
Implements specialized computational kernels to accelerate token generation and decoding phases of large language models.
Mamba is a deep learning framework designed for building and training sequence models that process long-range data dependencies with linear-time computational efficiency. By utilizing selective state space modeling, the library enables the construction of neural network architectures that replace traditional attention mechanisms with high-performance state space operations. The framework distinguishes itself through the use of data-dependent state gating, which allows the model to dynamically filter information flow based on the input sequence. To ensure high throughput, it incorporates hardw
Includes optimized hardware-specific kernels for executing complex state space calculations during model training and inference.
FlashMLA is an LLM attention kernel library and inference acceleration library providing a collection of high-performance CUDA kernels. It implements multi-head latent attention mechanisms designed to reduce memory overhead and increase throughput during the forward and backward passes of large language model inference. The library utilizes quantized cache attention kernels to improve computation efficiency across both sparse and dense token processing. It specifically optimizes the prefill and decoding phases of model inference through these latent attention implementations. The project cov
Improves speed and memory efficiency of LLM decoding and prefill stages using specialized kernels.
Text Generation Inference is a production-ready engine designed for the deployment and serving of large language models. It functions as a containerized runtime environment that manages model execution, scales across distributed hardware, and provides high-performance inference capabilities for demanding production environments. The project distinguishes itself through advanced optimization techniques, including continuous batching to maximize hardware utilization and tensor parallelism to shard large models across multiple accelerator cards. It supports efficient inference through custom com
Utilizes hand-optimized low-level compute kernels to accelerate transformer model inference operations.
lmdeploy is a high-performance inference engine and deployment framework for large language models and vision models. It functions as a multi-modal model server and compression toolkit designed to serve models with high throughput and low latency. The system enables the distribution of model services across multiple machines using request-based load balancing and tensor parallelism. It includes specialized tools for model quantization and compression to reduce the memory footprint of weights and caches. The framework covers broad capability areas including production deployment, distributed
Uses advanced execution kernels to increase requests per second and process model data more efficiently.
Cactus 是一个端侧 AI 推理引擎,专为在移动和可穿戴硬件上执行大语言模型、视觉模型和语音转文字系统而设计。它提供了一个用于定义矩阵运算和激活函数序列的可编程张量计算图,以及一个利用本地文本文件为模型响应提供依据的本地检索增强生成(RAG)框架。 该项目具有一个多平台 SDK,包含用于将 AI 功能集成到移动应用程序中的语言绑定,以及一个将外部模型格式转换为优化本地执行的模型转换系统。它利用混合路由系统,根据硬件容量在端侧执行和云端提供商之间重定向工作负载。 该引擎涵盖了广泛的功能面,包括用于语音活动检测和转录的端侧音频处理、用于相似性搜索的向量嵌入生成,以及用于将模型输出解析为外部函数调用的工具集成。这些过程由针对移动硬件低延迟性能优化的原生内核提供支持。
Utilizes native kernels tuned for low-latency, energy-efficient mathematical operations on mobile hardware.
This project provides a comprehensive technical guide and framework for engineering large-scale machine learning systems. It covers the full lifecycle of model development, focusing on the infrastructure and computational principles required to build, train, and serve generative AI models across distributed GPU clusters. The repository distinguishes itself by offering deep-dive tutorials and implementation strategies for complex system challenges. It emphasizes high-performance architectural primitives, such as collective communication orchestration, distributed tensor sharding, and static gr
Ensures bitwise identical log-probability calculations by standardizing kernels and disabling non-deterministic optimizations.
AutoGPTQ 是一个模型压缩工具包和训练后量化框架,旨在减少大语言模型的内存占用。它利用 GPTQ 算法压缩神经网络权重,降低硬件要求并减少 VRAM 使用量。 该项目通过提供优化内核来提高 Token 生成速度,从而充当推理加速器。它具有模型架构扩展性,允许通过可配置模式将量化能力添加到新的模型结构中。 该框架涵盖了全面的量化流水线,包括层级权重压缩、基于校准的缩放估计以及特定精度的内存映射。它还包括用于模型性能评估的系统,以衡量量化对语言和摘要任务准确性的影响。
Uses specialized computational kernels to accelerate the token generation and decoding phases of quantized LLMs.
tiny-llm is a large language model inference engine and transformer model implementation. It serves as a quantized model runtime and paged key-value cache manager, providing a specialized inference stack optimized for Apple Silicon. The system distinguishes itself through high-throughput execution techniques, including continuous batching and paged attention. It utilizes a paged memory system to eliminate fragmentation during token generation and employs on-the-fly dequantization of compressed weights to reduce the memory footprint during matrix multiplication. The project covers a broad ran
Implements custom low-level kernels to accelerate the token generation and decoding phases.
ComfyUI-nunchaku is a 4-bit diffusion inference engine and a set of nodes for running low-precision quantized diffusion models within ComfyUI visual workflows. It provides a backend that reduces memory overhead and increases generation speed for transformer models. The project includes specialized tools for identity-preserving generation and an image-to-image guidance toolkit that uses depth maps and reference images. It also features a multimodal visual question answering implementation and a utility for merging multiple quantized model files into single unified files. The engine covers a b
Implements fused kernel projections and rotations to accelerate transformer model inference speed.