Why is microsoft/bitnet a recommended Inference Optimization Kernels GitHub Repositories repository?

Decode tokens using optimized kernels that reduce processing delays during the autoregressive generation phase of highly compressed language models.

Why is pytorch/examples a recommended Inference Optimization Kernels GitHub Repositories repository?

Registers and selects specialized compute kernels at runtime to optimize execution paths for inference.

Why is liguodongiot/llm-action a recommended Inference Optimization Kernels GitHub Repositories repository?

Utilizes specialized computational kernels to maximize throughput and minimize latency during text generation.

Why is kvcache-ai/ktransformers a recommended Inference Optimization Kernels GitHub Repositories repository?

Implements specialized computational kernels to accelerate token generation and decoding phases of large language models.

Why is state-spaces/mamba a recommended Inference Optimization Kernels GitHub Repositories repository?

Includes optimized hardware-specific kernels for executing complex state space calculations during model training and inference.

Why is deepseek-ai/flashmla a recommended Inference Optimization Kernels GitHub Repositories repository?

Improves speed and memory efficiency of LLM decoding and prefill stages using specialized kernels.

Why is huggingface/text-generation-inference a recommended Inference Optimization Kernels GitHub Repositories repository?

Utilizes hand-optimized low-level compute kernels to accelerate transformer model inference operations.

Why is internlm/lmdeploy a recommended Inference Optimization Kernels GitHub Repositories repository?

Uses advanced execution kernels to increase requests per second and process model data more efficiently.

Why is cactus-compute/cactus a recommended Inference Optimization Kernels GitHub Repositories repository?

Utilizes native kernels tuned for low-latency, energy-efficient mathematical operations on mobile hardware.

Why is zhaochenyang20/awesome-ml-sys-tutorial a recommended Inference Optimization Kernels GitHub Repositories repository?

Ensures bitwise identical log-probability calculations by standardizing kernels and disabling non-deterministic optimizations.

13 个仓库

Awesome GitHub RepositoriesInference Optimization Kernels

Specialized computational kernels designed to accelerate the token generation and decoding phases of large language models.

Distinguishing note: Focuses specifically on low-level kernel optimization for inference speed, distinct from general model training or high-level API wrappers.

Explore 13 awesome GitHub repositories matching artificial intelligence & ml · Inference Optimization Kernels. Refine with filters or upvote what's useful.

用 AI 发现最棒的仓库。我们将通过 AI 为您搜索最匹配的仓库。

microsoft/bitnet
microsoft/BitNet
39,327在 GitHub 上查看
BitNet is a quantized inference engine designed to execute highly compressed language models by performing arithmetic on low-precision, bit-level weight data. It functions as a model optimization toolkit and a high-performance kernel library, enabling the execution of large language models on consumer hardware by reducing memory footprints and increasing processing speeds. The project distinguishes itself through hardware-specific kernel optimizations that leverage native processor instructions to accelerate matrix multiplication. By utilizing packed integer arithmetic and memory-aligned weig
Decode tokens using optimized kernels that reduce processing delays during the autoregressive generation phase of highly compressed language models.
Python
在 GitHub 上查看39,327
pytorch/examples
pytorch/examples
23,752在 GitHub 上查看
This repository serves as a comprehensive collection of reference implementations for the PyTorch machine learning library. It provides practical examples for building, training, and deploying deep learning models, functioning as a toolkit for developers to explore neural network architectures and training workflows. The project distinguishes itself by offering concrete demonstrations of complex machine learning operations, ranging from computer vision tasks like object detection and depth estimation to the training of large-scale transformer models. These examples illustrate how to implement
Registers and selects specialized compute kernels at runtime to optimize execution paths for inference.
Python
在 GitHub 上查看23,752
liguodongiot/llm-action
liguodongiot/llm-action
23,169在 GitHub 上查看
This project is a comprehensive framework for the training, fine-tuning, and deployment of large language models. It functions as a distributed deep learning platform that enables users to scale model workflows across multiple hardware nodes while providing tools for model evaluation and performance benchmarking. The platform distinguishes itself by offering specialized utilities for model compression and weight transformation, allowing users to reduce memory footprints and latency through quantization and pruning. It supports the adaptation of large models for consumer-grade hardware, facili
Utilizes specialized computational kernels to maximize throughput and minimize latency during text generation.
HTMLllmllm-inferencellm-serving
在 GitHub 上查看23,169
kvcache-ai/ktransformers
kvcache-ai/ktransformers
17,288在 GitHub 上查看
Ktransformers is a comprehensive framework designed for the operation, fine-tuning, and serving of large language models. It functions as a heterogeneous inference engine and quantized execution runtime, enabling the deployment of massive models by distributing computational workloads across both CPU and GPU resources. This architecture allows users to bypass local memory constraints, making it possible to run and train models that exceed the capacity of a single device. The project distinguishes itself through specialized support for sparse architectures, particularly mixture-of-experts mode
Implements specialized computational kernels to accelerate token generation and decoding phases of large language models.
Python
在 GitHub 上查看17,288
state-spaces/mamba
state-spaces/mamba
17,215在 GitHub 上查看
Mamba is a deep learning framework designed for building and training sequence models that process long-range data dependencies with linear-time computational efficiency. By utilizing selective state space modeling, the library enables the construction of neural network architectures that replace traditional attention mechanisms with high-performance state space operations. The framework distinguishes itself through the use of data-dependent state gating, which allows the model to dynamically filter information flow based on the input sequence. To ensure high throughput, it incorporates hardw
Includes optimized hardware-specific kernels for executing complex state space calculations during model training and inference.
Python
在 GitHub 上查看17,215
deepseek-ai/flashmla
deepseek-ai/FlashMLA
12,706在 GitHub 上查看
FlashMLA is an LLM attention kernel library and inference acceleration library providing a collection of high-performance CUDA kernels. It implements multi-head latent attention mechanisms designed to reduce memory overhead and increase throughput during the forward and backward passes of large language model inference. The library utilizes quantized cache attention kernels to improve computation efficiency across both sparse and dense token processing. It specifically optimizes the prefill and decoding phases of model inference through these latent attention implementations. The project cov
Improves speed and memory efficiency of LLM decoding and prefill stages using specialized kernels.
C++
在 GitHub 上查看12,706
huggingface/text-generation-inference
huggingface/text-generation-inference
10,775在 GitHub 上查看
Text Generation Inference is a production-ready engine designed for the deployment and serving of large language models. It functions as a containerized runtime environment that manages model execution, scales across distributed hardware, and provides high-performance inference capabilities for demanding production environments. The project distinguishes itself through advanced optimization techniques, including continuous batching to maximize hardware utilization and tensor parallelism to shard large models across multiple accelerator cards. It supports efficient inference through custom com
Utilizes hand-optimized low-level compute kernels to accelerate transformer model inference operations.
Pythonbloomdeep-learningfalcon
在 GitHub 上查看10,775
internlm/lmdeploy
InternLM/lmdeploy
7,903在 GitHub 上查看
lmdeploy is a high-performance inference engine and deployment framework for large language models and vision models. It functions as a multi-modal model server and compression toolkit designed to serve models with high throughput and low latency. The system enables the distribution of model services across multiple machines using request-based load balancing and tensor parallelism. It includes specialized tools for model quantization and compression to reduce the memory footprint of weights and caches. The framework covers broad capability areas including production deployment, distributed
Uses advanced execution kernels to increase requests per second and process model data more efficiently.
Pythoncodellamacuda-kernelsdeepspeed
在 GitHub 上查看7,903
cactus-compute/cactus
cactus-compute/cactus
5,363在 GitHub 上查看
Cactus 是一个端侧 AI 推理引擎，专为在移动和可穿戴硬件上执行大语言模型、视觉模型和语音转文字系统而设计。它提供了一个用于定义矩阵运算和激活函数序列的可编程张量计算图，以及一个利用本地文本文件为模型响应提供依据的本地检索增强生成（RAG）框架。该项目具有一个多平台 SDK，包含用于将 AI 功能集成到移动应用程序中的语言绑定，以及一个将外部模型格式转换为优化本地执行的模型转换系统。它利用混合路由系统，根据硬件容量在端侧执行和云端提供商之间重定向工作负载。该引擎涵盖了广泛的功能面，包括用于语音活动检测和转录的端侧音频处理、用于相似性搜索的向量嵌入生成，以及用于将模型输出解析为外部函数调用的工具集成。这些过程由针对移动硬件低延迟性能优化的原生内核提供支持。
Utilizes native kernels tuned for low-latency, energy-efficient mathematical operations on mobile hardware.
C++aiandroidarm
在 GitHub 上查看5,363
zhaochenyang20/awesome-ml-sys-tutorial
zhaochenyang20/Awesome-ML-SYS-Tutorial
5,371在 GitHub 上查看
This project provides a comprehensive technical guide and framework for engineering large-scale machine learning systems. It covers the full lifecycle of model development, focusing on the infrastructure and computational principles required to build, train, and serve generative AI models across distributed GPU clusters. The repository distinguishes itself by offering deep-dive tutorials and implementation strategies for complex system challenges. It emphasizes high-performance architectural primitives, such as collective communication orchestration, distributed tensor sharding, and static gr
Ensures bitwise identical log-probability calculations by standardizing kernels and disabling non-deterministic optimizations.
Python
在 GitHub 上查看5,371
autogptq/autogptq
AutoGPTQ/AutoGPTQ
5,070在 GitHub 上查看
AutoGPTQ 是一个模型压缩工具包和训练后量化框架，旨在减少大语言模型的内存占用。它利用 GPTQ 算法压缩神经网络权重，降低硬件要求并减少 VRAM 使用量。该项目通过提供优化内核来提高 Token 生成速度，从而充当推理加速器。它具有模型架构扩展性，允许通过可配置模式将量化能力添加到新的模型结构中。该框架涵盖了全面的量化流水线，包括层级权重压缩、基于校准的缩放估计以及特定精度的内存映射。它还包括用于模型性能评估的系统，以衡量量化对语言和摘要任务准确性的影响。
Uses specialized computational kernels to accelerate the token generation and decoding phases of quantized LLMs.
Python
在 GitHub 上查看5,070
skyzh/tiny-llm
skyzh/tiny-llm
4,304在 GitHub 上查看
tiny-llm is a large language model inference engine and transformer model implementation. It serves as a quantized model runtime and paged key-value cache manager, providing a specialized inference stack optimized for Apple Silicon. The system distinguishes itself through high-throughput execution techniques, including continuous batching and paged attention. It utilizes a paged memory system to eliminate fragmentation during token generation and employs on-the-fly dequantization of compressed weights to reduce the memory footprint during matrix multiplication. The project covers a broad ran
Implements custom low-level kernels to accelerate the token generation and decoding phases.
Pythoncourselarge-language-modelllm
在 GitHub 上查看4,304
nunchaku-ai/comfyui-nunchaku
nunchaku-ai/ComfyUI-nunchaku
2,901在 GitHub 上查看
ComfyUI-nunchaku is a 4-bit diffusion inference engine and a set of nodes for running low-precision quantized diffusion models within ComfyUI visual workflows. It provides a backend that reduces memory overhead and increases generation speed for transformer models. The project includes specialized tools for identity-preserving generation and an image-to-image guidance toolkit that uses depth maps and reference images. It also features a multimodal visual question answering implementation and a utility for merging multiple quantized model files into single unified files. The engine covers a b
Implements fused kernel projections and rotations to accelerate transformer model inference speed.
Pythoncomfyuidiffusionflux
在 GitHub 上查看2,901

Awesome Inference Optimization Kernels GitHub Repositories

microsoft/BitNet

pytorch/examples

liguodongiot/llm-action

kvcache-ai/ktransformers

state-spaces/mamba

deepseek-ai/FlashMLA

huggingface/text-generation-inference

InternLM/lmdeploy

cactus-compute/cactus

zhaochenyang20/Awesome-ML-SYS-Tutorial

AutoGPTQ/AutoGPTQ

skyzh/tiny-llm

nunchaku-ai/ComfyUI-nunchaku

探索子标签