3 Repos
High-performance computational kernels for accelerating neural network operations.
Distinguishing note: Focuses on custom mathematical kernels, distinct from general model parallelism.
Explore 3 awesome GitHub repositories matching artificial intelligence & ml · Kernel Optimization Libraries. Refine with filters or upvote what's useful.
ColossalAI is a distributed deep learning framework designed for training and deploying massive artificial intelligence models across clusters of hardware accelerators. It functions as a parallel computing engine that partitions model workloads and data across multiple processors to maximize memory efficiency and throughput. The platform distinguishes itself through a comprehensive suite of parallelization strategies, including multi-dimensional tensor parallelism and pipeline-based model parallelism, which segment neural network layers and stages across devices. To support large-scale genera
Replaces standard operations with custom high-performance kernels to accelerate mathematical calculations.
FlashAttention is an attention mechanism optimization library and machine learning acceleration framework designed to increase training speed and reduce memory footprint for large-scale neural network models. It functions as a collection of low-level CUDA kernels that optimize memory-bound operations to improve hardware utilization on graphics processing units. The library distinguishes itself through an input-output-aware algorithm design that minimizes data movement between different levels of memory. By employing kernel fusion and tiled matrix multiplication, it combines sequential operati
Provides a collection of low-level CUDA kernels that optimize memory-bound operations for deep learning.
This project is a distributed training infrastructure designed for aligning large language models through reinforcement learning. It functions as an end-to-end engine for complex alignment tasks, including proximal policy optimization, direct preference optimization, and iterative self-play. By providing a unified framework for multi-turn interactions and tool-use scenarios, it enables the development of models capable of reasoning and external environment engagement. The framework distinguishes itself through a decoupled architecture that separates model training from sample generation. This
Integrates high-performance computational kernels for accelerating neural network operations.