LeetCUDA | Awesome Repository

LeetCUDA is a collection of high-performance GPU kernel libraries focusing on memory optimization, activation functions, and attention mechanisms. It serves as a reference library for CUDA kernel implementations, ranging from basic element-wise operations to complex neural network components, and provides Python bindings to integrate these kernels into deep learning workflows.

The project is distinguished by its focus on low-level hardware optimizations. This includes the use of tensor cores for half-precision matrix multiplication, asynchronous data pipelining with double buffering, and shared memory swizzling to prevent bank conflicts. It also features advanced attention implementations, such as FlashAttention and FlashAttention-2, utilizing fine-grained tiling and low-level assembly.

The library covers a broad surface of GPU primitives, including a variety of activation functions, normalization layers, and matrix operations. It also implements parallel patterns such as warp-level reductions, dot products, and matrix transpositions. For computer vision tasks, it includes a GPU-accelerated implementation of non-maximum suppression.

The repository includes tools for verifying hardware performance through assembly inspection, kernel profiling, and throughput benchmarking against standard libraries.

Features

GPU Kernel Implementations - Provides a comprehensive collection of custom-written CUDA kernels for accelerated parallel computing and neural network operations.
Activation Functions - Ships a suite of optimized CUDA kernels for element-wise activation functions like ReLU, GELU, and Sigmoid.
Attention Kernel Libraries - Provides optimized CUDA kernels for scaled dot-product attention using tensor cores and fine-grained tiling.
FlashAttention - Implements FlashAttention using low-level assembly and fine-grained tiling for extreme memory optimization.

Features

GPU Kernel Implementations - Provides a comprehensive collection of custom-written CUDA kernels for accelerated parallel computing and neural network operations.
Activation Functions - Ships a suite of optimized CUDA kernels for element-wise activation functions like ReLU, GELU, and Sigmoid.
Attention Kernel Libraries - Provides optimized CUDA kernels for scaled dot-product attention using tensor cores and fine-grained tiling.
FlashAttention - Implements FlashAttention using low-level assembly and fine-grained tiling for extreme memory optimization.

The repository includes tools for verifying hardware performance through assembly inspection, kernel profiling, and throughput benchmarking against standard libraries.