LeetCUDA is a collection of high-performance GPU kernel libraries focusing on memory optimization, activation functions, and attention mechanisms. It serves as a reference library for CUDA kernel implementations, ranging from basic element-wise operations to complex neural network components, and provides Python bindings to integrate these kernels into deep learning workflows.
The project is distinguished by its focus on low-level hardware optimizations. This includes the use of tensor cores for half-precision matrix multiplication, asynchronous data pipelining with double buffering, and shared memory swizzling to prevent bank conflicts. It also features advanced attention implementations, such as FlashAttention and FlashAttention-2, utilizing fine-grained tiling and low-level assembly.
The library covers a broad surface of GPU primitives, including a variety of activation functions, normalization layers, and matrix operations. It also implements parallel patterns such as warp-level reductions, dot products, and matrix transpositions. For computer vision tasks, it includes a GPU-accelerated implementation of non-maximum suppression.
The repository includes tools for verifying hardware performance through assembly inspection, kernel profiling, and throughput benchmarking against standard libraries.