# xlite-dev/leetcuda

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/xlite-dev-leetcuda).**

9,694 stars · 962 forks · Cuda · gpl-3.0

## Links

- GitHub: https://github.com/xlite-dev/LeetCUDA
- Homepage: https://github.com/xlite-dev/LeetCUDA
- awesome-repositories: https://awesome-repositories.com/repository/xlite-dev-leetcuda.md

## Topics

`cuda` `cuda-12` `cuda-cpp` `cuda-demo` `cuda-kernel` `cuda-kernels` `cuda-library` `cuda-toolkit` `flash-attention` `hgemm` `learn-cuda` `leet-cuda`

## Description

LeetCUDA is a collection of high-performance GPU kernel libraries focusing on memory optimization, activation functions, and attention mechanisms. It serves as a reference library for CUDA kernel implementations, ranging from basic element-wise operations to complex neural network components, and provides Python bindings to integrate these kernels into deep learning workflows.

The project is distinguished by its focus on low-level hardware optimizations. This includes the use of tensor cores for half-precision matrix multiplication, asynchronous data pipelining with double buffering, and shared memory swizzling to prevent bank conflicts. It also features advanced attention implementations, such as FlashAttention and FlashAttention-2, utilizing fine-grained tiling and low-level assembly.

The library covers a broad surface of GPU primitives, including a variety of activation functions, normalization layers, and matrix operations. It also implements parallel patterns such as warp-level reductions, dot products, and matrix transpositions. For computer vision tasks, it includes a GPU-accelerated implementation of non-maximum suppression.

The repository includes tools for verifying hardware performance through assembly inspection, kernel profiling, and throughput benchmarking against standard libraries.

## Tags

### Artificial Intelligence & ML

- [GPU Kernel Implementations](https://awesome-repositories.com/f/artificial-intelligence-ml/gpu-kernel-implementations.md) — Provides a comprehensive collection of custom-written CUDA kernels for accelerated parallel computing and neural network operations.
- [Activation Functions](https://awesome-repositories.com/f/artificial-intelligence-ml/activation-functions.md) — Ships a suite of optimized CUDA kernels for element-wise activation functions like ReLU, GELU, and Sigmoid. ([source](https://github.com/xlite-dev/LeetCUDA/blob/main/kernels/elu/elu.cu))
- [Attention Kernel Libraries](https://awesome-repositories.com/f/artificial-intelligence-ml/attention-kernel-libraries.md) — Provides optimized CUDA kernels for scaled dot-product attention using tensor cores and fine-grained tiling.
- [FlashAttention](https://awesome-repositories.com/f/artificial-intelligence-ml/attention-mechanisms/flashattention.md) — Implements FlashAttention using low-level assembly and fine-grained tiling for extreme memory optimization. ([source](https://github.com/xlite-dev/LeetCUDA))
- [FlashAttention-2](https://awesome-repositories.com/f/artificial-intelligence-ml/attention-mechanisms/flashattention-2.md) — Implements FlashAttention-2 utilizing tensor cores and hardware instructions for high-dimensional tensor operations. ([source](https://github.com/xlite-dev/LeetCUDA/blob/main/kernels/flash-attn/mma/basic/flash_attn_mma_split_kv.cu))
- [Half-Precision Matrix Multiplications](https://awesome-repositories.com/f/artificial-intelligence-ml/half-precision-inference/half-precision-matrix-multiplications.md) — Provides half-precision matrix multiplication utilizing shared memory swizzling and asynchronous copy instructions for maximum throughput. ([source](https://github.com/xlite-dev/LeetCUDA/blob/main/kernels/hgemm))
- [Normalization Layers](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/frameworks/model-construction/neural-network-layers/normalization-layers.md) — Implements layer normalization to standardize vectors and stabilize training in neural networks. ([source](https://github.com/xlite-dev/LeetCUDA/blob/main/kernels/openai-triton/layer-norm))
- [Python Bindings](https://awesome-repositories.com/f/artificial-intelligence-ml/pytorch-backends/python-bindings.md) — Provides Python bindings that wrap custom CUDA kernels for use within PyTorch deep learning workflows. ([source](https://github.com/xlite-dev/LeetCUDA/blob/main/README.md))
- [Attention State Merging](https://awesome-repositories.com/f/artificial-intelligence-ml/attention-mechanisms/attention-state-merging.md) — Implements attention state merging for split-KV scenarios to reduce inference latency. ([source](https://github.com/xlite-dev/LeetCUDA/blob/main/kernels/openai-triton/merge-attn-states))
- [Activation Functions](https://awesome-repositories.com/f/artificial-intelligence-ml/convolutional-neural-networks/activation-functions.md) — Implements the Gaussian Error Linear Unit (GELU) activation using precision types and vectorization. ([source](https://github.com/xlite-dev/LeetCUDA/blob/main/kernels/gelu))
- [Python Bindings](https://awesome-repositories.com/f/artificial-intelligence-ml/deep-learning-libraries/cuda-accelerated-libraries/python-bindings.md) — Wraps custom CUDA kernels into Python bindings for seamless integration with PyTorch deep learning workflows.
- [Embedding Lookup Kernels](https://awesome-repositories.com/f/artificial-intelligence-ml/gpu-kernel-implementations/embedding-lookup-kernels.md) — Implements specialized GPU kernels for embedding lookups using precision levels and vectorization. ([source](https://github.com/xlite-dev/LeetCUDA/blob/main/kernels/embedding))
- [Triton Kernels](https://awesome-repositories.com/f/artificial-intelligence-ml/gpu-kernel-implementations/triton-kernels.md) — Implements fused operations and attention states using the Triton domain-specific language. ([source](https://github.com/xlite-dev/LeetCUDA))
- [Normalization Gradient Computations](https://awesome-repositories.com/f/artificial-intelligence-ml/gradient-computation/normalization-gradient-computations.md) — Computes vector-jacobian products for weights and biases in layer normalization using parallel reduction. ([source](https://github.com/xlite-dev/LeetCUDA/blob/main/kernels/openai-triton/layer-norm))
- [Half-Precision Matrix-Vector Operations](https://awesome-repositories.com/f/artificial-intelligence-ml/half-precision-inference/half-precision-matrix-vector-operations.md) — Executes half-precision matrix-vector multiplication using tiling strategies and hardware acceleration. ([source](https://github.com/xlite-dev/LeetCUDA/blob/main/kernels/hgemv))
- [Online Softmax Kernels](https://awesome-repositories.com/f/artificial-intelligence-ml/online-softmax-kernels.md) — Implements softmax calculations using online techniques for GPU acceleration. ([source](https://github.com/xlite-dev/LeetCUDA/blob/main/kernels/softmax))
- [Rotary Positional Embeddings](https://awesome-repositories.com/f/artificial-intelligence-ml/positional-embedding-techniques/rotary-positional-embeddings.md) — Computes rotary positional embeddings to support relative token positioning in language model sequences. ([source](https://github.com/xlite-dev/LeetCUDA/blob/main/kernels/rope))
- [RMS Normalizations](https://awesome-repositories.com/f/artificial-intelligence-ml/rms-normalizations.md) — Provides root mean square layer normalization across various floating-point precisions and vectorization strategies. ([source](https://github.com/xlite-dev/LeetCUDA/blob/main/kernels/rms-norm))
- [Tensor Reductions](https://awesome-repositories.com/f/artificial-intelligence-ml/tensor-reductions.md) — Performs sum reductions across warps and blocks using floating-point and integer precisions. ([source](https://github.com/xlite-dev/LeetCUDA/blob/main/kernels/reduce))
- [Warp-Level Reductions](https://awesome-repositories.com/f/artificial-intelligence-ml/tensor-reductions/warp-level-reductions.md) — Implements warp-level reductions using shuffle instructions to aggregate values across threads.

### Part of an Awesome List

- [Tensor Core Optimization](https://awesome-repositories.com/f/awesome-lists/ai/tensor-core-optimization.md) — Leverages tensor cores for hardware-accelerated half-precision matrix multiplications and tensor operations.

### Data & Databases

- [Matrix Multiplication Utilities](https://awesome-repositories.com/f/data-databases/batch-processing/batch-matrix-multiplication-utilities/matrix-multiplication-utilities.md) — Implements single-precision matrix multiplication optimized with shared memory tiling and double buffering. ([source](https://github.com/xlite-dev/LeetCUDA/blob/main/kernels/sgemm/sgemm_async.cu))
- [Vectorized Memory Access](https://awesome-repositories.com/f/data-databases/vector-data-processing/vectorized-memory-access.md) — Increases throughput by loading multiple data elements simultaneously using vectorized types. ([source](https://github.com/xlite-dev/LeetCUDA/blob/main/kernels/hgemv/hgemv.cu))

### Operating Systems & Systems Programming

- [Asynchronous Data Pipelining](https://awesome-repositories.com/f/operating-systems-systems-programming/asynchronous-data-pipelining.md) — Implements asynchronous data pipelining to overlap global memory loads with computation using double buffering.
- [GPU Memory Optimizations](https://awesome-repositories.com/f/operating-systems-systems-programming/gpu-memory-optimizations.md) — Implements shared memory swizzling, double buffering, and vectorized access to maximize GPU memory throughput.
- [C-Bindings](https://awesome-repositories.com/f/operating-systems-systems-programming/kernel-core-internals/process-and-memory-management/memory-management/allocation-strategies/memory-allocation-libraries/low-level-system-operations/c-bindings.md) — Ships Python wrappers providing direct access to low-level C++ CUDA kernels for performance-critical operations.
- [Shared Memory Swizzling](https://awesome-repositories.com/f/operating-systems-systems-programming/shared-memory-swizzling.md) — Prevents shared memory bank conflicts by rearranging data access patterns via swizzling. ([source](https://github.com/xlite-dev/LeetCUDA/blob/main/kernels/flash-attn/mma/swizzle/flash_attn_mma_share_kv_swizzle_q.cu))
- [Warp-Level Primitives](https://awesome-repositories.com/f/operating-systems-systems-programming/warp-level-primitives.md) — Implements high-performance warp-level reduction and shuffle instructions for parallel data aggregation on GPUs. ([source](https://github.com/xlite-dev/LeetCUDA/blob/main/kernels/hgemv/hgemv.cu))

### Scientific & Mathematical Computing

- [GPU Matrix Operation Implementations](https://awesome-repositories.com/f/scientific-mathematical-computing/gpu-matrix-operation-implementations.md) — Provides high-performance GPU implementations of matrix operations utilizing specialized libraries like CUTLASS and CuTe. ([source](https://github.com/xlite-dev/LeetCUDA))
- [Matrix-Vector Products](https://awesome-repositories.com/f/scientific-mathematical-computing/high-performance-execution-environments/scientific-computing-platforms/scientific-computing/matrix-operations/matrix-vector-products.md) — Implements single-precision matrix-vector products using tiling strategies and kernel optimizations. ([source](https://github.com/xlite-dev/LeetCUDA/blob/main/kernels/hgemv/hgemv.cu))
- [Vectorized Memory Primitives](https://awesome-repositories.com/f/scientific-mathematical-computing/high-performance-execution-environments/scientific-computing-platforms/scientific-computing/vectorized-array-operations/vectorized-memory-primitives.md) — Utilizes vectorized memory primitives to load multiple data elements simultaneously and increase global memory throughput.
- [Element-wise Array Operations](https://awesome-repositories.com/f/scientific-mathematical-computing/element-wise-array-operations.md) — Performs element-wise addition across float32 and float16 arrays using vectorized memory access. ([source](https://github.com/xlite-dev/LeetCUDA/blob/main/kernels/elementwise))
- [Matrix Transposition Kernels](https://awesome-repositories.com/f/scientific-mathematical-computing/matrix-transposition-kernels.md) — Rearranges matrix elements between rows and columns using vectorization and shared memory. ([source](https://github.com/xlite-dev/LeetCUDA/blob/main/kernels/mat-transpose))
- [Online Softmax](https://awesome-repositories.com/f/scientific-mathematical-computing/numerical-mathematical-foundations/statistics-probability/probability-distributions/softmax-normalization/online-softmax.md) — Implements an online softmax approach to compute probabilities across tokens while reducing memory overhead.
- [Block-Level Reductions](https://awesome-repositories.com/f/scientific-mathematical-computing/prefix-calculations/parallel-prefix-sum-calculators/block-level-reductions.md) — Calculates total sums across a block using warp-level primitives and shared memory. ([source](https://github.com/xlite-dev/LeetCUDA/blob/main/kernels/reduce/block_all_reduce.cu))
- [Vector Dot Product Kernels](https://awesome-repositories.com/f/scientific-mathematical-computing/vector-dot-product-kernels.md) — Computes the scalar product of two vectors using floating-point precisions and vectorization. ([source](https://github.com/xlite-dev/LeetCUDA/blob/main/kernels/dot-product))

### Software Engineering & Architecture

- [Tiled Memory Access Patterns](https://awesome-repositories.com/f/software-engineering-architecture/shared-memory-management/memory-access-profilers/tiled-memory-access-patterns.md) — Implements fine-grained tiling to manage memory usage and maintain constant complexity at the hardware level. ([source](https://github.com/xlite-dev/LeetCUDA/blob/main/kernels/flash-attn/mma/swizzle/flash_attn_mma_tiling_qkv_swizzle_qk.cu))
