# tile-ai/tilelang

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/tile-ai-tilelang).**

5,226 stars · 454 forks · Python · other

## Links

- GitHub: https://github.com/tile-ai/tilelang
- Homepage: https://tilelang.com/
- awesome-repositories: https://awesome-repositories.com/repository/tile-ai-tilelang.md

## Description

TileLang is a Python-embedded domain-specific language compiler that JIT-compiles and autotunes GPU kernels. It uses a tile-based DSL, automatic software pipelining, and parallel autotuning to generate optimized GPU kernels at runtime.

It supports tensor core operations with Pythonic syntax, automatic memory management, and thread mapping. The compiler searches over tile sizes, thread counts, and scheduling policies, compiling and benchmarking candidates in parallel to find the fastest kernel. It also caches compiled binaries and tuning results to disk for reuse across sessions.

TileLang includes optimizations for attention, convolution, and reduction operators, with multi-level tiling, software pipelining, and warp specialization. It manages memory across global, shared, and register levels, supports synchronization barriers, and provides debugging and diagnostic tools.

## Tags

### Programming Languages & Runtimes

- [Python GPU Kernels](https://awesome-repositories.com/f/programming-languages-runtimes/compiler-interpreter-internals/compiler-infrastructure/jit-kernel-compilers/python-gpu-kernels.md) — Provides a Python-embedded DSL for writing high-performance GPU kernels with automatic thread mapping and shared memory management.
- [Autotuning GPU Kernel Compilers](https://awesome-repositories.com/f/programming-languages-runtimes/compiler-interpreter-internals/compiler-infrastructure/jit-kernel-compilers/python-gpu-kernels/autotuning-gpu-kernel-compilers.md) — Automatically searches over tile sizes and thread counts to optimize GPU kernel performance.
- [GPU Kernel DSL Compilers](https://awesome-repositories.com/f/programming-languages-runtimes/compiler-interpreter-internals/compiler-infrastructure/jit-kernel-compilers/python-gpu-kernels/gpu-kernel-dsl-compilers.md) — Provides the core Pythonic DSL for structuring high-performance kernels with automatic memory management and thread mapping. ([source](https://tilelang.com/autoapi/tilelang/cuda/intrinsics/macro/mma_macro_generator/index.html))
- [Buffer Reshaping and Reinterpretation](https://awesome-repositories.com/f/programming-languages-runtimes/buffer-reshaping-and-reinterpretation.md) — TVM's feature to reinterpret a buffer with a new shape and optionally a new data type without copying the underlying data. ([source](https://tilelang.com/autoapi/tilelang/carver/template/conv/index.html))
- [GPU Architecture Identifications](https://awesome-repositories.com/f/programming-languages-runtimes/compilation-target-specifications/architecture-specific-generators/gpu-architecture-identifications.md) — Identifies GPU generation to guide architecture-specific kernel optimizations during compilation. ([source](https://tilelang.com/autoapi/tilelang/jit/adapter/nvrtc/wrapper/index.html))
- [JIT Kernel Compilers](https://awesome-repositories.com/f/programming-languages-runtimes/compiler-interpreter-internals/compiler-infrastructure/jit-kernel-compilers.md) — Compiles GPU kernels on the fly and executes them immediately or returns a reusable object. ([source](https://tilelang.com/autoapi/tilelang/carver/arch/arch_base/index.html))
- [CUDA Kernel Compilers](https://awesome-repositories.com/f/programming-languages-runtimes/compiler-interpreter-internals/compiler-infrastructure/jit-kernel-compilers/cuda-kernel-compilers.md) — Compiles GPU kernels to CUDA device binaries with configurable architecture options. ([source](https://tilelang.com/autoapi/tilelang/carver/arch/rdna/index.html))
- [GPU Backend Selections](https://awesome-repositories.com/f/programming-languages-runtimes/compiler-interpreter-internals/compiler-infrastructure/jit-kernel-compilers/python-gpu-kernels/per-kernel-compilation-options/gpu-backend-selections.md) — Selects GPU or CPU backend for kernel compilation with optional architecture specification for tuning. ([source](https://tilelang.com/deeplearning_operators/matmul.html))
- [Runtime GPU Kernel Compilation Libraries](https://awesome-repositories.com/f/programming-languages-runtimes/compiler-interpreter-internals/compiler-infrastructure/jit-kernel-compilers/runtime-gpu-kernel-compilation-libraries.md) — Compiles and caches GPU kernels at runtime for dynamic specialization without a separate build step.
- [Compiled Kernel Executions](https://awesome-repositories.com/f/programming-languages-runtimes/compiler-interpreter-internals/compiler-infrastructure/jit-kernel-compilers/runtime-gpu-kernel-compilation-libraries/compiled-kernel-executions.md) — Executes compiled GPU kernels on the active stream with automatic pointer binding. ([source](https://tilelang.com/autoapi/tilelang/contrib/cutedsl/threadblock_swizzle/index.html))
- [TVM JIT Compilers](https://awesome-repositories.com/f/programming-languages-runtimes/compiler-interpreter-internals/compiler-infrastructure/jit-kernel-compilers/tvm-jit-compilers.md) — Compiles high-level kernel descriptions into device binaries at runtime using TVM's IR and LLVM backends.
- [Dense and Sparse MMA Executions](https://awesome-repositories.com/f/programming-languages-runtimes/dense-and-sparse-mma-executions.md) — Executes both dense and sparse matrix multiply-accumulate operations on Tensor Cores. ([source](https://tilelang.com/deeplearning_operators/gemv.html))
- [GPU Kernel Code Generators](https://awesome-repositories.com/f/programming-languages-runtimes/gpu-kernel-code-generators.md) — Generates executable GPU code from intermediate representations for target platforms. ([source](https://tilelang.com/autoapi/tilelang/engine/lower/index.html))
- [Grid-Wide Synchronizations](https://awesome-repositories.com/f/programming-languages-runtimes/grid-wide-synchronizations.md) — Provides grid-wide thread block synchronization for cooperative kernel launches. ([source](https://tilelang.com/programming_guides/control_flow.html))
- [Kernel Autotuning Frameworks](https://awesome-repositories.com/f/programming-languages-runtimes/kernel-autotuning-frameworks.md) — Automatically searches over tile sizes, thread counts, and scheduling policies to find the fastest kernel configuration.
- [Kernel Fusion Operations](https://awesome-repositories.com/f/programming-languages-runtimes/runtime-execution-environments/runtime-environments/runtimes/graph-symbolic-execution-engines/operation-kernels/kernel-fusion-operations.md) — Combines successive operations into fused kernels to minimize global memory traffic. ([source](https://tilelang.com/programming_guides/language_basics.html))
- [Scoped GPU Memory Allocations](https://awesome-repositories.com/f/programming-languages-runtimes/scoped-gpu-memory-allocations.md) — TVM's feature to allocate on-chip shared memory, per-thread fragments, or scalar variables under a block-scoped zone. ([source](https://tilelang.com/compiler_internals/inject_fence_proxy.html))
- [Software Pipelining Optimizations](https://awesome-repositories.com/f/programming-languages-runtimes/software-pipelining-optimizations.md) — Implements automatic multistage software pipelining to overlap memory loads with computation in GPU kernels.
- [TMA Data Loads](https://awesome-repositories.com/f/programming-languages-runtimes/tma-descriptor-creation/tma-data-loads.md) — TVM's feature to use special TMA hardware to efficiently load multidimensional tensors from global into shared memory. ([source](https://tilelang.com/autoapi/tilelang/language/customize/index.html))
- [TMA Data Stores](https://awesome-repositories.com/f/programming-languages-runtimes/tma-descriptor-creation/tma-data-stores.md) — TVM's feature to use special TMA hardware to write data from shared back to global memory, optionally performing atomic reduction. ([source](https://tilelang.com/autoapi/tilelang/language/customize/index.html))
- [TMA Data Transfer Initiations](https://awesome-repositories.com/f/programming-languages-runtimes/tma-descriptor-creation/tma-data-transfer-initiations.md) — TVM's feature to initiate an asynchronous data load from global memory to shared memory using a TMA descriptor, supporting standard and 2SM configurations. ([source](https://tilelang.com/autoapi/tilelang/language/annotations/index.html))
- [C++ Kernel Launcher Generators](https://awesome-repositories.com/f/programming-languages-runtimes/c-kernel-launcher-generators.md) — Generates C++ launcher code for GPU kernels with automatic memory management. ([source](https://tilelang.com/autoapi/tilelang/carver/arch/arch_base/index.html))
- [Cluster Thread Block Synchronizations](https://awesome-repositories.com/f/programming-languages-runtimes/cluster-thread-block-synchronizations.md) — Coordinates thread blocks within GPU clusters using barrier primitives. ([source](https://tilelang.com/autoapi/tilelang/language/cluster/index.html))
- [Async Proxy Operation Fencings](https://awesome-repositories.com/f/programming-languages-runtimes/compiler-fence-proxy-extensions/async-proxy-operation-fencings.md) — Guarantees correct ordering between generic and async memory accesses in GPU kernels. ([source](https://tilelang.com/autoapi/tilelang/cuda/pipeline/index.html))
- [AMD GPU Kernel Compilers](https://awesome-repositories.com/f/programming-languages-runtimes/compiler-interpreter-internals/compiler-infrastructure/jit-kernel-compilers/amd-gpu-kernel-compilers.md) — Compiles GPU kernels for AMD architectures with automatic device library and linker location. ([source](https://tilelang.com/autoapi/tilelang/contrib/rocm/index.html))
- [HIP Kernel Compilers](https://awesome-repositories.com/f/programming-languages-runtimes/compiler-interpreter-internals/compiler-infrastructure/jit-kernel-compilers/hip-kernel-compilers.md) — Compiles GPU kernels into HIP device binaries for AMD architectures. ([source](https://tilelang.com/autoapi/tilelang/contrib/hipcc/index.html))
- [Kernel Artifact Caches](https://awesome-repositories.com/f/programming-languages-runtimes/compiler-interpreter-internals/compiler-infrastructure/jit-kernel-compilers/kernel-artifact-caches.md) — Caches compiled kernel binaries and source code to disk for reuse across sessions. ([source](https://tilelang.com/programming_guides/instructions.html))
- [NVRTC Kernel Compilers](https://awesome-repositories.com/f/programming-languages-runtimes/compiler-interpreter-internals/compiler-infrastructure/jit-kernel-compilers/nvrtc-kernel-compilers.md) — Uses NVRTC to compile GPU kernels at runtime for immediate execution. ([source](https://tilelang.com/autoapi/tilelang/cuda/op/gemm/gemm_tcgen05/index.html))
- [Batch Kernel Compilations](https://awesome-repositories.com/f/programming-languages-runtimes/compiler-interpreter-internals/compiler-infrastructure/jit-kernel-compilers/python-gpu-kernels/autotuning-gpu-kernel-compilers/batch-kernel-compilations.md) — Batch-compiles kernel configurations in parallel to accelerate autotuning. ([source](https://tilelang.com/autoapi/tilelang/language/builtin/index.html))
- [Parallel Kernel Compilations](https://awesome-repositories.com/f/programming-languages-runtimes/compiler-interpreter-internals/compiler-infrastructure/jit-kernel-compilers/python-gpu-kernels/autotuning-gpu-kernel-compilers/parallel-kernel-compilations.md) — Compiles kernel parameterizations in parallel using multiple workers. ([source](https://tilelang.com/autoapi/tilelang/language/allocate/index.html))
- [Kernel Performance Benchmarks](https://awesome-repositories.com/f/programming-languages-runtimes/compiler-interpreter-internals/compiler-infrastructure/jit-kernel-compilers/runtime-gpu-kernel-compilation-libraries/compiled-kernel-executions/kernel-performance-benchmarks.md) — Benchmarks compiled GPU kernel execution time with test tensors to measure latency. ([source](https://tilelang.com/programming_guides/language_basics.html))
- [GPU Asynchronous Data Transfers](https://awesome-repositories.com/f/programming-languages-runtimes/gpu-asynchronous-data-transfers.md) — TVM's feature to move data from global to shared memory without blocking, enabling computation to overlap with data movement. ([source](https://tilelang.com/autoapi/tilelang/contrib/cutedsl/math/index.html))
- [Inter-Level GPU Memory Copies](https://awesome-repositories.com/f/programming-languages-runtimes/inter-level-gpu-memory-copies.md) — TVM's feature to copy data between specified memory buffers or address spaces, such as between global and shared memory in GPU kernels. ([source](https://tilelang.com/autoapi/tilelang/jit/adapter/cutedsl/checks/index.html))
- [Kernel Launch Context Setups](https://awesome-repositories.com/f/programming-languages-runtimes/kernel-launch-context-setups.md) — Configures grid and thread dimensions for GPU kernel launches. ([source](https://tilelang.com/))
- [Kernel Library Compilations](https://awesome-repositories.com/f/programming-languages-runtimes/library-compilation/kernel-library-compilations.md) — Compiles kernel source code into shared libraries for target devices and manages their lifecycle. ([source](https://tilelang.com/autoapi/tilelang/jit/adapter/libgen/index.html))
- [Kernel Shared Library Generators](https://awesome-repositories.com/f/programming-languages-runtimes/library-compilation/kernel-shared-library-generators.md) — Generates compiled shared libraries from kernel specifications for integration with Python. ([source](https://tilelang.com/autoapi/tilelang/jit/index.html))
- [Matrix Fragment Transfers](https://awesome-repositories.com/f/programming-languages-runtimes/matrix-fragment-transfers.md) — TVM's feature to transfer matrix fragments by loading or storing 8×8 fragments between shared memory and registers using PTX matrix instructions. ([source](https://tilelang.com/autoapi/tilelang/autotuner/tuner/index.html))
- [Memory Hierarchy Data Movements](https://awesome-repositories.com/f/programming-languages-runtimes/memory-hierarchy-data-movements.md) — TVM's feature to move data between global, shared, and fragment memories and overlay computation with communication via software pipelining. ([source](https://tilelang.com/autoapi/tilelang/cuda/intrinsics/macro/mma_macro_generator/index.html))
- [MMA Shape Policies](https://awesome-repositories.com/f/programming-languages-runtimes/mma-shape-policies.md) — Configures tensor core shape policies for generating efficient MMA instructions. ([source](https://tilelang.com/autoapi/tilelang/carver/roller/hint/index.html))
- [Multi-Pass Compiler Pipelines](https://awesome-repositories.com/f/programming-languages-runtimes/multi-pass-compiler-pipelines.md) — Defines a sequence of compiler passes to lower IR modules for specific hardware targets. ([source](https://tilelang.com/autoapi/tilelang/backend/pass_pipeline/pipeline/index.html))
- [Multi-Platform Code Generators](https://awesome-repositories.com/f/programming-languages-runtimes/multi-platform-code-generators.md) — Generates GPU kernel code for multiple hardware backends through a unified interface. ([source](https://tilelang.com/autoapi/tilelang/jit/adapter/wrapper/index.html))
- [Thread and Block Index Queries](https://awesome-repositories.com/f/programming-languages-runtimes/thread-and-block-index-queries.md) — Provides built-in primitives to query thread and block indices within GPU kernels. ([source](https://tilelang.com/autoapi/tilelang/cuda/intrinsics/macro/wgmma_macro_generator/index.html))
- [Tile-Based GPU Memory Movements](https://awesome-repositories.com/f/programming-languages-runtimes/tile-based-gpu-memory-movements.md) — TVM's feature to copy tiles between global, shared, and register‑file memory scopes, with an asynchronous variant for manual prefetch pipelining. ([source](https://tilelang.com/programming_guides/python_compatibility.html))
- [TMA Descriptor Creation](https://awesome-repositories.com/f/programming-languages-runtimes/tma-descriptor-creation.md) — Provides TMA descriptor creation for asynchronous data movement in GPU kernels. ([source](https://tilelang.com/autoapi/tilelang/carver/utils/index.html))
- [Parameterized Numeric Types](https://awesome-repositories.com/f/programming-languages-runtimes/type-name-assignment/unique-type-identifiers/custom-data-type-declarations/parameterized-numeric-types.md) — Defines parameterized numeric types with configurable precision and FP4 detection for GPU kernels. ([source](https://tilelang.com/autoapi/tilelang/cuda/op/gemm/gemm_mma/index.html))
- [Warp-Group Matrix Multiply-Accumulates](https://awesome-repositories.com/f/programming-languages-runtimes/warp-group-matrix-multiply-accumulates.md) — Performs warp-group level matrix multiply-accumulate for high-performance GPU kernels. ([source](https://tilelang.com/autoapi/tilelang/cuda/intrinsics/macro/mma_sm75_macro_generator/index.html))

### Artificial Intelligence & ML

- [Attention Kernel Optimizers](https://awesome-repositories.com/f/artificial-intelligence-ml/attention-mechanisms/attention-kernel-configurations/attention-kernel-optimizers.md) — Implements efficient attention mechanisms for transformers with custom tiling, pipelining, and memory access patterns.
- [Memory-Compute Overlaps](https://awesome-repositories.com/f/artificial-intelligence-ml/communication-computation-overlap/memory-compute-overlaps.md) — Overlaps memory copies with computation in GPU kernels to hide memory latency. ([source](https://tilelang.com/autoapi/tilelang/cuda/intrinsics/layout/mma_sp_layout/index.html))
- [Convolution Kernel Configurators](https://awesome-repositories.com/f/artificial-intelligence-ml/convolutional-kernel-optimizations/convolution-kernel-configurators.md) — Defines matrix-matrix convolution computations with configurable dimensions, data types, and optional bias. ([source](https://tilelang.com/programming_guides/python_compatibility.html))
- [GEMV Shared Memory Tilers](https://awesome-repositories.com/f/artificial-intelligence-ml/gpu-kernel-implementations/tile-based-kernel-authoring/gemv-shared-memory-tilers.md) — Writes GEMV kernels that tile the vector and matrix into shared memory for efficient data reuse. ([source](https://tilelang.com/deeplearning_operators/gemv.html))
- [Python Tile Kernel Interfaces](https://awesome-repositories.com/f/artificial-intelligence-ml/gpu-kernel-implementations/tile-based-kernel-authoring/python-tile-kernel-interfaces.md) — Provides a Pythonic DSL to express tile-based GPU kernels with automatic thread mapping and memory management.
- [Tile-Based GPU Kernel Programming](https://awesome-repositories.com/f/artificial-intelligence-ml/gpu-kernel-implementations/tile-based-kernel-authoring/tile-based-gpu-kernel-programming.md) — Provides a tile-level programming model that abstracts individual thread management. ([source](https://tilelang.com/programming_guides/control_flow.html))
- [GPU Hardware Architecture Modelings](https://awesome-repositories.com/f/artificial-intelligence-ml/hardware-device-management/gpu-hardware-architecture-modelings.md) — Models GPU hardware specifications like register capacity and warp size to guide code generation. ([source](https://tilelang.com/programming_guides/control_flow.html))
- [Hardware-Specific](https://awesome-repositories.com/f/artificial-intelligence-ml/kernel-optimizers/hardware-specific.md) — Provides hardware-specific kernel configuration retrieval for automatic tuning across GPU architectures. ([source](https://tilelang.com/programming_guides/language_basics.html))
- [Attention Compute Graph Builders](https://awesome-repositories.com/f/artificial-intelligence-ml/sparse-attention-kernels/attention-compute-graph-builders.md) — Sets up the compute graph for attention kernels including matrix multiplication, bias, and type casting. ([source](https://tilelang.com/autoapi/tilelang/jit/adapter/nvrtc/index.html))
- [Multi-SM Attention Parallelizers](https://awesome-repositories.com/f/artificial-intelligence-ml/tensor-parallelism/attention-parallelism-optimizers/multi-sm-attention-parallelizers.md) — Divides attention computation across multiple streaming multiprocessors for parallel execution and merges results. ([source](https://tilelang.com/autoapi/tilelang/language/builtin/index.html))
- [Tile Data Loaders](https://awesome-repositories.com/f/artificial-intelligence-ml/tiled-processing/cooperative-tile-processors/tile-data-loaders.md) — Copies tiles between global, shared, and register memory with bounds checking. ([source](https://tilelang.com/programming_guides/instructions.html))
- [Automatic Tiling Configurators](https://awesome-repositories.com/f/artificial-intelligence-ml/tiled-processing/cooperative-tile-processors/tile-reductions/automatic-tiling-configurators.md) — Automatically generates and ranks tiling configurations for matrix and reduction kernels. ([source](https://tilelang.com/programming_guides/autotuning.html))
- [Heuristic Tiling Optimizations](https://awesome-repositories.com/f/artificial-intelligence-ml/tiled-processing/cooperative-tile-processors/tile-reductions/automatic-tiling-configurators/heuristic-tiling-optimizations.md) — Ships a heuristic policy that automatically selects optimal tile configurations for GPU kernels. ([source](https://tilelang.com/autoapi/tilelang/contrib/cutedsl/cpasync/index.html))
- [Multi-Level Tiling Optimizations](https://awesome-repositories.com/f/artificial-intelligence-ml/tiled-processing/cooperative-tile-processors/tile-reductions/automatic-tiling-configurators/multi-level-tiling-optimizations.md) — Implements multi-level tiling to maximize bandwidth and reduce latency across memory hierarchy. ([source](https://tilelang.com/programming_guides/language_basics.html))
- [GPU Warp Specializations](https://awesome-repositories.com/f/artificial-intelligence-ml/vector-field-estimation/optical-flow-computation/feature-warping-modules/gpu-warp-specializations.md) — Specializes GPU warps into producer-consumer configurations for automatic synchronization. ([source](https://tilelang.com/autoapi/tilelang/jit/adapter/base/index.html))
- [Hardware-Specific Attention Configurators](https://awesome-repositories.com/f/artificial-intelligence-ml/attention-mechanisms/attention-kernel-configurations/hardware-specific-attention-configurators.md) — Fetches hardware-specific configuration hints for attention kernels on the target GPU. ([source](https://tilelang.com/autoapi/tilelang/contrib/nvcc/index.html))
- [Hardware-Aware Reduction Configurators](https://awesome-repositories.com/f/artificial-intelligence-ml/context-aware-code-generators/hardware-aware-generation/hardware-aware-reduction-configurators.md) — Generates hardware-aware configurations for reduction kernels tailored to the target architecture. ([source](https://cdn.jsdelivr.net/gh/tile-ai/tilelang@main/README.md))
- [Hardware-Aware Convolution Configurators](https://awesome-repositories.com/f/artificial-intelligence-ml/convolutional-kernel-optimizations/convolution-kernel-configurators/hardware-aware-convolution-configurators.md) — Retrieves optimized convolution configurations tailored to the target GPU architecture. ([source](https://tilelang.com/compiler_internals/tensor_checks.html))
- [Dynamic Tensor Shapes](https://awesome-repositories.com/f/artificial-intelligence-ml/dynamic-tensor-shapes.md) — Accepts symbolic dimension values so a single compiled kernel adapts to varying input sizes without recompilation. ([source](https://tilelang.com/autoapi/tilelang/language/dtypes/index.html))
- [Tensor Instruction Shape Queries](https://awesome-repositories.com/f/artificial-intelligence-ml/dynamic-tensor-shapes/tensor-shape-inferences/tensor-instruction-shape-queries.md) — Queries supported tensor instruction shapes from the target architecture for instruction selection. ([source](https://tilelang.com/programming_guides/overview.html))
- [Elementwise Tile Math Operations](https://awesome-repositories.com/f/artificial-intelligence-ml/gpu-kernel-implementations/tile-based-kernel-authoring/elementwise-tile-math-operations.md) — Applies elementwise math functions like exp, log, and sigmoid on tile fragments. ([source](https://tilelang.com/deeplearning_operators/gemv.html))
- [Elementary Math Functions](https://awesome-repositories.com/f/artificial-intelligence-ml/gpu-kernel-implementations/tile-based-kernel-authoring/elementwise-tile-math-operations/elementary-math-functions.md) — TVM's feature to compute exponential, logarithmic, trigonometric, and other elementary math operations within GPU/CPU kernel code. ([source](https://tilelang.com/autoapi/tilelang/carver/utils/index.html))
- [FP8 Data Type Resolutions](https://awesome-repositories.com/f/artificial-intelligence-ml/half-precision-inference/half-precision-matrix-multiplications/fp8-scaling/fp8-data-type-resolutions.md) — Resolves the correct FP8 data type representation for the current GPU platform architecture. ([source](https://tilelang.com/programming_guides/autotuning.html))
- [Low-Precision Decoding](https://awesome-repositories.com/f/artificial-intelligence-ml/half-precision-inference/low-precision-decoding.md) — TVM's feature to decode low-precision integer or floating-point data into half-precision formats using inline PTX operations. ([source](https://tilelang.com/deeplearning_operators/deepseek_mla.html))
- [Kernel Caching Systems](https://awesome-repositories.com/f/artificial-intelligence-ml/kernel-optimizations/kernel-caching-systems.md) — Persists GPU binaries and launcher code to disk for reuse across sessions. ([source](https://tilelang.com/autoapi/tilelang/language/builtin/index.html))
- [Cached Kernel Loading](https://awesome-repositories.com/f/artificial-intelligence-ml/kernel-optimizations/kernel-caching-systems/cached-kernel-loading.md) — Loads compiled kernels from cache to avoid recompilation across sessions. ([source](https://tilelang.com/autoapi/tilelang/jit/adapter/nvrtc/adapter/index.html))
- [Sparse Metadata Loading Implementations](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/model-optimization-and-inference/hardware-and-acceleration/tensor-computing-libraries/tensor-operations/sparse-tensor-representations/sparse-metadata-loading-implementations.md) — Loads sparsity metadata into shared memory for use with sparse tensor core instructions. ([source](https://tilelang.com/deeplearning_operators/deepseek_mla.html))
- [PyTorch Kernel Integrations](https://awesome-repositories.com/f/artificial-intelligence-ml/pytorch-backends/pytorch-tensor-interoperabilities/pytorch-kernel-integrations.md) — Wraps native function calls to execute in the current PyTorch GPU stream and device context. ([source](https://tilelang.com/autoapi/tilelang/jit/adapter/tvm_ffi/index.html))
- [Cumulative Sum Calculators](https://awesome-repositories.com/f/artificial-intelligence-ml/tensor-computation-primitives/cumulative-sum-calculators.md) — Calculates cumulative sum or maximum along a sequence in one or two dimensions, optionally in reverse. ([source](https://tilelang.com/autoapi/tilelang/jit/adapter/cutedsl/index.html))
- [Warp-Level Reductions](https://awesome-repositories.com/f/artificial-intelligence-ml/tensor-reductions/warp-level-reductions.md) — Provides warp and block-level reduction operations for aggregating values across GPU threads. ([source](https://tilelang.com/autoapi/tilelang/contrib/cutedsl/gemm_v2/index.html))
- [Tile Reductions](https://awesome-repositories.com/f/artificial-intelligence-ml/tiled-processing/cooperative-tile-operations/tile-reductions.md) — Computes sum, min, max, and cumulative reductions on tile fragments. ([source](https://tilelang.com/deeplearning_operators/gemv.html))
- [Tile Iteration Pattern Configurations](https://awesome-repositories.com/f/artificial-intelligence-ml/tiled-processing/cooperative-tile-processors/tile-reductions/automatic-tiling-configurators/tile-iteration-pattern-configurations.md) — Provides configurable tile traversal patterns to optimize memory access patterns. ([source](https://tilelang.com/deeplearning_operators/deepseek_mla.html))
- [Buffer Layout Inferences](https://awesome-repositories.com/f/artificial-intelligence-ml/training-memory-management/memory-layout-optimizations/buffer-layout-inferences.md) — Automatically infers buffer layouts from operator specifications to generate efficient GPU kernel code. ([source](https://tilelang.com/autoapi/tilelang/cuda/intrinsics/macro/mma_sm70_macro_generator/index.html))

### Part of an Awesome List

- [Tensor Core Programming Frameworks](https://awesome-repositories.com/f/awesome-lists/ai/tensor-core-optimization/tensor-core-programming-frameworks.md) — Provides a Pythonic framework for writing tensor core matrix multiply-accumulate operations.
- [Tensor Core Matrix Multiplications](https://awesome-repositories.com/f/awesome-lists/ai/tensor-core-optimization/tensor-core-programming-frameworks/tensor-core-matrix-multiplications.md) — Performs matrix multiply-accumulate operations using tensor core instructions and shared memory. ([source](https://tilelang.com/autoapi/tilelang/contrib/cutedsl/gemm_tcgen05/index.html))
- [Sparse Tensor Core Instruction Generation](https://awesome-repositories.com/f/awesome-lists/ai/tensor-core-optimization/tensor-core-programming-frameworks/sparse-tensor-core-instruction-generation.md) — Generates low-level sparse tensor core instructions for matrix multiply-accumulate. ([source](https://tilelang.com/autoapi/tilelang/jit/adapter/cutedsl/adapter/index.html))
- [Tensor Core Instruction Generation](https://awesome-repositories.com/f/awesome-lists/ai/tensor-core-optimization/tensor-core-programming-frameworks/tensor-core-instruction-generation.md) — Produces the final tensor core machine instructions from shared memory fragments. ([source](https://tilelang.com/autoapi/tilelang/cuda/intrinsics/layout/mma_sm70_layout/index.html))
- [Tensor Core Operand Loadings](https://awesome-repositories.com/f/awesome-lists/ai/tensor-core-optimization/tensor-core-programming-frameworks/tensor-core-operand-loadings.md) — Loads matrix operands from shared memory into registers for tensor core matrix multiply-accumulate. ([source](https://tilelang.com/autoapi/tilelang/language/experimental/gemm_sp_op/index.html))
- [Tensor Core Result Storage Implementations](https://awesome-repositories.com/f/awesome-lists/ai/tensor-core-optimization/tensor-core-programming-frameworks/tensor-core-result-storage-implementations.md) — Implements the final write-back step for tensor core matrix results into shared memory. ([source](https://tilelang.com/autoapi/tilelang/jit/adapter/torch/metal/index.html))
- [DSL Compilers](https://awesome-repositories.com/f/awesome-lists/devtools/common-lisp-libraries/dsl-compilers.md) — Converts DSL source code into a compiled object executable on GPU or CPU hardware. ([source](https://tilelang.com/autoapi/tilelang/language/eager/utils/index.html))

### Data & Databases

- [Thread-Level All-Reducers](https://awesome-repositories.com/f/data-databases/collection-reducers/thread-level-all-reducers.md) — Replaces global atomic adds with a thread-level all-reduce to reduce synchronization overhead in GPU kernels. ([source](https://tilelang.com/tutorials/debug_tools_for_tilelang.html))
- [GPU Buffer Allocators](https://awesome-repositories.com/f/data-databases/data-buffering/gpu-buffer-allocators.md) — TVM's feature to allocate a multi-dimensional buffer in the shared, local, fragment, or global memory space for use inside a GPU kernel. ([source](https://tilelang.com/deeplearning_operators/matmul.html))
- [Buffer Initialization Operations](https://awesome-repositories.com/f/data-databases/data-buffering/gpu-buffer-allocators/buffer-initialization-operations.md) — TVM's feature to fill every element of a buffer with a specified constant value, including a dedicated operation to zero it out. ([source](https://tilelang.com/autoapi/tilelang/contrib/dlpack/index.html))
- [Single-Element Buffer Allocations](https://awesome-repositories.com/f/data-databases/data-buffering/gpu-buffer-allocators/single-element-buffer-allocations.md) — TVM's feature to allocate a single-element buffer in local or variable memory, optionally with an initial value. ([source](https://tilelang.com/programming_guides/control_flow.html))
- [GPU Availability Detections](https://awesome-repositories.com/f/data-databases/feature-availability-comparisons/gpu-availability-detections.md) — Checks GPU acceleration availability and queries target-specific hardware features for kernel compilation. ([source](https://tilelang.com/autoapi/tilelang/cuda/op/gemm/gemm_wgmma/index.html))
- [Split-K Reduction Distributors](https://awesome-repositories.com/f/data-databases/top-k-element-extraction/split-k-reduction-distributors.md) — Distributes the reduction along the K dimension across multiple threads and combines results with atomic adds. ([source](https://tilelang.com/deeplearning_operators/gemv.html))
- [Tensor Core Layout Translations](https://awesome-repositories.com/f/data-databases/memory-mapping-utilities/tensor-mappings/tensor-core-layout-translations.md) — Translates between shared memory tile layouts and MMA register file layouts for tensor cores. ([source](https://tilelang.com/autoapi/tilelang/contrib/cutedsl/threadblock_swizzle/index.html))
- [DLPack Protocols](https://awesome-repositories.com/f/data-databases/serialization-frameworks/zero-copy/dlpack-protocols.md) — Wraps tensors from DLPack-compatible frameworks for direct kernel execution. ([source](https://tilelang.com/autoapi/tilelang/contrib/msvc/index.html))
- [Vectorized Memory Access](https://awesome-repositories.com/f/data-databases/vector-data-processing/vectorized-memory-access.md) — Provides vectorized memory reads to maximize memory bandwidth in GPU kernels. ([source](https://tilelang.com/autoapi/tilelang/contrib/cutedsl/ptx_mma/index.html))

### DevOps & Infrastructure

- [Multi-Stage Memory Pipelines](https://awesome-repositories.com/f/devops-infrastructure/cli-job-runners/multi-stage-pipeline-orchestrators/configurable-stage-pipelines/multi-stage-memory-pipelines.md) — Configures multi-stage pipelines to overlap memory access with computation in GPU kernels. ([source](https://tilelang.com/autoapi/tilelang/jit/adapter/cutedsl/adapter/index.html))
- [Annotated Kernel Definitions](https://awesome-repositories.com/f/devops-infrastructure/function-as-a-service-platforms/gpu-kernel-function-wrappers/annotated-kernel-definitions.md) — Defines GPU or CPU kernels by annotating functions with tensor shapes and data types. ([source](https://tilelang.com/autoapi/tilelang/analysis/fragment_loop_checker/index.html))
- [GPU Kernel Configuration Generators](https://awesome-repositories.com/f/devops-infrastructure/hardware-configuration-tools/hardware-specific-boot-configurators/gpu-kernel-configuration-generators.md) — Generates top-ranked GPU kernel configurations tailored to the target architecture for automatic tuning. ([source](https://tilelang.com/autoapi/tilelang/jit/adapter/cutedsl/wrapper/index.html))
- [Asynchronous Operation Synchronizations](https://awesome-repositories.com/f/devops-infrastructure/distributed-synchronization/barrier-synchronization/asynchronous-operation-synchronizations.md) — Synchronizes asynchronous data movement with computation using barrier primitives. ([source](https://tilelang.com/autoapi/tilelang/language/customize/index.html))
- [Synchronization Barrier Allocations](https://awesome-repositories.com/f/devops-infrastructure/distributed-synchronization/barrier-synchronization/synchronization-barrier-allocations.md) — Allocates synchronization barriers for coordinating thread arrivals in GPU kernels. ([source](https://tilelang.com/programming_guides/control_flow.html))
- [Math Library Accelerators](https://awesome-repositories.com/f/devops-infrastructure/gpu-acceleration-libraries/math-library-accelerators.md) — Provides accelerated implementations of common math functions on GPU and CPU. ([source](https://tilelang.com/autoapi/tilelang/language/fastmath/index.html))

### Operating Systems & Systems Programming

- [Architecture Detection](https://awesome-repositories.com/f/operating-systems-systems-programming/architecture-detection.md) — Checks target device architecture to guide GPU kernel compilation and optimization. ([source](https://tilelang.com/programming_guides/overview.html))
- [GPU Architecture Detections](https://awesome-repositories.com/f/operating-systems-systems-programming/architecture-detection/gpu-architecture-detections.md) — Detects GPU architecture and compute capability to correctly target kernel compilation at runtime. ([source](https://tilelang.com/autoapi/tilelang/contrib/rocm/index.html))
- [Atomic Memory Operations](https://awesome-repositories.com/f/operating-systems-systems-programming/atomic-memory-operations.md) — Provides atomic read-modify-write operations on shared and global GPU memory for safe concurrency. ([source](https://tilelang.com/autoapi/tilelang/contrib/cutedsl/atomic/index.html))
- [GPU Device Property Queries](https://awesome-repositories.com/f/operating-systems-systems-programming/virtualization-emulation/hardware-emulators/device-property-querying/gpu-device-property-queries.md) — Models GPU resources such as register capacity and shared memory limits for compiler tuning. ([source](https://tilelang.com/programming_guides/overview.html))
- [Kernel-to-Template Bindings](https://awesome-repositories.com/f/operating-systems-systems-programming/hardware-interfacing-drivers/hardware-acceleration/gpu-acceleration/gpu-accelerated-compilers/pure-function-kernels/kernel-to-template-bindings.md) — Binds kernel functions to hardware-aware templates for targeted configuration generation. ([source](https://tilelang.com/autoapi/tilelang/jit/index.html))
- [GPU Kernel Assertions](https://awesome-repositories.com/f/operating-systems-systems-programming/kernel-component-debugging/gpu-kernel-assertions.md) — Implements GPU kernel assertions that halt execution and optionally print a message when a condition fails. ([source](https://tilelang.com/autoapi/tilelang/language/fill_op/index.html))
- [Persistent Thread Block Patterns](https://awesome-repositories.com/f/operating-systems-systems-programming/kernel-process-internals/kernel-thread-distinctions/educational-kernel-thread-implementations/persistent-thread-block-patterns.md) — Implements persistent thread-block patterns for dynamic work distribution across GPU thread blocks. ([source](https://tilelang.com/autoapi/tilelang/carver/arch/cuda/index.html))
- [Shared Memory Swizzling](https://awesome-repositories.com/f/operating-systems-systems-programming/shared-memory-swizzling.md) — Provides shared memory swizzling to reduce bank conflicts in GPU kernel execution. ([source](https://tilelang.com/autoapi/tilelang/language/annotations/index.html))
- [Sparse MMA Layout Conversions](https://awesome-repositories.com/f/operating-systems-systems-programming/shared-memory-swizzling/sparse-mma-layout-conversions.md) — Implements shared memory to MMA sparse layout conversion for tensor core instructions. ([source](https://tilelang.com/autoapi/tilelang/jit/adapter/cutedsl/adapter/index.html))
- [Warp-Level Matrix Multiply-Accumulates](https://awesome-repositories.com/f/operating-systems-systems-programming/warp-level-primitives/warp-level-matrix-multiply-accumulates.md) — Generates warp-level tensor core instructions for matrix multiply-accumulate operations. ([source](https://tilelang.com/autoapi/tilelang/contrib/cutedsl/reduce/index.html))

### Scientific & Mathematical Computing

- [Configurable Matrix Multiplication Kernels](https://awesome-repositories.com/f/scientific-mathematical-computing/generalized-matrix-multiplications/configurable-matrix-multiplication-kernels.md) — Configures and generates high-performance matrix multiplication kernels with adjustable transposition, data types, and bias. ([source](https://tilelang.com/autoapi/tilelang/carver/template/base/index.html))
- [Tile-Based Matrix Multiplications](https://awesome-repositories.com/f/scientific-mathematical-computing/generalized-matrix-multiplications/tile-based-matrix-multiplications.md) — Performs tile-sized matrix multiplication using shared memory and tensor cores. ([source](https://tilelang.com/programming_guides/instructions.html))
- [Sparse Matrix Tile Loaders](https://awesome-repositories.com/f/scientific-mathematical-computing/generalized-matrix-multiplications/sparse-matrix-multiplications/sparse-matrix-tile-loaders.md) — Loads and transposes matrix tiles from shared memory for sparse tensor core operations. ([source](https://tilelang.com/autoapi/tilelang/cuda/intrinsics/macro/mma_sp_macro_generator/index.html))
- [Structured](https://awesome-repositories.com/f/scientific-mathematical-computing/generalized-matrix-multiplications/sparse-matrix-multiplications/structured.md) — Implements 2:4 structured sparse matrix multiplication using specialized tensor core instructions. ([source](https://tilelang.com/autoapi/tilelang/engine/param/index.html))
- [Multi-Dimensional Buffer Accesses](https://awesome-repositories.com/f/scientific-mathematical-computing/multi-dimensional-arrays/multi-dimensional-buffer-accesses.md) — Supports multi-dimensional indexing and slicing of buffers within GPU kernels. ([source](https://tilelang.com/autoapi/tilelang/analysis/fragment_loop_checker/index.html))
- [Arithmetic Operations](https://awesome-repositories.com/f/scientific-mathematical-computing/numerical-mathematical-foundations/arithmetic-number-types/arithmetic-operations.md) — Translates Python arithmetic operators into device-side computations in GPU kernels. ([source](https://tilelang.com/autoapi/tilelang/carver/template/matmul/index.html))
- [Sparse Matrix Tile Store Operations](https://awesome-repositories.com/f/scientific-mathematical-computing/sparse-matrix-storage/sparse-matrix-tile-store-operations.md) — Stores matrix tiles in swizzled shared memory layout for sparse tensor core use. ([source](https://tilelang.com/autoapi/tilelang/cuda/intrinsics/layout/mma_sp_layout/index.html))
- [Vector Dot Product Kernels](https://awesome-repositories.com/f/scientific-mathematical-computing/vector-dot-product-kernels.md) — Performs packed 4-element dot product accumulation for efficient integer matrix multiply. ([source](https://tilelang.com/autoapi/tilelang/carver/template/elementwise/index.html))

### Software Engineering & Architecture

- [Kernel Scheduling Hints](https://awesome-repositories.com/f/software-engineering-architecture/inline-data-structures/inlining/function-inlining-controls/optimization-hints/kernel-scheduling-hints.md) — Specifies tiling and tensor core hints to guide optimized GPU kernel compilation. ([source](https://tilelang.com/autoapi/tilelang/carver/arch/rdna/index.html))
- [Threadblock Swizzling Configurations](https://awesome-repositories.com/f/software-engineering-architecture/memory-management-utilities/memory-coalescing-utilities/threadblock-swizzling-configurations.md) — TVM's feature to remap the scheduling order of threadblocks using configurable patterns to improve L2 cache hit rates. ([source](https://tilelang.com/autoapi/tilelang/language/customize/index.html))
- [Memory Access Pattern Optimizers](https://awesome-repositories.com/f/software-engineering-architecture/shared-memory-management/memory-access-profilers/tiled-memory-access-patterns/memory-access-pattern-optimizers.md) — Optimizes memory access patterns using layout annotations, swizzling, and pipelining for GPU kernels. ([source](https://tilelang.com/autoapi/tilelang/contrib/cutedsl/cpasync/index.html))
- [GPU Compute Capability Inspections](https://awesome-repositories.com/f/software-engineering-architecture/software-architecture-education/gpu-architecture-education/gpu-compute-capability-inspections.md) — Inspects GPU compute capability and instruction sets to specialize kernels for different hardware generations. ([source](https://tilelang.com/autoapi/tilelang/cuda/intrinsics/macro/tcgen05_macro_generator/index.html))
- [Hardware-Specific Loop Structuring](https://awesome-repositories.com/f/software-engineering-architecture/compile-time-code-generation/iterative-code-generation/iterative-loop-constructs/hardware-specific-loop-structuring.md) — Structures loop iteration with serial, unrolled, parallel, and software-pipelined constructs for GPU hardware. ([source](https://tilelang.com/programming_guides/python_compatibility.html))
- [Architecture-Specific](https://awesome-repositories.com/f/software-engineering-architecture/inline-data-structures/inlining/function-inlining-controls/optimization-hints/architecture-specific.md) — Analyzes kernel functions to suggest top-k optimization hints for the target architecture. ([source](https://tilelang.com/programming_guides/software_pipeline.html))
- [Buffer Performance Hints](https://awesome-repositories.com/f/software-engineering-architecture/inline-data-structures/inlining/function-inlining-controls/optimization-hints/buffer-performance-hints.md) — Annotates buffers with layout and cache hints to guide compiler optimization of GPU kernels. ([source](https://tilelang.com/deeplearning_operators/elementwise.html))
- [Tuning Result Persistence](https://awesome-repositories.com/f/software-engineering-architecture/job-result-persistence/feature-retrieval-result-persistence/tuning-result-persistence.md) — Persists tuning results to disk so the best configuration can be reused across sessions. ([source](https://tilelang.com/deeplearning_operators/deepseek_mla.html))
- [Block Index Swizzlings](https://awesome-repositories.com/f/software-engineering-architecture/memory-management-utilities/memory-coalescing-utilities/threadblock-swizzling-configurations/block-index-swizzlings.md) — TVM's feature to map block indices to row-major or column-major swizzled rasterization coordinates to improve memory coalescing. ([source](https://tilelang.com/autoapi/tilelang/cuda/intrinsics/layout/utils/index.html))

### User Interface & Experience

- [GPU Capability Queries](https://awesome-repositories.com/f/user-interface-experience/hardware-capabilities-detection/gpu-capability-queries.md) — Detects GPU compute capability, tensor core availability, and data type support at compile time. ([source](https://tilelang.com/autoapi/tilelang/contrib/cutedsl/gemm_v1/index.html))
- [GPU Kernel Code Block Definitions](https://awesome-repositories.com/f/user-interface-experience/content-block-editors/style-injection-blocks/reusable-style-block-definition/gpu-kernel-code-block-definitions.md) — Marks code blocks with decorators for compile-time inlining as reusable device functions. ([source](https://tilelang.com/autoapi/tilelang/carver/template/base/index.html))

### Development Tools & Productivity

- [GPU Kernel Print Debugging](https://awesome-repositories.com/f/development-tools-productivity/print-based-debugging-workflows/pipeline-print-based-debugging/gpu-kernel-print-debugging.md) — Provides assertion and print functions that operate on the GPU device for kernel debugging. ([source](https://tilelang.com/))

### Hardware & IoT

- [GPU Occupancy Controllers](https://awesome-repositories.com/f/hardware-iot/crowd-occupancy-analyzers/home-occupancy-aggregators/gpu-occupancy-controllers.md) — Controls GPU occupancy by hinting minimum thread blocks per SM to optimize register usage. ([source](https://tilelang.com/autoapi/tilelang/jit/kernel/index.html))

### Networking & Communication

- [Schedule Block Retrievals](https://awesome-repositories.com/f/networking-communication/http-gateways/block-retrieval/schedule-block-retrievals.md) — Retrieves specific schedule blocks from the compiler for targeted optimization and analysis. ([source](https://tilelang.com/autoapi/tilelang/jit/diagnostics/index.html))

### System Administration & Monitoring

- [GPU Kernel Profilers](https://awesome-repositories.com/f/system-administration-monitoring/execution-time-profilers/gpu-kernel-profilers.md) — Measures kernel latency using a built-in profiler to evaluate GPU kernel performance. ([source](https://tilelang.com/deeplearning_operators/matmul.html))