TileLang is a Python-embedded domain-specific language compiler that JIT-compiles and autotunes GPU kernels. It uses a tile-based DSL, automatic software pipelining, and parallel autotuning to generate optimized GPU kernels at runtime.
It supports tensor core operations with Pythonic syntax, automatic memory management, and thread mapping. The compiler searches over tile sizes, thread counts, and scheduling policies, compiling and benchmarking candidates in parallel to find the fastest kernel. It also caches compiled binaries and tuning results to disk for reuse across sessions.
TileLang includes optimizations for attention, convolution, and reduction operators, with multi-level tiling, software pipelining, and warp specialization. It manages memory across global, shared, and register levels, supports synchronization barriers, and provides debugging and diagnostic tools.