Flash Linear Attention is a training framework and inference engine for sequence models that use linear attention and state space mechanisms, designed to process long contexts with reduced memory and compute overhead. It provides hardware-optimized token mixing layers and fused CUDA kernels that minimize memory bandwidth and launch overhead across different GPU architectures, and includes a causal inference engine that generates text token-by-token using cached hidden states for efficient autoregressive decoding.
The project supports building hybrid sequence models that interleave standard attention with linear attention and state space layers, balancing efficiency with global context. It includes a distributed checkpoint manager that splits model weights across multiple files for parallel loading and saving in multi-node training, and a weight format transpilation utility for converting between Hugging Face and distributed checkpoint formats. The framework also provides hardware-aware kernel dispatch that selects optimized CUDA kernels at runtime based on GPU architecture and tensor shapes.
The training surface covers training models from scratch, continuing pretraining from checkpoints, launching multi-node training, and automatically resuming interrupted training from the last saved checkpoint. The project includes a streaming dataset pipeline that feeds training data from disk or network in real-time without loading the entire dataset into memory.