Flash Linear Attention

Features

Linear Attention Training Frameworks - Provides a complete training framework for sequence models using linear attention and state space mechanisms with hardware-optimized GPU kernels.

Autoregressive Inference Engines - Ships a causal inference engine that generates text token-by-token using cached hidden states for efficient autoregressive decoding.

Distributed Model Checkpointing - Ships a distributed checkpoint manager that splits model weights across multiple files for parallel loading and saving in multi-node training.

Fused GPU Kernel Composition - Provides fused CUDA kernels that combine multiple neural operations into single GPU kernels to reduce memory bandwidth and launch overhead.

Selective State Space Models - Processes sequences using recurrent state updates that capture long-range dependencies with linear complexity.

Hybrid Model Architectures - Builds large language models that interleave standard attention with linear attention and state space layers to balance efficiency and global context.

Hybrid Layer Compositions - Supports building hybrid sequence models that interleave standard attention with linear attention and state space layers.

Hybrid Architecture Builders - Provides a system for building hybrid sequence models that interleave standard attention with linear attention and state space layers within a single model architecture.

Long-Context Sequence Processors - Trains and deploys sequence models that process long contexts with reduced memory and compute overhead using linear attention and state space mechanisms.

Token Mixing Accelerators - Provides hardware-optimized token mixing layers and fused CUDA kernels that minimize memory bandwidth and launch overhead across different GPU architectures.

Fused Token Mixing Kernels - Runs fused GPU kernels for token mixing operations that minimize memory bandwidth and launch overhead across different GPU architectures.

Streaming Dataset Loaders - Includes a streaming dataset pipeline that feeds training data from disk or network in real-time without loading the entire dataset into memory.

Kernel Dispatchers - Implements hardware-aware kernel dispatch that selects optimized CUDA kernels at runtime based on GPU architecture and tensor shapes.

Multi-Node Training Scaling - Distributes training across multiple GPU nodes by setting environment variables for inter-node communication.

Hybrid State Space Toolkits - Provides a toolkit for constructing hybrid sequence models that combine standard attention with linear and state space layers.

Model Checkpoint Converters - Transforms Hugging Face format checkpoints into the distributed format needed for training.

Domain-Adaptive Continued Pretraining - Supports continuing training from a pretrained checkpoint using fresh data.

Checkpoint Format Transpilations - Provides a weight format transpilation utility for converting between Hugging Face and distributed checkpoint formats.

Training Checkpointers - Automatically continues training from the last saved checkpoint after an interruption.

From-Scratch Training - Trains a new sequence model from scratch using a configurable script with optimizers and schedulers.

Model Weight Conversions - Transforms model checkpoints between Hugging Face format and distributed training formats by remapping tensor layouts and metadata.

Flash Linear Attention is a training framework and inference engine for sequence models that use linear attention and state space mechanisms, designed to process long contexts with reduced memory and compute overhead. It provides hardware-optimized token mixing layers and fused CUDA kernels that minimize memory bandwidth and launch overhead across different GPU architectures, and includes a causal inference engine that generates text token-by-token using cached hidden states for efficient autoregressive decoding.

The project supports building hybrid sequence models that interleave standard attention with linear attention and state space layers, balancing efficiency with global context. It includes a distributed checkpoint manager that splits model weights across multiple files for parallel loading and saving in multi-node training, and a weight format transpilation utility for converting between Hugging Face and distributed checkpoint formats. The framework also provides hardware-aware kernel dispatch that selects optimized CUDA kernels at runtime based on GPU architecture and tensor shapes.

The training surface covers training models from scratch, continuing pretraining from checkpoints, launching multi-node training, and automatically resuming interrupted training from the last saved checkpoint. The project includes a streaming dataset pipeline that feeds training data from disk or network in real-time without loading the entire dataset into memory.

Features