# Dao-AILab/flash-attention

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/dao-ailab-flash-attention).**

22,302 stars · 2,393 forks · Python · bsd-3-clause

## Links

- GitHub: https://github.com/Dao-AILab/flash-attention
- awesome-repositories: https://awesome-repositories.com/repository/dao-ailab-flash-attention.md

## Description

FlashAttention is an attention mechanism optimization library and machine learning acceleration framework designed to increase training speed and reduce memory footprint for large-scale neural network models. It functions as a collection of low-level CUDA kernels that optimize memory-bound operations to improve hardware utilization on graphics processing units.

The library distinguishes itself through an input-output-aware algorithm design that minimizes data movement between different levels of memory. By employing kernel fusion and tiled matrix multiplication, it combines sequential operations and processes data in blocks that fit within high-speed on-chip memory. These techniques, paired with recomputation-based gradient calculation, allow for the training of transformer models with larger batch sizes and longer sequence lengths.

This framework provides a comprehensive set of computational primitives for high-performance deep learning. It covers the acceleration of transformer architecture components and attention layers, specifically targeting the memory and throughput constraints inherent in training massive language models.

## Tags

### Artificial Intelligence & ML

- [Attention Backends](https://awesome-repositories.com/f/artificial-intelligence-ml/model-optimization/inference-deployment/attention-backends.md) — Provides optimized computational backends specifically designed to accelerate attention mechanisms in transformer models. ([source](https://github.com/Dao-AILab/flash-attention/tree/main/flash_attn/cute))
- [GPU Training Accelerators](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/machine-learning-training/distributed-and-accelerated-compute/training-acceleration-tools/gpu-training-accelerators.md) — Acts as a machine learning acceleration framework providing computational primitives to increase training speed.
- [Language Model Training](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/model-fine-tuning-adaptation/language-model-training.md) — Accelerates the training process for massive transformer models by optimizing memory access and computation speed.
- [Memory Optimization Techniques](https://awesome-repositories.com/f/artificial-intelligence-ml/memory-optimization-techniques.md) — Reduces the memory footprint of deep learning models to enable larger batch sizes and longer sequence lengths.
- [Transformer Training Accelerators](https://awesome-repositories.com/f/artificial-intelligence-ml/transformer-training-accelerators.md) — Improves the performance of attention mechanisms to enable faster training and inference for transformer architectures.
- [Deep Learning Compute Kernels](https://awesome-repositories.com/f/artificial-intelligence-ml/deep-learning-compute-kernels.md) — Optimizes low-level kernel operations on graphics hardware to maximize throughput for deep learning.
- [Kernel Optimization Libraries](https://awesome-repositories.com/f/artificial-intelligence-ml/kernel-optimization-libraries.md) — Provides a collection of low-level CUDA kernels that optimize memory-bound operations for deep learning.

### Programming Languages & Runtimes

- [Kernel Fusion Operations](https://awesome-repositories.com/f/programming-languages-runtimes/runtime-execution-environments/runtime-environments/runtimes/graph-symbolic-execution-engines/operation-kernels/kernel-fusion-operations.md) — Combines multiple sequential operations into single GPU kernels to reduce intermediate memory overhead.

### Operating Systems & Systems Programming

- [Gradient Checkpointing](https://awesome-repositories.com/f/operating-systems-systems-programming/kernel-core-internals/process-and-memory-management/memory-management/buffer-and-cache-management/gradient-checkpointing.md) — Reduces memory usage by discarding intermediate attention scores and recomputing them during the backward pass.
- [SRAM-Aware](https://awesome-repositories.com/f/operating-systems-systems-programming/kernel-core-internals/process-and-memory-management/memory-management/sram-aware.md) — Explicitly manages data movement between high-speed on-chip memory and main GPU memory to maximize throughput.

### Data & Databases

- [Batch Matrix Multiplication Utilities](https://awesome-repositories.com/f/data-databases/batch-processing/batch-matrix-multiplication-utilities.md) — Divides large attention matrices into smaller blocks that fit into fast on-chip memory to minimize global memory access.

### Education & Learning Resources

- [Memory-Efficient Algorithms](https://awesome-repositories.com/f/education-learning-resources/technical-domain-education/technical-academic-domains/algorithmic-design-analysis/algorithms-and-design-patterns/memory-efficient-algorithms.md) — Implements IO-aware algorithms that minimize memory reads and writes between different levels of memory.
