Xformers

xformers is a collection of specialized toolsets for fused GPU operators, sparse attention mechanisms, modular transformer components, and performance benchmarking. It provides a library of optimized and interoperable building blocks used to construct and experiment with transformer architectures.

The project features a fused CUDA operator library that combines common layers into single GPU operations to increase throughput. It includes a sparse attention framework and memory-efficient attention kernels that utilize tiling strategies and structured sparsity patterns to reduce computational overhead and memory usage.

The toolkit covers a broad surface of performance optimization, including kernel fusion and an operator benchmarking framework for measuring the execution latency and memory footprint of individual model components. It also supports composable block assembly and custom component extensions to facilitate architectural experimentation.

Features

Fused GPU Kernel Composition - Provides a library of pre-optimized fused GPU kernels combining common layers like softmax and linear operations.

Attention Kernel Optimizers - Implements memory-efficient attention kernels using tiling and optimized memory access patterns.

Block-Wise Attention - Uses block-wise tiling to compute attention without materializing the full matrix, reducing memory overhead.

Memory-Efficient Deep Learning - Reduces GPU memory usage and increases speed for scaled dot-product attention in large-scale models.

Modular Layer Assembly - Enables the assembly of Transformer models by combining interoperable and pre-optimized building blocks.

Modular Architectures - Provides a system for assembling Transformer models using interchangeable and pre-optimized modular blocks.

Sparse Attention Kernels - Provides specialized kernels for sparse attention using structured sparsity patterns to handle long sequences.

Block-Sparse Attention Kernels - Implements block-sparse attention kernels that use structured masks to reduce computational complexity.

Sparse Attention Modules - Implements a framework of sparse attention modules and patterns to reduce computational overhead.

Operation Fusion Optimizations - Optimizes processing throughput by merging multiple neural network operations into single fused CUDA kernels.

Transformer Blocks - Ships a collection of optimized and interoperable Transformer blocks for modular model construction.

Transformer Models - Offers a framework for constructing custom Transformer models using optimized modular building blocks.

Layer-Level Performance Benchmarking - Provides tools for comparing the speed and memory overhead of individual model layers to guide optimization.

Operator Benchmarking Frameworks - Provides a framework to measure the execution speed and memory footprint of individual transformer building blocks.

Model Execution Benchmarks - Includes a framework for benchmarking the execution speed and memory consumption of individual model blocks.

Architectural Block Extensions - Provides an interface for integrating locally-defined Transformer blocks for architectural experimentation.

Model Component Extensions - Provides an interface for adding custom Transformer blocks that integrate with existing optimized components.

Neural Network Operation Benchmarking - Ships a framework to measure the execution speed and memory footprint of individual neural network operations.

facebookresearchxformers

Features

Star history