# nvidia/cutlass

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/nvidia-cutlass).**

9,904 stars · 1,909 forks · C++ · NOASSERTION

## Links

- GitHub: https://github.com/NVIDIA/cutlass
- Homepage: https://docs.nvidia.com/cutlass/index.html
- awesome-repositories: https://awesome-repositories.com/repository/nvidia-cutlass.md

## Topics

`cpp` `cuda` `deep-learning` `deep-learning-library` `gpu` `nvidia` `python`

## Description

Cutlass is a collection of C++ templates and Python interfaces for implementing high-performance linear algebra operations on NVIDIA GPUs. It provides a kernel composition framework for designing custom GPU kernels and a mixed-precision tensor library capable of executing operations across diverse data formats, ranging from 64-bit floating point to 4-bit integers.

The project features a toolkit for operator fusion that integrates activation functions and bias calculations directly into matrix multiplication kernels to reduce memory passes. It also includes a Python-based domain-specific language for defining high-performance GPU operations, which eliminates the need for C++ glue code.

The framework covers broader capabilities in GPU memory layout optimization, hierarchical tiling strategies, and the development of specialized CUDA kernels through modular software hierarchies.

## Tags

### Artificial Intelligence & ML

- [GPU Kernel Implementations](https://awesome-repositories.com/f/artificial-intelligence-ml/gpu-kernel-implementations.md) — Provides a framework for implementing custom-written hardware-level kernels for accelerated parallel computing on NVIDIA GPUs.
- [CUDA-Accelerated Libraries](https://awesome-repositories.com/f/artificial-intelligence-ml/deep-learning-libraries/cuda-accelerated-libraries.md) — A CUDA-accelerated library of C++ templates and Python interfaces for high-performance matrix operations.
- [Kernel Composition Frameworks](https://awesome-repositories.com/f/artificial-intelligence-ml/gpu-kernel-implementations/kernel-composition-frameworks.md) — Provides a modular software hierarchy for composing specialized GPU kernels by tuning tiling sizes and data types. ([source](https://github.com/nvidia/cutlass#readme))
- [Compute Engines](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/machine-learning-training/distributed-and-accelerated-compute/training-acceleration-tools/mixed-precision-training/compute-engines.md) — Implements a mixed-precision tensor library supporting data formats from 64-bit floating point down to 4-bit integers. ([source](https://github.com/nvidia/cutlass#readme))
- [Mixed-Precision Compute Engines](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/machine-learning-training/distributed-and-accelerated-compute/training-acceleration-tools/mixed-precision-training/mixed-precision-compute-engines.md) — Provides a unified interface for hardware-accelerated tensor cores supporting numerical formats from 64-bit float to 4-bit integers.
- [Mixed-Precision Computing](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/machine-learning-training/distributed-and-accelerated-compute/training-acceleration-tools/mixed-precision-training/mixed-precision-computing.md) — Executes operations across various data types from 64-bit floating point to 4-bit formats to balance precision and speed.
- [Matrix Operation Fusions](https://awesome-repositories.com/f/artificial-intelligence-ml/matrix-operation-fusions.md) — Integrates activation functions and bias calculations directly into matrix multiplication kernels to reduce memory passes. ([source](https://github.com/nvidia/cutlass#readme))
- [Gathered Matrix Multiplication](https://awesome-repositories.com/f/artificial-intelligence-ml/matrix-operation-fusions/gathered-matrix-multiplication.md) — Implements the fusion of gathering and matrix multiplication into a single optimized operation to reduce memory overhead.
- [Tensor Libraries](https://awesome-repositories.com/f/artificial-intelligence-ml/tensor-libraries.md) — Provides a framework for creating and manipulating multidimensional arrays across diverse numerical precisions.

### Data & Databases

- [Tensor Mappings](https://awesome-repositories.com/f/data-databases/memory-mapping-utilities/tensor-mappings.md) — Maps multi-dimensional tensor data into linear memory to optimize direct hardware access and movement.

### Programming Languages & Runtimes

- [Python GPU Kernels](https://awesome-repositories.com/f/programming-languages-runtimes/compiler-interpreter-internals/compiler-infrastructure/jit-kernel-compilers/python-gpu-kernels.md) — Provides a domain-specific language to define high-performance GPU kernels directly in Python, eliminating the need for C++ glue code.
- [Multidimensional Arrays](https://awesome-repositories.com/f/programming-languages-runtimes/multidimensional-arrays.md) — Defines multidimensional array structures to simplify indexing and data movement across GPU threads. ([source](https://github.com/nvidia/cutlass#readme))

### Scientific & Mathematical Computing

- [GPU Linear Algebra Libraries](https://awesome-repositories.com/f/scientific-mathematical-computing/gpu-linear-algebra-libraries.md) — Implements high-performance matrix multiplication and tensor computations on NVIDIA hardware using modular templates.
- [GPU Matrix Operation Implementations](https://awesome-repositories.com/f/scientific-mathematical-computing/gpu-matrix-operation-implementations.md) — Builds high-performance matrix multiplication and related GPU computations using modular templates and language interfaces. ([source](https://github.com/nvidia/cutlass#readme))

### Software Engineering & Architecture

- [Memory Layout Optimizations](https://awesome-repositories.com/f/software-engineering-architecture/memory-layout-optimizations.md) — Optimizes the organization of multidimensional tensors to improve cache locality and reduce memory overhead.
- [Template-Based Kernel Composition](https://awesome-repositories.com/f/software-engineering-architecture/performance-reliability/performance-optimization/computational-efficiency/custom-kernel-accelerators/custom-c-kernels/template-based-kernel-composition.md) — Uses C++ templates to generate specialized GPU kernels by combining modular software components and hardware tuning parameters.
- [Tiled Memory Access Patterns](https://awesome-repositories.com/f/software-engineering-architecture/shared-memory-management/memory-access-profilers/tiled-memory-access-patterns.md) — Organizes data movement into structured blocks to maximize cache locality across the GPU memory hierarchy.

### Operating Systems & Systems Programming

- [Kernel Configuration DSLs](https://awesome-repositories.com/f/operating-systems-systems-programming/systems-programming/c-interoperability-layers/python-c-interfaces/kernel-configuration-dsls.md) — Provides a high-level Python DSL to define kernel configurations and layouts without requiring C++ glue code.

### Part of an Awesome List

- [AI & Machine Learning](https://awesome-repositories.com/f/awesome-lists/ai/ai-machine-learning.md) — CUDA templates for high-performance linear algebra
- [Tensor Core Optimization](https://awesome-repositories.com/f/awesome-lists/ai/tensor-core-optimization.md) — Template-based library for implementing high-performance tensor computations on NVIDIA GPUs.
