# deepspeedai/DeepSpeed

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/deepspeedai-deepspeed).**

41,638 stars · 4,724 forks · Python · apache-2.0

## Links

- GitHub: https://github.com/deepspeedai/DeepSpeed
- Homepage: https://www.deepspeed.ai/
- awesome-repositories: https://awesome-repositories.com/repository/deepspeedai-deepspeed.md

## Topics

`billion-parameters` `compression` `data-parallelism` `deep-learning` `gpu` `inference` `machine-learning` `mixture-of-experts` `model-parallelism` `pipeline-parallelism` `pytorch` `trillion-parameters` `zero`

## Description

DeepSpeed is a high-performance library designed to scale deep learning model training and inference across massive clusters of GPUs and compute nodes. It provides a comprehensive suite of tools for distributed training, enabling the execution of models that exceed the memory capacity of single devices through advanced parameter partitioning, pipeline-based model parallelism, and memory-efficient state offloading.

The framework distinguishes itself through specialized communication-efficient optimizers and hardware-aware acceleration techniques. By utilizing gradient compression, quantization, and custom-compiled kernels, it minimizes network bandwidth bottlenecks and maximizes computational throughput. It further supports complex architectures like mixture-of-experts and long-context models by integrating sequence parallelism and sparse attention mechanisms, ensuring efficient resource utilization across heterogeneous hardware topologies.

Beyond its core training capabilities, the project includes a robust set of utilities for automated performance tuning, model profiling, and universal checkpointing. It provides infrastructure support for diverse processor architectures and cloud-based cluster deployment, allowing users to optimize execution environments through targeted kernel compilation and diagnostic monitoring.

## Tags

### Artificial Intelligence & ML

- [Distributed Memory Optimizers](https://awesome-repositories.com/f/artificial-intelligence-ml/distributed-memory-optimizers.md) — The framework partitions model states across available devices to reduce memory consumption and enable the training of massive models on distributed hardware. ([source](https://www.deepspeed.ai/tutorials/zero/))
- [Distributed Training Frameworks](https://awesome-repositories.com/f/artificial-intelligence-ml/distributed-training-frameworks.md) — The framework supports scaling model training across multiple compute nodes and GPUs by integrating existing architectures with distributed training capabilities. ([source](https://www.deepspeed.ai/tutorials/megatron/))
- [Distributed Training Optimizers](https://awesome-repositories.com/f/artificial-intelligence-ml/distributed-training-optimizers.md) — The framework minimizes data transfer between distributed nodes by using compressed algorithms to reduce communication overhead during optimizer updates. ([source](https://www.deepspeed.ai/tutorials/zero-one-adam/))
- [Distributed Training Sharding](https://awesome-repositories.com/f/artificial-intelligence-ml/distributed-training-sharding.md) — Model states and optimizer parameters are sharded across multiple compute nodes to enable training of models exceeding single-device memory.
- [Inference Engines](https://awesome-repositories.com/f/artificial-intelligence-ml/inference-engines.md) — The framework provides tools to inject optimized kernels and configure tensor parallelism to accelerate transformer model execution across multiple hardware devices. ([source](https://www.deepspeed.ai/tutorials/inference-tutorial/))
- [Large-Scale Training Frameworks](https://awesome-repositories.com/f/artificial-intelligence-ml/large-scale-training-frameworks.md) — Scaling neural network training across multiple compute nodes and GPUs to handle massive datasets and complex model architectures.
- [Memory Optimization Techniques](https://awesome-repositories.com/f/artificial-intelligence-ml/memory-optimization-techniques.md) — Reduces GPU memory consumption during large-scale training by offloading optimizer states to the host CPU. ([source](https://www.deepspeed.ai/tutorials/zero-offload/))
- [Universal Checkpointing](https://awesome-repositories.com/f/artificial-intelligence-ml/universal-checkpointing.md) — The framework standardizes model, optimizer, and scheduler states into a unified format to enable consistent checkpointing across varying model sizes, topologies, and hardware. ([source](https://www.deepspeed.ai/tutorials/universal-checkpointing/))
- [Inference Acceleration Engines](https://awesome-repositories.com/f/artificial-intelligence-ml/inference-acceleration-engines.md) — Accelerating the execution of large transformer models by injecting optimized kernels and utilizing tensor parallelism for low latency.
- [Long Context Training Optimizations](https://awesome-repositories.com/f/artificial-intelligence-ml/long-context-training-optimizations.md) — The framework reduces memory usage while maintaining precision by chunking input sequences and offloading activations between GPU and host memory. ([source](https://www.deepspeed.ai/tutorials/ulysses-offload/))
- [Model Parallelism Frameworks](https://awesome-repositories.com/f/artificial-intelligence-ml/model-parallelism-frameworks.md) — Neural network layers are partitioned into sequential stages across multiple devices to distribute memory load and enable large-scale model training.
- [Model Serialization](https://awesome-repositories.com/f/artificial-intelligence-ml/model-serialization.md) — Model, optimizer, and scheduler states are normalized into a consistent format to facilitate seamless saving and loading across heterogeneous hardware topologies.
- [Parallelism Engines](https://awesome-repositories.com/f/artificial-intelligence-ml/parallelism-engines.md) — A runtime environment that partitions massive model states and activations across multiple hardware devices to overcome memory constraints.
- [Pre-training Pipelines](https://awesome-repositories.com/f/artificial-intelligence-ml/pre-training-pipelines.md) — The framework provides optimized modeling code and data pipelines to configure dataset paths and hyperparameters for initial BERT model training. ([source](https://www.deepspeed.ai/tutorials/bert-pretraining/))
- [Transformer Training Accelerators](https://awesome-repositories.com/f/artificial-intelligence-ml/transformer-training-accelerators.md) — The framework accelerates transformer training by applying specialized GPU kernels that improve throughput on single devices and scale across multi-GPU clusters. ([source](https://www.deepspeed.ai/tutorials/transformer_kernel/))
- [Distributed Training Utilities](https://awesome-repositories.com/f/artificial-intelligence-ml/distributed-training-utilities.md) — Reducing memory consumption and communication overhead by partitioning model states and gradients across multiple hardware devices.
- [Attention Mechanisms](https://awesome-repositories.com/f/artificial-intelligence-ml/attention-mechanisms.md) — Attention mechanisms utilize block-sparse matrix operations to reduce computational complexity and memory footprint when processing long input sequences.
- [Fine-tuning Scripts](https://awesome-repositories.com/f/artificial-intelligence-ml/fine-tuning-scripts.md) — The framework includes reference scripts to adapt pre-trained BERT models to specific datasets using distributed training modes for improved performance. ([source](https://www.deepspeed.ai/tutorials/bert-finetuning/))
- [Memory Management Utilities](https://awesome-repositories.com/f/artificial-intelligence-ml/memory-management-utilities.md) — Memory-intensive states are dynamically moved between GPU memory and host CPU RAM to balance compute speed with available hardware capacity.
- [Model Quantization Tools](https://awesome-repositories.com/f/artificial-intelligence-ml/model-quantization-tools.md) — The framework defines bit-precision schedules, quantization algorithms, and grouping parameters to reduce model size during the training process. ([source](https://www.deepspeed.ai/tutorials/MoQ-tutorial/))
- [Optimization Techniques](https://awesome-repositories.com/f/artificial-intelligence-ml/optimization-techniques.md) — A collection of memory and compute efficiency techniques designed to accelerate training and inference for large-scale neural networks.
- [Pipeline Parallelism Partitioners](https://awesome-repositories.com/f/artificial-intelligence-ml/pipeline-parallelism-partitioners.md) — The framework enables efficient pipeline parallel training by partitioning large neural networks across multiple GPUs as a sequential list of layers. ([source](https://www.deepspeed.ai/tutorials/pipeline/))
- [Resource Optimization Tools](https://awesome-repositories.com/f/artificial-intelligence-ml/resource-optimization-tools.md) — The framework optimizes memory and compute efficiency by automatically tuning batch sizes and memory configurations based on model and system heuristics. ([source](https://www.deepspeed.ai/tutorials/autotuning/))
- [Sequence Parallelism Frameworks](https://awesome-repositories.com/f/artificial-intelligence-ml/sequence-parallelism-frameworks.md) — The framework distributes long sequences across multiple GPU devices by registering custom attention layers and adapting data loaders for transformer models. ([source](https://www.deepspeed.ai/tutorials/ulysses-alst-sequence-parallelism))
- [Training Optimizations](https://awesome-repositories.com/f/artificial-intelligence-ml/training-optimizations.md) — The framework accelerates convergence and reduces training time by dynamically dropping transformer layers during the training process using command-line flags. ([source](https://www.deepspeed.ai/tutorials/progressive_layer_dropping/))
- [Training Optimizers](https://awesome-repositories.com/f/artificial-intelligence-ml/training-optimizers.md) — Optimizing system efficiency by automatically adjusting batch sizes, memory configurations, and learning schedules to improve convergence and throughput.
- [Curriculum Learning Frameworks](https://awesome-repositories.com/f/artificial-intelligence-ml/curriculum-learning-frameworks.md) — The framework provides curriculum learning tools that define difficulty metrics and training schedules to improve model convergence and stability through progressive data complexity. ([source](https://www.deepspeed.ai/tutorials/curriculum-learning/))
- [Gradient Compression Techniques](https://awesome-repositories.com/f/artificial-intelligence-ml/gradient-compression-techniques.md) — Gradient data is compressed and quantized before network transmission to minimize bandwidth bottlenecks during large-scale distributed training sessions.
- [Learning Rate Schedulers](https://awesome-repositories.com/f/artificial-intelligence-ml/learning-rate-schedulers.md) — The framework improves convergence speeds during large-batch training by applying cyclic learning rate and momentum schedules to the optimization process. ([source](https://www.deepspeed.ai/tutorials/one-cycle/))
- [Mixed Precision Training Utilities](https://awesome-repositories.com/f/artificial-intelligence-ml/mixed-precision-training-utilities.md) — The framework improves memory and communication efficiency during training by applying block-based weight quantization and hierarchical parameter partitioning across all passes. ([source](https://www.deepspeed.ai/tutorials/mixed_precision_zeropp/))
- [Mixture-of-Experts Inference Optimizers](https://awesome-repositories.com/f/artificial-intelligence-ml/mixture-of-experts-inference-optimizers.md) — The framework achieves low latency and high throughput for mixture-of-experts models by using specialized parallelization techniques that avoid traditional dense model trade-offs. ([source](https://www.deepspeed.ai/tutorials/mixture-of-experts-inference/))
- [Sparse Attention Modules](https://awesome-repositories.com/f/artificial-intelligence-ml/sparse-attention-modules.md) — The framework reduces computational overhead in pre-trained models by replacing dense self-attention layers with optimized sparse attention modules. ([source](https://www.deepspeed.ai/tutorials/sparse-attention/))
- [Training Checkpointing](https://awesome-repositories.com/f/artificial-intelligence-ml/training-checkpointing.md) — The framework enables non-blocking model checkpointing by leveraging immutable parameters and optimizer states to transfer data during large-scale training sessions. ([source](https://www.deepspeed.ai/tutorials/datastates-async-checkpointing/))
- [Cloud Training Orchestrators](https://awesome-repositories.com/f/artificial-intelligence-ml/cloud-training-orchestrators.md) — The framework automates distributed training jobs on managed cloud services using provided configuration recipes and integration examples for consistent model tuning. ([source](https://www.deepspeed.ai/tutorials/azure/))
- [Distributed Communication Optimizers](https://awesome-repositories.com/f/artificial-intelligence-ml/distributed-communication-optimizers.md) — The framework reduces total communication volume between compute nodes by applying weight quantization and hierarchical parameter partitioning during distributed training. ([source](https://www.deepspeed.ai/tutorials/zeropp/))
- [Expert Parallelism Configurations](https://awesome-repositories.com/f/artificial-intelligence-ml/expert-parallelism-configurations.md) — The framework distributes model parameters across multiple process groups by specifying the number of experts and the degree of expert parallelism. ([source](https://www.deepspeed.ai/tutorials/mixture-of-experts/))
- [Gradient Management Techniques](https://awesome-repositories.com/f/artificial-intelligence-ml/gradient-management-techniques.md) — The framework reduces communication overhead by updating only critical gradients during training steps while offloading remaining computations to CPU memory. ([source](https://www.deepspeed.ai/tutorials/zenflow/))
- [Hardware Acceleration Kernels](https://awesome-repositories.com/f/artificial-intelligence-ml/hardware-acceleration-kernels.md) — Custom-compiled kernels optimize mathematical operations for specific hardware architectures to maximize throughput and reduce computational latency.
- [Hardware Acceleration Toolkits](https://awesome-repositories.com/f/artificial-intelligence-ml/hardware-acceleration-toolkits.md) — A set of specialized kernels and configuration tools that optimize neural network execution for diverse processor architectures and accelerators.
- [Model Pruning Techniques](https://awesome-repositories.com/f/artificial-intelligence-ml/model-pruning-techniques.md) — The framework decreases inference latency by reducing the number of hidden layers in a neural network while maintaining consistent layer width. ([source](https://www.deepspeed.ai/tutorials/model-compression/))
- [Sparse Computing Kernels](https://awesome-repositories.com/f/artificial-intelligence-ml/sparse-computing-kernels.md) — Improving computational speed and memory usage by replacing dense operations with specialized sparse kernels and attention mechanisms.
- [Communication Layers](https://awesome-repositories.com/f/artificial-intelligence-ml/communication-layers.md) — A communication layer that reduces network overhead during multi-node training through gradient compression and efficient parameter synchronization.
- [Computer Vision Training](https://awesome-repositories.com/f/artificial-intelligence-ml/computer-vision-training.md) — The framework includes standard training scripts for image datasets to verify model performance and establish baseline accuracy metrics for neural networks. ([source](https://www.deepspeed.ai/tutorials/cifar-10/))
- [NPU Accelerators](https://awesome-repositories.com/f/artificial-intelligence-ml/npu-accelerators.md) — The framework supports hardware-accelerated training and inference workflows on specialized neural processing units by managing required drivers, firmware, and toolkits. ([source](https://www.deepspeed.ai/tutorials/accelerator-setup-guide/))
- [Training Diagnostic Tools](https://awesome-repositories.com/f/artificial-intelligence-ml/training-diagnostic-tools.md) — The framework identifies the maximum stable learning rate for model training to enable faster convergence and effective use of large batch sizes. ([source](https://www.deepspeed.ai/tutorials/lrrt/))
- [XPU Accelerators](https://awesome-repositories.com/f/artificial-intelligence-ml/xpu-accelerators.md) — The framework supports runtime compilation of hardware-specific kernels for accelerated computing by installing compatible framework variants and matching compilers. ([source](https://www.deepspeed.ai/tutorials/accelerator-setup-guide/))

### Scientific & Mathematical Computing

- [Mathematical Optimization Kernels](https://awesome-repositories.com/f/scientific-mathematical-computing/mathematical-optimization-kernels.md) — The framework reduces memory usage and increases training speed for structural biology models using specialized kernels designed for large-scale sequence computations. ([source](https://www.deepspeed.ai/tutorials/ds4sci_evoformerattention/))
- [Sparse Attention Kernels](https://awesome-repositories.com/f/scientific-mathematical-computing/sparse-attention-kernels.md) — The framework processes sequences efficiently by computing self-attention outputs using sparse kernels that support relative position embeddings and attention masks. ([source](https://www.deepspeed.ai/tutorials/sparse-attention/))
- [Sparse Matrix Kernels](https://awesome-repositories.com/f/scientific-mathematical-computing/sparse-matrix-kernels.md) — The framework optimizes memory usage and computational efficiency in transformer models by executing block-sparse matrix multiplication patterns. ([source](https://www.deepspeed.ai/tutorials/sparse-attention/))
- [Sparse Softmax Kernels](https://awesome-repositories.com/f/scientific-mathematical-computing/sparse-softmax-kernels.md) — The framework maintains sparsity constraints within attention mechanisms by applying block-sparse softmax operations during forward and backward passes. ([source](https://www.deepspeed.ai/tutorials/sparse-attention/))

### System Administration & Monitoring

- [Performance Profilers](https://awesome-repositories.com/f/system-administration-monitoring/performance-profilers.md) — The framework calculates floating-point operations, latency, and throughput for individual modules and entire models to measure computational efficiency. ([source](https://www.deepspeed.ai/tutorials/flops-profiler/))
- [Training Metrics Exporters](https://awesome-repositories.com/f/system-administration-monitoring/training-metrics-exporters.md) — The framework records model and system performance data in real-time to external logging backends to ensure efficient hardware resource utilization. ([source](https://www.deepspeed.ai/tutorials/monitor))

### Testing & Quality Assurance

- [Execution Tracers](https://awesome-repositories.com/f/testing-quality-assurance/execution-tracers.md) — The framework records execution steps and exports performance data by wrapping training code in context managers that schedule tracing intervals. ([source](https://www.deepspeed.ai/tutorials/pytorch-profiler/))