DeepSpeed

DeepSpeed is a high-performance library designed to scale deep learning model training and inference across massive clusters of GPUs and compute nodes. It provides a comprehensive suite of tools for distributed training, enabling the execution of models that exceed the memory capacity of single devices through advanced parameter partitioning, pipeline-based model parallelism, and memory-efficient state offloading.

The framework distinguishes itself through specialized communication-efficient optimizers and hardware-aware acceleration techniques. By utilizing gradient compression, quantization, and custom-compiled kernels, it minimizes network bandwidth bottlenecks and maximizes computational throughput. It further supports complex architectures like mixture-of-experts and long-context models by integrating sequence parallelism and sparse attention mechanisms, ensuring efficient resource utilization across heterogeneous hardware topologies.

Beyond its core training capabilities, the project includes a robust set of utilities for automated performance tuning, model profiling, and universal checkpointing. It provides infrastructure support for diverse processor architectures and cloud-based cluster deployment, allowing users to optimize execution environments through targeted kernel compilation and diagnostic monitoring.

Features

Distributed Memory Optimizers - The framework partitions model states across available devices to reduce memory consumption and enable the training of massive models on distributed hardware.
Distributed Training Frameworks - The framework supports scaling model training across multiple compute nodes and GPUs by integrating existing architectures with distributed training capabilities.
Distributed Training Optimizers - The framework minimizes data transfer between distributed nodes by using compressed algorithms to reduce communication overhead during optimizer updates.
Distributed Training Sharding - Model states and optimizer parameters are sharded across multiple compute nodes to enable training of models exceeding single-device memory.
Inference Engines - The framework provides tools to inject optimized kernels and configure tensor parallelism to accelerate transformer model execution across multiple hardware devices.
Large-Scale Training Frameworks - Scaling neural network training across multiple compute nodes and GPUs to handle massive datasets and complex model architectures.
Memory Optimization Techniques - Reduces GPU memory consumption during large-scale training by offloading optimizer states to the host CPU.
Universal Checkpointing - The framework standardizes model, optimizer, and scheduler states into a unified format to enable consistent checkpointing across varying model sizes, topologies, and hardware.
Inference Acceleration Engines - Accelerating the execution of large transformer models by injecting optimized kernels and utilizing tensor parallelism for low latency.
Long Context Training Optimizations - The framework reduces memory usage while maintaining precision by chunking input sequences and offloading activations between GPU and host memory.
Model Parallelism Frameworks - Neural network layers are partitioned into sequential stages across multiple devices to distribute memory load and enable large-scale model training.
Model Serialization - Model, optimizer, and scheduler states are normalized into a consistent format to facilitate seamless saving and loading across heterogeneous hardware topologies.
Parallelism Engines - A runtime environment that partitions massive model states and activations across multiple hardware devices to overcome memory constraints.
Pre-training Pipelines - The framework provides optimized modeling code and data pipelines to configure dataset paths and hyperparameters for initial BERT model training.
Transformer Training Accelerators - The framework accelerates transformer training by applying specialized GPU kernels that improve throughput on single devices and scale across multi-GPU clusters.
Distributed Training Utilities - Reducing memory consumption and communication overhead by partitioning model states and gradients across multiple hardware devices.
Attention Mechanisms - Attention mechanisms utilize block-sparse matrix operations to reduce computational complexity and memory footprint when processing long input sequences.
Fine-tuning Scripts - The framework includes reference scripts to adapt pre-trained BERT models to specific datasets using distributed training modes for improved performance.
Memory Management Utilities - Memory-intensive states are dynamically moved between GPU memory and host CPU RAM to balance compute speed with available hardware capacity.
Model Quantization Tools - The framework defines bit-precision schedules, quantization algorithms, and grouping parameters to reduce model size during the training process.
Optimization Techniques - A collection of memory and compute efficiency techniques designed to accelerate training and inference for large-scale neural networks.
Pipeline Parallelism Partitioners - The framework enables efficient pipeline parallel training by partitioning large neural networks across multiple GPUs as a sequential list of layers.
Resource Optimization Tools - The framework optimizes memory and compute efficiency by automatically tuning batch sizes and memory configurations based on model and system heuristics.
Sequence Parallelism Frameworks - The framework distributes long sequences across multiple GPU devices by registering custom attention layers and adapting data loaders for transformer models.
Training Optimizations - The framework accelerates convergence and reduces training time by dynamically dropping transformer layers during the training process using command-line flags.
Training Optimizers - Optimizing system efficiency by automatically adjusting batch sizes, memory configurations, and learning schedules to improve convergence and throughput.
Mathematical Optimization Kernels - The framework reduces memory usage and increases training speed for structural biology models using specialized kernels designed for large-scale sequence computations.
Performance Profilers - The framework calculates floating-point operations, latency, and throughput for individual modules and entire models to measure computational efficiency.
Curriculum Learning Frameworks - The framework provides curriculum learning tools that define difficulty metrics and training schedules to improve model convergence and stability through progressive data complexity.
Gradient Compression Techniques - Gradient data is compressed and quantized before network transmission to minimize bandwidth bottlenecks during large-scale distributed training sessions.
Learning Rate Schedulers - The framework improves convergence speeds during large-batch training by applying cyclic learning rate and momentum schedules to the optimization process.
Mixed Precision Training Utilities - The framework improves memory and communication efficiency during training by applying block-based weight quantization and hierarchical parameter partitioning across all passes.
Mixture-of-Experts Inference Optimizers - The framework achieves low latency and high throughput for mixture-of-experts models by using specialized parallelization techniques that avoid traditional dense model trade-offs.
Sparse Attention Modules - The framework reduces computational overhead in pre-trained models by replacing dense self-attention layers with optimized sparse attention modules.
Training Checkpointing - The framework enables non-blocking model checkpointing by leveraging immutable parameters and optimizer states to transfer data during large-scale training sessions.
Model Training - Optimization library for distributed training and inference.
Natural Language Processing - Listed in the “Natural Language Processing” section of the FunNLP awesome list.
Computation and Optimization - Optimization library for efficient distributed training and inference.
Cloud Training Orchestrators - The framework automates distributed training jobs on managed cloud services using provided configuration recipes and integration examples for consistent model tuning.
Distributed Communication Optimizers - The framework reduces total communication volume between compute nodes by applying weight quantization and hierarchical parameter partitioning during distributed training.
Expert Parallelism Configurations - The framework distributes model parameters across multiple process groups by specifying the number of experts and the degree of expert parallelism.
Gradient Management Techniques - The framework reduces communication overhead by updating only critical gradients during training steps while offloading remaining computations to CPU memory.
Hardware Acceleration Kernels - Custom-compiled kernels optimize mathematical operations for specific hardware architectures to maximize throughput and reduce computational latency.
Hardware Acceleration Toolkits - A set of specialized kernels and configuration tools that optimize neural network execution for diverse processor architectures and accelerators.
Model Pruning Techniques - The framework decreases inference latency by reducing the number of hidden layers in a neural network while maintaining consistent layer width.
Sparse Attention Kernels - The framework processes sequences efficiently by computing self-attention outputs using sparse kernels that support relative position embeddings and attention masks.
Sparse Computing Kernels - Improving computational speed and memory usage by replacing dense operations with specialized sparse kernels and attention mechanisms.
Linear Algebra - The framework optimizes memory usage and computational efficiency in transformer models by executing block-sparse matrix multiplication patterns.
Training Metrics Exporters - The framework records model and system performance data in real-time to external logging backends to ensure efficient hardware resource utilization.
Communication Layers - A communication layer that reduces network overhead during multi-node training through gradient compression and efficient parameter synchronization.
Computer Vision Training - The framework includes standard training scripts for image datasets to verify model performance and establish baseline accuracy metrics for neural networks.
NPU Accelerators - The framework supports hardware-accelerated training and inference workflows on specialized neural processing units by managing required drivers, firmware, and toolkits.
Sparse Softmax Kernels - The framework maintains sparsity constraints within attention mechanisms by applying block-sparse softmax operations during forward and backward passes.
Training Diagnostic Tools - The framework identifies the maximum stable learning rate for model training to enable faster convergence and effective use of large batch sizes.
XPU Accelerators - The framework supports runtime compilation of hardware-specific kernels for accelerated computing by installing compatible framework variants and matching compilers.
Execution Tracers - The framework records execution steps and exports performance data by wrapping training code in context managers that schedule tracing intervals.

zhaochenyang20/Awesome-ML-SYS-Tutorial

5,371View on GitHub

This project provides a comprehensive technical guide and framework for engineering large-scale machine learning systems. It covers the full lifecycle of model development, focusing on the infrastructure and computational principles required to build, train, and serve generative AI models across distributed GPU clusters. The repository distinguishes itself by offering deep-dive tutorials and implementation strategies for complex system challenges. It emphasizes high-performance architectural primitives, such as collective communication orchestration, distributed tensor sharding, and static gr

microsoft/DeepSpeed

42,533View on GitHub

DeepSpeed is a distributed deep learning optimization library and framework designed for the training and inference of massive AI models. It serves as a model parallelism orchestrator and a toolkit for scaling large language models across multiple GPUs and compute nodes. The project distinguishes itself through 3D parallelism orchestration, which combines data, pipeline, and tensor parallelism. It utilizes ZeRO-based memory partitioning to eliminate redundant storage and employs CPU-offload memory management to move weights and optimizer states to system RAM. Additionally, it provides special

huggingface/peft

21,274View on GitHub

This library provides a framework for parameter-efficient fine-tuning, enabling the adaptation of large pretrained models by training only a small subset of parameters. It functions as a distributed model training system and optimization toolkit, designed to reduce the computational and memory requirements typically associated with full model fine-tuning. The project distinguishes itself through a suite of methods for modular adapter composition, including low-rank matrix decomposition and activation-based scaling. It supports the integration of multiple task-specific adapter modules, allowin

axolotl-ai-cloud/axolotl

12,059View on GitHub

Axolotl is a configuration-driven framework designed for the fine-tuning, evaluation, and quantization of large language models. It functions as a comprehensive orchestrator for distributed training, enabling users to manage complex workflows across multi-node and multi-GPU environments. By utilizing structured configuration files, the platform streamlines the setup of training parameters, dataset paths, and hardware distribution strategies. The project distinguishes itself through its support for diverse training methodologies, including full-parameter tuning, parameter-efficient adaptation,

deepspeedaiDeepSpeed

Features