Composer

Composer is a PyTorch distributed training framework designed for scaling large-scale models across multi-node GPU clusters. It functions as a large language model trainer, a distributed model optimizer, and a training lifecycle manager.

The project differentiates itself as a deep learning regularization library, providing specialized optimization techniques such as Sharpness Aware Minimization, MixUp, and CutMix to improve model generalization. It further distinguishes its training flow through the use of sequence length warmup, progressive layer freezing, and sharded-state checkpointing for large-scale model recovery.

The framework covers a broad capability surface including distributed training orchestration, mixed-precision hardware management, and cloud-native data streaming. It also provides extensive monitoring and observability tools for GPU memory diagnostics, training divergence detection, and throughput tracking.

The project includes a command-line launcher to automate the execution of multi-GPU training jobs across nodes.

Features

Language Model Trainers - Provides a comprehensive system for training LLMs with sharded checkpoints, sequence length warmup, and distributed data parallelism.
Large-Scale Training Frameworks - Implements a framework for scaling large-scale model training across multi-node GPU clusters using PyTorch.
Dynamic Batch Size Adjustment - Adjusts microbatch sizes and gradient accumulation rates dynamically to prevent out-of-memory errors.
Deep Learning Regularization Libraries - Provides a collection of training algorithms including MixUp and Sharpness Aware Minimization to improve generalization.
Data-Parallel Training - Synchronizes training workloads across multi-node clusters by managing data samplers and global batch sizes.
Distributed Training - Executes model training with configurable hardware selection, mixed-precision acceleration, and reproducible seeding.
Distributed Training Metadata - Provides access to global rank and world size to implement rank-specific logic across multi-node GPU clusters.
Distributed Training Optimizers - Improves training efficiency through mixed-precision, gradient accumulation, and communication-efficient optimization.
Distributed Training Orchestration - Orchestrates training across multi-node clusters by abstracting parallelism and distributed data loading.
Gradient Clipping Utilities - Limits gradient values or norms to prevent unstable updates and exploding gradients.
Hardware Acceleration Backends - Directs model execution across various hardware backends including CPUs, GPUs, and TPUs.
Standardized Training Workflows - Implements consistent loops and interfaces for forward passes, loss computation, and evaluation across different model architectures.
Training Lifecycle Management - Provides a comprehensive manager for handling model checkpoints, automated training stops, and cloud storage synchronization.
Progressive - Freezes network layers incrementally during training to optimize stability.
Model Parallelism - Distributes large models across multiple GPUs or nodes using data, shard, or tensor parallelism.
Progressive Sequence Length Warmup - Increases input sequence length progressively during early training to improve stability.
Progressive Input Scaling - Gradually increases input resolution or sequence length during training to accelerate convergence.
Sharpness-Aware Minimization - Minimizes the sharpness of the loss landscape to improve model generalization and robustness.
Tensor Parallelism - Shards individual tensors across multiple devices according to a layer plan to train large models.
Training Checkpoint Persistence - Persists training state or weights to storage at specified intervals using customizable naming.
Automatic Failure Resumption - Automatically detects the latest saved checkpoint and restarts training after a failure to maintain progress.
Training Loop Control - Manages training flow using progressive layer freezing, sequence length warmup, and weight maintenance.
Mixup Augmentations - Creates combinations of training examples and targets to reduce generalization error.
Sharded Checkpoint Storage - Saves and restores model weights as distributed shards to support large-scale models across varying GPU counts.
Length Extrapolation Biases - Biases attention matrices to favor nearby tokens to improve extrapolation to unseen sequence lengths.
Checkpoint-Based Recovery - Provides the ability to restore model states from the latest checkpoint to recover from training failures.
Cloud Dataset Streaming - Downloads training data from cloud blob storage on the fly to avoid local disk bottlenecks.
Cloud-Native Data Streaming - Downloads training datasets from remote object storage on the fly to bypass local disk capacity limits.
Distributed Data Coordination - Ensures different devices receive unique data batches using compatible distributed samplers.
Distributed Model Checkpointing - Supports saving and loading large-scale model states in sharded formats across multiple compute nodes.
Distributed Training Managers - Launches multi-GPU training jobs by setting environment variables and managing process execution across nodes.
Generative Model Evaluation - Produces periodic sample outputs from prompts to assess and monitor generative model progress.
Gradient Norm Monitors - Computes and logs gradient L2 norms to monitor training stability and weight updates.
Hardware Device Management - Manages the placement of models, optimizers, and data batches across specific CPU and GPU hardware.
Training Parameter Averaging - Averages model weights sampled near the end of training to improve final accuracy.
Gradient Flow Stabilizers - Clips gradients and manages layer freezing to stabilize and accelerate the training process.
Experiment Tracking Integrations - Integrates training data and metrics with external platforms for automated experiment tracking and analysis.
Training Safety Monitors - Monitors loss values and automatically detects NaN events to prevent unstable model convergence.
Model Checkpointing - Implements systems for saving and restoring neural network states to preserve training progress.
Memory Layout Optimizers - Optimizes GPU utilization by setting the model memory format to channels-last.
Training Task Automation - Halts training automatically when specific metrics reach a threshold or stop improving.
Parameter Weight Smoothing - Tracks a secondary set of model weights using an exponential moving average to improve evaluation.
Optimization Algorithm Injections - Integrates specialized optimization techniques and regularization methods directly into the core training execution flow.
Training Lifecycle Hooks - Allows users to inject custom logic into the training loop by triggering functions at specific lifecycle events.
Stochastic Depth Regularization - Provides stochastic depth regularization to randomly drop network paths during training and prevent overfitting in deep models.
Mixed-Precision Orchestration - Manages tensor casting and memory formats across different accelerators to balance computational speed and numerical accuracy.
Trainer State Serialization - Captures the internal trainer state into portable dictionaries to enable seamless resumption after hardware failures.
Gradient Accumulation Strategies - Simulates larger batch sizes by accumulating gradients over multiple micro-batches to reduce memory overhead.
Training Lifecycle Hooks - Injects custom logic into the training loop by hooking into events like epoch starts or batch ends.
Training Progress Monitors - Measures progress across epochs and batches to trigger specific behaviors at defined training intervals.
Standardized Evaluation Harnesses - Converts diverse data sources into consistent evaluator objects to ensure standardized and reproducible model benchmarking.
Elastic Resumption - Allows resuming sharded model states across different hardware configurations and GPU counts.
Training Memory Optimizers - Lowers the peak memory footprint by freeing training metric memory immediately after loss calculation.
NaN Loss Training Halts - Detects NaN values in loss computations to immediately halt training and prevent corrupted model weights.
Cloud Synchronization - Enables storing and loading model checkpoints directly from remote cloud buckets or object storage.
GPU Device Assignment - Manages the selection of GPU devices for training and configures TF32 matrix multiplications.
Model Export Formats - Converts trained models into portable formats optimized for production deployment and storage.
Tensor Layout Optimizations - Specifies the order of tensor dimensions in memory to improve hardware processing unit utilization.
Floating-Point Precision Conversions - Casts tensor operations to specific floating-point precisions to balance computational speed and numerical accuracy.
Throughput and ETA Monitors - Tracks real-time processing speed in terms of batches, samples, and tokens per second.
Training Progress Recording - Captures training metrics and checkpoints by routing them to local files, consoles, and monitoring platforms.
Hardware Monitoring Utilities - Tracks GPU and CPU metrics including occupancy, temperature, and power consumption during training.
Memory Usage Analyzers - Captures memory snapshots and visualizes allocation states to identify the causes of memory errors.
Memory Allocation Visualizers - Generates memory snapshots and flamegraphs to visualize and diagnose GPU memory allocation issues.
GPU Memory Monitors - Logs CUDA memory statistics during training batches to track VRAM utilization and detect memory leaks.
Memory Snapshotting - Records tensor memory allocations and out-of-memory events to facilitate GPU debugging.
Training Metric Monitors - Provides tools to track machine learning performance indicators like throughput and memory usage during training.
Computation and Optimization - PyTorch library for faster neural network training with higher accuracy.
Machine Learning Frameworks - Library for efficient and scalable model training.

Infrasys-AI/AISystem

17,017View on GitHub

AISystem is a comprehensive AI full-stack infrastructure project covering the entire pipeline from AI chip architecture to high-level training frameworks. It encompasses the development of AI compiler frameworks, inference engines, and distributed training orchestrators designed to coordinate workloads across a heterogeneous compute stack of CPUs, GPUs, and NPUs. The project focuses on the deep integration of software and hardware, employing software-hardware co-design to align tensor layouts with physical memory structures. It provides specialized capabilities for accelerating Transformer mo

facebookresearch/flashlight

5,443View on GitHub

Flashlight is a C++ machine learning library and deep learning framework designed for building and training neural networks. It functions as a tensor manipulation library and an automatic differentiation engine that tracks operations to calculate gradients via backpropagation for model optimization. The project is distinguished by its role as a distributed training framework, utilizing all-reduce gradient synchronization and distributed environments to scale machine learning workloads across multiple nodes and devices. It features a backend-agnostic memory interface and RAII-based management

fastai/course-v3

4,914View on GitHub

This repository is a comprehensive educational program and deep learning framework designed to teach practical deep learning using PyTorch through notebooks and code examples. It serves as a high-level library for building, training, and deploying neural networks, acting as a model training orchestrator that coordinates PyTorch models, optimizers, and loss functions. The project provides specialized toolkits for computer vision, natural language processing, and tabular data preprocessing. It distinguishes itself through advanced training controls such as discriminative learning rates, a two-w

pytorch/torchtitan

5,084View on GitHub

Torchtitan is a reference implementation for distributed deep learning built within the PyTorch ecosystem. It provides a framework for training large neural network models across multiple GPUs and nodes by combining several parallelism techniques, including fully sharded data parallelism (FSDP), tensor parallelism, and pipeline parallelism, making it possible to train models that exceed the memory capacity of a single device. The system distinguishes itself through asynchronous checkpointing, which saves model and optimizer state to persistent storage without pausing the training loop, enabli

mosaicmlcomposer

Features