Megatron LM

Megatron-LM is a distributed transformer training library and large language model training framework designed to scale models across thousands of GPUs. It functions as a GPU-optimized deep learning toolkit and a scaling engine for mixture-of-experts architectures, enabling the training of models with hundreds of billions of parameters.

The project implements multi-dimensional model parallelism, combining tensor, pipeline, data, expert, and context-based workload distribution. It specifically optimizes mixture-of-experts architectures through integrated memory and communication improvements to handle massive parameter counts.

The framework covers a broad capability surface including high-performance model convergence, hybrid architecture composition, and training state management. It utilizes mixed-precision training with formats such as FP8 and BF16, and provides utilities for converting model weights between different framework formats for interoperability.

Features

Distributed Training - Provides a framework for running large-scale language model training using GPU-optimized building blocks and pre-configured scripts.

Scaling Engines - Implements a specialized scaling engine to train mixture-of-experts models with hundreds of billions of parameters across thousands of GPUs.

Communication-Computation Overlap - Hides synchronization delays by overlapping gradient reduction and parameter gathering with active computation.

Deep Learning Toolkits - Provides a GPU-optimized toolkit for accelerating model convergence and throughput using low-precision formats.

Distributed GPU Computing - Manages complex tensor, pipeline, and data parallelism strategies to maximize hardware utilization.

Expert Parallelism Configurations - Distributes different specialists in a mixture-of-experts architecture across GPUs to handle massive parameter counts.

Pipeline Stage Sharding - Divides model layers into sequential stages across different GPUs to process batches of data simultaneously.

Large-Scale Model Training - Distributes the training of massive language models across thousands of GPUs to handle billions of parameters.

Large Scale Training - Distributes transformer training across thousands of GPUs to handle models with hundreds of billions of parameters.

Mixed Precision Training - Uses varying numerical formats like FP8 or BF16 to reduce memory footprint and accelerate compute throughput.

Scaling Optimizations - Scales MoE architectures using specialized memory and communication improvements for high parameter counts.

Training Optimizations - Scales mixture-of-experts architectures using integrated memory, communication, and computation improvements.

Large Language Model Training Frameworks - Provides a framework for training massive transformer models across GPU clusters using advanced distributed parallelism.

Sequence Parallelism Frameworks - Divides long input sequences across multiple GPUs to manage memory constraints while maintaining causal attention dependencies.

Tensor Parallelism - Splits large model weight matrices across multiple GPUs to compute partial results in parallel.

Parallelism Integrators - Combines tensor, pipeline, data, expert, and context parallelism to distribute workloads across GPU clusters.

Hybrid Layer Compositions - Enables the composition of diverse model structures, such as transformer and Mamba layers, into a unified network.

Model Training Optimizers - Integrates advanced optimization algorithms to accelerate model convergence and reduce compute resources.

Adaptive Context Parallelism - Increases training throughput for variable length sequences by adaptively sizing the context parallelism.

Training Checkpointing - Provides a system for saving and restoring training progress through a fault-tolerant pipeline.

Training Convergence Optimization - Applies advanced optimization algorithms and precision formats to reduce the time and compute required for convergence.

Communication Overlap Strategies - Implements communication-computation overlap to hide network latency during gradient synchronization and parameter updates.

Fault Tolerance - Saves periodic snapshots of the optimizer and model weights to allow training to resume from the last stable state.

Distributed Parallelism - Framework for model, tensor, and context parallelism.

Frontier Reasoning Models - Efficient reasoning model framework.

Language Model Development - Optimized library for training large-scale language models.

Language Model Libraries - Framework for training multi-billion parameter models using model parallelism.

Large Language Models - Research-focused transformer training at scale.

Model Training - Research-focused framework for training transformer models at scale.

Model Training Frameworks - Research-focused library for training transformers at scale.

Transformer Implementations - Research framework for training large-scale transformer language models.

Vision Language Models - Frontier-class multimodal models with flexible architectural configurations.

NVIDIAMegatron-LM

Features

Star history