Megatron-LM is a distributed transformer training library and large language model training framework designed to scale models across thousands of GPUs. It functions as a GPU-optimized deep learning toolkit and a scaling engine for mixture-of-experts architectures, enabling the training of models with hundreds of billions of parameters.
The project implements multi-dimensional model parallelism, combining tensor, pipeline, data, expert, and context-based workload distribution. It specifically optimizes mixture-of-experts architectures through integrated memory and communication improvements to handle massive parameter counts.
The framework covers a broad capability surface including high-performance model convergence, hybrid architecture composition, and training state management. It utilizes mixed-precision training with formats such as FP8 and BF16, and provides utilities for converting model weights between different framework formats for interoperability.