Torchtitan is a reference implementation for distributed deep learning built within the PyTorch ecosystem. It provides a framework for training large neural network models across multiple GPUs and nodes by combining several parallelism techniques, including fully sharded data parallelism (FSDP), tensor parallelism, and pipeline parallelism, making it possible to train models that exceed the memory capacity of a single device.
The system distinguishes itself through asynchronous checkpointing, which saves model and optimizer state to persistent storage without pausing the training loop, enabling fault tolerance and iterative experimentation. A unified composable parallelism scheduler allows data, tensor, and pipeline parallelism to be orchestrated from a single configuration, while a real-time monitoring tool logs loss, throughput, memory, and other metrics during training runs. The checkpoint format is designed to be directly loadable into conversion tools for subsequent fine‑tuning.
Additional capabilities include memory profile–driven autotuning that recommends optimal parallelism configurations, an elastic training coordinator that manages dynamic membership changes in the worker pool, and pipeline execution scheduling that minimises bubble time. These components collectively support large-scale distributed training with both high efficiency and operational flexibility.