Lightning is a PyTorch training framework and distributed AI training orchestrator designed to decouple core research logic from the engineering boilerplate required for model training. It functions as a deep learning workflow manager that automates the process of pretraining and finetuning models across diverse compute environments.
The project distinguishes itself by providing a hardware-agnostic training wrapper, allowing the same model code to execute on CPUs, GPUs, or TPUs without modification. It further manages the scaling of workloads from single devices to multi-node clusters and serves as a cloud GPU infrastructure manager with integrated autoscaling and monitoring.
The framework covers a broad range of training capabilities, including distributed data parallelism, automatic mixed precision, and state-based checkpoint automation. It also provides tools for production model export and supports custom training loop primitives for specialized model architectures.