Horovod

Horovod is a distributed deep learning framework and gradient synchronizer designed to scale model training across multiple GPUs and compute nodes. It functions as a distributed training orchestrator and an elastic training engine, utilizing an MPI collective communication library to synchronize weights and gradients across TensorFlow, PyTorch, Keras, and MXNet models.

The system distinguishes itself through dynamic elastic scaling, which allows it to adjust the number of active workers at runtime and recover from node failures. It optimizes communication efficiency using tensor fusion batching and half-precision gradient compression to reduce network bandwidth requirements.

The framework covers a broad set of capabilities including cluster orchestration across Kubernetes, Spark, and Ray, as well as hardware-aware resource mapping for CPUs and GPUs. It provides tools for distributed data management, such as parallel loading from Parquet files and offloaded preprocessing. Performance is further supported by RDMA network optimization, execution tracing, and Bayesian training optimization to maximize throughput.

Deployment is supported through containerized training images and orchestrated environments for high-performance compute clusters.

Features

Distributed Deep Learning Frameworks - Functions as a unified platform for scaling deep learning model training across multiple GPUs and compute nodes.

Distributed Gradient Synchronization - Implements all-reduce collective communication to synchronize gradients across distributed workers.

Communication Layers - Provides a communication abstraction layer that decouples the distributed training logic from specific network hardware and libraries.

Distributed Deep Learning - Scales the training of deep learning models across multiple GPUs and compute nodes to accelerate convergence.

Initial State Broadcasting - Broadcasts initial variable states from a lead process to all workers to ensure consistent model weight initialization.

Distributed Training Orchestration - Provides orchestration for distributed training jobs specifically integrated with Apache Spark clusters.

Distributed Training Orchestrators - Coordinates process ranks and resource allocation to parallelize model training across multiple processors and nodes.

Communication Optimization - Optimizes collective communication using all-reduce and all-gather operations to synchronize gradients efficiently.

Distributed Training Rate Scaling - Adjusts the learning rate based on the number of active workers to compensate for increased global batch sizes.

Elastic Scaling - Provides a dynamic elastic training engine that adjusts worker counts at runtime and recovers from host failures.

Collective Communication Operations - Implements critical communication patterns like all-reduce and all-gather to synchronize gradients across distributed GPUs.

Backend Selection - Selects specific backend implementations for communication operations based on available network hardware and vendor optimizations.

Distributed Script Launchers - Launches training scripts across multiple GPUs on a single machine or across a cluster of multiple machines.

Distributed Task Orchestration - Orchestrates the launch of distributed tasks by managing worker registration and execution across the cluster.

MPI Cluster Orchestrators - Detects hostnames and GPU resources within a cluster to initialize MPI-based distributed training processes.

ML Infrastructure Managers - Provides infrastructure management for deploying and scaling distributed deep learning environments on Kubernetes.

Multi-node Orchestration - Coordinates multiple containers across different machines to scale model training across distributed nodes.

Resource Allocation - Allocates specific CPU and GPU hardware resources to training processes to optimize hardware utilization.

Training Orchestrators - Coordinates distributed training jobs with the ability to dynamically adjust worker counts and recover from node failures.

Elastic Training Scaling - Allows adjusting the number of active workers at runtime and recovering from failed hosts without stopping training.

Distributed Coordination Primitives - Coordinates execution between process ranks using barriers to ensure all workers reach a consistent state.

Accelerator Device Mapping - Maps training processes to specific CPU or GPU devices to optimize hardware utilization based on cluster allocation.

Spark Integrations - Enables executing distributed training jobs on Apache Spark clusters to leverage existing resource management.

Distributed Model Checkpointing - Manages the persistence of model states across distributed nodes while preventing concurrent write corruption.

Gradient Compression Techniques - Reduces network bandwidth by compressing tensor gradients using half-precision formats during synchronization.

Offloaded Preprocessing - Offloads CPU-intensive dataset transformations to a dedicated cluster of processes to prevent data bottlenecks during GPU training.

Worker-Specific Serialization - Prevents filesystem corruption by restricting model checkpoint saving to a single designated worker.

Performance Tuning - Provides methods for optimizing computational throughput and network bandwidth in distributed machine learning pipelines.

Training Checkpointing - Persists intermediate training data, model checkpoints, and metric logs to local or distributed filesystems.

Tensor Communication Batching - Groups multiple small tensors into larger buffers to reduce network overhead during gradient synchronization.

Collective Operation Isolation - Runs distinct collective communication tasks on specific subsets of worker processes to enable concurrent operations.

Parallel Data Loaders - Implements parallel shuffling and memory caching for large datasets to maximize data throughput.

Custom Container Images - Generates container images tailored to specific deep learning frameworks and GPU drivers using configurable build arguments.

Containerized Training Environments - Provides pre-configured container images to standardize the training environment across different hardware configurations.

Kubernetes Job Orchestration - Bootstraps distributed deep learning environments using Kubernetes statefulsets and jobs for workers and drivers.

Ray Cluster Integrations - Distributes deep learning workloads across Ray clusters by wrapping stateful processes as actors.

Worker Groupings - Allows running collective operations on specific groups of workers to isolate communication and reduce overhead.

Communication Compression - Reduces volume of data transferred between nodes using half-precision compression for gradients and weights.

RDMA Networking - Integrates network interface cards via memory registration to enable low-latency direct memory access between nodes.

MPI Communication - Uses MPI collective communication protocols like all-reduce and all-gather to synchronize parallel training tasks.

Master-Worker Coordination - Uses rank-based identifiers to coordinate barriers and broadcasting between a master process and worker nodes.

GPU Profilers - Analyzes GPU hardware utilization and kernel execution performance during distributed training.

Execution Tracing - Records the sequence of worker activities and tensor operations to analyze synchronization efficiency.

Distributed Training Timelines - Records detailed timelines of communication and computation events to identify bottlenecks in distributed training.

Distributed Scalability Analysis - Provides tools to evaluate scalability and throughput using synthetic or real datasets to baseline performance and diagnose distributed bottlenecks.

Deep Learning Frameworks - Supports distributed training across multiple deep learning frameworks.

Optimization Tools - Framework for distributed deep learning training across clusters.

Computation and Optimization - Distributed training framework for TensorFlow, Keras, and PyTorch.

Parallel Programming Frameworks - Distributed deep learning training framework for major ML libraries.

horovodhorovod

Features

Star history