30 open-source projects similar to deepspeedai/deepspeed, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best DeepSpeed alternative.
This project provides a comprehensive technical guide and framework for engineering large-scale machine learning systems. It covers the full lifecycle of model development, focusing on the infrastructure and computational principles required to build, train, and serve generative AI models across distributed GPU clusters. The repository distinguishes itself by offering deep-dive tutorials and implementation strategies for complex system challenges. It emphasizes high-performance architectural primitives, such as collective communication orchestration, distributed tensor sharding, and static gr
DeepSpeed is a distributed deep learning optimization library and framework designed for the training and inference of massive AI models. It serves as a model parallelism orchestrator and a toolkit for scaling large language models across multiple GPUs and compute nodes. The project distinguishes itself through 3D parallelism orchestration, which combines data, pipeline, and tensor parallelism. It utilizes ZeRO-based memory partitioning to eliminate redundant storage and employs CPU-offload memory management to move weights and optimizer states to system RAM. Additionally, it provides special
This library provides a framework for parameter-efficient fine-tuning, enabling the adaptation of large pretrained models by training only a small subset of parameters. It functions as a distributed model training system and optimization toolkit, designed to reduce the computational and memory requirements typically associated with full model fine-tuning. The project distinguishes itself through a suite of methods for modular adapter composition, including low-rank matrix decomposition and activation-based scaling. It supports the integration of multiple task-specific adapter modules, allowin
Axolotl is a configuration-driven framework designed for the fine-tuning, evaluation, and quantization of large language models. It functions as a comprehensive orchestrator for distributed training, enabling users to manage complex workflows across multi-node and multi-GPU environments. By utilizing structured configuration files, the platform streamlines the setup of training parameters, dataset paths, and hardware distribution strategies. The project distinguishes itself through its support for diverse training methodologies, including full-parameter tuning, parameter-efficient adaptation,
Sglang is a high-performance inference engine and serving system designed for large language and multimodal models. It provides a programmable interface for orchestrating complex generation workflows, enabling developers to coordinate multi-turn dialogues, tool invocations, and reasoning chains through a domain-specific language. The platform is built to support production-scale deployments, offering an OpenAI-compatible API that allows for integration with existing application ecosystems. The system distinguishes itself through a disaggregated architecture that separates compute-intensive pr
DeepSpeedExamples is a collection of reference implementations and scripts for training, fine-tuning, and executing inference on large-scale AI models using DeepSpeed optimization. It provides a distributed model training guide and practical workflows for adapting large language models through memory-efficient techniques. The repository includes specialized implementations for pipeline parallelism to handle models exceeding single GPU memory and a suite of examples for ZeRO memory optimization to reduce per-device overhead. It also features standardized test suites for benchmarking the throug
Megatron-LM is a distributed transformer training library and large language model training framework designed to scale models across thousands of GPUs. It functions as a GPU-optimized deep learning toolkit and a scaling engine for mixture-of-experts architectures, enabling the training of models with hundreds of billions of parameters. The project implements multi-dimensional model parallelism, combining tensor, pipeline, data, expert, and context-based workload distribution. It specifically optimizes mixture-of-experts architectures through integrated memory and communication improvements t
DeepSpeedExamples is a collection of reference implementations for training and deploying large scale AI models using the DeepSpeed optimization library. It provides Python code examples for training massive models across multiple GPUs through distributed optimization techniques. The repository includes optimized patterns for deploying and running large language model predictions in production environments. It also serves as a guide for model compression to reduce memory footprints and as a source for performance benchmarks to measure execution speed and resource utilization. The project cov
Composer is a PyTorch distributed training framework designed for scaling large-scale models across multi-node GPU clusters. It functions as a large language model trainer, a distributed model optimizer, and a training lifecycle manager. The project differentiates itself as a deep learning regularization library, providing specialized optimization techniques such as Sharpness Aware Minimization, MixUp, and CutMix to improve model generalization. It further distinguishes its training flow through the use of sequence length warmup, progressive layer freezing, and sharded-state checkpointing for
Torchtune is a PyTorch-native library for fine-tuning, aligning, and quantizing large language models. It provides a configurable training pipeline orchestrated through YAML recipes, with CLI overrides and component swapping, distributed training via FSDP2, memory optimizations, and parameter-efficient fine-tuning methods like LoRA, DoRA, and QLoRA. The library distinguishes itself through its YAML-driven configuration system that defines all training parameters and instantiates components from config files, with full CLI override capability for any field or component at launch time. It suppo
XGBoost is a distributed machine learning library for implementing scalable gradient boosting decision trees used for regression, classification, and ranking. It functions as a predictive model framework and a cross-language toolkit, providing a core implementation with native bindings for Python, R, Java, Scala, and C++. The system is designed as a GPU-accelerated library that utilizes CUDA and NCCL to speed up the training of decision tree ensembles. It operates as a distributed framework capable of scaling training and prediction across multi-node clusters and GPU environments to process m
Fairseq is a PyTorch toolkit for sequence-to-sequence modeling, specializing in neural machine translation, automatic speech recognition, and large-scale language model training. It provides a framework for processing and aligning diverse data sources, including text, audio, and video, to support tasks such as speech-to-text conversion and multimodal sequence learning. The project is distinguished by its distributed training capabilities, which utilize parameter sharding, mixed-precision training, and CPU offloading to handle models that exceed single-device memory. It also includes specializ
OpenRLHF is a training framework and alignment library designed for reinforcement learning from human feedback across distributed GPU clusters. It provides tools for aligning large language models and multimodal vision-language models using algorithms such as PPO, GRPO, and DPO. The framework distinguishes itself through a distributed inference engine that overlaps sample rollout with training to increase throughput. It supports scaling to models exceeding 70 billion parameters via parameter sharding and handles long-context sequences through ring-attention sequence parallelism. The project
This project is a comprehensive framework and toolkit for developing, optimizing, and deploying transformer-based models across multimodal, document intelligence, and natural language processing tasks. It provides a unified neural architecture that processes text, vision, audio, and document layout data through a shared set of weights, enabling researchers and developers to build foundational models that align cross-modal representations. The platform distinguishes itself through advanced training and inference strategies designed for large-scale deep learning. It incorporates specialized mec
Accelerate is a PyTorch distributed training library that abstracts the boilerplate required to run models across multiple GPUs, TPUs, and CPUs. It functions as a deep learning model scaler and distributed hardware orchestrator, allowing the same training script to run on different hardware backends without modifying the core logic. The project provides a distributed training command line interface for configuring compute environments and launching jobs across single or multi-node clusters. It includes a mixed precision training framework to implement FP16 and BF16 precision, reducing memory
ColossalAI is a distributed deep learning framework designed for training and deploying massive artificial intelligence models across clusters of hardware accelerators. It functions as a parallel computing engine that partitions model workloads and data across multiple processors to maximize memory efficiency and throughput. The platform distinguishes itself through a comprehensive suite of parallelization strategies, including multi-dimensional tensor parallelism and pipeline-based model parallelism, which segment neural network layers and stages across devices. To support large-scale genera
This project is a deep learning framework designed for constructing, training, and deploying neural networks across diverse hardware environments. It functions as a high-performance tensor computation library that provides both imperative and symbolic programming interfaces, allowing developers to balance flexible, step-by-step model building with the efficiency of compiled computation graphs. The framework distinguishes itself through a hybrid execution engine that integrates declarative graph compilation with imperative runtime logic. It supports scalable, distributed training across multip
This project is a neural machine translation system used to build models that automatically translate text from one language to another. It utilizes sequence-to-sequence modeling to transform variable-length input sequences into corresponding output sequences. The system implements bidirectional recurrent neural network encoding and attention mechanisms to capture contextual information and focus on specific parts of the source text during translation. To manage training and inference, it employs separate computational graphs and supports distributing model layers across multiple GPU devices.
Paddle is a deep learning framework designed for building, training, and deploying neural networks. It provides a platform for constructing models using tensor-based computations and supports both dynamic and static execution graphs to facilitate research and production workflows. The platform functions as a distributed machine learning system, enabling the scaling of training workloads across multiple nodes and hardware clusters. It includes a comprehensive toolkit for model deployment and optimization, allowing users to convert external model formats, compress trained models for resource-co
PaddleDetection is an object detection framework designed for the end-to-end development, training, and deployment of computer vision models. It provides a comprehensive library of modular neural network architectures and pipelines that support object detection, instance segmentation, and multi-object tracking tasks. The project distinguishes itself through a configuration-driven approach that decouples model components like backbones and heads, allowing for the flexible assembly of custom vision workflows. It incorporates advanced techniques such as anchor-free detection logic, joint detecti
Swin-Transformer is a deep learning framework designed for training and deploying hierarchical vision transformer models. It serves as a research library and toolkit for computer vision tasks, providing the infrastructure to build models that replace standard convolution operations with sliding window self-attention mechanisms. By utilizing a multi-scale feature hierarchy, the framework enables the processing of visual data at varying resolutions and spatial scales. The project distinguishes itself through its implementation of shifted window partitioning, which facilitates global information
Horovod is a distributed deep learning framework and gradient synchronizer designed to scale model training across multiple GPUs and compute nodes. It functions as a distributed training orchestrator and an elastic training engine, utilizing an MPI collective communication library to synchronize weights and gradients across TensorFlow, PyTorch, Keras, and MXNet models. The system distinguishes itself through dynamic elastic scaling, which allows it to adjust the number of active workers at runtime and recover from node failures. It optimizes communication efficiency using tensor fusion batchi
This project is an educational platform and research toolkit designed to teach deep learning through a combination of mathematical theory, visual diagrams, and executable code. It provides a comprehensive environment for building, training, and evaluating neural networks, grounding complex concepts in interactive computational notebooks that allow for hands-on experimentation. The framework distinguishes itself by interleaving theoretical foundations—including linear algebra, calculus, and probability—with practical implementations across multiple industry-standard libraries. It supports flex
This project is a natural language processing framework focused on a generalized autoregressive pretrainer designed for unsupervised language representation. It implements a language model that combines permutation-based training with a Transformer-XL backbone to function as a long-context text processor. The system is distinguished by its ability to handle text sequences that exceed standard length limits through the use of segment-level recurrence and relative positional encoding. It scales high-performance pretraining across multiple GPUs and TPU clusters using distributed training impleme
jetson-inference is a set of libraries and tools for executing optimized deep learning models on embedded GPU hardware. Its primary purpose is to enable real-time computer vision and AI inference at the edge with low latency and high throughput. The project distinguishes itself through high-performance streaming analytics and the ability to execute concurrent AI pipelines on auto-grade silicon. It provides specialized support for multi-sensor stream processing, utilizing zero-copy data transport to load camera frames directly into GPU memory. The codebase covers a broad surface of capabiliti
This project is a low-dependency engine designed for training large language models using native C and CUDA. It provides a bare-metal environment for tensor computation, allowing for the execution of neural network operations directly on hardware accelerators without the overhead of high-level software abstractions. The framework distinguishes itself by implementing manual gradient backpropagation and custom hardware-specific kernels, providing granular control over memory mapping and computational precision. It supports distributed training across multiple graphics processors and compute nod
Flashlight is a C++ machine learning library and deep learning framework designed for building and training neural networks. It functions as a tensor manipulation library and an automatic differentiation engine that tracks operations to calculate gradients via backpropagation for model optimization. The project is distinguished by its role as a distributed training framework, utilizing all-reduce gradient synchronization and distributed environments to scale machine learning workloads across multiple nodes and devices. It features a backend-agnostic memory interface and RAII-based management
Torchtitan is a reference implementation for distributed deep learning built within the PyTorch ecosystem. It provides a framework for training large neural network models across multiple GPUs and nodes by combining several parallelism techniques, including fully sharded data parallelism (FSDP), tensor parallelism, and pipeline parallelism, making it possible to train models that exceed the memory capacity of a single device. The system distinguishes itself through asynchronous checkpointing, which saves model and optimizer state to persistent storage without pausing the training loop, enabli
FlashAttention is an attention mechanism optimization library and machine learning acceleration framework designed to increase training speed and reduce memory footprint for large-scale neural network models. It functions as a collection of low-level CUDA kernels that optimize memory-bound operations to improve hardware utilization on graphics processing units. The library distinguishes itself through an input-output-aware algorithm design that minimizes data movement between different levels of memory. By employing kernel fusion and tiled matrix multiplication, it combines sequential operati
This project is a comprehensive library of state-of-the-art neural network architectures designed for image classification and feature extraction. It provides a complete deep learning training framework that supports distributed execution, allowing users to build, train, and fine-tune vision models using optimized schedulers and pre-configured training recipes. The library distinguishes itself through a modular backbone architecture that treats neural networks as decoupled feature extractors, enabling the retrieval of multi-scale outputs for downstream tasks like object detection and segmenta