30 open-source projects similar to microsoft/deepspeed, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best DeepSpeed alternative.
Megatron-LM is a distributed transformer training library and large language model training framework designed to scale models across thousands of GPUs. It functions as a GPU-optimized deep learning toolkit and a scaling engine for mixture-of-experts architectures, enabling the training of models with hundreds of billions of parameters. The project implements multi-dimensional model parallelism, combining tensor, pipeline, data, expert, and context-based workload distribution. It specifically optimizes mixture-of-experts architectures through integrated memory and communication improvements t
This project provides a comprehensive technical guide and framework for engineering large-scale machine learning systems. It covers the full lifecycle of model development, focusing on the infrastructure and computational principles required to build, train, and serve generative AI models across distributed GPU clusters. The repository distinguishes itself by offering deep-dive tutorials and implementation strategies for complex system challenges. It emphasizes high-performance architectural primitives, such as collective communication orchestration, distributed tensor sharding, and static gr
Sglang is a high-performance inference engine and serving system designed for large language and multimodal models. It provides a programmable interface for orchestrating complex generation workflows, enabling developers to coordinate multi-turn dialogues, tool invocations, and reasoning chains through a domain-specific language. The platform is built to support production-scale deployments, offering an OpenAI-compatible API that allows for integration with existing application ecosystems. The system distinguishes itself through a disaggregated architecture that separates compute-intensive pr
ColossalAI is a distributed deep learning framework designed for training and deploying massive artificial intelligence models across clusters of hardware accelerators. It functions as a parallel computing engine that partitions model workloads and data across multiple processors to maximize memory efficiency and throughput. The platform distinguishes itself through a comprehensive suite of parallelization strategies, including multi-dimensional tensor parallelism and pipeline-based model parallelism, which segment neural network layers and stages across devices. To support large-scale genera
Horovod is a distributed deep learning framework and gradient synchronizer designed to scale model training across multiple GPUs and compute nodes. It functions as a distributed training orchestrator and an elastic training engine, utilizing an MPI collective communication library to synchronize weights and gradients across TensorFlow, PyTorch, Keras, and MXNet models. The system distinguishes itself through dynamic elastic scaling, which allows it to adjust the number of active workers at runtime and recover from node failures. It optimizes communication efficiency using tensor fusion batchi
Metaseq is a transformer sequence modeling toolkit designed for training, fine-tuning, and deploying sequence-to-sequence models using open pre-trained weights. It provides a comprehensive framework for large language model training, including dedicated tools for sequence dataset processing and a standalone inference server for generating text via API requests. The project features specialized utilities for model quantization to reduce parameter precision to eight bits, which lowers memory usage and increases inference speed. It also includes a checkpoint conversion pipeline to transform mode
DeepSpeed is a high-performance library designed to scale deep learning model training and inference across massive clusters of GPUs and compute nodes. It provides a comprehensive suite of tools for distributed training, enabling the execution of models that exceed the memory capacity of single devices through advanced parameter partitioning, pipeline-based model parallelism, and memory-efficient state offloading. The framework distinguishes itself through specialized communication-efficient optimizers and hardware-aware acceleration techniques. By utilizing gradient compression, quantization
gpt-neox is a distributed training system and framework for building large-scale autoregressive language models. It implements the transformer architecture and provides a toolkit for training models with billions of parameters by distributing weights across compute clusters. The framework distinguishes itself through extensive support for distributed model parallelism, including pipeline and sequence parallelism, to overcome single-device memory limits. It further supports sparse model architectures using a mixture of experts system with Sinkhorn-based routing. The project covers a broad ran
Axolotl is a distributed training orchestrator and fine-tuning framework for large language models, multimodal systems, and quantized models. It provides a structured environment for specializing pre-trained models through full parameter updates or low-rank adaptation, as well as aligning model outputs with human expectations via preference tuning pipelines and reward modeling. The system distinguishes itself through a configuration-driven pipeline that manages preprocessing and training workflows via a single file for reproducibility. It implements high-throughput optimizations such as multi
Grok-1 is an open-weights large language model implementation featuring a sparse mixture-of-experts architecture. It is designed for high-performance text generation and natural language processing by activating only a subset of specialized expert layers per token. The model utilizes 8-bit weight quantization to reduce memory overhead and accelerate loading. To manage its high parameter count, the implementation supports activation sharding, which distributes the memory load across multiple hardware devices during execution. The project covers large-scale model inference, including text comp
This library provides a framework for parameter-efficient fine-tuning, enabling the adaptation of large pretrained models by training only a small subset of parameters. It functions as a distributed model training system and optimization toolkit, designed to reduce the computational and memory requirements typically associated with full model fine-tuning. The project distinguishes itself through a suite of methods for modular adapter composition, including low-rank matrix decomposition and activation-based scaling. It supports the integration of multiple task-specific adapter modules, allowin
jetson-inference is a set of libraries and tools for executing optimized deep learning models on embedded GPU hardware. Its primary purpose is to enable real-time computer vision and AI inference at the edge with low latency and high throughput. The project distinguishes itself through high-performance streaming analytics and the ability to execute concurrent AI pipelines on auto-grade silicon. It provides specialized support for multi-sensor stream processing, utilizing zero-copy data transport to load camera frames directly into GPU memory. The codebase covers a broad surface of capabiliti
xtuner is a comprehensive training engine for large language models, offering a toolkit for pre-training, supervised fine-tuning, and the optimization of vision-language multimodal models. It serves as a distributed training accelerator and a specialized framework for scaling Mixture-of-Experts models and aligning model behavior through reinforcement learning from human feedback. The project distinguishes itself through advanced memory and compute optimizations, such as sequence parallelism for ultra-long context windows and interleaved pipeline parallelism to reduce GPU idle time. It provide
DeepSpeedExamples is a collection of reference implementations for training and deploying large scale AI models using the DeepSpeed optimization library. It provides Python code examples for training massive models across multiple GPUs through distributed optimization techniques. The repository includes optimized patterns for deploying and running large language model predictions in production environments. It also serves as a guide for model compression to reduce memory footprints and as a source for performance benchmarks to measure execution speed and resource utilization. The project cov
Axolotl is a configuration-driven framework designed for the fine-tuning, evaluation, and quantization of large language models. It functions as a comprehensive orchestrator for distributed training, enabling users to manage complex workflows across multi-node and multi-GPU environments. By utilizing structured configuration files, the platform streamlines the setup of training parameters, dataset paths, and hardware distribution strategies. The project distinguishes itself through its support for diverse training methodologies, including full-parameter tuning, parameter-efficient adaptation,
MiniCPM is a collection of small language models designed for local, on-device deployment in resource-constrained environments. The project focuses on running dense Transformer models on consumer hardware, including GPUs, CPUs, and Apple Silicon, without requiring custom code forks. The project distinguishes itself through heavy optimization for edge hardware, utilizing quantized weight compression in GGUF and MLX formats to reduce memory overhead. It implements advanced inference techniques such as speculative sampling and radix-tree prefix caching to accelerate generation speed and throughp
Accelerate is a PyTorch distributed training library that abstracts the boilerplate required to run models across multiple GPUs, TPUs, and CPUs. It functions as a deep learning model scaler and distributed hardware orchestrator, allowing the same training script to run on different hardware backends without modifying the core logic. The project provides a distributed training command line interface for configuring compute environments and launching jobs across single or multi-node clusters. It includes a mixed precision training framework to implement FP16 and BF16 precision, reducing memory
bitsandbytes is a quantization library for large language models that reduces memory footprints using k-bit quantization. It provides a framework for 4-bit low-rank adaptation, tools for 8-bit model compression, and memory-efficient optimizer extensions for PyTorch. The project enables the training of large models on limited hardware through 4-bit quantization and low-rank adaptation weights. It also facilitates faster inference by compressing models to 8-bit precision using vector-wise quantization. The library covers a range of memory optimization capabilities, including optimizer memory r
LLaMA-Factory is a comprehensive suite for dataset preparation, model fine-tuning, memory optimization, and standardized API deployment. It provides a unified platform for the supervised and reward-based fine-tuning of large language models and vision-language models. The framework includes a specialized toolkit for training vision-language models and a model serving interface that deploys trained models through high-performance APIs. It utilizes precision tuning and quantization techniques to reduce the hardware requirements and memory footprint of large models. The system covers data pipel
Unsloth is a high-performance training and inference platform designed to optimize the lifecycle of large language and multimodal models. It provides a comprehensive engine for fine-tuning, executing, and managing models locally, with a focus on reducing memory consumption and increasing compute speed on consumer-grade hardware. The platform distinguishes itself through hand-optimized kernels and automated computational graph techniques that maximize hardware throughput. It supports advanced training methodologies, including reinforcement learning for reasoning and efficient adapter-based fin
DeepSpeedExamples is a collection of reference implementations and scripts for training, fine-tuning, and executing inference on large-scale AI models using DeepSpeed optimization. It provides a distributed model training guide and practical workflows for adapting large language models through memory-efficient techniques. The repository includes specialized implementations for pipeline parallelism to handle models exceeding single GPU memory and a suite of examples for ZeRO memory optimization to reduce per-device overhead. It also features standardized test suites for benchmarking the throug
AISystem is a comprehensive AI full-stack infrastructure project covering the entire pipeline from AI chip architecture to high-level training frameworks. It encompasses the development of AI compiler frameworks, inference engines, and distributed training orchestrators designed to coordinate workloads across a heterogeneous compute stack of CPUs, GPUs, and NPUs. The project focuses on the deep integration of software and hardware, employing software-hardware co-design to align tensor layouts with physical memory structures. It provides specialized capabilities for accelerating Transformer mo
NeMo is a comprehensive framework designed for the development, training, and deployment of large-scale conversational and generative artificial intelligence models. It provides an integrated platform for building multimodal systems, encompassing speech processing, language modeling, and reinforcement learning alignment. The framework is built to handle the entire lifecycle of AI development, from data curation and model pretraining to production-ready service deployment. The platform distinguishes itself through advanced distributed training capabilities, including tensor and pipeline parall
Swin-Transformer is a deep learning framework designed for training and deploying hierarchical vision transformer models. It serves as a research library and toolkit for computer vision tasks, providing the infrastructure to build models that replace standard convolution operations with sliding window self-attention mechanisms. By utilizing a multi-scale feature hierarchy, the framework enables the processing of visual data at varying resolutions and spatial scales. The project distinguishes itself through its implementation of shifted window partitioning, which facilitates global information
This repository serves as a comprehensive collection of reference implementations for the PyTorch machine learning library. It provides practical examples for building, training, and deploying deep learning models, functioning as a toolkit for developers to explore neural network architectures and training workflows. The project distinguishes itself by offering concrete demonstrations of complex machine learning operations, ranging from computer vision tasks like object detection and depth estimation to the training of large-scale transformer models. These examples illustrate how to implement
llm-foundry is a training framework for large language models, providing a system for foundation model pre-training and supervised fine-tuning. It includes a distributed trainer for scaling workloads across multiple nodes and GPUs, a dataset streaming pipeline for loading data from cloud storage, and a parameter-efficient fine-tuning implementation. The framework distinguishes itself through its use of parameter sharding and high-throughput data streaming to maintain stability during large-scale training. It incorporates low-rank adaptation to reduce computational costs and uses eight-bit flo
Ludwig is a multimodal machine learning platform and low-code framework designed for building, training, and deploying neural networks. It enables the construction of models that process text, images, audio, and tabular data through a unified interface using declarative configuration files rather than custom code. The system features a specialized low-code framework for large language models, supporting supervised fine-tuning, preference alignment, and a constrained decoding tool to force structured data output via logit extraction. It also includes an automated model architecture search to i
ERNIE is a development toolkit for training, fine-tuning, and deploying large language models built on the PaddlePaddle deep learning platform. It provides a comprehensive suite of core components, including an inference server for vision and language models, a training and fine-tuning toolkit, and a framework for building retrieval-augmented generation systems using private knowledge bases. The project features multimodal AI models capable of reasoning across text, images, and video to perform complex visual understanding and information extraction. It distinguishes itself through specialize
FlexGen is an inference engine for large language models designed for high-throughput execution on single or multiple GPUs. It functions as a framework for managing model execution through a combination of memory offloading, weight compression, and pipeline orchestration. The system enables the execution of models that exceed available GPU memory by moving tensors and caches between GPU memory, system RAM, and disk storage. It utilizes 4-bit weight quantization to reduce the memory footprint of model parameters, allowing for increased batch processing capacity. The project covers distributed