# Distributed Training Frameworks for LLMs

> Search results for `distributed training framework for large models` on awesome-repositories.com. 118 total matches; showing the first 50.

Explore on the web: https://awesome-repositories.com/q/distributed-training-framework-for-large-models

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [this search on awesome-repositories.com](https://awesome-repositories.com/q/distributed-training-framework-for-large-models).**

## Results

- [distribution/distribution](https://awesome-repositories.com/repository/distribution-distribution.md) (10,479 ⭐) — Distribution is an open-source container image registry that implements the OCI Distribution Specification, enabling any OCI-compatible client to push, pull, and manage container images over standard protocols. It serves as a content distribution toolkit for packaging, shipping, storing, and delivering container content across networked environments, storing and retrieving content by its cryptographic hash for integrity and deduplication.

The registry separates image metadata from bulk data to enable efficient validation and partial pulls, and supports resumable blob uploads with chunked tran
- [datawhalechina/so-large-lm](https://awesome-repositories.com/repository/datawhalechina-so-large-lm.md) (7,400 ⭐) — This project is a comprehensive educational curriculum and structured learning path covering the full lifecycle of large language models. It provides a guided progression through the theory, architecture, training, and deployment of these models.

The curriculum includes specialized guides on transformer architecture, model training tutorials, and frameworks for designing autonomous agents. It also provides dedicated resources for studying model safety and ethics.

The material covers a wide range of technical capabilities, including distributed training strategies, parameter-efficient fine-tu
- [kyegomez/openmythos](https://awesome-repositories.com/repository/kyegomez-openmythos.md) (14,176 ⭐) — OpenMythos is a framework for implementing recurrent large language model architectures. It utilizes recurrent transformer blocks to enable compute-adaptive reasoning and variable processing depth through multiple iterative passes over the same weights.

The system features a mixture of experts framework that routes tokens between shared and specialized layers to optimize parameter usage. It also includes parameter-efficient fine-tuning tools using low-rank adaptation modules to modify model behavior with minimal weight updates.

The framework covers distributed training pipelines using data p
- [peremartra/large-language-model-notebooks-course](https://awesome-repositories.com/repository/peremartra-large-language-model-notebooks-course.md) (1,808 ⭐) — Practical course about Large Language Models.
- [eleutherai/gpt-neo](https://awesome-repositories.com/repository/eleutherai-gpt-neo.md) (8,275 ⭐) — GPT-Neo is an open-source distributed training framework designed for scaling GPT-2 and GPT-3-style language models across multiple devices using mesh-tensorflow for model parallelism. It provides the infrastructure to train transformer-based language models with billions of parameters across distributed computing environments, making large-scale language model research accessible outside of proprietary systems.

The framework supports training both autoregressive GPT-style models and masked language models like BERT or RoBERTa, with configurable masking strategies and token handling. It inclu
- [deepspeedai/deepspeed](https://awesome-repositories.com/repository/deepspeedai-deepspeed.md) (42,528 ⭐) — DeepSpeed is a high-performance library designed to scale deep learning model training and inference across massive clusters of GPUs and compute nodes. It provides a comprehensive suite of tools for distributed training, enabling the execution of models that exceed the memory capacity of single devices through advanced parameter partitioning, pipeline-based model parallelism, and memory-efficient state offloading.

The framework distinguishes itself through specialized communication-efficient optimizers and hardware-aware acceleration techniques. By utilizing gradient compression, quantization
- [modelscope/modelscope](https://awesome-repositories.com/repository/modelscope-modelscope.md) (8,718 ⭐) — ModelScope is a comprehensive machine learning platform that functions as a model hub, training framework, inference engine, and cloud development environment. It provides a centralized repository for discovering, downloading, and managing pre-trained models and datasets across multiple modalities, including natural language, vision, and speech.

The platform features a unified interface for multimodal model inference and a standardized framework for fine-tuning and evaluating large-scale models. It supports distributed training to scale workloads across multiple processors and provides contai
- [apache/mxnet](https://awesome-repositories.com/repository/apache-mxnet.md) (20,829 ⭐) — This project is a deep learning framework designed for constructing, training, and deploying neural networks across diverse hardware environments. It functions as a high-performance tensor computation library that provides both imperative and symbolic programming interfaces, allowing developers to balance flexible, step-by-step model building with the efficiency of compiled computation graphs.

The framework distinguishes itself through a hybrid execution engine that integrates declarative graph compilation with imperative runtime logic. It supports scalable, distributed training across multip
- [lambdalabsml/distributed-training-guide](https://awesome-repositories.com/repository/lambdalabsml-distributed-training-guide.md) (0 ⭐)
- [lightgbm-org/lightgbm](https://awesome-repositories.com/repository/lightgbm-org-lightgbm.md) (18,460 ⭐) — LightGBM is a gradient boosting framework used to train decision tree ensembles for classification, regression, and ranking tasks. It functions as a distributed machine learning library and a decision tree ensemble implementation that utilizes leaf-wise growth and histogram-based feature binning.

The framework is distinguished by its ability to offload heavy computations to CUDA or OpenCL devices for GPU acceleration and its capacity to parallelize training across multiple nodes using sockets, MPI, or Dask. It includes a specialized categorical feature processor that optimizes partitions for
- [autogluon/autogluon](https://awesome-repositories.com/repository/autogluon-autogluon.md) (9,997 ⭐) — AutoGluon is an automated machine learning framework and multimodal library designed to automate the end-to-end pipeline from data preprocessing to high-accuracy model training and validation. It functions as an automated model trainer for tabular, image, text, and time series data, as well as a tool for time series forecasting and foundation model finetuning.

The project is distinguished by its ability to jointly process and fuse different data types, allowing for the construction of multimodal neural networks that integrate images, text, and structured tables. It supports zero-shot inferenc
- [bradyfu/awesome-multimodal-large-language-models](https://awesome-repositories.com/repository/bradyfu-awesome-multimodal-large-language-models.md) (17,892 ⭐) — :sparkles::sparkles:Latest Advances on Multimodal Large Language Models
- [fastai/fastai](https://awesome-repositories.com/repository/fastai-fastai.md) (27,862 ⭐) — Fastai is a high-level deep learning library built on PyTorch that provides a unified interface for managing the entire machine learning lifecycle. It functions as a comprehensive training toolkit, abstracting hardware management and automating complex training loops to simplify the construction and execution of neural network models.

The framework is distinguished by its notebook-centric development environment and a type-dispatching data pipeline that automatically applies transformations based on input data formats. It emphasizes transfer learning through discriminative layer-wise optimiza
- [skindhu/build-a-large-language-model-cn](https://awesome-repositories.com/repository/skindhu-build-a-large-language-model-cn.md) (3,242 ⭐) — This project is a generative AI educational resource and natural language processing course. It serves as a technical implementation guide for building, pre-training, and fine-tuning a large language model from scratch using PyTorch.

The curriculum provides a step-by-step tutorial on large language model development, focusing specifically on the design of transformer-based text generation models. It includes dedicated instruction on parameter-efficient fine-tuning to optimize training by updating only a small subset of model weights.

The material covers the end-to-end generative AI training
- [ymcui/chinese-llama-alpaca](https://awesome-repositories.com/repository/ymcui-chinese-llama-alpaca.md) (18,944 ⭐) — This project is a comprehensive toolkit for adapting large language models to the Chinese language, providing a specialized framework for fine-tuning, inference, and local deployment. It serves as a coordinated suite for language-specific adaptation, including tools for expanding tokenizers and implementing retrieval-augmented generation.

The project distinguishes itself through a complete pipeline for model adaptation, featuring multilingual tokenizer expansion and a fine-tuning framework that supports instruction-based supervised training and adapter merging. It also includes a dedicated de
- [cmusatyalab/openface](https://awesome-repositories.com/repository/cmusatyalab-openface.md) (15,398 ⭐) — Openface is a deep learning toolkit designed for facial recognition and identity verification. It provides a comprehensive pipeline for detecting faces, aligning landmarks, and transforming facial images into compact numerical vectors. By utilizing these embeddings, the system enables identity classification and similarity comparison through geometric distance calculations.

The project distinguishes itself by integrating research-oriented diagnostic tools alongside its core recognition capabilities. It includes utilities for visualizing high-dimensional feature clusters, inspecting internal c
- [fchollet/deep-learning-models](https://awesome-repositories.com/repository/fchollet-deep-learning-models.md) (7,349 ⭐) — This project is a collection of deep learning tools for image classification and audio tagging, providing a repository of pre-trained model weights and architectures. It serves as a Keras model zoo that enables the immediate use of established neural networks for inference and transfer learning.

The library includes a music tagging framework that classifies audio recordings using convolutional recurrent neural networks and mel-spectrograms. For visual data, it provides implementations of architectures such as ResNet, VGG, and Xception, alongside a repository of weights trained on large datase
- [microsoft/swin-transformer](https://awesome-repositories.com/repository/microsoft-swin-transformer.md) (15,715 ⭐) — Swin-Transformer is a deep learning framework designed for training and deploying hierarchical vision transformer models. It serves as a research library and toolkit for computer vision tasks, providing the infrastructure to build models that replace standard convolution operations with sliding window self-attention mechanisms. By utilizing a multi-scale feature hierarchy, the framework enables the processing of visual data at varying resolutions and spatial scales.

The project distinguishes itself through its implementation of shifted window partitioning, which facilitates global information
- [handsonllm/hands-on-large-language-models](https://awesome-repositories.com/repository/handsonllm-hands-on-large-language-models.md) (27,059 ⭐) — This project is an educational resource focused on the internal mechanics and design principles of transformer-based neural networks. It provides a structured guide to the fundamental components of generative artificial intelligence, including sequence modeling, semantic embeddings, and the mathematical foundations of large language models.

The repository distinguishes itself through a heavy emphasis on visual documentation, utilizing diagrams and step-by-step explanations to clarify how data flows through complex neural architectures. It serves as a technical reference for developers seeking
- [dair-ai/prompt-engineering-guide](https://awesome-repositories.com/repository/dair-ai-prompt-engineering-guide.md) (75,678 ⭐) — This project is a comprehensive educational resource and technical guide focused on the development, optimization, and application of large language models. It provides a structured curriculum for mastering prompt engineering, ranging from foundational principles of instruction design to advanced techniques for improving model reasoning, accuracy, and reliability.

The guide distinguishes itself by offering deep technical insights into agentic workflows and autonomous system design. It covers the implementation of multi-step reasoning chains, tool integration through function calling, and stat
- [yangjianxin1/firefly](https://awesome-repositories.com/repository/yangjianxin1-firefly.md) (6,642 ⭐) — Firefly is a training framework and inference engine for large language models. It functions as a toolkit for pre-training and fine-tuning various open-weight architectures, providing a system for model alignment and parameter-efficient fine-tuning.

The project includes utilities for merging adapter weights back into base models to create standalone files. It also provides a model alignment toolkit to format training data according to specific prompt templates, ensuring conversational consistency across different models.

The framework supports distributed model training and preference-based
- [sciruby/distribution](https://awesome-repositories.com/repository/sciruby-distribution.md) (51 ⭐) — Probability distributions for Ruby.
- [dask/distributed](https://awesome-repositories.com/repository/dask-distributed.md) (1,671 ⭐) — A distributed task scheduler for Dask
- [lllyasviel/framepack](https://awesome-repositories.com/repository/lllyasviel-framepack.md) (17,028 ⭐) — FramePack is a neural video synthesis engine and generation framework designed to produce long, temporally consistent video sequences. It functions as a diffusion model optimizer, providing a suite of techniques to manage the computational demands of high-parameter video models while maintaining visual stability during extended generation tasks.

The system distinguishes itself through a hierarchical approach to frame prediction, which plans distant anchor frames before filling in intermediate content to prevent cumulative temporal drift. By utilizing constant-length context compression and to
- [clickhouse/clickhouse](https://awesome-repositories.com/repository/clickhouse-clickhouse.md) (48,229 ⭐) — ClickHouse is a high-performance, columnar analytical database designed for real-time query execution and large-scale data aggregation. It functions as a distributed data warehouse capable of processing petabytes of information, while also providing an embedded engine that integrates directly into applications for native query capabilities without external dependencies. The system is built to handle high-throughput ingestion and complex analytical workloads, delivering millisecond-level latency for interactive dashboards and operational monitoring.

The platform distinguishes itself through ad
- [microsoft/lightgbm](https://awesome-repositories.com/repository/microsoft-lightgbm.md) (18,096 ⭐) — LightGBM is a high-performance machine learning framework designed for constructing gradient-boosted decision tree ensembles. It provides a platform for training classification, regression, and ranking models, with a focus on memory efficiency and large-scale distributed computing.

The framework distinguishes itself through specialized algorithmic strategies, including leaf-wise tree growth and histogram-based decision learning, which prioritize convergence speed. It optimizes memory usage by bundling mutually exclusive features and employs gradient-based sampling to reduce training complexit
- [crewaiinc/crewai](https://awesome-repositories.com/repository/crewaiinc-crewai.md) (53,687 ⭐) — CrewAI is a multi-agent orchestration framework designed for building autonomous systems that execute complex, multi-step workflows. It provides a development platform where specialized agents are defined with specific roles, goals, and tool sets to perform tasks collaboratively. By leveraging a declarative workflow engine, the system manages task dependencies, state transitions, and execution logic, allowing for the creation of structured, stateful sequences of operations.

The framework distinguishes itself through its hierarchical management capabilities, which utilize manager agents to coo
- [microsoft/deepspeedexamples](https://awesome-repositories.com/repository/microsoft-deepspeedexamples.md) (6,822 ⭐) — DeepSpeedExamples is a collection of reference implementations for training and deploying large scale AI models using the DeepSpeed optimization library. It provides Python code examples for training massive models across multiple GPUs through distributed optimization techniques.

The repository includes optimized patterns for deploying and running large language model predictions in production environments. It also serves as a guide for model compression to reduce memory footprints and as a source for performance benchmarks to measure execution speed and resource utilization.

The project cov
- [b4rtaz/distributed-llama](https://awesome-repositories.com/repository/b4rtaz-distributed-llama.md) (2,837 ⭐) — Distributed-llama is a distributed inference engine and command line tool for running large language models across multiple networked machines. It functions as a compute cluster manager that coordinates worker nodes to share the computational load of a single model.

The system utilizes tensor parallelism to shard model weights across different hosts, allowing the execution of models that exceed the memory capacity of a single piece of hardware. It includes a dedicated format converter to transform standard model files into a compatible binary layout optimized for distributed loading.

The eng
- [dandavison/delta](https://awesome-repositories.com/repository/dandavison-delta.md) (31,136 ⭐) — Delta is a command-line pager that enhances the readability of terminal output by applying syntax highlighting and structured formatting to text streams. It functions as a specialized interface for version control systems, transforming standard output into color-coded, human-readable views.

The tool distinguishes itself through its ability to render side-by-side diff comparisons and visualize merge conflicts with clear, semantic highlighting. It dynamically calculates column widths and text alignment to fit complex file comparisons within the constraints of a terminal window, while allowing u
- [buoyancy99/large-video-planner](https://awesome-repositories.com/repository/buoyancy99-large-video-planner.md) (250 ⭐) — This repo provides training and inference code for the paper "Large Video Planner Enables Generalizable Robot Control"
- [microsoft/ai-edu](https://awesome-repositories.com/repository/microsoft-ai-edu.md) (14,065 ⭐) — ai-edu is a comprehensive AI education curriculum and machine learning courseware collection. It provides theoretical tutorials, deep learning lab exercises, and project blueprints designed to teach artificial intelligence fundamentals through a combination of study and practical implementation.

The project focuses on a learning-by-doing approach, guiding users from Python programming and neural network basics to advanced topics. It includes specialized instructional content on distributed AI training, MLOps educational guides for model quantization and pruning, and detailed frameworks for im
- [microsoft/agent-lightning](https://awesome-repositories.com/repository/microsoft-agent-lightning.md) (15,047 ⭐) — Agent Lightning is an optimization framework designed to refine the performance of individual AI agents within complex multi-agent systems. It provides a platform for improving decision-making and task execution by applying reinforcement learning, supervised fine-tuning, and automated prompt optimization.

The framework distinguishes itself through its ability to isolate specific agents for targeted tuning, allowing developers to enhance individual behaviors while maintaining the stability of the broader system architecture. By utilizing a modular interface, it integrates with diverse agent fr
- [jiutian-vl/large-vlm-based-vla-for-robotic-manipulation](https://awesome-repositories.com/repository/jiutian-vl-large-vlm-based-vla-for-robotic-manipulation.md) (415 ⭐) — A curated list of large VLM-based VLA models for robotic manipulation.
- [eto-ai/lance](https://awesome-repositories.com/repository/eto-ai-lance.md) (6,671 ⭐) — Lance is a versioned columnar data format and storage engine designed as a multimodal AI lakehouse. It serves as a vector database storage engine and a cloud object store dataset manager, organizing images, video, audio, and embeddings into a unified format optimized for machine learning workflows.

The project distinguishes itself by combining a columnar layout for structured data with a specialized blob store for large multimodal tensors. It implements a hybrid search engine that integrates vector similarity search, full-text search, and SQL analytics on a single dataset, supported by a stor
- [axolotl-ai-cloud/axolotl](https://awesome-repositories.com/repository/axolotl-ai-cloud-axolotl.md) (12,059 ⭐) — Axolotl is a configuration-driven framework designed for the fine-tuning, evaluation, and quantization of large language models. It functions as a comprehensive orchestrator for distributed training, enabling users to manage complex workflows across multi-node and multi-GPU environments. By utilizing structured configuration files, the platform streamlines the setup of training parameters, dataset paths, and hardware distribution strategies.

The project distinguishes itself through its support for diverse training methodologies, including full-parameter tuning, parameter-efficient adaptation,
- [huggingface/llm_training_handbook](https://awesome-repositories.com/repository/huggingface-llm-training-handbook.md) (563 ⭐) — An open collection of methodologies to help with successful training of large language models.
- [baidu/paddle](https://awesome-repositories.com/repository/baidu-paddle.md) (23,959 ⭐) — Paddle is a deep learning framework designed for building, training, and deploying large-scale machine learning models. It incorporates a distributed training engine for optimizing performance across multiple chips and a model inference engine for transforming trained models into production-ready formats for cross-platform execution.

The platform features a heterogeneous hardware abstraction and a standardized software stack that allows models to run across diverse hardware architectures through a common interface. It also includes a scientific computing library capable of solving complex dif
- [yqyang2233/large-language-model-break-ai](https://awesome-repositories.com/repository/yqyang2233-large-language-model-break-ai.md) (7 ⭐) — We are Team LlaXa from the School of Cyber Science and Technology at Zhejiang University. We actively respond to the call of the competition organizers by open-sourcing our solution. This repository contains the code for reproducing our submission to the Competition for LLM and Agent Safety 2024…
- [xming521/weclone](https://awesome-repositories.com/repository/xming521-weclone.md) (18,028 ⭐) — WeClone is an end-to-end framework designed for the creation, training, and deployment of personalized conversational AI digital twins. By fine-tuning large language models on individual chat history, the platform enables the replication of unique communication styles, speech patterns, and conversational habits. The system manages the entire lifecycle of these digital avatars, from initial data preparation to final integration into messaging platforms for real-time interaction.

The platform distinguishes itself through a comprehensive suite of data processing utilities that prepare raw messag
- [anomalyco/models.dev](https://awesome-repositories.com/repository/anomalyco-models-dev.md) (2,694 ⭐) — models.dev is a directory and intelligence system for large language models that provides a standardized catalog of technical specifications, provider mappings, and pricing data. It serves as a central index for model metadata, including context windows, output limits, and release dates.

The project functions as a capability index and pricing comparison tool, allowing for the analysis of token costs across different hosting providers. It maps generic model names to the specific API identifiers required by various third-party platforms and tracks support for functional features such as tool ca
- [bytedance/monolith](https://awesome-repositories.com/repository/bytedance-monolith.md) (9,271 ⭐) — Monolith is a distributed recommendation model framework and asynchronous training engine designed to build and train large-scale deep learning architectures. It functions as a distributed model trainer that processes massive datasets across multiple compute nodes using asynchronous update mechanisms.

The system features a dedicated embedding table manager that creates unique, feature-isolated tables to prevent representation collisions. It also includes a real-time weight updater to capture immediate changes in user interest and data hotspots through continuous parameter synchronization.

Th
- [raaminz/training](https://awesome-repositories.com/repository/raaminz-training.md) (28 ⭐) — This Repository is all about my training classes
- [allegroai/clearml](https://awesome-repositories.com/repository/allegroai-clearml.md) (6,733 ⭐) — ClearML is a comprehensive MLOps platform designed to manage the entire machine learning lifecycle. It functions as an experiment tracking tool, a data versioning system, and a pipeline orchestrator, while providing infrastructure for GPU cluster management and model serving.

The platform is distinguished by its ability to handle hybrid-cloud compute scheduling and fractional GPU allocation, allowing multiple workloads to share a single hardware accelerator. It employs a metadata-based approach to data versioning, using virtual views to track large datasets and artifacts without duplicating r
- [volcengine/verl](https://awesome-repositories.com/repository/volcengine-verl.md) (22,015 ⭐) — verl is a distributed training system designed for large language model alignment and reinforcement learning. It provides a framework for executing post-training pipelines, including supervised fine-tuning and reinforcement learning from human feedback, to refine model behavior and agentic capabilities.

The system utilizes a hybrid training and inference engine that optimizes memory and communication when switching between model generation and gradient updates. It supports multi-modal reinforcement learning for models processing both image and text data, and implements algorithms such as PPO
- [nodesource/distributions](https://awesome-repositories.com/repository/nodesource-distributions.md) (13,834 ⭐) — This project is a Node.js binary distribution repository and Linux package repository. It provides a hosted set of pre-compiled JavaScript runtime binaries for various Linux distributions to simplify installation and version management through native package managers.

The project includes a Node.js observability toolset and security policy manager. These components enable the gathering of runtime telemetry to monitor application health and performance via diagnostic dashboards, while providing a resource restriction layer that intercepts system calls to prevent unauthorized modules from acces
- [clearml/clearml](https://awesome-repositories.com/repository/clearml-clearml.md) (6,740 ⭐) — ClearML is a comprehensive MLOps platform designed to manage the end-to-end machine learning lifecycle, from initial experimentation to production deployment. It provides a suite of integrated tools including a pipeline orchestrator for automating workflows, an experiment tracking tool for logging hyperparameters and metrics, and a metadata-driven data versioning system for managing large-scale datasets and model artifacts.

The platform is distinguished by its advanced compute management and serving capabilities. It features a GPU compute manager that supports fractional resource slicing and
- [esri/spatial-framework-for-hadoop](https://awesome-repositories.com/repository/esri-spatial-framework-for-hadoop.md) (376 ⭐) — The Spatial Framework for Hadoop allows developers and data scientists to use the Hadoop data processing system for spatial data analysis.
- [facebookresearch/audiocraft](https://awesome-repositories.com/repository/facebookresearch-audiocraft.md) (23,379 ⭐) — Audiocraft is a deep learning audio library and machine learning framework designed for training, fine-tuning, and evaluating generative models for music and sound effects. It functions as a text-to-music generative model and a neural audio codec, providing the tools necessary to compress audio signals into discrete representations and synthesize high-fidelity waveforms from textual descriptions.

The framework is distinguished by its ability to combine multiple conditioning signals, allowing for the generation of audio based on text prompts, melodic excerpts, or style-based audio clips. It al
- [zyds/transformers-code](https://awesome-repositories.com/repository/zyds-transformers-code.md) (3,782 ⭐) — This project is a collection of scripts and workflows for training, fine-tuning, and deploying large language models using the Hugging Face Transformers toolkit. It functions as a distributed training framework, a library for natural language processing task implementations, and a system for building retrieval-augmented generation chatbots.

The repository includes specialized tools for model optimization, such as a Bayesian hyperparameter optimizer for automatically tuning model settings. It provides implementations for scaling model training across multiple graphics processors using data par