# Distributed Model Inference Frameworks

> Search results for `distributed inference to split a large model across machines` on awesome-repositories.com. 112 total matches; showing the first 50.

Explore on the web: https://awesome-repositories.com/q/distributed-inference-to-split-a-large-model-across-machines

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [this search on awesome-repositories.com](https://awesome-repositories.com/q/distributed-inference-to-split-a-large-model-across-machines).**

## Results

- [b4rtaz/distributed-llama](https://awesome-repositories.com/repository/b4rtaz-distributed-llama.md) (2,837 ⭐) — Distributed-llama is a distributed inference engine and command line tool for running large language models across multiple networked machines. It functions as a compute cluster manager that coordinates worker nodes to share the computational load of a single model.

The system utilizes tensor parallelism to shard model weights across different hosts, allowing the execution of models that exceed the memory capacity of a single piece of hardware. It includes a dedicated format converter to transform standard model files into a compatible binary layout optimized for distributed loading.

The eng
- [huggingface/text-generation-inference](https://awesome-repositories.com/repository/huggingface-text-generation-inference.md) (10,775 ⭐) — Text Generation Inference is a production-ready engine designed for the deployment and serving of large language models. It functions as a containerized runtime environment that manages model execution, scales across distributed hardware, and provides high-performance inference capabilities for demanding production environments.

The project distinguishes itself through advanced optimization techniques, including continuous batching to maximize hardware utilization and tensor parallelism to shard large models across multiple accelerator cards. It supports efficient inference through custom com
- [distribution/distribution](https://awesome-repositories.com/repository/distribution-distribution.md) (10,479 ⭐) — Distribution is an open-source container image registry that implements the OCI Distribution Specification, enabling any OCI-compatible client to push, pull, and manage container images over standard protocols. It serves as a content distribution toolkit for packaging, shipping, storing, and delivering container content across networked environments, storing and retrieving content by its cryptographic hash for integrity and deduplication.

The registry separates image metadata from bulk data to enable efficient validation and partial pulls, and supports resumable blob uploads with chunked tran
- [handsonllm/hands-on-large-language-models](https://awesome-repositories.com/repository/handsonllm-hands-on-large-language-models.md) (27,059 ⭐) — This project is an educational resource focused on the internal mechanics and design principles of transformer-based neural networks. It provides a structured guide to the fundamental components of generative artificial intelligence, including sequence modeling, semantic embeddings, and the mathematical foundations of large language models.

The repository distinguishes itself through a heavy emphasis on visual documentation, utilizing diagrams and step-by-step explanations to clarify how data flows through complex neural architectures. It serves as a technical reference for developers seeking
- [deepspeedai/deepspeed](https://awesome-repositories.com/repository/deepspeedai-deepspeed.md) (42,528 ⭐) — DeepSpeed is a high-performance library designed to scale deep learning model training and inference across massive clusters of GPUs and compute nodes. It provides a comprehensive suite of tools for distributed training, enabling the execution of models that exceed the memory capacity of single devices through advanced parameter partitioning, pipeline-based model parallelism, and memory-efficient state offloading.

The framework distinguishes itself through specialized communication-efficient optimizers and hardware-aware acceleration techniques. By utilizing gradient compression, quantization
- [axolotl-ai-cloud/axolotl](https://awesome-repositories.com/repository/axolotl-ai-cloud-axolotl.md) (12,059 ⭐) — Axolotl is a configuration-driven framework designed for the fine-tuning, evaluation, and quantization of large language models. It functions as a comprehensive orchestrator for distributed training, enabling users to manage complex workflows across multi-node and multi-GPU environments. By utilizing structured configuration files, the platform streamlines the setup of training parameters, dataset paths, and hardware distribution strategies.

The project distinguishes itself through its support for diverse training methodologies, including full-parameter tuning, parameter-efficient adaptation,
- [skindhu/build-a-large-language-model-cn](https://awesome-repositories.com/repository/skindhu-build-a-large-language-model-cn.md) (3,242 ⭐) — This project is a generative AI educational resource and natural language processing course. It serves as a technical implementation guide for building, pre-training, and fine-tuning a large language model from scratch using PyTorch.

The curriculum provides a step-by-step tutorial on large language model development, focusing specifically on the design of transformer-based text generation models. It includes dedicated instruction on parameter-efficient fine-tuning to optimize training by updating only a small subset of model weights.

The material covers the end-to-end generative AI training
- [intel-analytics/ipex-llm](https://awesome-repositories.com/repository/intel-analytics-ipex-llm.md) (8,836 ⭐) — ipex-llm is an acceleration library and inference engine designed to optimize the execution and finetuning of large language models on Intel GPUs and NPUs. It provides a HuggingFace compatible model backend and a dedicated quantization toolkit for converting model weights into low-bit precision formats.

The project facilitates distributed inference by splitting large model workloads across multiple accelerators using pipeline and tensor parallelism. It enables the deployment of models on Intel Arc, Flex, and Max GPUs to increase throughput and reduce latency.

The library covers a broad range
- [gokumohandas/made-with-ml](https://awesome-repositories.com/repository/gokumohandas-made-with-ml.md) (48,343 ⭐) — Made-With-ML is an automated documentation generator and developer experience platform designed to transform source code into structured, searchable reference websites. It functions as a codebase intelligence tool that parses implementation details to provide clear explanations of logic and data requirements.

The system distinguishes itself by leveraging language-level type annotations and structured code comments to generate interface specifications. By utilizing static analysis to extract metadata, it automates the transformation of docstrings into web-ready documentation, ensuring that tec
- [tiiny-ai/powerinfer](https://awesome-repositories.com/repository/tiiny-ai-powerinfer.md) (8,714 ⭐) — PowerInfer is a high-performance local large language model inference engine and sparse inference framework. It provides a runtime for executing models on consumer-grade hardware, utilizing a GPU acceleration backend to optimize tensor operations for graphics processors.

The system distinguishes itself through a sparse inference framework that increases generation speed by skipping computations based on activation sparsity in model weights. It includes a GGUF model converter for transforming weights and metadata into a unified binary format, as well as an OpenAI API compatible server for inte
- [goharbor/harbor](https://awesome-repositories.com/repository/goharbor-harbor.md) (28,761 ⭐) — Harbor is a self-hosted, enterprise-grade container registry platform designed to store, sign, and scan container images and cloud-native artifacts. It provides a centralized repository that integrates directly with Kubernetes environments to manage the full lifecycle of software artifacts, from initial storage to production deployment.

The platform distinguishes itself through a focus on security, governance, and multi-site availability. It features a pluggable vulnerability scanning framework that allows for the integration of various security engines, alongside content trust mechanisms tha
- [sgl-project/sglang](https://awesome-repositories.com/repository/sgl-project-sglang.md) (29,079 ⭐) — Sglang is a high-performance inference engine and serving system designed for large language and multimodal models. It provides a programmable interface for orchestrating complex generation workflows, enabling developers to coordinate multi-turn dialogues, tool invocations, and reasoning chains through a domain-specific language. The platform is built to support production-scale deployments, offering an OpenAI-compatible API that allows for integration with existing application ecosystems.

The system distinguishes itself through a disaggregated architecture that separates compute-intensive pr
- [duplicati/duplicati](https://awesome-repositories.com/repository/duplicati-duplicati.md) (14,283 ⭐) — Duplicati is a self-hosted backup server designed to perform encrypted, incremental, and compressed backups to a wide range of local, network, and cloud-based storage providers. It functions as a background service that automates recurring data protection tasks, ensuring that only changed data blocks are stored to maximize efficiency and minimize bandwidth usage.

The project distinguishes itself through a centralized management console that allows for the orchestration of multiple distributed backup agents from a single web-based dashboard. It supports multi-tenant management, enabling the or
- [aria42/infer](https://awesome-repositories.com/repository/aria42-infer.md) (176 ⭐) — inference and machine learning in clojure
- [peremartra/large-language-model-notebooks-course](https://awesome-repositories.com/repository/peremartra-large-language-model-notebooks-course.md) (1,808 ⭐) — Practical course about Large Language Models.
- [eleutherai/gpt-neo](https://awesome-repositories.com/repository/eleutherai-gpt-neo.md) (8,275 ⭐) — GPT-Neo is an open-source distributed training framework designed for scaling GPT-2 and GPT-3-style language models across multiple devices using mesh-tensorflow for model parallelism. It provides the infrastructure to train transformer-based language models with billions of parameters across distributed computing environments, making large-scale language model research accessible outside of proprietary systems.

The framework supports training both autoregressive GPT-style models and masked language models like BERT or RoBERTa, with configurable masking strategies and token handling. It inclu
- [exo-explore/exo](https://awesome-repositories.com/repository/exo-explore-exo.md) (45,380 ⭐) — Exo is a distributed inference engine designed to run machine learning models across local hardware. It functions as a network orchestration layer that automatically discovers available devices to form a unified computing cluster, allowing users to scale artificial intelligence workloads by distributing computational tasks across multiple machines.

The platform distinguishes itself through its ability to manage the entire lifecycle of local models while providing a standardized gateway for external applications. By translating local model outputs into industry-standard formats, it enables exi
- [clickhouse/clickhouse](https://awesome-repositories.com/repository/clickhouse-clickhouse.md) (48,229 ⭐) — ClickHouse is a high-performance, columnar analytical database designed for real-time query execution and large-scale data aggregation. It functions as a distributed data warehouse capable of processing petabytes of information, while also providing an embedded engine that integrates directly into applications for native query capabilities without external dependencies. The system is built to handle high-throughput ingestion and complex analytical workloads, delivering millisecond-level latency for interactive dashboards and operational monitoring.

The platform distinguishes itself through ad
- [h2oai/h2o-3](https://awesome-repositories.com/repository/h2oai-h2o-3.md) (7,493 ⭐) — h2o-3 is a distributed machine learning platform and automated machine learning framework designed for training and deploying predictive models using distributed in-memory computing. It functions as a deep learning framework and a distributed model scoring engine, capable of operating as a Kubernetes ML cluster to process large datasets in parallel.

The platform distinguishes itself through automated machine learning capabilities that automatically select the best algorithms and hyperparameters to optimize model performance. It provides specialized deep learning toolkits for tasks including i
- [fishaudio/fish-speech](https://awesome-repositories.com/repository/fishaudio-fish-speech.md) (24,928 ⭐) — This project is a generative speech synthesis engine that converts text into high-fidelity human speech. It utilizes a two-stage autoregressive transformer architecture that separates semantic token prediction from acoustic detail reconstruction to balance linguistic accuracy with audio quality. The system is designed to support multilingual output and conversational AI development, enabling the generation of context-aware speech that maintains flow across multiple dialogue turns.

The platform distinguishes itself through a production-ready inference server that employs continuous batching to
- [bradyfu/awesome-multimodal-large-language-models](https://awesome-repositories.com/repository/bradyfu-awesome-multimodal-large-language-models.md) (17,892 ⭐) — :sparkles::sparkles:Latest Advances on Multimodal Large Language Models
- [xorbitsai/inference](https://awesome-repositories.com/repository/xorbitsai-inference.md) (9,358 ⭐) — This project is a platform for the deployment of open source large language and multimodal models. It provides a unified interface to serve text, image, and speech models across local or cloud hardware.

The system enables distributed AI inference by orchestrating model workloads across multiple nodes and devices. It includes a unified API adapter layer to standardize inputs and outputs, as well as tools for multimodal chat and structural image generation.

The platform covers a broad capability surface including request batching for throughput optimization, dynamic model loading, and integrat
- [openrlhf/openrlhf](https://awesome-repositories.com/repository/openrlhf-openrlhf.md) (9,675 ⭐) — OpenRLHF is a training framework and alignment library designed for reinforcement learning from human feedback across distributed GPU clusters. It provides tools for aligning large language models and multimodal vision-language models using algorithms such as PPO, GRPO, and DPO.

The framework distinguishes itself through a distributed inference engine that overlaps sample rollout with training to increase throughput. It supports scaling to models exceeding 70 billion parameters via parameter sharding and handles long-context sequences through ring-attention sequence parallelism.

The project
- [google-research/google-research](https://awesome-repositories.com/repository/google-research-google-research.md) (38,139 ⭐) — This repository serves as a comprehensive research platform and toolkit for advancing machine learning, quantum computing, and large-scale scientific data analysis. It provides foundational frameworks for developing complex algorithmic systems, offering the necessary infrastructure for distributed training, computational graph execution, and high-performance model development.

The project distinguishes itself by integrating specialized research domains with robust, privacy-preserving methodologies. It supports diverse scientific discovery through tools for quantum simulation, physics-informed
- [bigscience-workshop/petals](https://awesome-repositories.com/repository/bigscience-workshop-petals.md) (10,208 ⭐) — Petals is a decentralized framework and inference engine for running large language models across a peer-to-peer network. It enables the execution of models that exceed the memory of any single machine by splitting computations and model layers across a collaborative swarm of GPUs.

The system functions as a collaborative compute network where participants share local GPU resources and host model weights. It supports distributed prompt-tuning to adapt massive models to specific tasks and allows for the establishment of private compute swarms to process sensitive data within restricted, trusted
- [nathancahill/split](https://awesome-repositories.com/repository/nathancahill-split.md) (6,278 ⭐) — Unopinionated utilities for resizeable split views
- [trainindata/deploying-machine-learning-models](https://awesome-repositories.com/repository/trainindata-deploying-machine-learning-models.md) (895 ⭐) — Accompanying repo for the online course Deployment of Machine Learning Models.
- [awesomedata/awesome-public-datasets](https://awesome-repositories.com/repository/awesomedata-awesome-public-datasets.md) (75,979 ⭐) — This project is a community-maintained, open-access directory of high-quality public datasets. It serves as a centralized reference point for researchers, developers, and data scientists to locate reliable information sources across a wide spectrum of industries and scientific fields. By providing a structured index, the repository facilitates the discovery of data necessary for exploratory analysis, machine learning model training, and the development of data-intensive applications.

The directory distinguishes itself through a lightweight, platform-agnostic approach to resource indexing that
- [intel/ipex-llm](https://awesome-repositories.com/repository/intel-ipex-llm.md) (8,836 ⭐) — Intel XPU LLM Acceleration Library is a toolkit designed to accelerate large language model inference and finetuning on Intel CPUs, GPUs, and NPUs. It provides a distributed inference engine for scaling models across multiple accelerators, a multimodal model runtime for vision and speech tasks, and a low-bit model quantization tool for converting weights into INT4, FP8, and GGUF formats.

The project features a parameter-efficient finetuning framework that enables model adaptation using QLoRA and DPO on Intel hardware. It distinguishes itself by providing specialized optimizations for Intel XP
- [huggingface/transformers](https://awesome-repositories.com/repository/huggingface-transformers.md) (161,630 ⭐) — Transformers is a comprehensive library for machine learning that provides a unified interface for training, fine-tuning, and deploying transformer-based models. It supports a wide range of tasks, including text classification, language modeling, question answering, and sequence-to-sequence translation, while offering specialized architectures for both text and vision processing. The framework includes tools for managing the entire model lifecycle, from data preprocessing and tokenization to distributed training and inference.

The library features extensive support for model optimization and
- [ml-explore/mlx](https://awesome-repositories.com/repository/ml-explore-mlx.md) (27,047 ⭐) — This project is a machine learning array framework and tensor computation library designed for high-performance numerical computing. It provides a comprehensive suite of tools for constructing and training neural networks, featuring an automatic differentiation engine that facilitates gradient-based optimization and complex mathematical modeling.

The library distinguishes itself through a unified memory architecture that allows data to be shared across CPU and GPU devices without explicit copies, significantly reducing data movement overhead. Its execution model relies on a lazy evaluation en
- [dask/distributed](https://awesome-repositories.com/repository/dask-distributed.md) (1,671 ⭐) — A distributed task scheduler for Dask
- [facebookresearch/fairseq](https://awesome-repositories.com/repository/facebookresearch-fairseq.md) (32,228 ⭐) — Fairseq is a PyTorch toolkit for sequence-to-sequence modeling, specializing in neural machine translation, automatic speech recognition, and large-scale language model training. It provides a framework for processing and aligning diverse data sources, including text, audio, and video, to support tasks such as speech-to-text conversion and multimodal sequence learning.

The project is distinguished by its distributed training capabilities, which utilize parameter sharding, mixed-precision training, and CPU offloading to handle models that exceed single-device memory. It also includes specializ
- [apache/mxnet](https://awesome-repositories.com/repository/apache-mxnet.md) (20,829 ⭐) — This project is a deep learning framework designed for constructing, training, and deploying neural networks across diverse hardware environments. It functions as a high-performance tensor computation library that provides both imperative and symbolic programming interfaces, allowing developers to balance flexible, step-by-step model building with the efficiency of compiled computation graphs.

The framework distinguishes itself through a hybrid execution engine that integrates declarative graph compilation with imperative runtime logic. It supports scalable, distributed training across multip
- [google/flax](https://awesome-repositories.com/repository/google-flax.md) (7,238 ⭐) — Flax is a deep learning framework and JAX neural network library designed for building complex machine learning models. It functions as a distributed training library and model state manager, providing a toolkit for defining flexible neural network architectures and scaling their training across multiple hardware devices.

The project is characterized by a design that separates network logic from parameter values to remain compatible with pure functions. It uses hierarchical module composition to organize networks as trees of nested modules and employs a reference-based state management system
- [mlcommons/inference](https://awesome-repositories.com/repository/mlcommons-inference.md) (1,582 ⭐) — Reference implementations of MLPerf® inference benchmarks
- [bertrandg/angular-split](https://awesome-repositories.com/repository/bertrandg-angular-split.md) (930 ⭐) — 🍌 Angular UI library to split views and allow dragging to resize areas using CSS grid layout.
- [infrasys-ai/aiinfra](https://awesome-repositories.com/repository/infrasys-ai-aiinfra.md) (7,414 ⭐)
- [braydie/howtobeaprogrammer](https://awesome-repositories.com/repository/braydie-howtobeaprogrammer.md) (16,218 ⭐) — HowToBeAProgrammer is a comprehensive software engineering career guide and professional development framework. It serves as a curated-knowledge repository and handbook designed to help programmers acquire technical habits and social competencies necessary for professional advancement.

The project distinguishes itself by integrating technical craftsmanship with a detailed manual for technical leadership and organizational navigation. It provides specific strategies for career progression, such as compensation negotiation, promotion readiness, and the management of professional boundaries to p
- [tensorflow/tensorflow](https://awesome-repositories.com/repository/tensorflow-tensorflow.md) (195,697 ⭐) — TensorFlow is a comprehensive machine learning framework designed for the construction, training, and deployment of complex mathematical models. It utilizes a graph-based execution model that represents operations as directed acyclic graphs, enabling automatic differentiation and efficient parallel processing. The system provides high-level interfaces for defining neural network architectures, alongside a robust engine for managing multidimensional array structures and tensor mathematics.

The framework distinguishes itself through a scalable distributed runtime that orchestrates workloads acr
- [avajs/ava](https://awesome-repositories.com/repository/avajs-ava.md) (20,849 ⭐) — Ava is a test runner for JavaScript and TypeScript designed to execute test suites with a focus on concurrency and isolation. It serves as a concurrent test executor that runs test files in parallel across multiple processes to reduce total runtime and prevent state leakage between suites.

The project features a built-in snapshot testing framework that saves large data structures to disk and compares subsequent executions to detect regressions via diffs. It is also compatible with the Test Anything Protocol, allowing it to export results for use with external reporting tools.

Its capability
- [abacaj/mpt-30b-inference](https://awesome-repositories.com/repository/abacaj-mpt-30b-inference.md) (574 ⭐) — Run inference on the latest MPT-30B model using your CPU. This inference code uses a ggml quantized model. To run the model we'll use a library called ctransformers that has bindings to ggml in python.
- [triton-inference-server/server](https://awesome-repositories.com/repository/triton-inference-server-server.md) (10,768 ⭐) — Triton Inference Server is a high-performance server designed to deploy machine learning models from multiple frameworks across GPUs and CPUs. It functions as a hardware-accelerated inference engine and a gRPC inference gateway, providing a standardized communication layer for transmitting binary tensor data with low latency.

The system acts as a multi-framework model orchestrator, allowing users to link multiple AI models into ensembles and scripts to create complex inference pipelines. It also serves as a model lifecycle manager, providing controls to load, unload, and monitor the performan
- [meta-llama/llama3](https://awesome-repositories.com/repository/meta-llama-llama3.md) (29,254 ⭐) — Llama 3 is a collection of pretrained, autoregressive transformer-based models designed for natural language generation, reasoning, and complex instruction following. It functions as a generative AI framework that provides the infrastructure for managing model weights, executing neural network inference, and handling computational workloads across diverse knowledge domains.

The project distinguishes itself through an integrated AI safety toolkit that employs secondary classification filtering to inspect inputs and outputs, ensuring adherence to usage compliance and safety standards. It suppor
- [state-spaces/mamba](https://awesome-repositories.com/repository/state-spaces-mamba.md) (17,215 ⭐) — Mamba is a deep learning framework designed for building and training sequence models that process long-range data dependencies with linear-time computational efficiency. By utilizing selective state space modeling, the library enables the construction of neural network architectures that replace traditional attention mechanisms with high-performance state space operations.

The framework distinguishes itself through the use of data-dependent state gating, which allows the model to dynamically filter information flow based on the input sequence. To ensure high throughput, it incorporates hardw
- [sciruby/distribution](https://awesome-repositories.com/repository/sciruby-distribution.md) (51 ⭐) — Probability distributions for Ruby.
- [eczarny/spectacle](https://awesome-repositories.com/repository/eczarny-spectacle.md) (13,631 ⭐) — Spectacle is a keyboard-driven window manager and organizer that uses system accessibility frameworks to manipulate window coordinates and dimensions. It allows for the arrangement, resizing, and movement of application windows across multiple displays using global keyboard shortcuts.

The tool focuses on multi-monitor layout management, enabling users to shift active windows between connected displays and snap windows into predefined screen regions such as halves, thirds, or corners. It also provides the ability to center and maximize windows to optimize screen real estate without using a mou
- [maxogden/binary-split](https://awesome-repositories.com/repository/maxogden-binary-split.md) (79 ⭐) — a fast newline (or any delimiter) splitter stream - like require('split') but specific for binary data
- [hpcaitech/colossalai](https://awesome-repositories.com/repository/hpcaitech-colossalai.md) (41,395 ⭐) — ColossalAI is a distributed deep learning framework designed for training and deploying massive artificial intelligence models across clusters of hardware accelerators. It functions as a parallel computing engine that partitions model workloads and data across multiple processors to maximize memory efficiency and throughput.

The platform distinguishes itself through a comprehensive suite of parallelization strategies, including multi-dimensional tensor parallelism and pipeline-based model parallelism, which segment neural network layers and stages across devices. To support large-scale genera
- [a-m-team/a-m-models](https://awesome-repositories.com/repository/a-m-team-a-m-models.md) (196 ⭐) — Read this in English.