Distributed Model Inference Frameworks

Open-source tools and libraries for splitting and running large machine learning models across multiple networked machines.

Find the best repos with AI.We'll search the best matching repositories with AI.

intel/ipex-llm
intel/ipex-llm
8,836View on GitHub
Intel XPU LLM Acceleration Library is a toolkit designed to accelerate large language model inference and finetuning on Intel CPUs, GPUs, and NPUs. It provides a distributed inference engine for scaling models across multiple accelerators, a multimodal model runtime for vision and speech tasks, and a low-bit model quantization tool for converting weights into INT4, FP8, and GGUF formats. The project features a parameter-efficient finetuning framework that enables model adaptation using QLoRA and DPO on Intel hardware. It distinguishes itself by providing specialized optimizations for Intel XPU backends, including the ability to execute large Mixture-of-Experts models on consumer-grade hardware and perform NPU-specific model conversion. The library covers a broad range of capabilities, including inference optimization via speculative decoding and KV-cache compression, distributed workload distribution through tensor and pipeline parallelism, and the deployment of local retrieval-augmented generation pipelines. It also supports multimodal execution for visual question answering and audio transcription, alongside OpenAI-compatible API serving.
This library provides distributed inference capabilities including tensor and pipeline parallelism for scaling large models across multiple Intel-based accelerators, fitting the requirements for distributed model execution.
PythonTensor Parallelism
View on GitHub8,836
eleutherai/gpt-neox
EleutherAI/gpt-neox
7,392View on GitHub
gpt-neox is a distributed training system and framework for building large-scale autoregressive language models. It implements the transformer architecture and provides a toolkit for training models with billions of parameters by distributing weights across compute clusters. The framework distinguishes itself through extensive support for distributed model parallelism, including pipeline and sequence parallelism, to overcome single-device memory limits. It further supports sparse model architectures using a mixture of experts system with Sinkhorn-based routing. The project covers a broad range of capabilities, including data processing for dataset blending and tokenization, RLHF model alignment, and text generation with stochastic sampling. It also includes tools for transformer representation analysis, model checkpoint conversion, and hardware-specific performance optimizations such as fused-kernel attention mechanisms. Monitoring and observability are handled through integrated training metrics logging, resource utilization profiling, and standardized language model evaluation.
This is a distributed training framework designed for building and training large-scale language models, rather than a tool specifically optimized for serving or running inference across multiple nodes.
PythonModel Parallelism FrameworksPipeline Parallelism Partitioners
View on GitHub7,392
ericlbuehler/mistral.rs
EricLBuehler/mistral.rs
6,597View on GitHub
mistral.rs is an inference engine for large language models that runs locally and exposes models behind OpenAI and Anthropic-compatible APIs. It serves as a multi-model serving platform, capable of loading several models in a single server process with per-request routing and on-demand loading and unloading. The engine supports multimodal inference, processing text alongside images, video, audio, and speech inputs, and includes a quantized model deployment runtime that reduces memory use and speeds up inference on consumer hardware. The project distinguishes itself through an agentic tool execution framework that runs server-side tools like code execution, shell commands, and web search in an automated loop during model generation, with session state persistence. It provides an in-process inference engine that can be embedded directly into Rust or Python applications without a separate server process, and includes an in-situ quantization engine that converts model weights to lower precision at load time with per-layer tuning. The system supports structured output constraints, forcing model output to conform to JSON Schema or grammar specifications during decoding, and offers automatic architecture detection that identifies model type, quantization format, and chat template from a Hugging Face model ID. The platform includes capabilities for managing LoRA adapters, composing models as mixture-of-experts configurations, and running distributed inference across multiple GPUs or nodes using tensor parallelism and ring transport. It provides a built-in web chat interface, supports speculative decoding with a smaller assistant model, and offers benchmarking, logging, and Prometheus metrics for monitoring. The project can be run from a configuration file, with options for customizing build processes, tuning hardware settings automatically, and managing model caches.
This inference engine supports distributed model inference through tensor parallelism and ring transport across multiple nodes, providing the core orchestration and partitioning capabilities required for multi-machine model execution.
RustLocal Model ServingAdapter-Aware RuntimesAgent Tool Execution
View on GitHub6,597
redis/rueidis
redis/rueidis
2,899View on GitHub
Rueidis is a high-performance Redis client library for Go that provides a type-safe and asynchronous interface for interacting with Redis servers. It includes a full implementation of the Redis serialization protocol and a dedicated connection manager to handle pooling, multiplexing, and automatic pipelining. The library is distinguished by its support for RDMA connectivity to reduce latency and CPU overhead. It features a distributed lock manager that implements majority-based locking and optimistic concurrency control, as well as client-side caching with invalidation signals to minimize network round trips. The project covers a wide range of capabilities, including the management of complex data structures such as Bloom filters, bitmaps, and JSON documents. It provides integrated support for Pub/Sub messaging, indexed object search, and reliability features like exponential backoff retries and cache stampede prevention. Observability is integrated through command performance tracing, network instrumentation, and cache monitoring.
This is a high-performance Redis client library for Go, which serves as a building block for distributed systems but does not provide the model partitioning or orchestration capabilities required for distributed machine learning inference.
GoRDMA NetworkingRDMA Protocol Implementations
View on GitHub2,899
tiiny-ai/powerinfer
Tiiny-AI/PowerInfer
8,714View on GitHub
PowerInfer is a high-performance local large language model inference engine and sparse inference framework. It provides a runtime for executing models on consumer-grade hardware, utilizing a GPU acceleration backend to optimize tensor operations for graphics processors. The system distinguishes itself through a sparse inference framework that increases generation speed by skipping computations based on activation sparsity in model weights. It includes a GGUF model converter for transforming weights and metadata into a unified binary format, as well as an OpenAI API compatible server for integrating local models with existing chat clients. The project covers broad capability areas including distributed model inference across multiple nodes, GPU hardware acceleration for Apple Metal and other processors, and structured text generation using formal grammars to constrain outputs. It also implements memory management techniques such as hybrid memory offloading, weight quantization, and CPU core affinity binding.
PowerInfer is a high-performance inference engine that supports distributed model execution across multiple nodes and GPUs, providing the core partitioning and orchestration capabilities required for distributed inference.
C++Local Inference EnginesSparse Model ArchitecturesApple Hardware Acceleration
View on GitHub8,714
huggingface/transformers
huggingface/transformers
161,630View on GitHub
Transformers is a comprehensive library for machine learning that provides a unified interface for training, fine-tuning, and deploying transformer-based models. It supports a wide range of tasks, including text classification, language modeling, question answering, and sequence-to-sequence translation, while offering specialized architectures for both text and vision processing. The framework includes tools for managing the entire model lifecycle, from data preprocessing and tokenization to distributed training and inference. The library features extensive support for model optimization and performance, including techniques like quantization, speculative decoding, and paged memory management for key-value caches. It provides native integration for distributed training across multi-node clusters, as well as flexible APIs for serving models via compatible inference servers. Developers can also utilize built-in utilities for model patching, custom kernel execution, and automated documentation generation to streamline development workflows.
This library provides the foundational tools and APIs for model partitioning and distributed inference, though it typically relies on external serving engines or orchestration layers to handle multi-node cluster management.
PythonAPI FrameworksByte Pair EncodingsHybrid
View on GitHub161,630
Less-relevant matchesScored below the primary cut
horovod/horovod
horovod/horovod
14,686View on GitHub
Horovod is a distributed deep learning framework and gradient synchronizer designed to scale model training across multiple GPUs and compute nodes. It functions as a distributed training orchestrator and an elastic training engine, utilizing an MPI collective communication library to synchronize weights and gradients across TensorFlow, PyTorch, Keras, and MXNet models. The system distinguishes itself through dynamic elastic scaling, which allows it to adjust the number of active workers at runtime and recover from node failures. It optimizes communication efficiency using tensor fusion batching and half-precision gradient compression to reduce network bandwidth requirements. The framework covers a broad set of capabilities including cluster orchestration across Kubernetes, Spark, and Ray, as well as hardware-aware resource mapping for CPUs and GPUs. It provides tools for distributed data management, such as parallel loading from Parquet files and offloaded preprocessing. Performance is further supported by RDMA network optimization, execution tracing, and Bayesian training optimization to maximize throughput. Deployment is supported through containerized training images and orchestrated environments for high-performance compute clusters.
This is a distributed training framework designed for scaling model weight synchronization during the learning process, rather than a tool for partitioning and serving large models for inference.
PythonMulti-node OrchestrationRDMA Networking
View on GitHub14,686
ml-explore/mlx
ml-explore/mlx
27,047View on GitHub
This project is a machine learning array framework and tensor computation library designed for high-performance numerical computing. It provides a comprehensive suite of tools for constructing and training neural networks, featuring an automatic differentiation engine that facilitates gradient-based optimization and complex mathematical modeling. The library distinguishes itself through a unified memory architecture that allows data to be shared across CPU and GPU devices without explicit copies, significantly reducing data movement overhead. Its execution model relies on a lazy evaluation engine and graph-based operation recording, which enables kernel fusion compilation to merge multiple operations into optimized execution units. These capabilities are complemented by stream-based execution control, which manages hardware-level concurrency to maximize throughput during intensive tensor processing. Beyond its core execution model, the framework supports a broad range of capabilities including distributed sharding infrastructure for scaling workloads across multiple devices, and extensive utilities for model weight management and serialization. It provides a deep library of mathematical and statistical operations, alongside specialized functions for quantized matrix multiplication and autoregressive text generation. The project is implemented in C++ and includes build-time configuration options to tailor hardware backends and compilation settings for specific deployment environments.
This is a high-performance array and tensor computation library designed for local hardware acceleration rather than a distributed orchestration framework for partitioning models across multiple physical nodes.
C++Distributed Parameter ShardingTensor ParallelismDistributed Execution Runtimes
View on GitHub27,047
cloudflare/quiche
cloudflare/quiche
11,563View on GitHub
This project is a memory-safe implementation of the QUIC transport protocol and HTTP/3, designed for high-throughput and efficient network communication. It provides a comprehensive toolkit for building secure, low-latency network applications by managing the full lifecycle of transport connections, including protocol negotiation, stream data exchange, and connection state management. The library distinguishes itself through a focus on performance and protocol integrity. It utilizes a formal state machine to enforce strict adherence to transport rules and employs zero-copy buffer management to minimize CPU overhead by mapping application memory directly to network buffers. To ensure resilience, it features modular congestion control, allowing for pluggable strategies, and stateless handshake validation to verify peer addresses before allocating server resources. The project covers a broad capability surface, including advanced traffic management, path discovery, and detailed observability tools for monitoring connection health and performance metrics. It provides granular control over security primitives, such as TLS certificate management and session resumption, while supporting specialized features like unreliable datagram delivery and multi-path routing. The implementation is written in Rust, providing a robust foundation for developers building high-performance web servers, clients, or experimental transport layer features.
This is a low-level network transport library for the QUIC protocol, which could be used as a communication building block for distributed systems but does not provide model partitioning or inference orchestration.
RustLow-Latency Data TransmissionZero-Copy Networking
View on GitHub11,563
axolotl-ai-cloud/axolotl
axolotl-ai-cloud/axolotl
12,059View on GitHub
Axolotl is a configuration-driven framework designed for the fine-tuning, evaluation, and quantization of large language models. It functions as a comprehensive orchestrator for distributed training, enabling users to manage complex workflows across multi-node and multi-GPU environments. By utilizing structured configuration files, the platform streamlines the setup of training parameters, dataset paths, and hardware distribution strategies. The project distinguishes itself through its support for diverse training methodologies, including full-parameter tuning, parameter-efficient adaptation, and reinforcement learning alignment. It provides specialized capabilities for multimodal model training, allowing for the integration of text, image, and media inputs. Furthermore, the framework includes advanced optimization tools such as quantization-aware training, which simulates precision loss to maintain model accuracy, and dynamic reward signal integration for aligning model behavior with human preferences. The framework covers a broad capability surface, including data management, performance optimization, and model lifecycle management. It handles data ingestion, preprocessing, and streaming, while offering advanced techniques like sequence packing and replay buffers to improve training efficiency. Performance is managed through distributed parallelism strategies, memory-efficient training pipelines, and custom kernel implementations. The project provides pre-configured container images to ensure consistent deployment across local and cloud-based compute environments. Users can manage the entire model lifecycle, from initial configuration and training to adapter merging and final inference execution.
This framework is designed for distributed model training and fine-tuning rather than serving as a dedicated engine for distributed model inference.
PythonModel Parallelism StrategiesDistributed Training Sharding
View on GitHub12,059
zai-org/chatglm-6b
zai-org/ChatGLM-6B
41,039View on GitHub
ChatGLM-6B is a generative AI inference engine designed for local execution of transformer-based language models. It provides a comprehensive runtime environment that allows users to load and run pre-trained neural network weights directly on their own hardware, ensuring data privacy and independence from external cloud services. The project distinguishes itself through a hardware-agnostic execution backend that supports deployment across diverse environments, including standard processors, Apple Silicon, and multi-GPU configurations. It incorporates advanced optimization techniques such as weight quantization and parameter-efficient fine-tuning via low-rank adaptation, which significantly reduce memory requirements and computational overhead. These features enable the deployment of large models on consumer-grade hardware while maintaining high throughput and performance. Beyond core inference, the toolkit includes a suite of utilities for programmatic integration, allowing developers to embed model capabilities into custom software workflows via standard interfaces. It also provides multiple interactive interfaces, including web-based graphical environments for text and vision tasks and a command-line interface for rapid prototyping and evaluation. The software is distributed as a Python-based package, requiring standard environment configuration to manage dependencies and hardware resource allocation.
This is a local inference engine designed for running models on single-machine hardware configurations rather than a distributed framework for orchestrating model partitioning across multiple physical nodes.
PythonTensor Parallelism
View on GitHub41,039
apache/brpc
apache/brpc
17,545View on GitHub
brpc is a high-performance C++ RPC framework and network programming library designed for building distributed systems. It functions as a multi-protocol RPC server capable of hosting and detecting multiple communication protocols, including gRPC, Thrift, HTTP, Redis, and Memcached, on a single TCP port. The project distinguishes itself through high-throughput data transport and memory efficiency, utilizing RDMA-based transport to bypass the kernel TCP stack and zero-copy memory management to eliminate data duplication. It also implements the Raft algorithm for consensus-based state replication to maintain consistency and high availability across distributed nodes. The framework provides a broad suite of capabilities for distributed system management, including dynamic service discovery via Consul or DNS, advanced traffic management with latency-based routing and circuit breaking, and comprehensive observability through Prometheus integration and built-in performance profiling. It also supports various communication patterns such as bi-directional streaming, asynchronous execution, and RESTful traffic serving.
This is a high-performance RPC framework used for building distributed systems, but it lacks the specific model partitioning and inference orchestration logic required for a distributed machine learning framework.
C++RDMA NetworkingRPC Frameworks
View on GitHub17,545
facebookresearch/fairseq
facebookresearch/fairseq
32,228View on GitHub
Fairseq is a PyTorch toolkit for sequence-to-sequence modeling, specializing in neural machine translation, automatic speech recognition, and large-scale language model training. It provides a framework for processing and aligning diverse data sources, including text, audio, and video, to support tasks such as speech-to-text conversion and multimodal sequence learning. The project is distinguished by its distributed training capabilities, which utilize parameter sharding, mixed-precision training, and CPU offloading to handle models that exceed single-device memory. It also includes specialized tools for data engineering, such as parallel data mining for unsupervised learning and back-translation for expanding training corpora. Its capability surface extends to comprehensive inference and generation tools, including beam search and lexical constraint enforcement, as well as model compression techniques like layer pruning and product quantization. The toolkit also provides utilities for feature extraction, model evaluation via metrics like perplexity and BLEU scores, and a registry-based system for extending models and tasks. Training and evaluation workflows are managed through a command-line interface that orchestrates hyperparameter configuration and model execution.
Fairseq is a comprehensive toolkit for training and evaluating sequence-to-sequence models, but it focuses on model training and research rather than providing a dedicated framework for distributed model inference across multiple nodes.
PythonDistributed Parameter ShardingDistributed Training Sharding
View on GitHub32,228
openrlhf/openrlhf
OpenRLHF/OpenRLHF
9,675View on GitHub
OpenRLHF is a training framework and alignment library designed for reinforcement learning from human feedback across distributed GPU clusters. It provides tools for aligning large language models and multimodal vision-language models using algorithms such as PPO, GRPO, and DPO. The framework distinguishes itself through a distributed inference engine that overlaps sample rollout with training to increase throughput. It supports scaling to models exceeding 70 billion parameters via parameter sharding and handles long-context sequences through ring-attention sequence parallelism. The project covers a broad range of capabilities, including supervised fine-tuning, reward model development, and the training of multi-turn agents. It incorporates memory optimization techniques such as low-rank adaptation, optimizer state offloading, and sample packing to reduce compute overhead.
This is a reinforcement learning and model training framework designed for alignment, rather than a dedicated distributed inference engine for serving large models to end-users.
PythonDistributed Training Sharding
View on GitHub9,675

Distributed Model Inference Frameworks

intel/ipex-llm

EleutherAI/gpt-neox

EricLBuehler/mistral.rs

redis/rueidis

Tiiny-AI/PowerInfer

huggingface/transformers

horovod/horovod

ml-explore/mlx

cloudflare/quiche

axolotl-ai-cloud/axolotl

zai-org/ChatGLM-6B

apache/brpc

facebookresearch/fairseq

OpenRLHF/OpenRLHF