High-Throughput LLM Inference Servers

Production-grade software frameworks designed to serve large language models with optimized latency and high throughput.

Find the best repos with AI.We'll search the best matching repositories with AI.

qwenlm/qwen
QwenLM/Qwen
21,294View on GitHub
Qwen is a comprehensive framework for large language model development, serving, and deployment. It provides a complete ecosystem for transformer-based sequence modeling, offering base models alongside specialized tools for instruction-tuned alignment, fine-tuning, and long-context inference. The project is designed to support both research and production environments, enabling users to train, optimize, and host generative models locally or across distributed hardware. The framework distinguishes itself through its focus on high-performance serving and extensibility. It features a high-performance inference engine that exposes OpenAI-compatible HTTP endpoints, allowing for integration into existing application architectures. To support complex workflows, it includes native capabilities for agentic tool use and function calling, which can be further refined through dedicated fine-tuning processes. The platform covers a broad range of operational requirements, including model quantization, multi-device tensor parallelism, and memory-efficient key-value caching to optimize throughput and resource usage. It also provides robust utilities for benchmarking performance, managing system-level behaviors, and securing model endpoints through authentication and safety-aligned configurations. The repository includes extensive documentation and scripts for model weight conversion, vocabulary expansion, and deployment across both CPU and GPU hardware.
Qwen provides a high-performance inference engine with OpenAI-compatible endpoints, batching, and multi-GPU support, making it a capable tool for serving LLMs in production environments.
PythonModel QuantizationModel Quantization Utilities
View on GitHub21,294
ggml-org/llama.cpp
ggml-org/llama.cpp
116,799View on GitHub
Llama.cpp is an inference engine designed for the local execution of text-based and multimodal language models on consumer hardware. It provides a core environment for running models that process both text and image inputs, utilizing hardware-accelerated backends to optimize performance across diverse CPU and GPU architectures. The project distinguishes itself by offering a lightweight HTTP server that adheres to standard API specifications, enabling chat completion, embeddings, and reranking services. It includes a suite of tools for model quantization and conversion, which reduces memory usage and improves performance, alongside a command-line interface for managing chat templates and inference parameters. The ecosystem further supports structured data generation through grammar-based output constraints and provides diagnostic utilities for visualizing computational graphs. Comprehensive documentation is available, including a reference matrix that details the compatibility of computational operations across supported hardware backends.
This is a high-performance inference engine that provides a lightweight HTTP server and quantization tools, though it is primarily optimized for local execution on consumer hardware rather than the high-throughput, multi-model distributed serving typically required for large-scale production environments.
C++Model Quantization Tools
View on GitHub116,799
microsoft/bitnet
microsoft/BitNet
39,327View on GitHub
BitNet is a quantized inference engine designed to execute highly compressed language models by performing arithmetic on low-precision, bit-level weight data. It functions as a model optimization toolkit and a high-performance kernel library, enabling the execution of large language models on consumer hardware by reducing memory footprints and increasing processing speeds. The project distinguishes itself through hardware-specific kernel optimizations that leverage native processor instructions to accelerate matrix multiplication. By utilizing packed integer arithmetic and memory-aligned weight permutation, the engine improves cache locality and computational density. These capabilities are specifically tuned to accelerate autoregressive decoding, minimizing latency during the sequential token generation process to support real-time text generation requirements. The toolkit includes a comprehensive suite for hardware-accelerated neural computation, allowing users to benchmark inference kernels and measure generation latency against baseline implementations. These tools ensure that the inference pipeline maintains high throughput and efficiency when processing compressed models on supported graphics hardware.
This repository provides specialized kernels and optimization tools for quantized inference rather than serving as a full-featured production API server capable of multi-model management and distributed inference.
PythonModel QuantizationModel Quantization ToolsModel Quantization Utilities
View on GitHub39,327
thu-pacman/chitu
thu-pacman/chitu
3,265View on GitHub
Chitu is a distributed serving platform and orchestrator for large language model inference. It functions as a compute manager designed to deploy and scale model workloads across diverse hardware architectures, including GPUs, CPUs, and heterogeneous hardware clusters. The platform enables model deployment across a wide range of targets, including NVIDIA GPUs, regional chipsets, and legacy hardware. It manages the execution of models across these varying environments to increase available computing capacity and optimize resource utilization. The system includes capabilities for distributed inference orchestration and heterogeneous hardware scaling, allowing models to run on configurations ranging from single devices to large production clusters. It also incorporates concurrent traffic management and request queueing to maintain stability during high-demand workloads.
Chitu is a distributed serving platform designed to orchestrate and scale LLM inference across heterogeneous hardware clusters, providing the high-throughput management and distributed execution required for production environments.
PythonLLM Serving ArchitecturesCross-Hardware Model InferenceDistributed Inference Orchestrators
View on GitHub3,265
deepseek-ai/deepseek-coder
deepseek-ai/DeepSeek-Coder
22,804View on GitHub
DeepSeek-Coder is a large language model and foundational neural network architecture designed specifically for software development tasks. It functions as an artificial intelligence assistant capable of interpreting complex programming instructions to generate, transpile, and structure source code. The system distinguishes itself through its ability to perform project-level code generation, analyzing broader context and patterns across entire software projects rather than isolated files. It supports multimodal input processing, allowing for the integration of text and visual data to inform its code generation and analysis workflows. The platform covers a comprehensive range of development capabilities, including automated code refactoring, conversational assistance, and high-performance model serving. It provides utilities for training custom models, fine-tuning on specialized datasets, and managing inference at scale through distributed tensor parallelism and mixed-precision operations.
This repository provides a specialized large language model for code generation that includes built-in utilities for distributed inference, tensor parallelism, and high-throughput serving, making it a functional tool for deploying this specific model in production environments.
PythonAI Coding AssistantsAI-Assisted DevelopmentGenerative Code Assistants
View on GitHub22,804
mudler/localai
mudler/LocalAI
46,889View on GitHub
LocalAI is a self-hosted inference server that enables the execution of machine learning models directly on local hardware. By providing a unified interface for text, image, and audio processing, it allows users to maintain full control over data privacy and infrastructure costs while eliminating dependencies on external network services. The platform functions as an API gateway that mimics standard cloud-based artificial intelligence interfaces, allowing existing applications to integrate local models as drop-in replacements. It utilizes a container-based architecture to package runtimes and dependencies, ensuring consistent deployment across diverse hardware configurations. To optimize system performance, the server employs an on-demand orchestration layer that dynamically loads and unloads models based on active requests, minimizing memory usage during periods of inactivity. The system supports a wide range of model architectures through a flexible backend abstraction that allows for driver switching at runtime. Users can manage their models and interact with the service through a web interface or via standard web requests, which the proxy translates into model-specific execution commands. The software is distributed as a containerized application to facilitate deployment across various server and cloud environments.
LocalAI is a self-hosted inference server that provides a unified API for running various machine learning models, making it a capable tool for local and production-adjacent model serving despite its primary focus on ease of use over extreme high-throughput optimization.
GoInference ServersLocal Inference EnginesLocal Model Serving
View on GitHub46,889
unslothai/unsloth
unslothai/unsloth
66,628View on GitHub
Unsloth is a high-performance training and inference platform designed to optimize the lifecycle of large language and multimodal models. It provides a comprehensive engine for fine-tuning, executing, and managing models locally, with a focus on reducing memory consumption and increasing compute speed on consumer-grade hardware. The platform distinguishes itself through hand-optimized kernels and automated computational graph techniques that maximize hardware throughput. It supports advanced training methodologies, including reinforcement learning for reasoning and efficient adapter-based fine-tuning, while offering a unified web-based interface for no-code model training, data preparation, and real-time performance monitoring. Beyond its core training capabilities, the project includes a local inference runtime that supports API-based deployment, tool-calling, and automated output verification. It manages the entire model development process, from dataset generation and hyperparameter configuration to model exporting and performance benchmarking across diverse hardware configurations. The software provides setup utilities for local development environments and includes diagnostic tools to assist with installation and hardware compatibility.
While primarily focused on model fine-tuning and local development, this platform includes a capable inference runtime with API support that serves as a viable tool for executing models in production-adjacent environments.
PythonLanguage Model TrainingCustom Kernel AcceleratorsEfficient Training Pipelines
View on GitHub66,628
Less-relevant matchesScored below the primary cut
intel/neural-compressor
intel/neural-compressor
2,585View on GitHub
Neural Compressor is a deep learning model compression toolkit and AI inference acceleration engine. It functions as an automated model quantization tool and hardware-aware model compiler designed to reduce the memory footprint of neural networks and decrease execution latency. The project provides specialized frameworks for optimizing large language models, utilizing weight-only quantization and hardware-specific kernels to improve the operational efficiency of generative AI workloads. It maps neural network operators to specialized CPU and GPU vector instructions to accelerate model execution. The toolkit covers a broad range of optimization capabilities, including post-training quantization, mixed-precision layer mapping, and graph operation fusion. It also includes automated performance tuning to discover optimal configuration settings for specific hardware targets.
This is a model optimization and compression toolkit used to prepare models for deployment, rather than a production-ready inference server that handles API requests, batching, and multi-model serving.
PythonModel QuantizationGPU AccelerationModel Quantization Tools
View on GitHub2,585
qwenlm/qwen3
QwenLM/Qwen3
27,324View on GitHub
Qwen3 is a transformer-based large language model designed as a generative AI foundation for understanding, reasoning, and generating human language. It functions as a comprehensive ecosystem for model training, fine-tuning, and production-ready inference, providing the underlying architecture and weights necessary to build diverse artificial intelligence applications. The project distinguishes itself through extensive support for model quantization and distributed inference, enabling efficient execution across a wide range of hardware from consumer-grade devices to scalable cloud infrastructure. It includes a specialized toolkit for weight compression and memory optimization, such as key-value cache management, which reduces computational requirements while maintaining performance. Furthermore, the model integrates with agentic frameworks, allowing for the development of autonomous systems capable of executing complex workflows and interacting with external tools. The ecosystem covers a broad surface of deployment and training methodologies, including standardized interfaces for modular plugin integration and function calling. It provides extensive documentation for various training, fine-tuning, and serving environments to facilitate integration into existing software stacks.
This repository provides the model weights and training ecosystem for a specific large language model rather than serving as a standalone, model-agnostic inference server designed for high-throughput production deployment.
PythonModel QuantizationModel Quantization Tools
View on GitHub27,324
bigscience-workshop/petals
bigscience-workshop/petals
10,208View on GitHub
Petals is a decentralized framework and inference engine for running large language models across a peer-to-peer network. It enables the execution of models that exceed the memory of any single machine by splitting computations and model layers across a collaborative swarm of GPUs. The system functions as a collaborative compute network where participants share local GPU resources and host model weights. It supports distributed prompt-tuning to adapt massive models to specific tasks and allows for the establishment of private compute swarms to process sensitive data within restricted, trusted networks. The platform manages distributed layer execution and pipeline-parallel inference, utilizing distributed hash tables for peer discovery and circuit relays to bypass firewalls. It includes mechanisms for dynamic block hosting and remote weight streaming to optimize how model parameters are loaded and distributed across the swarm. The software is implemented in Python.
Petals is a decentralized, peer-to-peer framework for running massive models across distributed hardware, which differs from the centralized, high-throughput production inference servers typically used for low-latency API serving.
PythonDistributed Inference Engines
View on GitHub10,208
vllm-project/llm-compressor
vllm-project/llm-compressor
2,764View on GitHub
llm-compressor is a quantization toolkit and post-training library designed to reduce the memory footprint and size of large language models. It provides a framework for compressing models using weight and activation quantization to enable more efficient deployment. The project distinguishes itself through a distributed quantization framework that utilizes data-parallel processing and disk-based weight offloading to handle massive model checkpoints that exceed available system memory. It includes specialized compressors for diverse architectures, including Mixture-of-Experts, Vision-Language, and Audio-Language models. The toolkit covers a broad range of optimization capabilities, including calibration-based and data-free quantization, checkpoint format conversion, and the reduction of precision for attention mechanisms and key-value caches. It manages these processes through structured compression recipes and orchestration pipelines to standardize model preparation and optimization.
This is a model compression and quantization toolkit used to prepare models for deployment, rather than an inference server designed to handle live requests, batching, or API serving.
PythonModel Quantization
View on GitHub2,764
nomic-ai/gpt4all
nomic-ai/gpt4all
77,375View on GitHub
GPT4All is a cross-platform runtime environment designed to execute large language models directly on local consumer hardware. By leveraging an optimized C++ inference backend, it enables private, offline AI interactions without requiring an internet connection or external cloud services. The project provides a comprehensive ecosystem for managing the entire model lifecycle, including discovery, downloading, and configuration of local weights. What distinguishes the platform is its integrated retrieval-augmented generation engine, which allows users to index local documents into semantic vector spaces. This capability enables context-aware chat sessions where the model can reference private files, notes, and spreadsheets to provide grounded, relevant responses. The system also features a local HTTP server that exposes an OpenAI-compatible API, allowing developers to integrate these private, self-hosted models into existing applications and workflows. Beyond its core inference and retrieval capabilities, the project includes a graphical desktop interface for end-user interaction and a Python software development kit for programmatic access. These tools support advanced configuration of model parameters, performance monitoring, and the management of local embedding pipelines for custom semantic search tasks. The software is distributed as a unified application package, with documentation available to guide users through installation and local environment setup.
This project is a consumer-focused desktop application designed for local, offline model execution rather than a high-performance server architecture built for production-grade throughput and distributed inference.
C++OpenAI-CompatibleLocal API Servers
View on GitHub77,375
karpathy/nanochat
karpathy/nanochat
55,103View on GitHub
Nanochat is a lightweight execution environment designed for training and running language models on standard consumer hardware. It functions as both a neural network training framework and an inference engine, enabling users to perform backpropagation-based training and model execution directly on general-purpose processors without the need for dedicated graphics hardware. The project distinguishes itself through a suite of optimization tools that prioritize efficiency on local machines. By utilizing memory-mapped weight loading and CPU-optimized vector math, it maximizes throughput for interactive sessions. Furthermore, the framework includes a quantization toolkit that allows users to adjust the numerical precision of weights and activations, effectively balancing memory consumption against computational speed. The platform supports a range of capabilities for transformer architecture experimentation, including the configuration of training parameters and the management of local data pipelines. It employs a stateless generation loop to process tokens through self-contained execution cycles, facilitating the development and fine-tuning of custom models in a private, local environment.
This project is a local training and experimentation framework designed for consumer hardware rather than a high-performance production inference server capable of multi-model serving or distributed inference.
PythonQuantization Tools
View on GitHub55,103
h2oai/h2o-3
h2oai/h2o-3
7,493View on GitHub
h2o-3 is a distributed machine learning platform and automated machine learning framework designed for training and deploying predictive models using distributed in-memory computing. It functions as a deep learning framework and a distributed model scoring engine, capable of operating as a Kubernetes ML cluster to process large datasets in parallel. The platform distinguishes itself through automated machine learning capabilities that automatically select the best algorithms and hyperparameters to optimize model performance. It provides specialized deep learning toolkits for tasks including image classification, anomaly detection, and image reconstruction and clustering. The system covers a broad range of capabilities including large-scale data processing via map-reduce and distributed key-value stores, and model explainability analysis to interpret predictions. Its model management suite supports the serialization of trained models into standalone artifacts for high-performance production scoring, alongside a registry for model logging and lifecycle orchestration. Deployment and orchestration are supported via Kubernetes stateful sets, Hadoop integration, and a web-based management interface.
This is a distributed machine learning and AutoML platform designed for training and scoring traditional predictive models, rather than a specialized inference server optimized for serving Large Language Models.
Jupyter NotebookDistributed Inference EnginesModel Orchestrators
View on GitHub7,493
blockrunai/clawrouter
BlockRunAI/ClawRouter
3,020View on GitHub
ClawRouter is an AI model router and API gateway designed to classify query complexity and assign prompts to the most efficient model tier. It operates as a multi-model AI proxy that orchestrates traffic between various large language models and AI media generators through a unified interface. The project distinguishes itself by integrating a non-custodial micropayment processor using the x402 protocol. This allows for per-request API access and USDC settlement on Base and Solana chains, replacing static API keys with wallet-based authentication and real-time budget enforcement. The system covers broad capability areas including automated model failover management, response caching for cost control, and multi-modal proxying for text, image, and video generation. It further integrates blockchain data services, enabling on-chain SQL querying, financial market data retrieval, and wallet intelligence analysis. The project is implemented as a local AI proxy server that intercepts and routes standardized AI API requests to various backends.
This is an AI model router and API proxy designed for traffic orchestration and cost management rather than the high-performance GPU-accelerated inference serving required for hosting LLMs.
TypeScriptModel Routers
View on GitHub3,020
huggingface/transformers
huggingface/transformers
161,630View on GitHub
Transformers is a comprehensive library for machine learning that provides a unified interface for training, fine-tuning, and deploying transformer-based models. It supports a wide range of tasks, including text classification, language modeling, question answering, and sequence-to-sequence translation, while offering specialized architectures for both text and vision processing. The framework includes tools for managing the entire model lifecycle, from data preprocessing and tokenization to distributed training and inference. The library features extensive support for model optimization and performance, including techniques like quantization, speculative decoding, and paged memory management for key-value caches. It provides native integration for distributed training across multi-node clusters, as well as flexible APIs for serving models via compatible inference servers. Developers can also utilize built-in utilities for model patching, custom kernel execution, and automated documentation generation to streamline development workflows.
This is a comprehensive machine learning library for training and model management rather than a dedicated, production-ready inference server designed for high-throughput model serving.
PythonModel Quantization
View on GitHub161,630
openbmb/voxcpm
OpenBMB/VoxCPM
29,985View on GitHub
VoxCPM is a multilingual speech synthesis system and text-to-speech inference server. It functions as an AI voice cloning tool and a synthetic voice designer, capable of generating natural speech across global languages and regional dialects using a GPU-accelerated audio generator. The project features a speech model fine-tuning framework that supports both full parameter updates and low-rank adaptation for customizing voice characteristics. It enables high-fidelity voice cloning from reference audio, including cross-lingual voice transfer and acoustic environment mimicry, as well as the creation of unique vocal identities through text-based voice design. The system provides broad capabilities for speech generation, including context-aware prosody, non-verbal cue insertion, and multi-speaker dialogue. It includes professional audio processing utilities for denoising and upsampling reference clips, as well as a high-throughput API server with streaming output and an OpenAI-compatible interface. The software supports deployment across various hardware backends, including CUDA, MPS, and CPU, and can be deployed via containers.
This is a specialized text-to-speech and voice synthesis server rather than a general-purpose LLM inference engine for text-based language models.
PythonContinuous Batching Strategies
View on GitHub29,985

High-Throughput LLM Inference Servers

QwenLM/Qwen

ggml-org/llama.cpp

microsoft/BitNet

thu-pacman/chitu

deepseek-ai/DeepSeek-Coder

mudler/LocalAI

unslothai/unsloth

intel/neural-compressor

QwenLM/Qwen3

bigscience-workshop/petals

vllm-project/llm-compressor

nomic-ai/gpt4all

karpathy/nanochat

h2oai/h2o-3

BlockRunAI/ClawRouter

huggingface/transformers

OpenBMB/VoxCPM