Lmdeploy

Features

LLM Deployment Frameworks - Provides a framework for distributing large model services across multiple machines using request distribution and tensor parallelism.
Large Language Model Deployments - Provides the infrastructure and framework necessary to deploy and serve large language models in production environments.
High Throughput Inference - Maximizes requests processed per second using continuous batching, tensor parallelism, and optimized kernels.
Inference Optimization Kernels - Uses advanced execution kernels to increase requests per second and process model data more efficiently.
Kernel Optimizations - Provides specialized low-level CUDA and C++ kernels to accelerate matrix multiplications and attention mechanisms.
Large Language Model Serving - Hosts and exposes large language and vision models via high-performance inference engines.
High-Throughput Model Serving - Implements a high-performance runtime designed to handle large volumes of concurrent inference requests with low latency.
Continuous Batching Strategies - Implements continuous batching strategies to dynamically insert new requests into active inference batches for high hardware utilization.
Model Quantization - Implements techniques and tools for reducing model memory footprint and computational requirements to improve inference performance.
Model Servers - Provides a deployment environment capable of running vision language models that process combined image and text inputs.
Model Serving Interfaces - Provides unified interfaces for deploying and serving multimodal model architectures across diverse hardware environments.
Vision-Language Models - Runs multimodal vision-language models that process combined image and text inputs across multiple accelerators.
KV Cache Quantizers - Reduces memory footprint by converting both high-precision floating point weights and KV caches to lower-bit formats.
Tensor Parallelism - Splits model weights across multiple GPUs to handle larger models and increase throughput via tensor parallelism.
LLM Inference Optimization - Increases request throughput and reduces latency using tensor parallelism and continuous batching for high-performance serving.
LLM Production Infrastructure - Provides infrastructure for managing the deployment, scaling, and reliability of large language models in production.
Batch Inference Pipelines - Allows processing large datasets through a local model for results without requiring an active API server.
Distributed Model Orchestration - Manages and balances workloads across multiple machines to serve several models simultaneously to a large user base.
Model Compression Suites - Ships a comprehensive toolkit for reducing model size through weight and cache quantization.
Distributed Deployment Utilities - Deploys multi-model services across multiple machines using a request distribution system to balance workloads.
Multimodal Pipeline Coordinators - Coordinates the flow of image and text data through distinct encoders before processing them in a unified transformer.
Model Compression - Reduces model size and memory requirements using weight and cache quantization to fit models into smaller hardware.
Weight Quantization - Decreases memory usage and increases speed by applying quantization to model weights and caches.
Inference Load Balancers - Distributes inference tasks across a cluster of machines to prevent bottlenecks and maximize resource utilization.
Inference and Serving - Toolkit for compressing and serving large models.
Inference Engines - Toolkit for compressing and serving models.
Inference Frameworks - Distributed inference toolkit supporting quantization and multiple API interfaces.
Model Serving & Deployment - Compresses and deploys LLMs for production.
Inference Frameworks - Framework for quantization, inference, and serving of LLMs and VLMs.

Open-source alternatives to Lmdeploy

Similar open-source projects, ranked by how many features they share with Lmdeploy.

sgl-project/sglang
sgl-project/sglang
29,079View on GitHub
Sglang is a high-performance inference engine and serving system designed for large language and multimodal models. It provides a programmable interface for orchestrating complex generation workflows, enabling developers to coordinate multi-turn dialogues, tool invocations, and reasoning chains through a domain-specific language. The platform is built to support production-scale deployments, offering an OpenAI-compatible API that allows for integration with existing application ecosystems. The system distinguishes itself through a disaggregated architecture that separates compute-intensive pr
Pythonattentionblackwellcuda
View on GitHub29,079
vllm-project/vllm
vllm-project/vllm
83,048View on GitHub
vLLM is a high-throughput inference engine designed for the efficient serving and execution of large language models. It functions as a production-ready distributed model server, providing standard API protocols for online serving while also supporting offline batch processing. The system is built to maximize token generation speed and memory efficiency, enabling both large-scale cloud deployments and local execution on personal hardware. The project distinguishes itself through advanced memory management and request scheduling techniques, most notably its use of non-contiguous key-value cach
Pythonamdblackwellcuda
View on GitHub83,048
huggingface/text-generation-inference
huggingface/text-generation-inference
10,775View on GitHub
Text Generation Inference is a production-ready engine designed for the deployment and serving of large language models. It functions as a containerized runtime environment that manages model execution, scales across distributed hardware, and provides high-performance inference capabilities for demanding production environments. The project distinguishes itself through advanced optimization techniques, including continuous batching to maximize hardware utilization and tensor parallelism to shard large models across multiple accelerator cards. It supports efficient inference through custom com
Pythonbloomdeep-learningfalcon
View on GitHub10,775
modeltc/lightllm
ModelTC/LightLLM
3,901View on GitHub
LightLLM is a high-performance serving framework for deploying and executing large language models. It functions as a multi-GPU inference engine and server capable of handling dense architectures, mixture-of-experts designs, and multimodal models that process both text and images. The system is distinguished by its specialized support for Mixture-of-Experts models using expert parallelism and fused kernels. It implements structured text generation through deterministic state machines and pushdown automata to enforce precise output formats. To optimize throughput, the framework employs specula
Pythondeep-learninggptllama
View on GitHub3,901

See all 30 alternatives to Lmdeploy

InternLMlmdeploy

Features

Open-source alternatives to Lmdeploy

sgl-project/sglang

vllm-project/vllm

huggingface/text-generation-inference

ModelTC/LightLLM

Star history

Open-source alternatives to Lmdeploy

sgl-project/sglang

vllm-project/vllm

huggingface/text-generation-inference

ModelTC/LightLLM