What are the best open-source alternatives to FlexLLMGen?

30 open-source projects similar to fminference/flexllmgen, ranked by shared features. Top picks: fminference/flexgen, llm-d/llm-d, modeltc/lightllm, hanxiao/bert-as-service, huggingface/text-embeddings-inference, pytorch/serve, infrasys-ai/aisystem, internlm/lmdeploy, lightning-ai/litserve, lmstudio-ai/lms.

Is fminference/flexgen a good alternative to FlexLLMGen?

FlexGen is an inference engine for large language models designed for high-throughput execution on single or multiple GPUs. It functions as a framework for managing model execution through a combination of memory offloading, weight compression, and pipeline orchestration. The system enables the ex…

Is llm-d/llm-d a good alternative to FlexLLMGen?

llm-d is a distributed serving framework designed for large language model inference. It functions as an inference orchestrator and gateway, providing a control plane for deploying model replicas and managing hardware accelerators. The system includes a batch inference scheduler and a cache manager…

Is modeltc/lightllm a good alternative to FlexLLMGen?

LightLLM is a high-performance serving framework for deploying and executing large language models. It functions as a multi-GPU inference engine and server capable of handling dense architectures, mixture-of-experts designs, and multimodal models that process both text and images. The system is di…

Is hanxiao/bert-as-service a good alternative to FlexLLMGen?

This project is a high-performance BERT embedding service and inference server designed to map text sequences into fixed-length numerical vectors. It functions as a machine learning microservice and distributed model server that decouples request handling from heavy computation. The system utilize…

Is huggingface/text-embeddings-inference a good alternative to FlexLLMGen?

Text Embeddings Inference is a high-performance inference server designed to host text embedding and sequence classification models as scalable API endpoints. It provides a vector embedding API to convert text into dense representations and a cross-encoder reranking server for scoring the relevance…

Is pytorch/serve a good alternative to FlexLLMGen?

This project is a PyTorch model serving framework designed to deploy and scale machine learning models in production via scalable network endpoints. It functions as a high-performance inference server, optimizer, and model lifecycle manager that handles model loading, request batching, and hardware…

Is infrasys-ai/aisystem a good alternative to FlexLLMGen?

AISystem is a comprehensive AI full-stack infrastructure project covering the entire pipeline from AI chip architecture to high-level training frameworks. It encompasses the development of AI compiler frameworks, inference engines, and distributed training orchestrators designed to coordinate workl…

Is internlm/lmdeploy a good alternative to FlexLLMGen?

lmdeploy is a high-performance inference engine and deployment framework for large language models and vision models. It functions as a multi-modal model server and compression toolkit designed to serve models with high throughput and low latency. The system enables the distribution of model servi…

Is lightning-ai/litserve a good alternative to FlexLLMGen?

LitServe is a Python AI inference server framework and LLM serving framework designed for high-concurrency inference. It functions as a distributed AI model server and dynamic batching inference engine, providing the tools to build and host custom servers that run AI models. The framework distingu…

Is lmstudio-ai/lms a good alternative to FlexLLMGen?

This project is a headless large language model inference engine and server manager designed for local deployments. It provides a developer toolkit and API gateway that allows for the management of model lifecycles and inference tasks without a graphical user interface. The system enables the depl…

Back to fminference/flexllmgen

Open-source alternatives to FlexLLMGen

30 open-source projects similar to fminference/flexllmgen, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best FlexLLMGen alternative.

fminference/flexgen
FMInference/FlexGen
9,366View on GitHub
FlexGen is an inference engine for large language models designed for high-throughput execution on single or multiple GPUs. It functions as a framework for managing model execution through a combination of memory offloading, weight compression, and pipeline orchestration. The system enables the execution of models that exceed available GPU memory by moving tensors and caches between GPU memory, system RAM, and disk storage. It utilizes 4-bit weight quantization to reduce the memory footprint of model parameters, allowing for increased batch processing capacity. The project covers distributed
Python
View on GitHub9,366
llm-d/llm-d
llm-d/llm-d
2,514View on GitHub
llm-d is a distributed serving framework designed for large language model inference. It functions as an inference orchestrator and gateway, providing a control plane for deploying model replicas and managing hardware accelerators. The system includes a batch inference scheduler and a cache manager to coordinate request flow and memory utilization. The project is distinguished by a disaggregated serving architecture that separates prefill and decode execution phases across specialized workers to maximize throughput. It employs a hardware-agnostic control plane and tiered cache offloading, mov
Shell
View on GitHub2,514
modeltc/lightllm
ModelTC/LightLLM
3,901View on GitHub
LightLLM is a high-performance serving framework for deploying and executing large language models. It functions as a multi-GPU inference engine and server capable of handling dense architectures, mixture-of-experts designs, and multimodal models that process both text and images. The system is distinguished by its specialized support for Mixture-of-Experts models using expert parallelism and fused kernels. It implements structured text generation through deterministic state machines and pushdown automata to enforce precise output formats. To optimize throughput, the framework employs specula
Pythondeep-learninggptllama
View on GitHub3,901

Open-source alternatives to FlexLLMGen

FMInference/FlexGen

llm-d/llm-d

ModelTC/LightLLM

hanxiao/bert-as-service

huggingface/text-embeddings-inference

pytorch/serve

Infrasys-AI/AISystem

InternLM/lmdeploy

Lightning-AI/LitServe

lmstudio-ai/lms

NVIDIA/triton-inference-server

predibase/lorax

RLinf/RLinf

skyzh/tiny-llm

Tiiny-AI/PowerInfer

turboderp/exllamav2

turboderp-org/exllamav2

QwenLM/Qwen-Image

kubeflow/kfserving

Lightning-AI/lit-llama

QwenLM/Qwen2.5-Omni

bigscience-workshop/petals

gpustack/gpustack

cumulo-autumn/StreamDiffusion

collabora/WhisperLive

alirezadir/Production-Level-Deep-Learning

kserve/kserve

microsoft/DeepSpeedExamples

baichuan-inc/Baichuan-7B

sgl-project/mini-sglang