What are the best open-source alternatives to FlexGen?

30 open-source projects similar to fminference/flexgen, ranked by shared features. Top picks: tiiny-ai/powerinfer, sjtu-ipads/powerinfer, microsoft/deepspeed, fminference/flexllmgen, openbmb/minicpm, vllm-project/llm-compressor, lostruins/koboldcpp, qwenlm/qwen-image, internlm/lmdeploy, openvinotoolkit/openvino.

Is tiiny-ai/powerinfer a good alternative to FlexGen?

PowerInfer is a high-performance local large language model inference engine and sparse inference framework. It provides a runtime for executing models on consumer-grade hardware, utilizing a GPU acceleration backend to optimize tensor operations for graphics processors. The system distinguishes i…

Is sjtu-ipads/powerinfer a good alternative to FlexGen?

PowerInfer is an inference engine and serving framework designed to run large language models on local hardware. It combines a hybrid CPU-GPU offloader, a quantization tool, and a sparse model optimizer to enable the execution of high-parameter models on consumer-grade devices. The system distingu…

Is microsoft/deepspeed a good alternative to FlexGen?

DeepSpeed is a distributed deep learning optimization library and framework designed for the training and inference of massive AI models. It serves as a model parallelism orchestrator and a toolkit for scaling large language models across multiple GPUs and compute nodes. The project distinguishes…

Is fminference/flexllmgen a good alternative to FlexGen?

FlexLLMGen is an inference engine and runtime designed to run large language models on a single GPU by combining weight compression with tensor offloading. It reduces model weight memory usage by approximately 70% through 4-bit quantization, and stores model parameters, attention cache, and hidden…

Is openbmb/minicpm a good alternative to FlexGen?

MiniCPM is a collection of small language models designed for local, on-device deployment in resource-constrained environments. The project focuses on running dense Transformer models on consumer hardware, including GPUs, CPUs, and Apple Silicon, without requiring custom code forks. The project di…

Is vllm-project/llm-compressor a good alternative to FlexGen?

llm-compressor is a quantization toolkit and post-training library designed to reduce the memory footprint and size of large language models. It provides a framework for compressing models using weight and activation quantization to enable more efficient deployment. The project distinguishes itsel…

Is lostruins/koboldcpp a good alternative to FlexGen?

KoboldCPP is a local large language model inference engine and GGUF model runner designed to execute quantized models on personal hardware. It functions as a multimodal AI server and API gateway, providing OpenAI-compatible endpoints that allow third-party clients to interact with locally hosted mo…

Is qwenlm/qwen-image a good alternative to FlexGen?

Qwen-Image is a text-to-image model and large language model image generation framework. It functions as an AI image editing suite and a personalized image trainer, capable of producing high-fidelity visuals and accurate typography from natural language descriptions. The system is distinguished by…

Is internlm/lmdeploy a good alternative to FlexGen?

lmdeploy is a high-performance inference engine and deployment framework for large language models and vision models. It functions as a multi-modal model server and compression toolkit designed to serve models with high throughput and low latency. The system enables the distribution of model servi…

Is openvinotoolkit/openvino a good alternative to FlexGen?

OpenVINO is an AI inference engine and model serving platform designed to execute optimized deep learning models across CPUs, GPUs, and NPUs through a unified API. It includes a model optimization toolkit for converting, quantizing, and compressing models from various frameworks, alongside a specia…

Back to fminference/flexgen

Open-source alternatives to FlexGen

30 open-source projects similar to fminference/flexgen, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best FlexGen alternative.

tiiny-ai/powerinfer
Tiiny-AI/PowerInfer
8,714View on GitHub
PowerInfer is a high-performance local large language model inference engine and sparse inference framework. It provides a runtime for executing models on consumer-grade hardware, utilizing a GPU acceleration backend to optimize tensor operations for graphics processors. The system distinguishes itself through a sparse inference framework that increases generation speed by skipping computations based on activation sparsity in model weights. It includes a GGUF model converter for transforming weights and metadata into a unified binary format, as well as an OpenAI API compatible server for inte
C++large-language-modelsllamallm
View on GitHub8,714
sjtu-ipads/powerinfer
SJTU-IPADS/PowerInfer
9,568View on GitHub
PowerInfer is an inference engine and serving framework designed to run large language models on local hardware. It combines a hybrid CPU-GPU offloader, a quantization tool, and a sparse model optimizer to enable the execution of high-parameter models on consumer-grade devices. The system distinguishes itself through neuron-activation-based offloading, using a predictor model to preload frequent neurons into VRAM while keeping rare neurons in system memory. This hybrid execution model balances workloads between the GPU and CPU based on input patterns to optimize memory access and increase tok
C++
View on GitHub9,568
microsoft/deepspeed
microsoft/DeepSpeed
42,533View on GitHub
DeepSpeed is a distributed deep learning optimization library and framework designed for the training and inference of massive AI models. It serves as a model parallelism orchestrator and a toolkit for scaling large language models across multiple GPUs and compute nodes. The project distinguishes itself through 3D parallelism orchestration, which combines data, pipeline, and tensor parallelism. It utilizes ZeRO-based memory partitioning to eliminate redundant storage and employs CPU-offload memory management to move weights and optimizer states to system RAM. Additionally, it provides special
Python
View on GitHub42,533

Open-source alternatives to FlexGen

Tiiny-AI/PowerInfer

SJTU-IPADS/PowerInfer

microsoft/DeepSpeed

FMInference/FlexLLMGen

OpenBMB/MiniCPM

vllm-project/llm-compressor

LostRuins/koboldcpp

QwenLM/Qwen-Image

InternLM/lmdeploy

openvinotoolkit/openvino

NVIDIA/TensorRT-LLM

ggerganov/llama.cpp

sgl-project/sglang

vllm-project/vllm

intel/ipex-llm

intel/neural-compressor

comfyanonymous/ComfyUI

afshinea/stanford-cme-295-transformers-large-language-models

Infrasys-AI/AISystem

zhaochenyang20/Awesome-ML-SYS-Tutorial

huggingface/peft

meta-llama/llama3

NVIDIA/FasterTransformer

pytorch-labs/gpt-fast

Infrasys-AI/AIInfra

huggingface/text-embeddings-inference

NVlabs/Sana

NVIDIA/personaplex

nunchaku-ai/ComfyUI-nunchaku

philschmid/deep-learning-pytorch-huggingface