30 open-source projects similar to fminference/flexllmgen, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best FlexLLMGen alternative.
FlexGen is an inference engine for large language models designed for high-throughput execution on single or multiple GPUs. It functions as a framework for managing model execution through a combination of memory offloading, weight compression, and pipeline orchestration. The system enables the execution of models that exceed available GPU memory by moving tensors and caches between GPU memory, system RAM, and disk storage. It utilizes 4-bit weight quantization to reduce the memory footprint of model parameters, allowing for increased batch processing capacity. The project covers distributed
llm-d is a distributed serving framework designed for large language model inference. It functions as an inference orchestrator and gateway, providing a control plane for deploying model replicas and managing hardware accelerators. The system includes a batch inference scheduler and a cache manager to coordinate request flow and memory utilization. The project is distinguished by a disaggregated serving architecture that separates prefill and decode execution phases across specialized workers to maximize throughput. It employs a hardware-agnostic control plane and tiered cache offloading, mov
LightLLM is a high-performance serving framework for deploying and executing large language models. It functions as a multi-GPU inference engine and server capable of handling dense architectures, mixture-of-experts designs, and multimodal models that process both text and images. The system is distinguished by its specialized support for Mixture-of-Experts models using expert parallelism and fused kernels. It implements structured text generation through deterministic state machines and pushdown automata to enforce precise output formats. To optimize throughput, the framework employs specula
This project is a high-performance BERT embedding service and inference server designed to map text sequences into fixed-length numerical vectors. It functions as a machine learning microservice and distributed model server that decouples request handling from heavy computation. The system utilizes a ZeroMQ messaging infrastructure to provide low-latency communication between distributed clients and the inference server. It incorporates server-side batch processing and GPU workload scaling to maximize hardware utilization and manage high request volumes. The platform supports semantic search
Text Embeddings Inference is a high-performance inference server designed to host text embedding and sequence classification models as scalable API endpoints. It provides a vector embedding API to convert text into dense representations and a cross-encoder reranking server for scoring the relevance of document sequences against a query. The project features a GPU-accelerated inference engine that utilizes dynamic batching and specialized kernels to maximize throughput. It offers a high-performance binary interface via gRPC as an alternative to standard HTTP to reduce network latency and seria
This project is a PyTorch model serving framework designed to deploy and scale machine learning models in production via scalable network endpoints. It functions as a high-performance inference server, optimizer, and model lifecycle manager that handles model loading, request batching, and hardware acceleration. The system distinguishes itself through advanced orchestration and optimization capabilities, such as chaining multiple models into sequential workflows using execution graphs and employing dynamic batching to improve throughput and latency. It provides specialized support for generat
AISystem is a comprehensive AI full-stack infrastructure project covering the entire pipeline from AI chip architecture to high-level training frameworks. It encompasses the development of AI compiler frameworks, inference engines, and distributed training orchestrators designed to coordinate workloads across a heterogeneous compute stack of CPUs, GPUs, and NPUs. The project focuses on the deep integration of software and hardware, employing software-hardware co-design to align tensor layouts with physical memory structures. It provides specialized capabilities for accelerating Transformer mo
lmdeploy is a high-performance inference engine and deployment framework for large language models and vision models. It functions as a multi-modal model server and compression toolkit designed to serve models with high throughput and low latency. The system enables the distribution of model services across multiple machines using request-based load balancing and tensor parallelism. It includes specialized tools for model quantization and compression to reduce the memory footprint of weights and caches. The framework covers broad capability areas including production deployment, distributed
LitServe is a Python AI inference server framework and LLM serving framework designed for high-concurrency inference. It functions as a distributed AI model server and dynamic batching inference engine, providing the tools to build and host custom servers that run AI models. The framework distinguishes itself through a dynamic-batching request queue that groups individual inference requests into single tensors to maximize GPU throughput. It supports distributed GPU scaling, allowing model workloads to be spread across multiple hardware accelerators to balance compute loads and increase total
This project is a headless large language model inference engine and server manager designed for local deployments. It provides a developer toolkit and API gateway that allows for the management of model lifecycles and inference tasks without a graphical user interface. The system enables the deployment of model engines across different operating systems, cloud environments, or CI pipelines. It includes a command-line interface for bootstrapping development projects and automating the orchestration of loading and unloading model binaries based on specific workflow needs. The toolset covers i
Triton Inference Server is a high-performance AI model inference server and multi-framework model runtime designed for deploying machine learning models across cloud, data center, and embedded edge infrastructure. It serves as an execution engine that allows for the concurrent running of models from various frameworks to optimize hardware utilization. The project features a dynamic batching inference engine that groups individual requests into larger batches to increase total processing throughput. It also provides a model ensemble pipeline, which enables the chaining of multiple models toget
Lorax is a GPU-accelerated inference server and multi-adapter engine designed for serving large language models. It functions as a high-throughput system capable of deploying models via Kubernetes and managing the dynamic swapping of Low-Rank Adaptation adapters per request. The server distinguishes itself through multi-adapter dynamic batching, which allows requests using different adapter weights to be processed in a single GPU forward pass. It employs just-in-time adapter loading and weighted adapter merging to maximize throughput and enable multi-tasking without sacrificing performance.
RLinf is a distributed reinforcement learning orchestrator and embodied AI training framework. It provides the infrastructure to train vision-language-action models and robotic policies using a combination of reinforcement learning and supervised fine-tuning. The system is designed for scaling workloads across GPU clusters, managing the placement of actors, rollout workers, and environment components. It features a specialized robotics data collection pipeline for gathering teleoperated demonstrations and simulation trajectories into standardized replay buffers, alongside a hardware interface
tiny-llm is a large language model inference engine and transformer model implementation. It serves as a quantized model runtime and paged key-value cache manager, providing a specialized inference stack optimized for Apple Silicon. The system distinguishes itself through high-throughput execution techniques, including continuous batching and paged attention. It utilizes a paged memory system to eliminate fragmentation during token generation and employs on-the-fly dequantization of compressed weights to reduce the memory footprint during matrix multiplication. The project covers a broad ran
PowerInfer is a high-performance local large language model inference engine and sparse inference framework. It provides a runtime for executing models on consumer-grade hardware, utilizing a GPU acceleration backend to optimize tensor operations for graphics processors. The system distinguishes itself through a sparse inference framework that increases generation speed by skipping computations based on activation sparsity in model weights. It includes a GGUF model converter for transforming weights and metadata into a unified binary format, as well as an OpenAI API compatible server for inte
exllamav2 is a high-performance inference library designed for running large language models locally on consumer-grade GPUs. It provides a GPU-accelerated runner and quantization tools to enable model execution without reliance on cloud-based computing services. The project features a quantization utility that compresses models into mixed bitrates between two and eight bits to reduce video RAM requirements. It distinguishes itself through a batched text generator that handles grouped requests and deduplicates cache data to increase throughput. The library covers a broad capability surface in
exllamav2 is a high-performance inference engine and framework for executing large language models locally on consumer-class GPUs. It provides a complete system for local model deployment, including a specialized inference engine and tools for model quantization. The project features a multi-GPU inference framework that distributes workloads across multiple graphics cards to run models that exceed the memory capacity of a single device. It includes a GPU model quantizer capable of converting models into mixed-precision formats between 2 and 8 bits to balance memory usage and accuracy. The en
Qwen-Image is a text-to-image model and large language model image generation framework. It functions as an AI image editing suite and a personalized image trainer, capable of producing high-fidelity visuals and accurate typography from natural language descriptions. The system is distinguished by its precision text rendering engine, which integrates multi-script calligraphy and layout-coherent alphabetic text into images. It provides specialized capabilities for subject identity preservation and consistent subject generation across different poses and viewpoints, alongside a training pipelin
KServe is an open platform for deploying and serving generative and predictive AI models on Kubernetes. It defines inference services as custom resources with declarative YAML specifications, enabling a Kubernetes-native approach to model deployment and lifecycle management. The platform leverages Knative-based serverless scaling for automatic scale-to-zero and revision management, and supports a pluggable serving runtime architecture that maps model formats to containerized execution environments. KServe distinguishes itself through model-aware autoscaling that scales replicas based on token
Lit-llama is a PyTorch-based implementation framework for the LLaMA language model, providing a system for pre-training, fine-tuning, and high-performance inference. It includes a pre-training pipeline for creating foundational language models from scratch and tools for running pretrained weights to generate natural text and predict sequences. The project provides specialized toolkits for parameter-efficient fine-tuning using low-rank adaptation and lightweight adapters. It also includes a quantization library that reduces model memory footprints through four-bit and eight-bit precision to en
Qwen2.5-Omni is an omnichannel multimodal large language model designed to process and generate content across text, audio, vision, and video. It functions as a real-time speech AI, utilizing an end-to-end architecture to maintain synchronous voice conversations with low-latency responses. The project emphasizes efficiency through quantized edge models, allowing for local inference on mobile hardware and resource-constrained devices. It employs 4-bit weight quantization, CPU-based process offloading, and on-demand weight loading to reduce GPU memory requirements. The system integrates specia
Petals is a decentralized framework and inference engine for running large language models across a peer-to-peer network. It enables the execution of models that exceed the memory of any single machine by splitting computations and model layers across a collaborative swarm of GPUs. The system functions as a collaborative compute network where participants share local GPU resources and host model weights. It supports distributed prompt-tuning to adapt massive models to specific tasks and allows for the establishment of private compute swarms to process sensitive data within restricted, trusted
gpustack is a GPU cluster management platform and LLM inference orchestrator. It functions as a centralized system for pooling and orchestrating graphics processing units across local servers and cloud environments, serving as a heterogeneous compute manager for diverse hardware and software configurations. The system provides a secure AI model deployment gateway that serves models as scalable services using key-based authentication. It includes a GPU resource scheduler that balances workloads across accelerators and coordinates multiple inference engines to map specific AI models to compatib
StreamDiffusion is an interactive generative AI framework and inference engine designed for the low-latency delivery of image and video streams. It provides a real-time Stable Diffusion pipeline for text-to-image and image-to-image generation, enabling the creation of continuous generative image streams with minimized computational delay. The framework optimizes throughput using a pre-computed cache engine and residual-based guidance approximation to reduce the number of required model passes. It further manages GPU load through similarity-based frame skipping, which avoids redundant computat
WhisperLive is a real-time speech-to-text server that converts live audio streams into text using Whisper models. It functions as a backend service that receives microphone input via WebSockets and provides incremental transcriptions with word-level timestamps. The system utilizes a GPU-accelerated inference engine and a keyword-boosted transcription API to improve the recognition accuracy of domain-specific jargon, acronyms, and product names. It also includes a speaker diarization tool that clusters audio embeddings to identify and label different participants within a recording. Additiona
This project is an MLOps architectural guide and framework for designing and deploying deep learning systems into production environments. It provides a structured approach to model inference deployment, ML pipeline orchestration, and the creation of production-level machine learning architectures. The project distinguishes itself through a focus on distributed deep learning and edge AI optimization. It covers methodologies for parallelizing model training across multiple GPUs to handle large datasets and applies techniques like quantization and distillation to reduce model size for embedded
KServe is a Kubernetes-native platform for deploying and serving machine learning models as scalable inference services. It supports both generative AI models, including large language models, and traditional predictive models from frameworks such as TensorFlow, PyTorch, Scikit-Learn, XGBoost, and ONNX. The platform manages the full lifecycle of model deployments, including revision tracking, canary rollouts, A/B testing, and automatic rollbacks, and provides serverless scale-to-zero capabilities for cost-efficient resource management. KServe distinguishes itself through a standardized infere
DeepSpeedExamples is a collection of reference implementations for training and deploying large scale AI models using the DeepSpeed optimization library. It provides Python code examples for training massive models across multiple GPUs through distributed optimization techniques. The repository includes optimized patterns for deploying and running large language model predictions in production environments. It also serves as a guide for model compression to reduce memory footprints and as a source for performance benchmarks to measure execution speed and resource utilization. The project cov
Baichuan-7B is an open-source 7 billion parameter bilingual Transformer model designed for text generation and few-shot learning across Chinese and English. It is built on a large Transformer architecture trained on a bilingual corpus, enabling it to produce coherent text in both languages from a single model. The model incorporates several optimization techniques that distinguish it from standard large language models. It uses rotary position embeddings that can extrapolate to longer sequences than seen during training, allowing context extension beyond the original 4096-token training lengt
mini-sglang is a collection of tools for large language model inference, serving as an OpenAI-compatible inference server, a memory-efficient prefill engine, and a tensor parallelism runtime. It also functions as a local batch processing engine for offline benchmarking and ablation studies. The project focuses on acceleration and memory management through a KV cache manager that reuses precomputed caches for shared request prefixes. It handles large model workloads by distributing tasks across multiple GPUs and manages peak memory consumption by splitting long input sequences into smaller chu