30 open-source projects similar to kvcache-ai/ktransformers, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best Ktransformers alternative.
Sglang is a high-performance inference engine and serving system designed for large language and multimodal models. It provides a programmable interface for orchestrating complex generation workflows, enabling developers to coordinate multi-turn dialogues, tool invocations, and reasoning chains through a domain-specific language. The platform is built to support production-scale deployments, offering an OpenAI-compatible API that allows for integration with existing application ecosystems. The system distinguishes itself through a disaggregated architecture that separates compute-intensive pr
Text Generation Inference is a production-ready engine designed for the deployment and serving of large language models. It functions as a containerized runtime environment that manages model execution, scales across distributed hardware, and provides high-performance inference capabilities for demanding production environments. The project distinguishes itself through advanced optimization techniques, including continuous batching to maximize hardware utilization and tensor parallelism to shard large models across multiple accelerator cards. It supports efficient inference through custom com
ChatGLM3 is a comprehensive framework for deploying, fine-tuning, and serving large language models. It functions as a high-performance inference engine designed to support conversational AI, enabling developers to build interactive agents capable of multi-turn dialogue, autonomous code execution, and structured tool invocation. The project distinguishes itself through its focus on hardware-agnostic deployment and resource optimization. It supports distributed model parallelism across multiple graphics cards, paged key-value caching for concurrent request processing, and weight quantization t
This project provides a comprehensive technical guide and framework for engineering large-scale machine learning systems. It covers the full lifecycle of model development, focusing on the infrastructure and computational principles required to build, train, and serve generative AI models across distributed GPU clusters. The repository distinguishes itself by offering deep-dive tutorials and implementation strategies for complex system challenges. It emphasizes high-performance architectural primitives, such as collective communication orchestration, distributed tensor sharding, and static gr
lmdeploy is a high-performance inference engine and deployment framework for large language models and vision models. It functions as a multi-modal model server and compression toolkit designed to serve models with high throughput and low latency. The system enables the distribution of model services across multiple machines using request-based load balancing and tensor parallelism. It includes specialized tools for model quantization and compression to reduce the memory footprint of weights and caches. The framework covers broad capability areas including production deployment, distributed
Airllm is a framework designed to execute and fine-tune large language models on consumer-grade hardware. By employing layer-wise model decomposition and memory-efficient loading techniques, the engine enables the operation of massive models that would otherwise exceed available system or video memory. The project distinguishes itself through a suite of optimization strategies that balance memory footprint with performance. It utilizes block-wise weight quantization and asynchronous layer prefetching to reduce resource consumption and hide data transfer latency. Additionally, the framework su
FastChat is a training and serving platform for large language models that provides an integrated toolkit for fine-tuning, hosting, and benchmarking chatbots. It functions as an inference server capable of hosting multiple models and exposing them via a standardized API for chat applications. The platform distinguishes itself through a distributed model controller that manages worker nodes and routes requests across a hardware-agnostic inference layer supporting various accelerators. It includes a dedicated evaluation framework for assessing model quality using automated judges, multi-turn di
This project is a comprehensive framework for the training, fine-tuning, and deployment of large language models. It functions as a distributed deep learning platform that enables users to scale model workflows across multiple hardware nodes while providing tools for model evaluation and performance benchmarking. The platform distinguishes itself by offering specialized utilities for model compression and weight transformation, allowing users to reduce memory footprints and latency through quantization and pruning. It supports the adaptation of large models for consumer-grade hardware, facili
BitNet is a quantized inference engine designed to execute highly compressed language models by performing arithmetic on low-precision, bit-level weight data. It functions as a model optimization toolkit and a high-performance kernel library, enabling the execution of large language models on consumer hardware by reducing memory footprints and increasing processing speeds. The project distinguishes itself through hardware-specific kernel optimizations that leverage native processor instructions to accelerate matrix multiplication. By utilizing packed integer arithmetic and memory-aligned weig
Intel XPU LLM Acceleration Library is a toolkit designed to accelerate large language model inference and finetuning on Intel CPUs, GPUs, and NPUs. It provides a distributed inference engine for scaling models across multiple accelerators, a multimodal model runtime for vision and speech tasks, and a low-bit model quantization tool for converting weights into INT4, FP8, and GGUF formats. The project features a parameter-efficient finetuning framework that enables model adaptation using QLoRA and DPO on Intel hardware. It distinguishes itself by providing specialized optimizations for Intel XP
OpenVINO is an AI inference engine and model serving platform designed to execute optimized deep learning models across CPUs, GPUs, and NPUs through a unified API. It includes a model optimization toolkit for converting, quantizing, and compressing models from various frameworks, alongside a specialized generative AI runtime for large language models. The project distinguishes itself through a plugin-based hardware acceleration layer that maps neural network operations to vendor-specific drivers. It features advanced execution mechanisms such as continuous batching, speculative decoding, and
MiniCPM is a collection of small language models designed for local, on-device deployment in resource-constrained environments. The project focuses on running dense Transformer models on consumer hardware, including GPUs, CPUs, and Apple Silicon, without requiring custom code forks. The project distinguishes itself through heavy optimization for edge hardware, utilizing quantized weight compression in GGUF and MLX formats to reduce memory overhead. It implements advanced inference techniques such as speculative sampling and radix-tree prefix caching to accelerate generation speed and throughp
LightLLM is a high-performance serving framework for deploying and executing large language models. It functions as a multi-GPU inference engine and server capable of handling dense architectures, mixture-of-experts designs, and multimodal models that process both text and images. The system is distinguished by its specialized support for Mixture-of-Experts models using expert parallelism and fused kernels. It implements structured text generation through deterministic state machines and pushdown automata to enforce precise output formats. To optimize throughput, the framework employs specula
OpenLLM is a framework for deploying, managing, and scaling open-source large language models
Qwen is a comprehensive framework for large language model development, serving, and deployment. It provides a complete ecosystem for transformer-based sequence modeling, offering base models alongside specialized tools for instruction-tuned alignment, fine-tuning, and long-context inference. The project is designed to support both research and production environments, enabling users to train, optimize, and host generative models locally or across distributed hardware. The framework distinguishes itself through its focus on high-performance serving and extensibility. It features a high-perfor
This project is a platform for the deployment of open source large language and multimodal models. It provides a unified interface to serve text, image, and speech models across local or cloud hardware. The system enables distributed AI inference by orchestrating model workloads across multiple nodes and devices. It includes a unified API adapter layer to standardize inputs and outputs, as well as tools for multimodal chat and structural image generation. The platform covers a broad capability surface including request batching for throughput optimization, dynamic model loading, and integrat
llama.cpp is a high-performance C++ inference engine and runtime for executing large language models locally across various hardware architectures. It provides the core components for local model execution, including a dedicated model quantizer for compressing weights into the GGUF format and a system for generating text embeddings for semantic search. The project distinguishes itself through specialized memory and execution optimizations, such as block-wise weight quantization to reduce memory footprints and memory-mapped model loading. It supports structured text generation by using formal
mistral.rs is an inference engine for large language models that runs locally and exposes models behind OpenAI and Anthropic-compatible APIs. It serves as a multi-model serving platform, capable of loading several models in a single server process with per-request routing and on-demand loading and unloading. The engine supports multimodal inference, processing text alongside images, video, audio, and speech inputs, and includes a quantized model deployment runtime that reduces memory use and speeds up inference on consumer hardware. The project distinguishes itself through an agentic tool exe
MLC LLM is a machine learning compiler and inference engine designed to execute large language models locally across diverse hardware platforms, including desktop, mobile, and web environments. By utilizing machine learning compilation, the project transforms high-level model definitions into specialized, hardware-specific binary libraries. This process optimizes model weights and generates compute kernels tailored to the unique memory and processing characteristics of target graphics and mobile hardware. The engine distinguishes itself by providing a unified runtime abstraction that enables
This project is a comprehensive toolkit designed for the full lifecycle management of large language and multimodal models. It functions as a unified orchestrator that handles the entire development process, ranging from dataset preparation and supervised fine-tuning to advanced reinforcement learning alignment and production-ready inference deployment. The platform distinguishes itself through a specialized reinforcement learning library that supports complex optimization algorithms, including group relative policy optimization and leave-one-out techniques, to improve model instruction-follo
KoboldCPP is a local large language model inference engine and GGUF model runner designed to execute quantized models on personal hardware. It functions as a multimodal AI server and API gateway, providing OpenAI-compatible endpoints that allow third-party clients to interact with locally hosted models. The project distinguishes itself as an AI storytelling backend, featuring dedicated tools for long-form narrative management through persistent memory, world lore tracking, and character state management. It further extends its capabilities as a multimodal server capable of processing text, im
WeClone is an end-to-end framework designed for the creation, training, and deployment of personalized conversational AI digital twins. By fine-tuning large language models on individual chat history, the platform enables the replication of unique communication styles, speech patterns, and conversational habits. The system manages the entire lifecycle of these digital avatars, from initial data preparation to final integration into messaging platforms for real-time interaction. The platform distinguishes itself through a comprehensive suite of data processing utilities that prepare raw messag
Metaseq is a transformer sequence modeling toolkit designed for training, fine-tuning, and deploying sequence-to-sequence models using open pre-trained weights. It provides a comprehensive framework for large language model training, including dedicated tools for sequence dataset processing and a standalone inference server for generating text via API requests. The project features specialized utilities for model quantization to reduce parameter precision to eight bits, which lowers memory usage and increases inference speed. It also includes a checkpoint conversion pipeline to transform mode
Llama 3 is a collection of pretrained, autoregressive transformer-based models designed for natural language generation, reasoning, and complex instruction following. It functions as a generative AI framework that provides the infrastructure for managing model weights, executing neural network inference, and handling computational workloads across diverse knowledge domains. The project distinguishes itself through an integrated AI safety toolkit that employs secondary classification filtering to inspect inputs and outputs, ensuring adherence to usage compliance and safety standards. It suppor
This project is a cross-platform machine learning inference engine designed to execute pre-trained models across diverse operating systems and hardware environments. It functions as a standardized execution framework that manages the entire lifecycle of model inference, from loading and graph optimization to hardware-accelerated execution and generative sequence management. The runtime distinguishes itself through a highly modular architecture that decouples model logic from hardware-specific kernels. By utilizing an execution provider abstraction, it enables developers to offload computation
This repository serves as a comprehensive collection of reference implementations for the PyTorch machine learning library. It provides practical examples for building, training, and deploying deep learning models, functioning as a toolkit for developers to explore neural network architectures and training workflows. The project distinguishes itself by offering concrete demonstrations of complex machine learning operations, ranging from computer vision tasks like object detection and depth estimation to the training of large-scale transformer models. These examples illustrate how to implement
vLLM is a high-throughput inference engine designed for the efficient serving and execution of large language models. It functions as a production-ready distributed model server, providing standard API protocols for online serving while also supporting offline batch processing. The system is built to maximize token generation speed and memory efficiency, enabling both large-scale cloud deployments and local execution on personal hardware. The project distinguishes itself through advanced memory management and request scheduling techniques, most notably its use of non-contiguous key-value cach
LiteRT is a runtime and API for executing machine learning and generative AI models on mobile, desktop, and IoT hardware. It consists of an inference engine and a specialized environment for running quantized large language and diffusion models locally on edge hardware. The system includes an ahead-of-time model compiler that translates models into hardware-specific bytecode to reduce startup latency and memory overhead. It provides a unified interface for Neural Processing Units with automatic fallback routing to CPUs or GPUs when specific subgraph support is unavailable. An edge model conve
CTranslate2 is a C++ inference engine and runtime for Transformer models, designed to execute models on both CPU and GPU with optimizations for speed and memory efficiency. It functions as a model format converter, quantization tool, and REST API server, enabling deployment of neural machine translation, automatic speech recognition, and text generation models. The engine distinguishes itself through a suite of runtime optimizations including layer fusion, weight-matrix quantization, batch-by-length grouping, and a caching allocator that reuses GPU memory. It supports tensor-parallel model di
LlamaFactory is a unified framework for fine-tuning and adapting large language models. It provides a comprehensive platform that standardizes training workflows across diverse machine learning architectures, allowing users to execute both full-tuning and parameter-efficient methods through a single interface. The project distinguishes itself by offering a low-code visual dashboard that enables users to configure experiments and monitor performance metrics in real time without writing extensive custom scripts. It also features a configuration-driven orchestration system that decouples experim