MiniCPM

MiniCPM is a collection of small language models designed for local, on-device deployment in resource-constrained environments. The project focuses on running dense Transformer models on consumer hardware, including GPUs, CPUs, and Apple Silicon, without requiring custom code forks.

The project distinguishes itself through heavy optimization for edge hardware, utilizing quantized weight compression in GGUF and MLX formats to reduce memory overhead. It implements advanced inference techniques such as speculative sampling and radix-tree prefix caching to accelerate generation speed and throughput.

Capability areas cover the full model lifecycle, including supervised fine-tuning and preference optimization via parameter-efficient LoRA adapters. The system supports structured tool calling for external agent integration and provides various serving options, including OpenAI-compatible APIs, REST endpoints, and a command-line interface.

The implementation includes tools for converting model checkpoints between formats and distributing training workloads across multiple GPUs.

Features

Local and On-Device Inference - Provides a collection of models and optimizations specifically for local and on-device inference across GPUs, CPUs, and Apple Silicon.

Quantized Model Deployments - Deploys quantized models on laptops using a single binary without requiring Python.

Weight Quantization - Reduces weight precision using GGUF and MLX formats to enable execution on resource-constrained edge devices.

Hybrid Global-Local Attention - Combines sparse and linear attention to process million-token windows efficiently.

Generative Text Inference - Provides systems for producing text outputs from language models using sampling parameters and prompt inputs.

Inference Acceleration - Increases decoding speed using speculative sampling and specialized kernels to reduce latency.

Inference Optimizations - Accelerates token generation using advanced techniques like speculative sampling with draft models.

Language Model Fine-Tuning - Provides frameworks for adjusting pre-trained language models using supervised fine-tuning and preference optimization.

Large Language Model Fine-Tuning - Implements full-parameter and parameter-efficient methods to adapt large language models to specific tasks.

Native Tool Call Parsers - Translates raw model output into structured function calls using a native parser.

Edge AI Model Deployment - Allows deployment of dense transformer models to resource-constrained edge devices using GGUF artifacts.

Model Inference - Processes natural language and vision tasks using dedicated inference engines.

Inference Optimizations - Increases generation speed and throughput via speculative sampling, prefix caching, and hardware-specific runtimes.

Model Inference Optimizations - Optimizes text generation speeds and manages sequence lengths for efficient deployment on constrained hardware.

Model Fine-Tuning - Supports supervised fine-tuning and preference optimization to adapt models to specific tasks.

Fine-tuned Model Deployment - Supports loading lightweight LoRA adapters onto base models or merging them for efficient serving.

Model Format Converters - Translates standard model checkpoints into optimized formats for compatibility with local runtimes.

GGUF Format Conversions - Converts custom model weights into quantized GGUF files for efficient local execution.

Compressed Model Formats - Supports loading models in compressed GGUF and MLX formats to reduce memory overhead.

GGUF Weight Quantization - Transforms trained weights into quantized GGUF formats optimized for local inference.

LoRA Adapter Loaders - Provides mechanisms to load parameter-efficient LoRA adapters at runtime to apply specialized training.

Model Weight Utilities - Transforms proprietary framework checkpoints into standard adapters for common library interoperability.

Parameter-Efficient Adaptation - Utilizes parameter-efficient LoRA adapters to customize model behavior without modifying the core architecture.

Tool Calling - Emits structured tool calls that allow external agents to execute functions.

MLX Format Conversions - Transforms model checkpoints into optimized 4-bit quantized MLX formats for Apple Silicon.

Hardware Optimized Inference - Supports a variety of runtimes and GPU offloading strategies to balance performance and memory across consumer hardware.

Local LLM Execution - Executes causal language model generation on local GPU or CPU environments using standard architectures.

Local Model Deployment - Enables running dense language models on local hardware using optimized formats like GGUF, MLX, and Safetensors.

Large Language Model Deployments - Runs dense Transformer models on resource-constrained hardware using standard architectures.

Small Language Models - Focuses on the deployment of small language models specifically designed for local, resource-constrained environments.

Model Tool Calls - Maps model outputs to executable fields to trigger automated external workflows.

Multi-turn Interaction Managers - Manages stateful multi-turn conversational sessions where the model maintains context across prompts.

OpenAI-Compatible APIs - Provides standard HTTP endpoints for interacting with the model in an OpenAI-compatible format.

Batch Inference Engines - Provides capabilities to process multiple prompts simultaneously to achieve high throughput for offline evaluation.

Context Window Management - Allows configuration of token counts by adjusting embedding and position settings based on available memory.

Distributed Training - Scales training across multiple GPUs by launching distributed processes and integrating with DeepSpeed.

Distributed Training Scaling Utilities - Distributes training workloads across multiple graphics cards using DeepSpeed ZeRO to handle larger models.

High Throughput Inference - Uses specialized serving engines and techniques to maximize the processing speed of model queries at scale.

Chat Model Interfaces - Implements an interactive chat interface allowing back-and-forth conversations with response interruption via CLI.

LLM Tool Calling - Emits structured tool calls that can be parsed and executed by external agents for automated workflows.

Local Model Servers - Provides a local server that exposes the model via network endpoints for chat interfaces.

Long Context Processing - Combines sparse and linear attention with hybrid embeddings to handle million-token windows.

Weight Merging Utilities - Integrates trained adapter matrices directly into base model weights to eliminate inference latency.

Speculative Decoding Strategies - Employs speculative decoding with a smaller draft model to accelerate token generation speed.

Inference Configuration Parameters - Provides adjustable sampling parameters and prompt formatting to control model output behavior.

Quantized Fine-Tuning - Optimizes training on consumer GPUs using 4-bit quantization to handle long contexts within limited VRAM.

Model Adapters - Includes utilities to transform proprietary checkpoints into standard model adapters for inference compatibility.

Backend Runtimes - Provides pluggable backend runtimes to adapt model execution across different hardware architectures.

Hardware-Agnostic Deployment - Migrates and adapts models to diverse AI chip architectures using a unified stack.

Sampling Parameter Tuning - Adjusts temperature and top-p parameters to balance concise responses and expanded reasoning.

Model Serving APIs - Exposes local machine learning models as network-accessible services via REST APIs.

On-Device Models - Powers local applications such as desktop companions using lightweight models optimized for edge devices.

Prefix Caching - Uses a radix-tree prefix cache to store common prompt prefixes and avoid redundant computations.

Reasoning Mode Controllers - Allows users to switch between fast and deliberate reasoning styles via specific query tokens.

Tensor Parallelism - Implements tensor parallelism to split model computations across multiple GPUs or CPU nodes.

Dialogue Adaptation - Supports training models for multi-turn dialogue using role-based masks to calculate loss across interactions.

Command Line Interfaces - Provides a single binary CLI for running language models without needing Python or CUDA installations.

Graphical User Interfaces - Ships a desktop GUI for interacting with local models through a graphical interface.

Model Adaptation and Merging - Combines trained LoRA adapters into the base model to create a single standalone model.

Adapter Merging - Combines learned LoRA weights back into the base model to create a standalone deployment.

Apple Silicon Inference - Optimizes model execution specifically for Apple M-series hardware using specialized tensor frameworks.

OpenAI-Compatible Servers - Implements a local server following the OpenAI API specification for broad client compatibility.

Large Language Models - Compact yet powerful conversational model for efficient deployment.

Text LLM Models - Efficient 2.4B parameter model designed for edge device deployment.

OpenBMBMiniCPM

Features

Star history