Koboldcpp

KoboldCPP is a local large language model inference engine and GGUF model runner designed to execute quantized models on personal hardware. It functions as a multimodal AI server and API gateway, providing OpenAI-compatible endpoints that allow third-party clients to interact with locally hosted models.

The project distinguishes itself as an AI storytelling backend, featuring dedicated tools for long-form narrative management through persistent memory, world lore tracking, and character state management. It further extends its capabilities as a multimodal server capable of processing text, images, and audio using vision projectors and speech synthesis.

The system includes broad support for hardware acceleration via GPU-layer offloading and multi-GPU tensor splitting to handle large models. It incorporates advanced output control through grammar constraints and phrase banning, as well as grounded retrieval capabilities that connect models to local documents and web search.

The core runtime is implemented in C++ for high-performance memory management and hardware-level optimization.

Features

Narrative Writing Assistants - Provides specialized tools for writing long-form narratives, managing persistent world lore and character memories.

OpenAI-Compatible APIs - Provides a local server with OpenAI-compatible endpoints for integration with third-party applications.

Hardware Acceleration - Offloads model layers to the GPU and optimizes CPU instructions to increase token generation speed.

Local Model Runners - Provides a runtime environment optimized to load and execute quantized GGUF models on local hardware.

Local Inference Engines - Serves as an optimized local inference engine for LLMs with GPU acceleration and OpenAI-compatible API endpoints.

Model API Gateways - Provides translation layers that expose local model capabilities through standardized OpenAI-compatible API endpoints.

Multimodal AI Orchestrators - Functions as a backend that coordinates vision, speech, and language models for unified multimodal interactions.

Narrative State Management - Organizes characters and world information to maintain consistency across long-form storytelling narratives.

Weight Quantization - Uses GGML-based weight quantization to fit large language models into limited system RAM and GPU memory.

Storytelling Backends - Acts as a backend for managing long-form narratives with persistent memory and character state tracking.

Text Generation - Produces text responses using various model formats and sampling techniques to improve output quality.

Grammar-Constrained Samplers - Forces model responses to follow precise syntax or structures using formal grammar-constrained sampling.

Local LLM Tools - Provides the engine and tools for running large language models on personal hardware with GPU acceleration.

C++ Inference Runtimes - Implements the core model execution engine in C++ for high-performance memory and hardware optimization.

Persistent Context Managers - Injects permanent context or keyword-triggered information into prompts to maintain coherence across sessions.

Model Inference APIs - Exposes compatible HTTP endpoints for web services to interact with locally loaded models.

Document Grounding - Anchors AI responses to local documents or real-time web search results for grounded retrieval.

Context Window Extrapolation - Implements scaling techniques to enable language models to process sequences longer than their original training length.

RAG Document Retrieval - Retrieves relevant snippets from uploaded documents using a search engine to provide grounded context for responses.

External Tool Integration - Connects to external servers allowing the AI to interact with system files, databases, and internet search.

Chat Template Configurations - Allows overriding default tokenizer formats using built-in templates or custom definitions to force specific instruct tags.

Speculative Decoding Strategies - Uses a small draft model to predict future tokens that a larger primary model validates for faster generation.

Model Fine-Tuning - Supports steering model behavior and ensuring structured output through LoRA adapters and grammar constraints.

Multi-GPU Distribution - Partitions model tensors across multiple graphics cards to execute models that exceed a single GPU's memory.

Dynamic Model Reloading - Enables changing models and configurations at runtime via an admin interface without requiring a restart.

LoRA Adapter Loaders - Implements mechanisms for applying low-rank adaptation weights to base models during runtime.

KV Cache Management - Stores processed prompt states in memory to avoid redundant computations when switching conversation contexts.

Distributed Tensor Sharding - Partitions model tensors across multiple graphics cards to enable the execution of very large models.

Model Layer Offloading - Distributes specific model layers between the CPU and GPU to accelerate inference based on hardware availability.

Streaming APIs - Sends output to clients incrementally using polled-streaming or server-sent events for real-time visualization.

Token Blacklists - Prevents the generation of specific words or symbols by triggering text regeneration when banned phrases appear.

KV Cache Snapshotting - Saves cache snapshots to memory to avoid redundant computations when switching between different conversation contexts.

Inference and Serving - Single-file runner for GGUF models.

Inference Engines - Easy-to-use GGUF model runner with integrated UI.

Large Language Models - Easy-to-use text generation software for GGML-based models.

Model Deployment - Listed in the “Model Deployment” section of the Llm Course awesome list.

LostRuinskoboldcppFork

Features

Star history