SakuraLLM is a multi-format document translation system that hosts large language models for translating Japanese text into other languages. It functions as an inference server that exposes translation models through an OpenAI-compatible API, allowing any tool supporting the OpenAI client format to send translation requests. The system is designed as a glossary-aware translation engine that applies user-defined term dictionaries to ensure consistent translation of proper nouns and names across outputs.
The project distinguishes itself by supporting multiple high-performance inference backends including llama.cpp, vLLM, and Ollama, enabling flexible deployment across consumer CPU and GPU hardware. It features a format-preserving translation pipeline that extracts, translates, and reassembles text from structured formats like ebooks and subtitles while retaining timestamps, line breaks, and markup. The system also supports CPU-GPU hybrid inference for memory-constrained setups, tensor parallel multi-GPU distribution for larger models, and token probability filtering to refine translation precision.
SakuraLLM provides translation capabilities for ebooks, subtitles, visual novels, galgames, RPG Maker games, manga, and plain-text novels. It processes documents by dividing long texts into manageable segments, translating each segment through the language model, and reassembling the output with original formatting intact. The system includes glossary management for maintaining terminology consistency, degeneration detection that monitors token generation and retries with adjusted parameters when output quality degrades, and multi-threaded inference for improved throughput.
The project offers a Docker-based deployment with API authentication and supports running on consumer NVIDIA and AMD GPUs.