Moshi is a real-time voice foundation model and speech-to-speech framework designed for bidirectional, low-latency conversations. It functions as a full-duplex voice interface that processes audio and text concurrently in a single stream, enabling natural human-machine dialogue without sequential processing delays.
The system utilizes a neural audio codec to compress high-fidelity audio into low-bitrate tokens for efficient transmission. To manage complex responses and reasoning, it employs internal monologue modeling, which generates a hidden stream of thought tokens alongside audible speech.
The project includes a quantized inference server and a hardware-agnostic backend that supports various environments, including Apple silicon and production GPUs. Operational capabilities cover multi-modal tokenization, asynchronous batch processing, and deployment options such as containerization, secure local tunneling, and a web-based interaction interface.
A command-line interaction client is provided for sending and receiving data from an active inference server.