Moshi | Awesome Repository

Moshi is a real-time voice foundation model and speech-to-speech framework designed for bidirectional, low-latency conversations. It functions as a full-duplex voice interface that processes audio and text concurrently in a single stream, enabling natural human-machine dialogue without sequential processing delays.

The system utilizes a neural audio codec to compress high-fidelity audio into low-bitrate tokens for efficient transmission. To manage complex responses and reasoning, it employs internal monologue modeling, which generates a hidden stream of thought tokens alongside audible speech.

The project includes a quantized inference server and a hardware-agnostic backend that supports various environments, including Apple silicon and production GPUs. Operational capabilities cover multi-modal tokenization, asynchronous batch processing, and deployment options such as containerization, secure local tunneling, and a web-based interaction interface.

A command-line interaction client is provided for sending and receiving data from an active inference server.

Features

Full-Duplex Multimodal Interaction - Provides a full-duplex architecture that processes simultaneous audio and text streams for real-time conversation.
Speech-to-Speech Frameworks - Provides a full-duplex framework for bidirectional, low-latency speech-to-speech and text-to-speech conversations.
Neural Audio Compression - Uses neural codecs to compress high-fidelity audio into low-bitrate tokens for efficient real-time transmission.

Features

Full-Duplex Multimodal Interaction - Provides a full-duplex architecture that processes simultaneous audio and text streams for real-time conversation.
Speech-to-Speech Frameworks - Provides a full-duplex framework for bidirectional, low-latency speech-to-speech and text-to-speech conversations.
Neural Audio Compression - Uses neural codecs to compress high-fidelity audio into low-bitrate tokens for efficient real-time transmission.

A command-line interaction client is provided for sending and receiving data from an active inference server.