mlx-audio is an audio processing toolkit built on Apple MLX that provides speech transcription, text-to-speech synthesis, voice cloning, and audio source separation using local models. It offers an OpenAI-compatible REST API and web interface for running audio generation and transcription tasks, enabling drop-in integration with existing tools that follow that endpoint structure.
The toolkit supports text-prompted audio source separation, allowing specific sounds to be isolated from mixed recordings based on natural language descriptions. It also provides voice cloning from a short reference audio sample, speech enhancement through noise reduction, and voice activity detection with speaker diarization to distinguish between different speakers in recordings.
Additional capabilities include speech-to-text transcription with word-level timestamp alignment, streaming audio generation that outputs results incrementally, and model weight quantization to reduce memory footprint and accelerate inference. The system manages multiple models through a unified interface and supports WebSocket audio transport for low-latency communication.