Mlx Audio | Awesome Repository

mlx-audio is an audio processing toolkit built on Apple MLX that provides speech transcription, text-to-speech synthesis, voice cloning, and audio source separation using local models. It offers an OpenAI-compatible REST API and web interface for running audio generation and transcription tasks, enabling drop-in integration with existing tools that follow that endpoint structure.

The toolkit supports text-prompted audio source separation, allowing specific sounds to be isolated from mixed recordings based on natural language descriptions. It also provides voice cloning from a short reference audio sample, speech enhancement through noise reduction, and voice activity detection with speaker diarization to distinguish between different speakers in recordings.

Additional capabilities include speech-to-text transcription with word-level timestamp alignment, streaming audio generation that outputs results incrementally, and model weight quantization to reduce memory footprint and accelerate inference. The system manages multiple models through a unified interface and supports WebSocket audio transport for low-latency communication.

Features

Speech Processing Toolkits - An audio toolkit built on Apple MLX for speech transcription, text-to-speech, voice cloning, and source separation.
OpenAI-Compatible APIs - Exposes audio processing capabilities through an OpenAI-compatible REST API for drop-in integration.
Audio Source Separation Models - Isolates specific sounds from mixed audio files using natural language text prompts.
Text-Prompted Separators - Isolates specific sounds from mixed audio files using natural language text prompts.

Features

Speech Processing Toolkits - An audio toolkit built on Apple MLX for speech transcription, text-to-speech, voice cloning, and source separation.
OpenAI-Compatible APIs - Exposes audio processing capabilities through an OpenAI-compatible REST API for drop-in integration.
Audio Source Separation Models - Isolates specific sounds from mixed audio files using natural language text prompts.
Text-Prompted Separators - Isolates specific sounds from mixed audio files using natural language text prompts.