Open-source tools that utilize machine learning models to isolate vocals and individual instrument tracks from audio.
Spleeter is an AI audio source separation library and deep learning toolkit designed to split mixed music files into individual audio stems, such as vocals and drums. It provides a suite of pretrained models for isolating different instruments and voices from a recording. The toolkit includes capabilities for training and evaluating custom audio separation models using labeled datasets and configuration files. It also features utilities for measuring model performance by comparing separation outputs against reference datasets. The system manages audio processing through spectral representations and uses a custom interface for loading and saving audio data across different storage formats. Exporting separated stems is handled via asynchronous processing.
Spleeter is a Python-based deep learning toolkit that provides the exact functionality required to isolate vocals, drums, and other instruments from mixed audio files using pretrained models.
Ultimate Vocal Remover is a desktop application designed for AI-driven audio source separation. It utilizes deep learning models to isolate vocals, drums, and other individual instruments from mixed audio files, providing a utility for professional production and creative editing workflows. The software distinguishes itself by leveraging GPU-accelerated tensor computation to perform complex signal processing tasks, significantly reducing the time required for high-fidelity audio extraction. It incorporates a modular plugin architecture that integrates external utilities to support a wide range of audio file formats, ensuring compatibility across diverse media libraries. Beyond core separation capabilities, the toolkit includes features for modifying audio pitch and tempo to meet specific project requirements. It also supports automated batch processing, allowing users to queue multiple files for sequential handling without manual intervention. The application is distributed as a desktop utility with documentation available for installation and configuration.
This is a comprehensive AI-driven audio source separation tool that supports multi-track stem export, vocal removal, GPU acceleration, and batch processing, all built on a Python-based architecture.
SpleeterGui is a graphical interface for the Spleeter machine learning library, serving as an AI source separation tool and audio stem extractor. It allows users to separate mixed audio files into individual source tracks, such as vocals, drums, and bass, using a visual application. The project functions as a wrapper for the Spleeter engine, removing the requirement to use command line tools for music stem isolation and audio source separation. It provides a visual method for managing audio source isolation and preparing instrument tracks. The interface includes tools for output directory management to define where processed audio files are saved. It handles file routing and the execution of the underlying separation engine through a desktop window.
This is a graphical interface for the Spleeter engine that provides a user-friendly way to perform AI-based audio source separation and stem extraction, though it lacks the native command-line interface and Python-centric workflow requested.
mlx-audio is an audio processing toolkit built on Apple MLX that provides speech transcription, text-to-speech synthesis, voice cloning, and audio source separation using local models. It offers an OpenAI-compatible REST API and web interface for running audio generation and transcription tasks, enabling drop-in integration with existing tools that follow that endpoint structure. The toolkit supports text-prompted audio source separation, allowing specific sounds to be isolated from mixed recordings based on natural language descriptions. It also provides voice cloning from a short reference audio sample, speech enhancement through noise reduction, and voice activity detection with speaker diarization to distinguish between different speakers in recordings. Additional capabilities include speech-to-text transcription with word-level timestamp alignment, streaming audio generation that outputs results incrementally, and model weight quantization to reduce memory footprint and accelerate inference. The system manages multiple models through a unified interface and supports WebSocket audio transport for low-latency communication.
This toolkit provides AI-based audio source separation using natural language prompts, making it a functional tool for isolating specific sounds from mixed audio files despite its broader focus on speech processing.
ace-step-ui is an AI music production workspace and interface for generating, editing, and organizing synthetic audio tracks and vocals. It provides a technical control panel for managing prompts, seeds, and style parameters to produce high-quality audio. The project includes a digital audio workstation interface for trimming and fading files, alongside an audio stem separation tool that splits mixed tracks into individual components such as drums, bass, and vocals. It also features a music video creator for generating visual content and procedural album art to accompany generated music. The software covers the full production lifecycle, including lyric composition tools and prompt optimization to transform genre tags into technical specifications. Workflow management is supported through batch track generation and a searchable audio library for organizing assets into playlists and favorites.
This is an AI-powered music production workspace that includes a built-in audio stem separation tool for isolating vocals, drums, and bass from mixed files.
ACE Step 1.5 is a local text-to-music generation and audio editing system that runs on consumer hardware. It transforms plain-language descriptions into full-length songs with lyrics, and can edit existing audio through cover generation, vocal removal, track separation, and selective repainting. The system supports multilingual prompts and lyrics in over 50 languages, and provides precise control over musical structure including duration, BPM, key, and time signature. The project distinguishes itself through a dual-stream diffusion architecture that processes separate latent streams for vocals and instruments, synchronized through cross-attention layers during denoising. It enables style personalization through lightweight LoRA adapters that can be trained from a few songs in about one hour, and supports batch generation of up to eight songs simultaneously. The system can generate complete songs in under ten seconds on a standard consumer GPU while using less than four gigabytes of video memory. The software is accessible through multiple interfaces including a Gradio web UI, a REST API, a CLI wizard, and a VST3 plugin for direct integration into digital audio workstations. It also includes a pre-trained source separation pipeline for isolating vocal and instrumental stems from mixed audio.
This tool provides a robust source separation pipeline for isolating vocal and instrumental stems alongside its primary music generation features, making it a capable solution for your audio separation needs.