ACE Step 1.5 is a local text-to-music generation and audio editing system that runs on consumer hardware. It transforms plain-language descriptions into full-length songs with lyrics, and can edit existing audio through cover generation, vocal removal, track separation, and selective repainting. The system supports multilingual prompts and lyrics in over 50 languages, and provides precise control over musical structure including duration, BPM, key, and time signature.
The project distinguishes itself through a dual-stream diffusion architecture that processes separate latent streams for vocals and instruments, synchronized through cross-attention layers during denoising. It enables style personalization through lightweight LoRA adapters that can be trained from a few songs in about one hour, and supports batch generation of up to eight songs simultaneously. The system can generate complete songs in under ten seconds on a standard consumer GPU while using less than four gigabytes of video memory.
The software is accessible through multiple interfaces including a Gradio web UI, a REST API, a CLI wizard, and a VST3 plugin for direct integration into digital audio workstations. It also includes a pre-trained source separation pipeline for isolating vocal and instrumental stems from mixed audio.