VoiceCraft is a neural speech generation and manipulation system consisting of a text-to-speech system, a voice cloning tool, and an audio inpainting engine. It uses a large language model approach to synthesize high-fidelity audio from text and replicate speaker identities.
The system provides zero-shot voice cloning and speech editing capabilities, allowing users to modify spoken content within existing recordings. This includes an audio inpainting engine that replaces specific sections of audio with new speech while preserving the original acoustic characteristics and speaker identity.
The project covers high-level capabilities for text-to-speech synthesis, custom voice model training through phoneme-based tokenization, and acoustic speech refinement. It utilizes autoregressive synthesis and latent space representations to decouple speaker identity from linguistic content.