Amphion is an audio generation toolkit designed for the research and development of models that synthesize speech, music, and environmental sound effects. It provides a standardized framework for reproducible audio synthesis, incorporating a text-to-speech engine and a voice conversion framework.
The project specializes in transforming audio identities, allowing for the modification of speaker accents and voice identities while preserving original rhythm and style. It also includes capabilities for singing voice synthesis and the generation of environmental soundscapes from text descriptions using diffusion models.
The toolkit covers a broad range of audio processing capabilities, including neural vocoding for waveform reconstruction, discrete token encoding, and zero-shot voice cloning. It further provides utilities for audio dataset preprocessing to unify diverse open-source data, as well as tools for audio quality evaluation and the visualization of model mechanisms.