# jasonppy/voicecraft

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/jasonppy-voicecraft).**

8,500 stars · 796 forks · Jupyter Notebook · NOASSERTION

## Links

- GitHub: https://github.com/jasonppy/VoiceCraft
- awesome-repositories: https://awesome-repositories.com/repository/jasonppy-voicecraft.md

## Description

VoiceCraft is a neural speech generation and manipulation system consisting of a text-to-speech system, a voice cloning tool, and an audio inpainting engine. It uses a large language model approach to synthesize high-fidelity audio from text and replicate speaker identities.

The system provides zero-shot voice cloning and speech editing capabilities, allowing users to modify spoken content within existing recordings. This includes an audio inpainting engine that replaces specific sections of audio with new speech while preserving the original acoustic characteristics and speaker identity.

The project covers high-level capabilities for text-to-speech synthesis, custom voice model training through phoneme-based tokenization, and acoustic speech refinement. It utilizes autoregressive synthesis and latent space representations to decouple speaker identity from linguistic content.

## Tags

### Artificial Intelligence & ML

- [Zero-Shot Voice Cloning](https://awesome-repositories.com/f/artificial-intelligence-ml/generative-ai-resources/speech-synthesis/zero-shot-voice-cloning.md) — Generates high-fidelity speech using short reference audio samples to replicate speaker identity without retraining.
- [Voice Cloning Tools](https://awesome-repositories.com/f/artificial-intelligence-ml/generative-ai-resources/speech-synthesis/voice-cloning-tools.md) — Provides a pipeline to generate high-quality synthetic speech by processing custom audio recordings and transcripts.
- [Voice Model Trainers](https://awesome-repositories.com/f/artificial-intelligence-ml/language-model-trainers/voice-model-trainers.md) — Converts audio recordings and transcripts into phoneme sequences to train and refine neural speech models. ([source](https://github.com/jasonppy/voicecraft#readme))
- [Phoneme-Based Pipelines](https://awesome-repositories.com/f/artificial-intelligence-ml/sequence-alignment-models/phoneme-based-alignment/phoneme-based-pipelines.md) — Converts text and audio transcripts into discrete phonetic units to standardize speech generation.
- [Audio Inpainting And Editing](https://awesome-repositories.com/f/artificial-intelligence-ml/text-to-audio-synthesis/audio-inpainting-and-editing.md) — Provides tools for modifying and regenerating specific segments of existing audio using text-based guidance.
- [Autoregressive Synthesis](https://awesome-repositories.com/f/artificial-intelligence-ml/text-to-audio-synthesis/autoregressive-synthesis.md) — Implements autoregressive audio synthesis to produce natural speech rhythms and prosody from text input.
- [Text-to-Speech](https://awesome-repositories.com/f/artificial-intelligence-ml/text-to-speech.md) — Synthesizes natural human speech from text input using high-fidelity neural generative models.
- [Surgical Audio Editing](https://awesome-repositories.com/f/artificial-intelligence-ml/text-to-speech/speech-to-speech-models/surgical-audio-editing.md) — Allows for the modification of spoken content within existing recordings while preserving original voice identity.
- [Voice Cloning](https://awesome-repositories.com/f/artificial-intelligence-ml/voice-cloning.md) — Replicates specific human vocal characteristics from audio samples to create high-fidelity synthetic voice models.
- [Latent Space Encoders](https://awesome-repositories.com/f/artificial-intelligence-ml/generative-ai-resources/diffusion-visual-models/generative-ai-models/latent-space-generative-models/latent-space-projections/latent-space-encoders.md) — Utilizes latent space encoders to decouple speaker identity from linguistic content for synthetic generation.
- [Acoustic Models](https://awesome-repositories.com/f/artificial-intelligence-ml/generative-ai-resources/speech-synthesis/acoustic-models.md) — Uses neural acoustic models to convert linguistic representations into high-fidelity audio features.
- [Zero-Shot Speech Editors](https://awesome-repositories.com/f/artificial-intelligence-ml/generative-ai-resources/speech-synthesis/zero-shot-speech-editors.md) — Modifies spoken content and infills audio tokens while preserving original voice identity without retraining.
- [Speech-to-Speech with Video Streams](https://awesome-repositories.com/f/artificial-intelligence-ml/text-to-speech/speech-to-speech-models/speech-to-speech-frameworks/speech-to-speech-with-video-streams.md) — Replaces sections of existing audio with new speech while maintaining original acoustic characteristics. ([source](https://github.com/jasonppy/voicecraft#readme))

### Graphics & Multimedia

- [Audio Content Refinement](https://awesome-repositories.com/f/graphics-multimedia/audio-content-refinement.md) — Replacing or correcting specific words in a recording without needing to re-record the entire session.
- [Audio Gap Infilling](https://awesome-repositories.com/f/graphics-multimedia/audio-music/audio-processing/audio-gap-infilling.md) — Provides a neural engine for predicting and restoring missing audio segments to modify spoken content.