# sparkaudio/spark-tts

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/sparkaudio-spark-tts).**

10,930 stars · 1,170 forks · Python · apache-2.0

## Links

- GitHub: https://github.com/SparkAudio/Spark-TTS
- awesome-repositories: https://awesome-repositories.com/repository/sparkaudio-spark-tts.md

## Description

Spark-TTS is a deep learning text-to-speech synthesis engine designed to convert written text into high-fidelity audio. It utilizes a transformer-based architecture and autoregressive sequence modeling to generate coherent speech, transforming linguistic input into natural-sounding waveforms through neural speech codec synthesis.

The platform distinguishes itself through zero-shot voice cloning, which allows users to mimic a target speaker’s unique vocal identity using only a short reference audio sample without requiring additional model training. It also features cross-lingual phonetic mapping, enabling the synthesis of multilingual speech while maintaining consistent speaker characteristics across different languages.

The system provides extensive control over vocal output, allowing for the adjustment of pitch, speed, and other prosodic attributes during the generation process. By manipulating latent space representations, users can refine speech parameters to achieve specific vocal characteristics for various applications. The project is available as a Python-based framework for audio generation.

## Tags

### Artificial Intelligence & ML

- [Autoregressive Transformers](https://awesome-repositories.com/f/artificial-intelligence-ml/autoregressive-transformers.md) — Implements an autoregressive transformer architecture to generate high-fidelity speech by predicting audio tokens based on preceding context.
- [Zero-Shot Voice Cloning](https://awesome-repositories.com/f/artificial-intelligence-ml/generative-ai-resources/speech-synthesis/zero-shot-voice-cloning.md) — Enables zero-shot voice cloning to mimic target speaker identities using only short reference audio samples without requiring additional model training.
- [Text-to-Speech](https://awesome-repositories.com/f/artificial-intelligence-ml/text-to-speech.md) — Converts written text into high-fidelity audio using advanced neural speech synthesis models. ([source](https://sparkaudio.github.io/spark-tts/))
- [Cross-Lingual Speech Generators](https://awesome-repositories.com/f/artificial-intelligence-ml/generative-ai-resources/speech-synthesis/cross-lingual-speech-generators.md) — Produces high-quality spoken output in multiple languages while maintaining consistent speaker identity.
- [Generative Audio Engines](https://awesome-repositories.com/f/artificial-intelligence-ml/generative-audio-engines.md) — Provides a framework for synthesizing speech with adjustable parameters for pitch, speed, and vocal characteristics.
- [Multilingual Speech Models](https://awesome-repositories.com/f/artificial-intelligence-ml/multilingual-speech-models.md) — Generates natural-sounding audio across multiple languages while preserving unique speaker identity. ([source](https://sparkaudio.github.io/spark-tts/))
- [Speaker Embeddings](https://awesome-repositories.com/f/artificial-intelligence-ml/speaker-embeddings.md) — Extracts acoustic features from short audio samples to condition synthesis models without additional training.
- [Speech Synthesis Models](https://awesome-repositories.com/f/artificial-intelligence-ml/speech-synthesis-models.md) — Decodes compressed latent representations into high-fidelity audio waveforms using neural speech codec synthesis.
- [Autoregressive Models](https://awesome-repositories.com/f/artificial-intelligence-ml/autoregressive-models.md) — Generates coherent speech by predicting successive audio tokens based on preceding context.
- [Prosody Control Tokens](https://awesome-repositories.com/f/artificial-intelligence-ml/latent-conditioning-mechanisms/prosody-control-tokens.md) — Modulates speech pitch, speed, and prosody by manipulating latent vector dimensions during inference.
- [Speech Synthesis Markup](https://awesome-repositories.com/f/artificial-intelligence-ml/text-to-speech/speech-emphasis-controls/speech-synthesis-markup.md) — Provides controls for adjusting pitch and speed to refine vocal output during synthesis. ([source](https://sparkaudio.github.io/spark-tts/))

### Security & Cryptography

- [Speech Attribute Controls](https://awesome-repositories.com/f/security-cryptography/identity-access-management/access-control/access-control-models/attribute-based-access-controls/speech-attribute-controls.md) — Adjusts vocal characteristics like pitch and speaking rate through coarse or fine settings. ([source](https://sparkaudio.github.io/spark-tts/))
