# paddlepaddle/paddlespeech

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/paddlepaddle-paddlespeech).**

12,616 stars · 1,959 forks · Python · Apache-2.0

## Links

- GitHub: https://github.com/PaddlePaddle/PaddleSpeech
- Homepage: https://paddlespeech.readthedocs.io
- awesome-repositories: https://awesome-repositories.com/repository/paddlepaddle-paddlespeech.md

## Topics

`asr` `code-switch` `conformer` `kws` `punctuation-restoration` `self-supervised-learning` `sound-classification` `speech-alignment` `speech-recognition` `speech-synthesis` `speech-translation` `streaming-asr` `streaming-tts` `transformer` `tts` `vocoder` `voice-cloning` `voice-recognition` `wav2vec2` `whisper`

## Description

PaddleSpeech is a comprehensive toolkit of neural models for speech recognition, synthesis, and translation built on the PaddlePaddle deep learning framework. It provides a collection of frameworks and tools for converting spoken audio into written text, synthesizing natural audio from text, and performing direct speech translation.

The toolkit includes specialized capabilities for keyword spotting to detect trigger words and speaker verification systems that extract unique voiceprints to identify and distinguish between individuals. It also features end-to-end translation tools that map audio features directly to a target language without intermediate transcription.

The system covers a broad range of speech processing tasks, including automatic speech recognition with punctuation restoration, speaker diarization, and audio sound classification. Its synthesis pipeline manages the generation of mel spectrograms and raw audio waveforms, while a streaming inference engine enables real-time processing with low latency.

## Tags

### Artificial Intelligence & ML

- [Automatic Speech Recognition](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/speech-processing/automatic-speech-recognition.md) — Provides a comprehensive system for converting spoken audio into written text with streaming and punctuation support.
- [Speech-to-Text Translation](https://awesome-repositories.com/f/artificial-intelligence-ml/audio-transcription/end-to-end-pipelines/speech-to-text-translation.md) — Maps source audio features directly to target language text without using an intermediate transcription step.
- [Acoustic Models](https://awesome-repositories.com/f/artificial-intelligence-ml/generative-ai-resources/speech-synthesis/acoustic-models.md) — Provides neural network architectures that convert linguistic representations into audio features like mel-spectrograms.
- [Keyword Spotting](https://awesome-repositories.com/f/artificial-intelligence-ml/keyword-spotting.md) — Detects specific predefined trigger words or phrases within continuous audio streams to initiate actions. ([source](https://github.com/paddlepaddle/paddlespeech#readme))
- [Multilingual Speech Translation](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/speech-processing/multilingual-speech-translation.md) — Translates spoken audio from one language into text in another via an end-to-end process. ([source](https://github.com/paddlepaddle/paddlespeech#readme))
- [End-to-End Speech Translation](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/speech-processing/speech-datasets/english/speech-to-text-translation/end-to-end-speech-translation.md) — Provides models for translating spoken audio from one language directly into another without intermediate text.
- [Speaker Diarization](https://awesome-repositories.com/f/artificial-intelligence-ml/speaker-diarization.md) — Partitions audio recordings into distinct segments to determine which individual is speaking at any given time. ([source](https://github.com/paddlepaddle/paddlespeech#readme))
- [Speaker Embeddings](https://awesome-repositories.com/f/artificial-intelligence-ml/speaker-embeddings.md) — Generates fixed-dimensional numerical representations of voices to identify and verify individual speaker identities.
- [Speaker Identification Frameworks](https://awesome-repositories.com/f/artificial-intelligence-ml/speaker-identification-frameworks.md) — Extracts voice embeddings and partitions audio to identify and distinguish between different individual speakers.
- [Speech Processing Toolkits](https://awesome-repositories.com/f/artificial-intelligence-ml/speech-processing-toolkits.md) — Offers a comprehensive toolkit of neural models for speech recognition, synthesis, and translation.
- [Speech Transcription](https://awesome-repositories.com/f/artificial-intelligence-ml/speech-transcription.md) — Converts spoken audio into written text using standard and streaming automatic speech recognition methods. ([source](https://github.com/paddlepaddle/paddlespeech#readme))
- [Text-to-Speech](https://awesome-repositories.com/f/artificial-intelligence-ml/text-to-speech.md) — Synthesizes natural human speech from text input with support for voice cloning and streaming output. ([source](https://github.com/paddlepaddle/paddlespeech#readme))
- [Speech-to-Speech Models](https://awesome-repositories.com/f/artificial-intelligence-ml/text-to-speech/speech-to-speech-models.md) — Translates spoken audio directly into spoken or written target languages without intermediate transcription.
- [Prosody Control](https://awesome-repositories.com/f/artificial-intelligence-ml/generative-ai-resources/speech-synthesis/prosody-control.md) — Modifies the timing, pitch, and energy of synthesized speech using duration predictors and conditional inputs. ([source](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/tts/models_introduction.md))
- [Self-Supervised Speech Representations](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/speech-processing/self-supervised-speech-representations.md) — Learns general linguistic features from large unlabeled audio datasets via self-supervised representation learning.
- [Prosodic Duration Predictors](https://awesome-repositories.com/f/artificial-intelligence-ml/prosodic-duration-predictors.md) — Implements conditional duration prediction to control the timing and rhythm of synthesized speech.
- [Punctuation Restoration](https://awesome-repositories.com/f/artificial-intelligence-ml/punctuation-restoration.md) — Automatically inserts punctuation marks into raw speech-to-text transcripts to improve readability. ([source](https://github.com/paddlepaddle/paddlespeech#readme))
- [Streaming Transcription Inference](https://awesome-repositories.com/f/artificial-intelligence-ml/text-to-speech/conversational-audio-streams/streaming-transcription-inference.md) — Processes audio data in small chunks for real-time speech recognition and synthesis with low latency.

### Graphics & Multimedia

- [Text-to-Speech Engines](https://awesome-repositories.com/f/graphics-multimedia/media-processing-analysis/audio-processing-systems/audio-processing/text-to-speech-engines.md) — Provides a complete synthesis pipeline that transforms written text into natural audio waveforms using acoustic models and neural vocoders.
- [Neural Vocoders](https://awesome-repositories.com/f/graphics-multimedia/media-processing-analysis/audio-processing-systems/audio-synthesis/neural-vocoders.md) — Transforms generated spectral data into high-fidelity time-domain audio waveforms using neural vocoders.
- [Voiceprint Extraction](https://awesome-repositories.com/f/graphics-multimedia/media-processing-analysis/media-manipulation/media-processing-workflows/audio-analysis-synthesis/audio-feature-extraction/voiceprint-extraction.md) — Creates unique digital signatures from voice samples to identify and verify specific speakers. ([source](https://github.com/paddlepaddle/paddlespeech#readme))