Piper1 Gpl

This project is a neural text-to-speech system and voice trainer that converts written text into spoken audio across a variety of global languages and regional dialects. It functions as an ONNX-based engine capable of performing fast offline inference and uses a phoneme-based controller to manage precise pronunciation.

The system distinguishes itself through a comprehensive toolkit for neural voice training, allowing for the creation of custom single-speaker or multi-speaker models. It supports the export of these models to a standardized open format and provides hardware acceleration via graphics processors to increase the speed of audio generation.

The engine covers a wide range of synthesis capabilities, including real-time chunked audio streaming and file-based export. It provides granular control over vocal delivery through raw phoneme injection, punctuation-based prosody adjustments, and the modification of speaking speed and volume.

Features

Neural Text-to-Speech Engines - Implements a deep learning pipeline that translates characters into phonemes and generates raw audio waveforms.

Multi-Language Speech Generators - Supports a wide variety of global languages and regional dialects through language-specific neural pipelines.

Voice Model Trainers - Provides a toolkit for training custom single-speaker or multi-speaker voice models from recordings and transcripts.

Voice Synthesizer Training - Implements a toolkit for training neural text-to-speech models to mimic specific target speakers.

Multi-Speaker Training - Supports training a single model to synthesize multiple distinct voices by linking recordings to unique speaker identifiers.

Multi-Speaker Synthesis - Enables the development of a single neural model capable of synthesizing multiple distinct voices and regional dialects.

Speaker Embeddings - Uses unique speaker identifiers to link multiple distinct voices within a single neural model.

Text-to-Speech Synthesis - Provides an HTTP interface for converting written text into spoken audio with adjustable speed and variability.

ONNX-Based Engines - Implements a neural speech synthesis system using ONNX models for high-performance offline inference.

Phoneme-Based Speech Processors - Converts text into a sequence of specific speech sounds to ensure precise pronunciation and intonation.

Multi-Language Speech Generators - Synthesizes spoken audio across a wide variety of global languages and regional dialects.

GPU Acceleration - Provides hardware acceleration via graphics processors to increase the speed of neural audio generation.

Hardware-Accelerated Inference - Offloads neural network computations to GPUs to increase the speed of audio generation.

ONNX Model Exporters - Converts trained neural network checkpoints into the standardized ONNX format for cross-platform acceleration.

Vocal Characteristic Adjustments - Provides tools to modify the volume, speaking speed, and audio variation of generated speech.

Audio File Exports - Converts text into spoken audio and writes the resulting sound directly to waveform files.

Local Speech Synthesis - Supports local speech generation by using pre-trained models exported to formats like ONNX for standalone use.

Raw Phoneme Injection - Allows for precise pronunciation control by inserting specific phonemes into text blocks to override automatic conversion.

Phonetic Pronunciation Overrides - Allows manual overriding of text-to-phoneme conversion using raw phoneme IDs for exact pronunciation.

Weight-Based Initializations - Supports starting new voice model training from existing model weights to reduce compute and convergence time.

Generative Audio Chunking - Produces synthesized audio in incremental chunks to allow playback to begin before processing is complete.

Streaming Audio Generators - Generates spoken audio incrementally to enable immediate playback while the remaining text is being processed.

Live Synthesis Streaming - Streams synthesized speech in incremental chunks to allow playback to begin before the full text is processed.

GPU-Accelerated TTS - Uses graphics processors to accelerate the neural speech synthesis process.

OHF-Voicepiper1-gpl

Features

Star history