Neutts

Neutts is a neural text-to-speech engine designed for real-time streaming output on edge devices such as phones and laptops. It supports voice cloning from short audio references, enabling zero-shot reproduction of a target speaker's voice, and can be fine-tuned or retrained from scratch for custom voices and styles.

The system distinguishes itself through a decoder-only architecture that halves memory and accelerates generation on constrained hardware, combined with quantized model inference for reduced memory footprint. Its streaming decoder loop interleaves synthesis with playback, delivering minimal latency. Additionally, each generated utterance can embed an inaudible or perceptible audio watermark to verify synthetic origin and traceability.

Beyond core synthesis, neutts offers capabilities such as pre-encoding reference audio to skip encoding on repeated runs, and full model customization through fine-tuning on paired text-audio data. The project provides tools for adapting the model to edge deployment and supporting on-device real-time speech generation.

Features

Neural Text-to-Speech Engines - Provides a compact neural text-to-speech engine that generates natural-sounding audio from written text.

Text-to-Speech - Converts written text into natural-sounding speech using compact neural models optimized for real-time edge deployment.

Decoder Architectures - Uses a decoder-only transformer architecture to reduce memory and accelerate inference on edge hardware.

Voice Conditioning Encoders - Encodes a short reference audio sample into a voice embedding to condition the decoder for zero-shot voice cloning.

Custom Data Fine-Tunings - Supports fine-tuning the neural TTS model on custom text-audio datasets for personalized voices.

Speech Model Fine-Tuning - Provides fine-tuning capabilities to adapt the speech model on custom text-audio data for new voices or domain-specific styles.

Edge AI Model Deployment - Optimizes the neural TTS model for on-device inference on phones, laptops, and edge hardware.

From-Scratch - Builds a speech model from a base language model and custom speech tokens to create novel voices and speaking styles.

Model Quantization - Applies quantization to model weights and components to shrink memory and speed up inference on edge devices.

Weight Quantization - Applies weight quantization to reduce memory footprint and accelerate inference on constrained devices.

Decoder-Only Inference Modes - Loads only the decoder portion of the speech model during inference to minimize memory and computation.

Voice Cloning Engines - Reproduces a target speaker's voice from a short audio reference and generates natural-sounding speech from any text input without retraining.

On-Device Text-to-Speech Synthesizers - Runs the neural TTS engine entirely on-device for real-time, private speech generation.

Voice Cloning - Reproduces a target speaker's voice from a short audio reference for zero-shot voice cloning.

Streaming Audio Generators - Streams generated audio in chunks for immediate playback before synthesis finishes, enabling low-latency speech output.

Live Synthesis Streaming - Streams generated speech in chunks so playback begins before the full utterance is synthesized.

Real-time Synthesis Streaming - Streams synthesized audio in real-time chunks so playback begins before synthesis completes, minimizing delay.

Streaming Decoders - Decodes synthetic audio autoregressively in small chunks to enable streaming playback with minimal latency.

Audio Watermarking - Embeds inaudible watermarks in generated speech to verify synthetic origin and enable traceability.

Audible Watermarks - Adds a perceptible tonal watermark to generated speech for traceability.

Additional AI Tools - On-device TTS model with instant voice cloning from audio samples.

neuphonicneutts

Features

Star history