Open-source libraries and pre-trained models for converting spoken audio to text and generating synthetic human speech.
NeMo is a multimodal AI framework and toolkit designed for the development, training, and scaling of large language models, generative AI systems, and speech-based models. It functions as an automatic speech recognition toolkit, a text-to-speech engine, and a framework for building models that process and generate combinations of text, image, and audio data. The project serves as a conversational AI orchestrator capable of managing real-time, interruptible voice interactions. It provides specialized workflows for speech translation, converting spoken audio from one language into text or speech in another. The platform covers a broad range of AI model development capabilities, including the training of generative and speech models. Its operational surface includes automatic speech recognition, text-to-speech synthesis, and the creation of multimodal pipelines.
NeMo is a comprehensive framework for building and deploying speech-based models that natively supports automatic speech recognition, text-to-speech synthesis, multi-language workflows, and real-time inference.
SenseVoice is a multilingual speech large language model designed for audio transcription, speaker diarization, and emotion recognition. It functions as an automatic speech recognition system that converts spoken audio into text across multiple languages. The system distinguishes itself by integrating acoustic event detection and speech emotion recognition, allowing it to identify non-speech sounds, such as laughter or applause, and discrete emotional states. It also includes a framework for speaker diarization to track and label different speakers within a single recording. The project's capabilities extend to speech synthesis, including expressive text-to-speech, zero-shot speaker identity cloning, and voice interpolation. It further provides tools for speech model fine-tuning to optimize performance for specific domains or rare languages.
This toolkit provides a comprehensive suite for both automatic speech recognition and expressive text-to-speech, including advanced features like zero-shot voice cloning and multi-language support.
Dia is a generative AI audio tool and text-to-speech synthesis engine designed for the production-ready deployment of machine learning models. It provides a framework for creating lifelike synthetic speech by conditioning generation on reference audio samples to replicate specific vocal characteristics, emotional tones, and delivery styles. The system distinguishes itself through its ability to perform custom voice cloning and precise control over audio output. Users can adjust generation parameters such as temperature and guidance scale to modify the pacing, creativity, and style of the synthesized speech. Additionally, the platform supports the injection of nonverbal vocal expressions, such as laughter or gasps, through the use of specialized text markers. The framework integrates with standard machine learning ecosystems to facilitate the management and scaling of generative services. It supports modular model orchestration, ensuring that complex audio synthesis tasks remain consistent and performant within production environments.
Dia is a specialized text-to-speech and voice cloning engine that provides robust tools for generative audio synthesis, though it lacks the automatic speech recognition capabilities requested by the visitor.
PaddleSpeech is a comprehensive toolkit of neural models for speech recognition, synthesis, and translation built on the PaddlePaddle deep learning framework. It provides a collection of frameworks and tools for converting spoken audio into written text, synthesizing natural audio from text, and performing direct speech translation. The toolkit includes specialized capabilities for keyword spotting to detect trigger words and speaker verification systems that extract unique voiceprints to identify and distinguish between individuals. It also features end-to-end translation tools that map audio features directly to a target language without intermediate transcription. The system covers a broad range of speech processing tasks, including automatic speech recognition with punctuation restoration, speaker diarization, and audio sound classification. Its synthesis pipeline manages the generation of mel spectrograms and raw audio waveforms, while a streaming inference engine enables real-time processing with low latency.
This toolkit provides a comprehensive suite of neural models for both automatic speech recognition and text-to-speech synthesis, including support for real-time streaming inference, voice cloning, and multiple languages.
Voicebox is a local speech processing system that provides text-to-speech generation, speech-to-text transcription, and voice cloning. It utilizes local machine learning inference and GPU acceleration to process audio and text data without relying on external API calls. The project features a voice cloning toolkit for creating synthetic profiles from audio samples and a timeline-based voice editor for composing multi-character conversations. It also includes an AI voice management API that allows external applications and AI agents to programmatically manage voice profiles and generate speech. Capabilities cover audio processing pipelines for effects like pitch shifts and reverb, as well as real-time and file-based transcription with filler word removal. The system supports persona-based dialogue generation, batch synthesis with prompt caching, and global text dictation for inserting transcripts directly into the operating system clipboard. The processing engine can be hosted on local hardware or remote GPU servers.
Voicebox is a comprehensive speech processing toolkit that provides both automatic speech recognition and text-to-speech synthesis, including built-in support for voice cloning and real-time local inference.
This project is a generative speech synthesis engine that converts text into high-fidelity human speech. It utilizes a two-stage autoregressive transformer architecture that separates semantic token prediction from acoustic detail reconstruction to balance linguistic accuracy with audio quality. The system is designed to support multilingual output and conversational AI development, enabling the generation of context-aware speech that maintains flow across multiple dialogue turns. The platform distinguishes itself through a production-ready inference server that employs continuous batching to maximize hardware utilization and reduce latency. It includes a comprehensive voice cloning toolkit that replicates unique vocal characteristics from short reference audio samples without requiring additional model training. Users can further customize output through low-rank adaptation fine-tuning, which allows for efficient style adjustments, and speaker-specific token embeddings that manage distinct voice characteristics during multi-speaker generation. Beyond core synthesis, the project provides a full suite of utilities for training and alignment, including reinforcement learning techniques to optimize for semantic accuracy and instruction adherence. It supports a variety of operational interfaces, including a command-line tool, a web-based dashboard, and an authenticated HTTP server for remote generation workloads. The system also includes data preparation and serialization tools to streamline the process of organizing and normalizing audio datasets for model training.
This project is a specialized text-to-speech synthesis engine that provides advanced voice cloning and multilingual support, though it lacks the automatic speech recognition capabilities requested by the visitor.
OpenVoice is a multilingual text-to-speech framework and voice cloning AI model designed for high-fidelity voice replication and low-latency audio generation. It functions as an instant speech synthesis engine that converts text to audio while replicating a specific speaker's tone and color. The system is distinguished by its ability to perform cross-lingual cloning, allowing the vocal characteristics of a reference speaker to be applied to speech in different languages regardless of the original training data. It utilizes a decoupled representation to separate the physical identity of a voice from its emotional and rhythmic delivery. This tool provides granular speech control over audio generation, enabling adjustments to parameters such as emotion, accent, rhythm, and intonation. These capabilities allow for the creation of digital replicas using short audio samples to synthesize expressive speech.
This is a specialized text-to-speech and voice cloning framework that provides high-fidelity synthesis and cross-lingual capabilities, though it lacks the automatic speech recognition functionality requested by the visitor.
mlx-audio is an audio processing toolkit built on Apple MLX that provides speech transcription, text-to-speech synthesis, voice cloning, and audio source separation using local models. It offers an OpenAI-compatible REST API and web interface for running audio generation and transcription tasks, enabling drop-in integration with existing tools that follow that endpoint structure. The toolkit supports text-prompted audio source separation, allowing specific sounds to be isolated from mixed recordings based on natural language descriptions. It also provides voice cloning from a short reference audio sample, speech enhancement through noise reduction, and voice activity detection with speaker diarization to distinguish between different speakers in recordings. Additional capabilities include speech-to-text transcription with word-level timestamp alignment, streaming audio generation that outputs results incrementally, and model weight quantization to reduce memory footprint and accelerate inference. The system manages multiple models through a unified interface and supports WebSocket audio transport for low-latency communication.
This toolkit provides a comprehensive suite for speech processing, including automatic speech recognition, text-to-speech synthesis, and voice cloning, all optimized for local execution on Apple silicon.
ESPnet is a comprehensive speech processing toolkit and PyTorch-based trainer designed for building end-to-end speech recognition, synthesis, and translation models. It provides a structured framework for developing automatic speech recognition systems using transducer and encoder-decoder architectures, alongside engines for text-to-speech synthesis and speech translation pipelines. The project distinguishes itself through a recipe-based workflow execution system that ensures experimental reproducibility by running standardized sequences of scripts for data preparation and model training. It leverages containerized environments to provide consistent execution across platforms and supports large-scale distributed training across multiple GPUs and nodes. The toolkit covers a broad range of capabilities, including spoken language understanding for intent and sentiment classification, audio enhancement and separation, and singing voice synthesis. It also incorporates advanced training techniques such as self-supervised learning, parameter-efficient fine-tuning, and transfer learning. Model development is supported by utilities for audio data formatting, spectral augmentation, and the integration of pretrained encoders, while inference is optimized through blockwise beam search for real-time streaming execution.
ESPnet is a comprehensive, research-grade toolkit that natively supports both automatic speech recognition and text-to-speech synthesis, offering a wide array of pre-trained models, multi-language capabilities, and support for real-time inference and voice conversion.
Tortoise-tts is a neural text-to-speech engine and voice cloning toolkit designed for high-quality audio generation. It functions as a zero-shot synthesis system, meaning it can generate speech for unseen speakers without requiring additional training or fine-tuning for each new voice. The system specializes in replicating human vocal characteristics using small sets of reference audio clips. It allows for the extraction of voice latents to mimic specific speakers, the generation of random synthetic identities, and the blending of multiple voice profiles to create hybrid vocal identities. The project covers a broad range of synthesis capabilities, including long-form audio processing via sentence-level text chunking and multi-voice synthesis. It provides tools for emotional speech control through instructional embeddings and supports non-English text processing via specialized tokenizers. Additional utilities include synthetic speech detection and inference acceleration.
This is a specialized text-to-speech and voice cloning engine that provides high-quality synthesis and zero-shot capabilities, though it lacks the automatic speech recognition functionality requested by the visitor.
Sherpa-ONNX is an ONNX-based speech processing toolkit that provides a local speech recognition engine, an on-device voice synthesis tool, and a speaker identification framework. It is designed as a cross-platform speech API that enables speech-to-text, text-to-speech, and speaker verification tasks to be executed locally on a device without requiring network access. The project is distinguished by its ability to perform zero-shot voice cloning and speaker diarization on-device. It supports a wide range of hardware accelerations, including GPU and various NPU architectures, and provides a WebSocket server for hosting remote streaming and batch transcription services. The toolkit covers a broad surface of audio capabilities, including multilingual speech recognition and translation, sound event classification, wake word detection, and voice activity detection. It also includes text processing utilities for automatic punctuation and subtitle generation, as well as audio signal processing for noise removal and source separation. Native interfaces are available for Java, Kotlin, Swift, and Object Pascal, with support for WebAssembly to enable browser-based recognition.
Sherpa-ONNX is a comprehensive speech processing toolkit that natively supports both automatic speech recognition and text-to-speech synthesis, including advanced features like zero-shot voice cloning and multi-language support for local, real-time inference.
Spark-TTS is a deep learning text-to-speech synthesis engine designed to convert written text into high-fidelity audio. It utilizes a transformer-based architecture and autoregressive sequence modeling to generate coherent speech, transforming linguistic input into natural-sounding waveforms through neural speech codec synthesis. The platform distinguishes itself through zero-shot voice cloning, which allows users to mimic a target speaker’s unique vocal identity using only a short reference audio sample without requiring additional model training. It also features cross-lingual phonetic mapping, enabling the synthesis of multilingual speech while maintaining consistent speaker characteristics across different languages. The system provides extensive control over vocal output, allowing for the adjustment of pitch, speed, and other prosodic attributes during the generation process. By manipulating latent space representations, users can refine speech parameters to achieve specific vocal characteristics for various applications. The project is available as a Python-based framework for audio generation.
This is a specialized text-to-speech synthesis engine that supports zero-shot voice cloning and multilingual output, though it lacks the automatic speech recognition capabilities requested by the visitor.
Linly-Dubbing is an automated video dubbing pipeline designed for multilingual video localization. It converts spoken content in videos into another language by coordinating speech-to-text transcription, text translation, and text-to-speech synthesis. The system distinguishes itself through AI-driven lip synchronization and animation, which aligns facial expressions and mouth movements to the synthesized voiceover. It also utilizes audio source separation to isolate vocals from background music and noise, allowing for clean voice replacement while preserving original background audio. The broader capability surface includes tools for web video downloading, timestamped speech transcription, and voice cloning. A graphical configuration interface is provided to manage the processing pipeline, select audio files, and adjust numeric parameters.
This is an automated video dubbing pipeline that integrates speech-to-text, text-to-speech, and voice cloning, making it a comprehensive tool for speech processing tasks despite its primary focus on video localization.
Voice Pro is a comprehensive speech and audio processing toolkit that combines text-to-speech synthesis, voice cloning, speech recognition, and translation capabilities into a single application. At its core, the project enables users to generate natural-sounding speech from text, clone voices from short audio samples without requiring prior training data, and perform real-time speech translation across over 100 languages. The platform distinguishes itself through its integrated multimedia workflow, allowing users to download YouTube videos, extract audio, separate voice tracks, generate word-timed subtitles, and produce dubbed content in over 100 languages through a unified pipeline. It supports multiple speech synthesis engines including Edge-TTS, F5-TTS, E2-TTS, CosyVoice, and kokoro, while also providing the ability to train custom TTS models on user-provided datasets and export trained models to ONNX format for deployment. Beyond core speech generation, the application offers extensive audio processing features such as transcribing speech to text with word-level subtitle generation, translating subtitle files while preserving formatting, and performing real-time speech recognition and translation with customizable audio inputs. The system also includes capabilities for extracting audio from video, removing noise, and managing the application's installation and dependencies through built-in cleanup utilities.
This toolkit provides a comprehensive suite for both automatic speech recognition and text-to-speech synthesis, featuring support for zero-shot voice cloning, multilingual translation, and pre-trained model integration within a unified pipeline.
Neutts is a neural text-to-speech engine designed for real-time streaming output on edge devices such as phones and laptops. It supports voice cloning from short audio references, enabling zero-shot reproduction of a target speaker's voice, and can be fine-tuned or retrained from scratch for custom voices and styles. The system distinguishes itself through a decoder-only architecture that halves memory and accelerates generation on constrained hardware, combined with quantized model inference for reduced memory footprint. Its streaming decoder loop interleaves synthesis with playback, delivering minimal latency. Additionally, each generated utterance can embed an inaudible or perceptible audio watermark to verify synthetic origin and traceability. Beyond core synthesis, neutts offers capabilities such as pre-encoding reference audio to skip encoding on repeated runs, and full model customization through fine-tuning on paired text-audio data. The project provides tools for adapting the model to edge deployment and supporting on-device real-time speech generation.
This is a specialized neural text-to-speech engine that excels at real-time synthesis and voice cloning on edge devices, though it lacks the automatic speech recognition capabilities requested in your search.
Pocket-tts is a text-to-speech server and neural speech synthesizer that converts written text into audible speech. It includes a CPU-optimized inference engine and a voice cloning tool capable of analyzing audio samples to reproduce specific speaker characteristics. The system differentiates itself through the use of dynamic int8 quantization to reduce memory usage and increase generation speed on processors. It supports real-time speech synthesis by streaming audio chunks incrementally and utilizes voice state caching to store processed embeddings as portable files, bypassing redundant processing during speaker cloning. The project covers a broad range of capabilities, including local model hosting and self-hosted API services for remote audio generation. It provides utilities for model initialization across multiple languages and a native backend to handle computationally intensive synthesis operations.
This is a dedicated text-to-speech engine that supports voice cloning and real-time synthesis, though it lacks the automatic speech recognition capabilities requested by the visitor.
GPT-SoVITS is a text-to-speech synthesis engine and voice cloning toolkit designed for generating natural-sounding human speech. It functions as a neural audio processing pipeline that maps input text to high-fidelity audio waveforms, utilizing conditional variational autoencoders and flow-based decoders to ensure expressive output. The platform distinguishes itself through its ability to perform few-shot voice cloning and cross-lingual speech generation, allowing users to maintain a specific speaker's vocal identity and emotional delivery across multiple languages. By employing cross-modal latent alignment, the system effectively bridges text-based linguistic features with speaker-specific embeddings, while a generative adversarial network-based vocoder ensures the final audio maintains high time-domain quality. The software provides a modular pipeline that supports the entire lifecycle of custom voice model development, including data preprocessing, fine-tuning on small datasets, and inference. It incorporates self-supervised speech representation models to extract discrete linguistic units, facilitating robust voice conversion and automated audio content creation. The project includes documentation for model training, inference procedures, and command-line execution.
This is a specialized text-to-speech and voice cloning engine that provides robust support for multi-language synthesis and fine-tuning, though it lacks the automatic speech recognition capabilities requested by the visitor.
Qwen3-TTS is a large language model text-to-speech engine designed to convert written text into natural-sounding human speech. It functions as an audio tokenizer and a generative system for speech synthesis. The project features a promptable voice designer for creating synthetic vocal personas based on natural language descriptions. It also includes a zero-shot voice cloning tool that mimics a target speaker using a short reference audio clip and a transcript. The system provides a framework for speech model fine-tuning to improve speaker likeness and quality through supervised training. Additional capabilities include custom voice synthesis across different languages and a web interface launcher for interacting with the models.
This repository provides a specialized generative engine for text-to-speech synthesis and voice cloning, though it lacks the automatic speech recognition capabilities required for a comprehensive speech processing toolkit.
This project is a speech recognition and translation engine that utilizes a sequence-to-sequence transformer architecture to convert audio into text. It is built upon a weakly supervised learning framework, which leverages large-scale, unlabelled audio-transcript data to create generalized speech representations capable of performing simultaneous transcription, language identification, and translation. The system distinguishes itself through a unified multi-task modeling approach that shares token sequences across different objectives, allowing it to handle diverse languages and vocabularies without language-specific rules. By employing byte-level tokenization and sliding window audio segmentation, the engine maintains memory efficiency and temporal consistency when processing long-form audio or varied acoustic environments. The toolkit provides both command-line and programmatic interfaces, enabling developers to integrate speech-to-text capabilities directly into custom software applications or automate high-volume batch processing of media libraries. It includes utilities for accessing multilingual and English-only speech corpora to support model validation and domain-specific performance tuning.
This is a robust automatic speech recognition and translation engine that provides pre-trained models and multi-language support, though it focuses exclusively on speech-to-text rather than text-to-speech synthesis.
EmotiVoice is an emotional text-to-speech engine and bilingual speech synthesizer designed to generate synthetic audio in English and Chinese. It utilizes a deep learning architecture to produce high-fidelity speech with controllable emotional states and timbres. The project includes a voice cloning framework for replicating specific speaker identities by training custom acoustic models on personal audio datasets. It employs a jointly-trained acoustic-vocoder pipeline and style-embedding-based synthesis to manage expression and reduce audio artifacts. The system covers a broad range of speech processing capabilities, including grapheme-to-phoneme conversion for bilingual text, voice model fine-tuning, and mel spectrogram visualization for quality monitoring. Users can generate audio through a web-based synthesis dashboard, a command line interface, or a self-hosted HTTP API. The environment can be deployed as a containerized service using Docker for consistent execution across different systems.
This is a specialized text-to-speech engine that supports voice cloning and emotional synthesis, though it lacks the automatic speech recognition capabilities requested by the visitor.