The visitor is looking for open-source machine learning models and frameworks capable of performing automatic speech recognition (ASR) and text-to-speech (TTS) synthesis.

nvidia/nemo is the closest match — NeMo is a comprehensive framework for building and deploying speech-based models that natively supports automatic speech recognition, text-to-speech synthesis, multi-language workflows, and real-time inference.. Other strong matches: funaudiollm/sensevoice, nari-labs/dia, paddlepaddle/paddlespeech, jamiepine/voicebox.

Why does nvidia/nemo match “an open source speech synthesis and recognition tool”?

NeMo is a comprehensive framework for building and deploying speech-based models that natively supports automatic speech recognition, text-to-speech synthesis, multi-language workflows, and real-time inference.

Why does funaudiollm/sensevoice match “an open source speech synthesis and recognition tool”?

This toolkit provides a comprehensive suite for both automatic speech recognition and expressive text-to-speech, including advanced features like zero-shot voice cloning and multi-language support.

Why does nari-labs/dia match “an open source speech synthesis and recognition tool”?

Dia is a specialized text-to-speech and voice cloning engine that provides robust tools for generative audio synthesis, though it lacks the automatic speech recognition capabilities requested by the visitor.

Why does paddlepaddle/paddlespeech match “an open source speech synthesis and recognition tool”?

This toolkit provides a comprehensive suite of neural models for both automatic speech recognition and text-to-speech synthesis, including support for real-time streaming inference, voice cloning, and multiple languages.

Why does jamiepine/voicebox match “an open source speech synthesis and recognition tool”?

Voicebox is a comprehensive speech processing toolkit that provides both automatic speech recognition and text-to-speech synthesis, including built-in support for voice cloning and real-time local inference.

Speech Synthesis and Recognition Models

Open-source libraries and pre-trained models for converting spoken audio to text and generating synthetic human speech.

Find the best repos with AI.We'll search the best matching repositories with AI.

nvidia/nemo
NVIDIA/NeMo
17,394View on GitHub
NeMo is a multimodal AI framework and toolkit designed for the development, training, and scaling of large language models, generative AI systems, and speech-based models. It functions as an automatic speech recognition toolkit, a text-to-speech engine, and a framework for building models that process and generate combinations of text, image, and audio data. The project serves as a conversational AI orchestrator capable of managing real-time, interruptible voice interactions. It provides specialized workflows for speech translation, converting spoken audio from one language into text or speech in another. The platform covers a broad range of AI model development capabilities, including the training of generative and speech models. Its operational surface includes automatic speech recognition, text-to-speech synthesis, and the creation of multimodal pipelines.
NeMo is a comprehensive framework for building and deploying speech-based models that natively supports automatic speech recognition, text-to-speech synthesis, multi-language workflows, and real-time inference.
PythonAutomatic Speech RecognitionAutomatic Speech RecognitionText-to-Speech
View on GitHub17,394
funaudiollm/sensevoice
FunAudioLLM/SenseVoice
7,536View on GitHub
SenseVoice is a multilingual speech large language model designed for audio transcription, speaker diarization, and emotion recognition. It functions as an automatic speech recognition system that converts spoken audio into text across multiple languages. The system distinguishes itself by integrating acoustic event detection and speech emotion recognition, allowing it to identify non-speech sounds, such as laughter or applause, and discrete emotional states. It also includes a framework for speaker diarization to track and label different speakers within a single recording. The project's capabilities extend to speech synthesis, including expressive text-to-speech, zero-shot speaker identity cloning, and voice interpolation. It further provides tools for speech model fine-tuning to optimize performance for specific domains or rare languages.
This toolkit provides a comprehensive suite for both automatic speech recognition and expressive text-to-speech, including advanced features like zero-shot voice cloning and multi-language support.
PythonAutomatic Speech RecognitionVoice CloningSpeech Synthesis
View on GitHub7,536
nari-labs/dia
nari-labs/dia
19,324View on GitHub
Dia is a generative AI audio tool and text-to-speech synthesis engine designed for the production-ready deployment of machine learning models. It provides a framework for creating lifelike synthetic speech by conditioning generation on reference audio samples to replicate specific vocal characteristics, emotional tones, and delivery styles. The system distinguishes itself through its ability to perform custom voice cloning and precise control over audio output. Users can adjust generation parameters such as temperature and guidance scale to modify the pacing, creativity, and style of the synthesized speech. Additionally, the platform supports the injection of nonverbal vocal expressions, such as laughter or gasps, through the use of specialized text markers. The framework integrates with standard machine learning ecosystems to facilitate the management and scaling of generative services. It supports modular model orchestration, ensuring that complex audio synthesis tasks remain consistent and performant within production environments.
Dia is a specialized text-to-speech and voice cloning engine that provides robust tools for generative audio synthesis, though it lacks the automatic speech recognition capabilities requested by the visitor.
PythonText-to-SpeechVoice CloningVoice Cloning Engines
View on GitHub19,324
paddlepaddle/paddlespeech
PaddlePaddle/PaddleSpeech
12,626View on GitHub
PaddleSpeech is a comprehensive toolkit of neural models for speech recognition, synthesis, and translation built on the PaddlePaddle deep learning framework. It provides a collection of frameworks and tools for converting spoken audio into written text, synthesizing natural audio from text, and performing direct speech translation. The toolkit includes specialized capabilities for keyword spotting to detect trigger words and speaker verification systems that extract unique voiceprints to identify and distinguish between individuals. It also features end-to-end translation tools that map audio features directly to a target language without intermediate transcription. The system covers a broad range of speech processing tasks, including automatic speech recognition with punctuation restoration, speaker diarization, and audio sound classification. Its synthesis pipeline manages the generation of mel spectrograms and raw audio waveforms, while a streaming inference engine enables real-time processing with low latency.
This toolkit provides a comprehensive suite of neural models for both automatic speech recognition and text-to-speech synthesis, including support for real-time streaming inference, voice cloning, and multiple languages.
PythonAutomatic Speech RecognitionText-to-SpeechText-to-Speech Engines
View on GitHub12,626
jamiepine/voicebox
jamiepine/voicebox
30,041View on GitHub
Voicebox is a local speech processing system that provides text-to-speech generation, speech-to-text transcription, and voice cloning. It utilizes local machine learning inference and GPU acceleration to process audio and text data without relying on external API calls. The project features a voice cloning toolkit for creating synthetic profiles from audio samples and a timeline-based voice editor for composing multi-character conversations. It also includes an AI voice management API that allows external applications and AI agents to programmatically manage voice profiles and generate speech. Capabilities cover audio processing pipelines for effects like pitch shifts and reverb, as well as real-time and file-based transcription with filler word removal. The system supports persona-based dialogue generation, batch synthesis with prompt caching, and global text dictation for inserting transcripts directly into the operating system clipboard. The processing engine can be hosted on local hardware or remote GPU servers.
Voicebox is a comprehensive speech processing toolkit that provides both automatic speech recognition and text-to-speech synthesis, including built-in support for voice cloning and real-time local inference.
TypeScriptText-to-SpeechVoice CloningVoice Cloning Toolkits
View on GitHub30,041
fishaudio/fish-speech
fishaudio/fish-speech
24,928View on GitHub
This project is a generative speech synthesis engine that converts text into high-fidelity human speech. It utilizes a two-stage autoregressive transformer architecture that separates semantic token prediction from acoustic detail reconstruction to balance linguistic accuracy with audio quality. The system is designed to support multilingual output and conversational AI development, enabling the generation of context-aware speech that maintains flow across multiple dialogue turns. The platform distinguishes itself through a production-ready inference server that employs continuous batching to maximize hardware utilization and reduce latency. It includes a comprehensive voice cloning toolkit that replicates unique vocal characteristics from short reference audio samples without requiring additional model training. Users can further customize output through low-rank adaptation fine-tuning, which allows for efficient style adjustments, and speaker-specific token embeddings that manage distinct voice characteristics during multi-speaker generation. Beyond core synthesis, the project provides a full suite of utilities for training and alignment, including reinforcement learning techniques to optimize for semantic accuracy and instruction adherence. It supports a variety of operational interfaces, including a command-line tool, a web-based dashboard, and an authenticated HTTP server for remote generation workloads. The system also includes data preparation and serialization tools to streamline the process of organizing and normalizing audio datasets for model training.
This project is a specialized text-to-speech synthesis engine that provides advanced voice cloning and multilingual support, though it lacks the automatic speech recognition capabilities requested by the visitor.
PythonText-to-SpeechVoice CloningSpeech Synthesis Engines
View on GitHub24,928
myshell-ai/openvoice
myshell-ai/OpenVoice
36,720View on GitHub
OpenVoice is a multilingual text-to-speech framework and voice cloning AI model designed for high-fidelity voice replication and low-latency audio generation. It functions as an instant speech synthesis engine that converts text to audio while replicating a specific speaker's tone and color. The system is distinguished by its ability to perform cross-lingual cloning, allowing the vocal characteristics of a reference speaker to be applied to speech in different languages regardless of the original training data. It utilizes a decoupled representation to separate the physical identity of a voice from its emotional and rhythmic delivery. This tool provides granular speech control over audio generation, enabling adjustments to parameters such as emotion, accent, rhythm, and intonation. These capabilities allow for the creation of digital replicas using short audio samples to synthesize expressive speech.
This is a specialized text-to-speech and voice cloning framework that provides high-fidelity synthesis and cross-lingual capabilities, though it lacks the automatic speech recognition functionality requested by the visitor.
PythonText-to-SpeechVoice CloningSpeech Synthesis Engines
View on GitHub36,720
blaizzy/mlx-audio
Blaizzy/mlx-audio
5,994View on GitHub
mlx-audio is an audio processing toolkit built on Apple MLX that provides speech transcription, text-to-speech synthesis, voice cloning, and audio source separation using local models. It offers an OpenAI-compatible REST API and web interface for running audio generation and transcription tasks, enabling drop-in integration with existing tools that follow that endpoint structure. The toolkit supports text-prompted audio source separation, allowing specific sounds to be isolated from mixed recordings based on natural language descriptions. It also provides voice cloning from a short reference audio sample, speech enhancement through noise reduction, and voice activity detection with speaker diarization to distinguish between different speakers in recordings. Additional capabilities include speech-to-text transcription with word-level timestamp alignment, streaming audio generation that outputs results incrementally, and model weight quantization to reduce memory footprint and accelerate inference. The system manages multiple models through a unified interface and supports WebSocket audio transport for low-latency communication.
This toolkit provides a comprehensive suite for speech processing, including automatic speech recognition, text-to-speech synthesis, and voice cloning, all optimized for local execution on Apple silicon.
PythonText-to-SpeechVoice Cloning
View on GitHub5,994
espnet/espnet
espnet/espnet
9,861View on GitHub
ESPnet is a comprehensive speech processing toolkit and PyTorch-based trainer designed for building end-to-end speech recognition, synthesis, and translation models. It provides a structured framework for developing automatic speech recognition systems using transducer and encoder-decoder architectures, alongside engines for text-to-speech synthesis and speech translation pipelines. The project distinguishes itself through a recipe-based workflow execution system that ensures experimental reproducibility by running standardized sequences of scripts for data preparation and model training. It leverages containerized environments to provide consistent execution across platforms and supports large-scale distributed training across multiple GPUs and nodes. The toolkit covers a broad range of capabilities, including spoken language understanding for intent and sentiment classification, audio enhancement and separation, and singing voice synthesis. It also incorporates advanced training techniques such as self-supervised learning, parameter-efficient fine-tuning, and transfer learning. Model development is supported by utilities for audio data formatting, spectral augmentation, and the integration of pretrained encoders, while inference is optimized through blockwise beam search for real-time streaming execution.
ESPnet is a comprehensive, research-grade toolkit that natively supports both automatic speech recognition and text-to-speech synthesis, offering a wide array of pre-trained models, multi-language capabilities, and support for real-time inference and voice conversion.
PythonAutomatic Speech RecognitionText-to-SpeechSpeech Recognition Models
View on GitHub9,861
neonbjb/tortoise-tts
neonbjb/tortoise-tts
14,864View on GitHub
Tortoise-tts is a neural text-to-speech engine and voice cloning toolkit designed for high-quality audio generation. It functions as a zero-shot synthesis system, meaning it can generate speech for unseen speakers without requiring additional training or fine-tuning for each new voice. The system specializes in replicating human vocal characteristics using small sets of reference audio clips. It allows for the extraction of voice latents to mimic specific speakers, the generation of random synthetic identities, and the blending of multiple voice profiles to create hybrid vocal identities. The project covers a broad range of synthesis capabilities, including long-form audio processing via sentence-level text chunking and multi-voice synthesis. It provides tools for emotional speech control through instructional embeddings and supports non-English text processing via specialized tokenizers. Additional utilities include synthetic speech detection and inference acceleration.
This is a specialized text-to-speech and voice cloning engine that provides high-quality synthesis and zero-shot capabilities, though it lacks the automatic speech recognition functionality requested by the visitor.
Jupyter NotebookText-to-SpeechVoice CloningVoice Cloning Toolkits
View on GitHub14,864
k2-fsa/sherpa-onnx
k2-fsa/sherpa-onnx
13,017View on GitHub
Sherpa-ONNX is an ONNX-based speech processing toolkit that provides a local speech recognition engine, an on-device voice synthesis tool, and a speaker identification framework. It is designed as a cross-platform speech API that enables speech-to-text, text-to-speech, and speaker verification tasks to be executed locally on a device without requiring network access. The project is distinguished by its ability to perform zero-shot voice cloning and speaker diarization on-device. It supports a wide range of hardware accelerations, including GPU and various NPU architectures, and provides a WebSocket server for hosting remote streaming and batch transcription services. The toolkit covers a broad surface of audio capabilities, including multilingual speech recognition and translation, sound event classification, wake word detection, and voice activity detection. It also includes text processing utilities for automatic punctuation and subtitle generation, as well as audio signal processing for noise removal and source separation. Native interfaces are available for Java, Kotlin, Swift, and Object Pascal, with support for WebAssembly to enable browser-based recognition.
Sherpa-ONNX is a comprehensive speech processing toolkit that natively supports both automatic speech recognition and text-to-speech synthesis, including advanced features like zero-shot voice cloning and multi-language support for local, real-time inference.
C++Text-to-SpeechVoice CloningSpeech Recognition Systems
View on GitHub13,017
sparkaudio/spark-tts
SparkAudio/Spark-TTS
10,930View on GitHub
Spark-TTS is a deep learning text-to-speech synthesis engine designed to convert written text into high-fidelity audio. It utilizes a transformer-based architecture and autoregressive sequence modeling to generate coherent speech, transforming linguistic input into natural-sounding waveforms through neural speech codec synthesis. The platform distinguishes itself through zero-shot voice cloning, which allows users to mimic a target speaker’s unique vocal identity using only a short reference audio sample without requiring additional model training. It also features cross-lingual phonetic mapping, enabling the synthesis of multilingual speech while maintaining consistent speaker characteristics across different languages. The system provides extensive control over vocal output, allowing for the adjustment of pitch, speed, and other prosodic attributes during the generation process. By manipulating latent space representations, users can refine speech parameters to achieve specific vocal characteristics for various applications. The project is available as a Python-based framework for audio generation.
This is a specialized text-to-speech synthesis engine that supports zero-shot voice cloning and multilingual output, though it lacks the automatic speech recognition capabilities requested by the visitor.
PythonCross-Lingual Speech GeneratorsText-to-Speech
View on GitHub10,930
kedreamix/linly-dubbing
Kedreamix/Linly-Dubbing
3,048View on GitHub
Linly-Dubbing is an automated video dubbing pipeline designed for multilingual video localization. It converts spoken content in videos into another language by coordinating speech-to-text transcription, text translation, and text-to-speech synthesis. The system distinguishes itself through AI-driven lip synchronization and animation, which aligns facial expressions and mouth movements to the synthesized voiceover. It also utilizes audio source separation to isolate vocals from background music and noise, allowing for clean voice replacement while preserving original background audio. The broader capability surface includes tools for web video downloading, timestamped speech transcription, and voice cloning. A graphical configuration interface is provided to manage the processing pipeline, select audio files, and adjust numeric parameters.
This is an automated video dubbing pipeline that integrates speech-to-text, text-to-speech, and voice cloning, making it a comprehensive tool for speech processing tasks despite its primary focus on video localization.
Jupyter NotebookAutomatic Speech RecognitionText-to-SpeechSpeech Synthesis
View on GitHub3,048
abus-aikorea/voice-pro
abus-aikorea/voice-pro
6,255View on GitHub
Voice Pro is a comprehensive speech and audio processing toolkit that combines text-to-speech synthesis, voice cloning, speech recognition, and translation capabilities into a single application. At its core, the project enables users to generate natural-sounding speech from text, clone voices from short audio samples without requiring prior training data, and perform real-time speech translation across over 100 languages. The platform distinguishes itself through its integrated multimedia workflow, allowing users to download YouTube videos, extract audio, separate voice tracks, generate word-timed subtitles, and produce dubbed content in over 100 languages through a unified pipeline. It supports multiple speech synthesis engines including Edge-TTS, F5-TTS, E2-TTS, CosyVoice, and kokoro, while also providing the ability to train custom TTS models on user-provided datasets and export trained models to ONNX format for deployment. Beyond core speech generation, the application offers extensive audio processing features such as transcribing speech to text with word-level subtitle generation, translating subtitle files while preserving formatting, and performing real-time speech recognition and translation with customizable audio inputs. The system also includes capabilities for extracting audio from video, removing noise, and managing the application's installation and dependencies through built-in cleanup utilities.
This toolkit provides a comprehensive suite for both automatic speech recognition and text-to-speech synthesis, featuring support for zero-shot voice cloning, multilingual translation, and pre-trained model integration within a unified pipeline.
PythonText-to-Speech
View on GitHub6,255
neuphonic/neutts
neuphonic/neutts
6,007View on GitHub
Neutts is a neural text-to-speech engine designed for real-time streaming output on edge devices such as phones and laptops. It supports voice cloning from short audio references, enabling zero-shot reproduction of a target speaker's voice, and can be fine-tuned or retrained from scratch for custom voices and styles. The system distinguishes itself through a decoder-only architecture that halves memory and accelerates generation on constrained hardware, combined with quantized model inference for reduced memory footprint. Its streaming decoder loop interleaves synthesis with playback, delivering minimal latency. Additionally, each generated utterance can embed an inaudible or perceptible audio watermark to verify synthetic origin and traceability. Beyond core synthesis, neutts offers capabilities such as pre-encoding reference audio to skip encoding on repeated runs, and full model customization through fine-tuning on paired text-audio data. The project provides tools for adapting the model to edge deployment and supporting on-device real-time speech generation.
This is a specialized neural text-to-speech engine that excels at real-time synthesis and voice cloning on edge devices, though it lacks the automatic speech recognition capabilities requested in your search.
PythonText-to-SpeechVoice CloningVoice Cloning Engines
View on GitHub6,007
kyutai-labs/pocket-tts
kyutai-labs/pocket-tts
3,301View on GitHub
Pocket-tts is a text-to-speech server and neural speech synthesizer that converts written text into audible speech. It includes a CPU-optimized inference engine and a voice cloning tool capable of analyzing audio samples to reproduce specific speaker characteristics. The system differentiates itself through the use of dynamic int8 quantization to reduce memory usage and increase generation speed on processors. It supports real-time speech synthesis by streaming audio chunks incrementally and utilizes voice state caching to store processed embeddings as portable files, bypassing redundant processing during speaker cloning. The project covers a broad range of capabilities, including local model hosting and self-hosted API services for remote audio generation. It provides utilities for model initialization across multiple languages and a native backend to handle computationally intensive synthesis operations.
This is a dedicated text-to-speech engine that supports voice cloning and real-time synthesis, though it lacks the automatic speech recognition capabilities requested by the visitor.
PythonText-to-SpeechVoice CloningText-to-Speech Engines
View on GitHub3,301
rvc-boss/gpt-sovits
RVC-Boss/GPT-SoVITS
58,724View on GitHub
GPT-SoVITS is a text-to-speech synthesis engine and voice cloning toolkit designed for generating natural-sounding human speech. It functions as a neural audio processing pipeline that maps input text to high-fidelity audio waveforms, utilizing conditional variational autoencoders and flow-based decoders to ensure expressive output. The platform distinguishes itself through its ability to perform few-shot voice cloning and cross-lingual speech generation, allowing users to maintain a specific speaker's vocal identity and emotional delivery across multiple languages. By employing cross-modal latent alignment, the system effectively bridges text-based linguistic features with speaker-specific embeddings, while a generative adversarial network-based vocoder ensures the final audio maintains high time-domain quality. The software provides a modular pipeline that supports the entire lifecycle of custom voice model development, including data preprocessing, fine-tuning on small datasets, and inference. It incorporates self-supervised speech representation models to extract discrete linguistic units, facilitating robust voice conversion and automated audio content creation. The project includes documentation for model training, inference procedures, and command-line execution.
This is a specialized text-to-speech and voice cloning engine that provides robust support for multi-language synthesis and fine-tuning, though it lacks the automatic speech recognition capabilities requested by the visitor.
PythonCross-Lingual Speech GeneratorsText-to-Speech Engines
View on GitHub58,724
qwenlm/qwen3-tts
QwenLM/Qwen3-TTS
11,976View on GitHub
Qwen3-TTS is a large language model text-to-speech engine designed to convert written text into natural-sounding human speech. It functions as an audio tokenizer and a generative system for speech synthesis. The project features a promptable voice designer for creating synthetic vocal personas based on natural language descriptions. It also includes a zero-shot voice cloning tool that mimics a target speaker using a short reference audio clip and a transcript. The system provides a framework for speech model fine-tuning to improve speaker likeness and quality through supervised training. Additional capabilities include custom voice synthesis across different languages and a web interface launcher for interacting with the models.
This repository provides a specialized generative engine for text-to-speech synthesis and voice cloning, though it lacks the automatic speech recognition capabilities required for a comprehensive speech processing toolkit.
PythonText-to-SpeechVoice Cloning
View on GitHub11,976
openai/whisper
openai/whisper
102,828View on GitHub
This project is a speech recognition and translation engine that utilizes a sequence-to-sequence transformer architecture to convert audio into text. It is built upon a weakly supervised learning framework, which leverages large-scale, unlabelled audio-transcript data to create generalized speech representations capable of performing simultaneous transcription, language identification, and translation. The system distinguishes itself through a unified multi-task modeling approach that shares token sequences across different objectives, allowing it to handle diverse languages and vocabularies without language-specific rules. By employing byte-level tokenization and sliding window audio segmentation, the engine maintains memory efficiency and temporal consistency when processing long-form audio or varied acoustic environments. The toolkit provides both command-line and programmatic interfaces, enabling developers to integrate speech-to-text capabilities directly into custom software applications or automate high-volume batch processing of media libraries. It includes utilities for accessing multilingual and English-only speech corpora to support model validation and domain-specific performance tuning.
This is a robust automatic speech recognition and translation engine that provides pre-trained models and multi-language support, though it focuses exclusively on speech-to-text rather than text-to-speech synthesis.
PythonAutomatic Speech RecognitionSpeech Recognition LibrariesSpeech Recognition Systems
View on GitHub102,828
netease-youdao/emotivoice
netease-youdao/EmotiVoice
8,446View on GitHub
EmotiVoice is an emotional text-to-speech engine and bilingual speech synthesizer designed to generate synthetic audio in English and Chinese. It utilizes a deep learning architecture to produce high-fidelity speech with controllable emotional states and timbres. The project includes a voice cloning framework for replicating specific speaker identities by training custom acoustic models on personal audio datasets. It employs a jointly-trained acoustic-vocoder pipeline and style-embedding-based synthesis to manage expression and reduce audio artifacts. The system covers a broad range of speech processing capabilities, including grapheme-to-phoneme conversion for bilingual text, voice model fine-tuning, and mel spectrogram visualization for quality monitoring. Users can generate audio through a web-based synthesis dashboard, a command line interface, or a self-hosted HTTP API. The environment can be deployed as a containerized service using Docker for consistent execution across different systems.
This is a specialized text-to-speech engine that supports voice cloning and emotional synthesis, though it lacks the automatic speech recognition capabilities requested by the visitor.
PythonText-to-SpeechVoice CloningSpeech Synthesis
View on GitHub8,446

Speech Synthesis and Recognition Models

NVIDIA/NeMo

FunAudioLLM/SenseVoice

nari-labs/dia

PaddlePaddle/PaddleSpeech

jamiepine/voicebox

fishaudio/fish-speech

myshell-ai/OpenVoice

Blaizzy/mlx-audio

espnet/espnet

neonbjb/tortoise-tts

k2-fsa/sherpa-onnx

SparkAudio/Spark-TTS

Kedreamix/Linly-Dubbing

abus-aikorea/voice-pro

neuphonic/neutts

kyutai-labs/pocket-tts

RVC-Boss/GPT-SoVITS

QwenLM/Qwen3-TTS

openai/whisper

netease-youdao/EmotiVoice