SenseVoice | Awesome Repository

SenseVoice is a multilingual speech large language model designed for audio transcription, speaker diarization, and emotion recognition. It functions as an automatic speech recognition system that converts spoken audio into text across multiple languages.

The system distinguishes itself by integrating acoustic event detection and speech emotion recognition, allowing it to identify non-speech sounds, such as laughter or applause, and discrete emotional states. It also includes a framework for speaker diarization to track and label different speakers within a single recording.

The project's capabilities extend to speech synthesis, including expressive text-to-speech, zero-shot speaker identity cloning, and voice interpolation. It further provides tools for speech model fine-tuning to optimize performance for specific domains or rare languages.

Features

Multilingual Transcription - Converts spoken audio into text across multiple languages using low-latency processing and automatic language detection.
Automatic Speech Recognition - Functions as a multilingual automatic speech recognition system converting spoken audio to text.
Speech Synthesis - Generates natural sounding human speech across multiple languages with controllable pitch, rate, and emotional tone.
Zero-Shot Voice Cloning - Features zero-shot voice cloning to synthesize target voices from short reference clips without retraining.

Features

Multilingual Transcription - Converts spoken audio into text across multiple languages using low-latency processing and automatic language detection.
Automatic Speech Recognition - Functions as a multilingual automatic speech recognition system converting spoken audio to text.
Speech Synthesis - Generates natural sounding human speech across multiple languages with controllable pitch, rate, and emotional tone.
Zero-Shot Voice Cloning - Features zero-shot voice cloning to synthesize target voices from short reference clips without retraining.