SenseVoice is a multilingual speech large language model designed for audio transcription, speaker diarization, and emotion recognition. It functions as an automatic speech recognition system that converts spoken audio into text across multiple languages.
The system distinguishes itself by integrating acoustic event detection and speech emotion recognition, allowing it to identify non-speech sounds, such as laughter or applause, and discrete emotional states. It also includes a framework for speaker diarization to track and label different speakers within a single recording.
The project's capabilities extend to speech synthesis, including expressive text-to-speech, zero-shot speaker identity cloning, and voice interpolation. It further provides tools for speech model fine-tuning to optimize performance for specific domains or rare languages.