# funaudiollm/sensevoice

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/funaudiollm-sensevoice).**

7,536 stars · 702 forks · Python · other

## Links

- GitHub: https://github.com/FunAudioLLM/SenseVoice
- Homepage: https://funaudiollm.github.io/
- awesome-repositories: https://awesome-repositories.com/repository/funaudiollm-sensevoice.md

## Topics

`ai` `aigc` `asr` `audio-event-classification` `cross-lingual` `gpt-4o` `llm` `multilingual` `python` `pytorch` `speech-emotion-recognition` `speech-recognition` `speech-to-text`

## Description

SenseVoice is a multilingual speech large language model designed for audio transcription, speaker diarization, and emotion recognition. It functions as an automatic speech recognition system that converts spoken audio into text across multiple languages.

The system distinguishes itself by integrating acoustic event detection and speech emotion recognition, allowing it to identify non-speech sounds, such as laughter or applause, and discrete emotional states. It also includes a framework for speaker diarization to track and label different speakers within a single recording.

The project's capabilities extend to speech synthesis, including expressive text-to-speech, zero-shot speaker identity cloning, and voice interpolation. It further provides tools for speech model fine-tuning to optimize performance for specific domains or rare languages.

## Tags

### Artificial Intelligence & ML

- [Multilingual Transcription](https://awesome-repositories.com/f/artificial-intelligence-ml/audio-transcription/multilingual-transcription.md) — Converts spoken audio into text across multiple languages using low-latency processing and automatic language detection. ([source](https://cdn.jsdelivr.net/gh/funaudiollm/sensevoice@main/README.md))
- [Automatic Speech Recognition](https://awesome-repositories.com/f/artificial-intelligence-ml/automatic-speech-recognition.md) — Functions as a multilingual automatic speech recognition system converting spoken audio to text.
- [Speech Synthesis](https://awesome-repositories.com/f/artificial-intelligence-ml/generative-ai-resources/speech-synthesis.md) — Generates natural sounding human speech across multiple languages with controllable pitch, rate, and emotional tone. ([source](https://funaudiollm.github.io/))
- [Zero-Shot Voice Cloning](https://awesome-repositories.com/f/artificial-intelligence-ml/generative-ai-resources/speech-synthesis/zero-shot-voice-cloning.md) — Features zero-shot voice cloning to synthesize target voices from short reference clips without retraining.
- [Spoken Language Detection](https://awesome-repositories.com/f/artificial-intelligence-ml/language-detection-tools/spoken-language-detection.md) — Detects spoken languages in audio samples to route data to the appropriate transcription pipelines. ([source](https://cdn.jsdelivr.net/gh/funaudiollm/sensevoice@main/README.md))
- [Multilingual Speech-to-Text](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/speech-processing/speech-datasets/english/speech-to-text-translation/multilingual-speech-to-text.md) — Provides a large language model for low-latency transcription across multiple languages.
- [Speaker Diarization](https://awesome-repositories.com/f/artificial-intelligence-ml/speaker-diarization.md) — Labels different speakers in a recording using voice activity detection and speaker modeling. ([source](https://cdn.jsdelivr.net/gh/funaudiollm/sensevoice@main/README.md))
- [End-to-End Pipelines](https://awesome-repositories.com/f/artificial-intelligence-ml/audio-transcription/end-to-end-pipelines.md) — Implements an end-to-end pipeline mapping raw audio waveforms directly to text tokens.
- [Inference Latency Optimizers](https://awesome-repositories.com/f/artificial-intelligence-ml/inference-latency-optimizers.md) — Utilizes a streaming architecture for latency-optimized inference to decode speech tokens in real-time.
- [Speech Model Fine-Tuning](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/machine-learning-training/fine-tuning-and-alignment/fine-tuning-frameworks/speech-model-fine-tuning.md) — Offers frameworks for adapting speech models to specific business scenarios or rare language samples. ([source](https://cdn.jsdelivr.net/gh/funaudiollm/sensevoice@main/README.md))
- [Speech Model Training](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/model-training-and-tuning/training-frameworks/model-training-frameworks/speech-model-training.md) — Provides tools for fine-tuning speech models for specific business domains or rare languages.
- [Acoustic-Prosodic Embeddings](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/speech-processing/speech-datasets/english/speech-to-text-translation/multilingual-speech-to-text/acoustic-feature-processing/acoustic-prosodic-embeddings.md) — Implements acoustic-prosodic embeddings to encode emotional states and non-speech events as tokens.
- [Shared Acoustic Encoders](https://awesome-repositories.com/f/artificial-intelligence-ml/multilingual-representation-learners/shared-acoustic-encoders.md) — Uses a multilingual shared encoder to extract acoustic features across diverse languages and events.
- [Voice Cloning](https://awesome-repositories.com/f/artificial-intelligence-ml/voice-cloning.md) — Replicates specific human vocal characteristics from audio samples using zero-shot generation or fine-tuning. ([source](https://funaudiollm.github.io/))

### Graphics & Multimedia

- [Audio Emotion Classifiers](https://awesome-repositories.com/f/graphics-multimedia/audio-music/audio-processing/audio-emotion-classifiers.md) — Analyzes acoustic signals to identify discrete emotional states like happiness or anger. ([source](https://funaudiollm.github.io/))
- [Audio Event Detection](https://awesome-repositories.com/f/graphics-multimedia/audio-music/audio-processing/audio-event-detection.md) — Identifies non-speech acoustic events such as laughter or applause to categorize audio environments. ([source](https://cdn.jsdelivr.net/gh/funaudiollm/sensevoice@main/README.md))

### Part of an Awesome List

- [Audio and Speech Models](https://awesome-repositories.com/f/awesome-lists/media/audio-and-speech-models.md) — Multitask foundation model for speech, emotion, and audio events.