# snakers4/silero-vad

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/snakers4-silero-vad).**

8,209 stars · 730 forks · Python · mit

## Links

- GitHub: https://github.com/snakers4/silero-vad
- awesome-repositories: https://awesome-repositories.com/repository/snakers4-silero-vad.md

## Topics

`onnx` `onnx-runtime` `onnxruntime` `pytorch` `speech` `speech-processing` `vad` `voice-activity-detection` `voice-commands` `voice-control` `voice-detection` `voice-recognition`

## Description

Silero VAD is a voice activity detection model and deep learning speech classifier designed to distinguish human speech from silence across diverse languages and noisy environments. It functions as a pre-trained neural network capable of identifying speech segments within both static audio recordings and real-time data streams.

The project includes a language identification tool for classifying spoken languages and a framework for fine-tuning audio models. It provides utilities for optimizing detection thresholds using validation datasets and retraining the model with custom labeled audio to improve accuracy.

The system covers audio analysis capabilities such as speech probability estimation, temporal timestamp identification, and audio segment extraction. It also handles automated preprocessing by isolating and merging speech chunks to remove silence.

## Tags

### Artificial Intelligence & ML

- [Pre-trained Speech Models](https://awesome-repositories.com/f/artificial-intelligence-ml/pre-trained-speech-models.md) — Ships a pre-trained deep learning model designed to classify audio frames as speech or silence.
- [Voice Activity Detection](https://awesome-repositories.com/f/artificial-intelligence-ml/voice-activity-detection.md) — Implements high-performance voice activity detection to identify speech boundaries in real-time and static audio streams. ([source](https://github.com/snakers4/silero-vad/wiki/Version-history-and-Available-Models))
- [Speech Boundary Detection](https://awesome-repositories.com/f/artificial-intelligence-ml/audio-transcription/transcription-buffering/audio-segmenting/speech-boundary-detection.md) — Provides the ability to locate exact start and end timestamps of spoken segments within audio recordings.
- [Deep Learning Classifiers](https://awesome-repositories.com/f/artificial-intelligence-ml/deep-learning-classifiers.md) — Uses a deep learning classifier to distinguish between human speech and silence across diverse environments.
- [Spoken Language Detection](https://awesome-repositories.com/f/artificial-intelligence-ml/language-detection-tools/spoken-language-detection.md) — Identifies the specific language being spoken within an audio stream using pattern recognition. ([source](https://github.com/snakers4/silero-vad/blob/master/CITATION.cff))
- [Detection Model Fine-Tuning](https://awesome-repositories.com/f/artificial-intelligence-ml/language-model-trainers/voice-model-trainers/detection-model-fine-tuning.md) — Improves speech detection accuracy by retraining models using specific audio datasets and labeled timestamps.
- [Speech Model Fine-Tuning](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/machine-learning-training/fine-tuning-and-alignment/fine-tuning-frameworks/speech-model-fine-tuning.md) — Improves speech detection quality by retraining models using custom audio paths and time-stamped labels. ([source](https://github.com/snakers4/silero-vad/tree/master/tuning))
- [Speech Detection Fine-Tuning Frameworks](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/model-training-and-tuning/fine-tuning-and-customization/model-fine-tuning/speech-detection-fine-tuning-frameworks.md) — Provides a framework for retraining voice detection models using custom labeled datasets and optimized thresholds.
- [Real-Time Speech Processing](https://awesome-repositories.com/f/artificial-intelligence-ml/real-time-speech-processing.md) — Implements a real-time processing pipeline for detecting speech activity within live audio streams.
- [Speech Activity Detection](https://awesome-repositories.com/f/artificial-intelligence-ml/speech-activity-detection.md) — Identifies active voice segments within continuous audio streams using configurable confidence thresholds. ([source](https://github.com/snakers4/silero-vad/tree/master/examples/cpp))
- [Speech Segment Extraction](https://awesome-repositories.com/f/artificial-intelligence-ml/audio-transcription/transcription-buffering/audio-segmenting/speech-segment-extraction.md) — Extracts speech timestamps and isolates voice segments from raw audio files.
- [Speech-Based Silence Removal](https://awesome-repositories.com/f/artificial-intelligence-ml/dataset-preprocessing-tools/audio-dataset-preprocessing/speech-based-silence-removal.md) — Isolates and merges speech segments from a recording to remove silence before transcription.
- [Audio Detection Thresholds](https://awesome-repositories.com/f/artificial-intelligence-ml/decision-trees/verification-threshold-optimizers/audio-detection-thresholds.md) — Provides utilities to calculate optimal speech trigger levels using labeled validation datasets.
- [Prediction Thresholds](https://awesome-repositories.com/f/artificial-intelligence-ml/face-detection/confidence-filtering/prediction-thresholds.md) — Calculates confidence scores for audio segments to determine if they exceed defined speech triggers.
- [Decision Threshold Calibration](https://awesome-repositories.com/f/artificial-intelligence-ml/face-detection/confidence-filtering/prediction-thresholds/decision-threshold-calibration.md) — Calculates ideal input and output probability thresholds based on validation datasets to maximize detection accuracy. ([source](https://github.com/snakers4/silero-vad/tree/master/tuning))
- [Speech Probability Scoring](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/speech-processing/speech-translation-systems/simultaneous-speech-to-speech-translation/speech-probability-scoring.md) — Calculates a score between zero and one for each audio chunk to estimate the likelihood of human speech. ([source](https://github.com/snakers4/silero-vad/wiki/Quality-Metrics))
- [Speech Segment Extraction](https://awesome-repositories.com/f/artificial-intelligence-ml/speech-activity-detection/speech-segment-extraction.md) — Isolates and merges detected speech chunks from an audio recording into a single continuous file. ([source](https://github.com/snakers4/silero-vad/wiki/Examples-and-Dependencies))

### Graphics & Multimedia

- [Lossless Audio Streaming](https://awesome-repositories.com/f/graphics-multimedia/audio-music/audio-streaming-engines/lossless-audio-streaming.md) — Analyzes audio input incrementally to detect the presence of speech as it occurs in real-time. ([source](https://github.com/snakers4/silero-vad/wiki/Version-history-and-Available-Models))

### Part of an Awesome List

- [Audio and Speech Models](https://awesome-repositories.com/f/awesome-lists/media/audio-and-speech-models.md) — Locates portions of audio containing speech and handles resampling to ensure consistent input quality. ([source](https://github.com/snakers4/silero-vad/blob/master/README.md))
- [Speech Boundary Timestamps](https://awesome-repositories.com/f/awesome-lists/media/audio-and-speech-models/speech-boundary-timestamps.md) — Locates the start and end times of speech segments within an audio file. ([source](https://github.com/snakers4/silero-vad/wiki/Examples-and-Dependencies))
- [Speech Probability Scoring](https://awesome-repositories.com/f/awesome-lists/media/audio-and-speech-models/speech-probability-scoring.md) — Calculates a numerical score for each audio window to estimate the likelihood of human speech. ([source](https://github.com/snakers4/silero-vad/wiki/Examples-and-Dependencies))

### Data & Databases

- [Audio Segment Offsets](https://awesome-repositories.com/f/data-databases/pointer-based-navigation/offset-based-addressing/timestamp-based-offset-lookups/audio-segment-offsets.md) — Maps model output indices to temporal offsets to isolate specific voice segments from recordings.
