FunClip is an open-source tool that transcribes speech from video files and clips segments based on text, speaker, or AI analysis. It combines speech recognition with speaker diarization, audio event detection, and visual content understanding to identify and extract relevant portions of a video.
The tool distinguishes itself through several integrated capabilities. It supports hotword-weighted speech recognition, which improves transcription accuracy for specific terms like names or jargon by boosting their probability during decoding. A large language model can interpret the transcribed text to automatically select video segments based on natural language prompts. Speaker diarization separates and labels audio segments by speaker identity, enabling clipping by a chosen speaker. Additionally, a visual-content understanding model analyzes video frames to select clips when the transcript alone is insufficient.
Beyond these core differentiators, FunClip generates SRT subtitle files for both the full video and each clipped segment. It provides a command-line interface for headless, scriptable execution of the entire recognition and clipping pipeline, as well as a web service interface accessible locally or over a network for browser-based use.