30 open-source projects similar to timerring/bilive, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best Bilive alternative.
GAM is a command-line tool for administering Google Workspace and Cloud Identity. It translates command-line arguments into structured API calls, enabling administrators to manage users, groups, organizational units, and domain settings across a Google Workspace environment. The tool handles authentication through OAuth2 flows, service accounts, and workload identity federation, and supports multi-tenant configurations for managing multiple domains or cloud projects from a single installation. GAM distinguishes itself through its batch processing and automation capabilities. It can process la
This project is an AI-driven suite of tools designed to repurpose long-form video content into short-form clips. It integrates a speech-to-text engine for automated transcription, a highlighting system that ranks engaging segments based on emotional hooks, and a video processor that converts horizontal footage into vertical formats. The system distinguishes itself through intelligent video cropping that utilizes face tracking and motion smoothing to keep subjects centered. It also employs an analysis system to extract viral highlights by scoring segments for engagement and practical value. T
Pipecat is a framework and software development kit for building real-time multimodal AI agents and speech-to-speech systems. It utilizes a frame-based data pipeline to route audio, video, and text through a modular sequence of processors, enabling the orchestration of low-latency conversational AI. The project is distinguished by its ability to coordinate complex multimodal services, including speech-to-text, language models, and text-to-speech, within a single pipeline. It features semantic voice activity detection for natural turn-taking, state-machine conversation flows for dialogue manag
SmartSub is a cross-platform desktop application for AI-driven video transcription and subtitle generation. It converts audio and video files into text subtitles using local AI models and incorporates hardware acceleration to increase processing speed. The tool features a subtitle translator that leverages large language models, such as OpenAI and DeepSeek, to convert subtitles between different languages. It includes a visual editor for proofreading and polishing transcribed text, paired with a video preview for frame-accurate synchronization. The software supports batch processing of multi
DashPlayer is a language learning video player designed for vocabulary and grammar study. It integrates an AI subtitle generator to create machine-translated captions and grammatical sentence analysis for video content. The project features a bilingual subtitle renderer that displays dual-language captions with toggleable visibility. It includes a remote media downloader to fetch online video content via URL and a utility to split long files into smaller segments for more manageable study sessions. The playback system supports sentence-based navigation, allowing users to jump between subtitl
This project is an AI video post-production suite that uses large language models and programmatic tools to automate editing, transcription, and subtitle generation. It functions as an AI editing agent that translates natural language instructions into shell commands, providing a programmatic interface for manipulating media via FFmpeg. The toolkit includes a motion graphics engine that generates technical animations and visual overlays through code-driven rendering and mathematical definitions. It distinguishes itself by combining an AI-powered transcriber for word-level timestamps with an a
Gifify is a tool for converting video files into optimized animated GIFs. It functions as a video to GIF converter and optimization utility that extracts specific clips from video files and burns text or subtitle overlays directly into the frames. The project differentiates itself through specialized GIF optimization, using lossy compression, color count limiting, and custom color palette generation to reduce file sizes. It also provides precise control over the output by allowing users to adjust playback speed, reverse playback direction, and resize dimensions. The software covers a broad s
Vosk is an offline speech-to-text engine and API that converts spoken audio into text locally on a device. It provides a cross-platform speech toolkit with language bindings for integrating voice recognition into server environments, Android, iOS, and Raspberry Pi. The project includes a speaker identification tool to distinguish between different voices and an acoustic model trainer for building custom neural network models. These training tools enable speech feature extraction and model accuracy evaluation to improve recognition for specialized domains. The system supports real-time audio
Short video factory is a local AI content generator and automated video editing tool. It provides a production pipeline that uses large language models to transform text prompts into marketing scripts and rendered short-form videos. The system is designed for local-first execution, running all processing and asset management on the host machine to maintain data privacy. It distinguishes itself through a batch-processing workflow that can sequentially execute copywriting and rendering for multiple items using predefined presets. The software covers a broad range of media capabilities, includi
QtAV is a cross-platform media engine and multimedia framework that combines FFmpeg decoding with the Qt framework for audio and video rendering. It functions as a hardware-accelerated video player, an OpenGL video renderer, and a multimedia stream transcoder. The project distinguishes itself through a hardware-abstraction decoding layer that utilizes GPU interfaces such as VA-API and VideoToolbox to decode high-resolution video. It employs a zero-copy memory transfer path to move decoded video data directly to graphics APIs, reducing CPU overhead and enabling high-performance YUV rendering.
N_m3u8DL-CLI is a cross-platform .NET command-line interface designed for extracting and recording adaptive video streams. It functions as an HLS and DASH downloader that retrieves media from m3u8 and DASH playlist files, including the ability to capture ongoing live broadcasts with automatic duration limits. The tool includes a dedicated AES-128-CBC stream decryptor to handle protected video segments using provided keys and initialization vectors. To optimize transfer speeds, it utilizes a multi-threaded download model and supports custom HTTP header management to bypass server restrictions.
MMF is a modular framework for building, training, and evaluating vision-and-language models. It provides a configuration-driven experiment system where model, dataset, and training parameters are defined through composable YAML files, alongside a curated model zoo of pretrained checkpoints for state-of-the-art multimodal architectures. The framework includes a multimodal dataset loader that downloads, processes, and batches vision-and-language data, and a vision-language model trainer supporting distributed training, mixed precision, and checkpoint-based resumption. The framework distinguish
Autosub is a command-line media processor and automatic subtitle generator that converts audio streams from video and audio files into timed text overlays. It functions as an AI speech-to-text converter that uses OpenAI Whisper to generate synchronized subtitles. The tool includes a language translation pipeline to convert transcribed speech into target languages, enabling multilingual video captioning. It manages the process from audio-stream extraction to the serialization of final subtitle files for local storage. The system covers audio-to-text transcription, time-stamped text mapping, a
Serve is a multimodal AI orchestrator and inference server designed for deploying and scaling machine learning models as cloud-native services. It functions as a containerized workflow engine and distributed service mesh that routes multimodal data through connected execution units. The framework provides specialized capabilities for large language models, including a token streaming gateway that delivers generated text incrementally to reduce perceived latency. It distinguishes itself by enabling the chaining of executors into complex data processing pipelines and the orchestration of these
This library provides a deep learning framework for training neural networks to perform speech recognition and audio classification. It utilizes sequence-to-sequence architectures to map variable-length audio inputs into text or numerical outputs, enabling the development of custom speech-to-text transcription models. The project distinguishes itself through integrated audio processing capabilities that transform raw waveforms into spectrograms and high-dimensional numerical vectors. These tools allow for the extraction of unique vocal characteristics to identify speakers, as well as the clas
obs-multi-rtmp is a plugin for OBS Studio that enables streaming a single video feed to multiple RTMP destinations simultaneously. It functions as an extension to the broadcasting software to add output destination management for live streams. The tool duplicates a live video stream and sends it to several different streaming platforms at once. This allows for simultaneous RTMP broadcasting to redundant or distributed endpoints without duplicating encoders. The project manages multi-platform live streaming through multiplexed RTMP streaming and socket-based data replication. It employs async
FunClip is an open-source tool that transcribes speech from video files and clips segments based on text, speaker, or AI analysis. It combines speech recognition with speaker diarization, audio event detection, and visual content understanding to identify and extract relevant portions of a video. The tool distinguishes itself through several integrated capabilities. It supports hotword-weighted speech recognition, which improves transcription accuracy for specific terms like names or jargon by boosting their probability during decoding. A large language model can interpret the transcribed tex
biliup is an automated live stream archival and video management system designed to record broadcasts and upload content to platforms. It functions as a stream recorder, video upload tool, and cross-platform migrator that handles the transfer of content from various sources to a target service. The project enables cross-platform video migration by downloading content from external sources and redistributing it via automated pipelines. It supports headless video management through a server-based interface and programmatic uploading tools that operate without manual browser interaction. Core c
BililiveRecorder is a tool for automatically capturing and saving live broadcasts and associated chat logs from Bilibili to local storage. It functions as a live stream automation bot that monitors channel statuses in real time to trigger recording tasks without manual intervention. The project provides a web-based recording manager and a graphical interface for configuring capture settings and managing target channels. It supports recording multiple simultaneous broadcasts and includes a dedicated system for recovering corrupted media caused by server-side interruptions. The application man
mm-cot is a multimodal language model reasoning framework designed for training and evaluating models that perform chain-of-thought reasoning across text and image data. It provides core systems for implementing step-by-step logical rationales to improve the accuracy of predictions, including a vision-language model trainer and a multimodal benchmark evaluator. The framework distinguishes itself through a decoupled rationale generation process that separates the training of logical justifications from the inference of final answers. It utilizes vision-transformer feature extraction and image
NeMo is a comprehensive framework designed for the development, training, and deployment of large-scale conversational and generative artificial intelligence models. It provides an integrated platform for building multimodal systems, encompassing speech processing, language modeling, and reinforcement learning alignment. The framework is built to handle the entire lifecycle of AI development, from data curation and model pretraining to production-ready service deployment. The platform distinguishes itself through advanced distributed training capabilities, including tensor and pipeline parall
Whisper.cpp is a high-performance, local-first speech recognition engine designed to run large-scale machine learning models on consumer hardware. It functions as a portable library that converts audio into text, supporting both static file transcription and real-time stream processing. By utilizing a lightweight inference engine and weight quantization, the project minimizes memory and compute overhead, allowing for efficient execution without reliance on external cloud APIs or internet connectivity. The project distinguishes itself through a hardware-agnostic compute abstraction that offloa
VideoCaptioner is an automated tool designed to generate and embed time-synchronized subtitles into video files. By leveraging speech recognition models, the software converts spoken audio into text and calculates precise timestamps to ensure captions align with the original media. The project operates as a local-first inference pipeline, performing all transcription tasks on the host machine to maintain data privacy. It utilizes a transformer-based neural network for speech recognition and integrates a multimedia framework to handle the technical aspects of video processing and subtitle stre
Pyvideotrans is an automated video localization platform designed to transcribe, translate, and dub media content for international distribution. It functions as an end-to-end workflow that combines speech recognition, text translation, and synthetic voice generation to process video files into localized versions. The system distinguishes itself by offering a choice between local model inference for privacy and integration with third-party cloud services via user-provided credentials. This architecture allows users to maintain control over their billing and data security while utilizing modul
Omnilingual-ASR is a multilingual automatic speech recognition framework and toolkit designed to transcribe audio across 1,600 languages. It provides a complete pipeline for converting speech to text, including a toolkit for fine-tuning pre-trained speech models to specific languages or datasets using custom training recipes. The system supports zero-shot speech recognition, allowing the model to predict text in unseen languages without extensive training data. It further enables few-shot language guidance through in-context examples and uses language codes to constrain transcription output t
Restreamer is a self-hosted video broadcast platform and RTMP streaming server. It functions as a live media processing gateway and a multi-destination stream relay, providing a web-based management interface to configure video codecs, hardware acceleration, and stream routing. The system enables multi-platform video streaming by duplicating a single live video source and forwarding it to various third-party broadcast services and external servers simultaneously. It also supports direct-to-website broadcasting, allowing users to host live content for private or public audiences via customizab
Jina is a cloud-native framework for building and deploying multimodal AI applications that process text, images, and audio across distributed microservices. It functions as an inference orchestrator and a distributed model gateway, providing a containerized stack to organize AI executors into operational pipelines. The system manages large language model workloads through token-streamed response delivery and dynamic batching to increase hardware throughput. It utilizes a protocol-agnostic communication layer to route data across different machine learning frameworks. The framework covers hi
PaddleSpeech is a comprehensive toolkit of neural models for speech recognition, synthesis, and translation built on the PaddlePaddle deep learning framework. It provides a collection of frameworks and tools for converting spoken audio into written text, synthesizing natural audio from text, and performing direct speech translation. The toolkit includes specialized capabilities for keyword spotting to detect trigger words and speaker verification systems that extract unique voiceprints to identify and distinguish between individuals. It also features end-to-end translation tools that map audi
Cactus is an on-device AI inference engine designed for executing large language models, vision models, and speech-to-text systems on mobile and wearable hardware. It provides a programmable tensor computation graph for defining sequences of matrix operations and activation functions, alongside a local retrieval augmented generation framework that grounds model responses using local text files. The project features a multiplatform SDK with language bindings for integrating AI capabilities into mobile applications and a model conversion system that transforms external model formats for optimiz
Grounded-Segment-Anything is a suite of specialized tools for multimodal visual analysis, text-based segmentation, and generative image editing. It integrates text-to-bounding-box detection and high-precision image segmentation masks to function as a text-based image segmenter and an automated visual labeling tool. The project enables text-driven image editing by identifying objects through natural language to perform inpainting and element replacement. It further extends visual analysis into three dimensions, allowing for 3D human reconstruction and the generation of 3D bounding boxes from t