51 Repos
Tools and models for analyzing, transforming, and synthesizing sound signals.
Distinguishing note: Focuses on audio-specific signal processing and machine learning tasks, distinct from general data processing.
Explore 51 awesome GitHub repositories matching artificial intelligence & ml · Audio Processing. Refine with filters or upvote what's useful.
VoxCPM is a multilingual speech synthesis system and text-to-speech inference server. It functions as an AI voice cloning tool and a synthetic voice designer, capable of generating natural speech across global languages and regional dialects using a GPU-accelerated audio generator. The project features a speech model fine-tuning framework that supports both full parameter updates and low-rank adaptation for customizing voice characteristics. It enables high-fidelity voice cloning from reference audio, including cross-lingual voice transfer and acoustic environment mimicry, as well as the crea
Offers professional audio utilities for denoising reference clips and upsampling low-resolution samples to studio quality.
This project is a comprehensive, curated knowledge base designed to support the development and maintenance of production-grade machine learning systems. It serves as a centralized repository of industry-standard technical literature, engineering case studies, and research papers, providing a structured reference for practitioners navigating the complexities of modern data science and machine learning engineering. The resource distinguishes itself through a cross-domain approach that bridges the gap between academic research and practical implementation. By synthesizing proven industry archit
Analyze and transform sound signals into digital representations for tasks like speech recognition, classification, or generative audio synthesis.
This project is a Python framework for building autonomous, event-driven agent systems. It provides a unified runtime for orchestrating multi-agent workflows, managing persistent conversation state, and executing code within secure, isolated sandbox environments. The framework is designed to handle complex task delegation, allowing agents to invoke other agents as tools while maintaining context across multi-turn interactions. The framework distinguishes itself through its deep integration with the Model Context Protocol, enabling agents to connect to external data sources and remote services
Processes static audio input buffers for voice pipeline ingestion and analysis.
faster-whisper is an automatic speech recognition framework and an optimized implementation of the Whisper speech-to-text engine. It functions as a CTranslate2 inference engine designed to convert spoken audio into written text. The project serves as a model quantization tool that transforms large audio model weights into lower precision formats. This process reduces memory usage and increases execution speed on hardware by utilizing integer quantized weights. The framework covers a broad range of capabilities including batch audio transcription for parallel processing and voice activity det
Provides parallel processing of audio segments to maximize transcription throughput and reduce latency.
Ultimate Vocal Remover is a desktop application designed for AI-driven audio source separation. It utilizes deep learning models to isolate vocals, drums, and other individual instruments from mixed audio files, providing a utility for professional production and creative editing workflows. The software distinguishes itself by leveraging GPU-accelerated tensor computation to perform complex signal processing tasks, significantly reducing the time required for high-fidelity audio extraction. It incorporates a modular plugin architecture that integrates external utilities to support a wide rang
Leverages GPU acceleration to perform complex audio source extraction and signal processing tasks.
Audiocraft is a deep learning audio library and machine learning framework designed for training, fine-tuning, and evaluating generative models for music and sound effects. It functions as a text-to-music generative model and a neural audio codec, providing the tools necessary to compress audio signals into discrete representations and synthesize high-fidelity waveforms from textual descriptions. The framework is distinguished by its ability to combine multiple conditioning signals, allowing for the generation of audio based on text prompts, melodic excerpts, or style-based audio clips. It al
Embeds invisible markers into audio signals to identify origin and protect content ownership.
Chatterbox is a comprehensive machine learning platform designed for multilingual speech synthesis and real-time audio generation. It functions as an engine that converts text into natural-sounding speech, capable of replicating specific human vocal characteristics and emotional expressions from short audio samples. The platform distinguishes itself through advanced control over the synthesis process, allowing for the manipulation of emotional intensity and the injection of non-verbal vocalizations such as laughter or coughing. It is engineered for low-latency performance, utilizing an optimi
Embeds imperceptible digital signatures into generated audio to ensure reliable provenance tracking.
This project is a comprehensive framework and toolkit for developing, optimizing, and deploying transformer-based models across multimodal, document intelligence, and natural language processing tasks. It provides a unified neural architecture that processes text, vision, audio, and document layout data through a shared set of weights, enabling researchers and developers to build foundational models that align cross-modal representations. The platform distinguishes itself through advanced training and inference strategies designed for large-scale deep learning. It incorporates specialized mec
Applies fine-tuned acoustic models to categorize audio inputs into specific classes based on learned patterns.
Faster-Whisper is a high-performance implementation of the Whisper speech-to-text model designed for efficient audio transcription. It provides an end-to-end processing pipeline that converts spoken audio into written text while maintaining lower memory consumption and faster execution speeds than standard implementations. The project achieves its performance through a specialized inference engine that utilizes optimized kernels and weight quantization to reduce computational complexity. It supports large-scale operations by grouping audio segments into dynamic batches and filtering out non-s
Supports transcribing multiple audio segments or files simultaneously through a dedicated pipeline to increase throughput for large-scale tasks.
This project is a comprehensive educational framework designed to teach the design, deployment, and performance optimization of machine learning systems. It provides a structured curriculum that covers the full stack of artificial intelligence engineering, ranging from the construction of core framework components like tensors and automatic differentiation engines to the orchestration of large-scale distributed training clusters. The platform distinguishes itself through its integration of physics-grounded systems modeling and interactive simulation environments. Users can experiment with dis
Implements on-device keyword spotting and voice command recognition for edge applications.
This software is a real-time voice changer that utilizes machine learning inference to transform live microphone input into target vocal characteristics. It functions as an artificial intelligence audio processing tool designed to modify vocal identity during active communication or live broadcasts. The application distinguishes itself by executing neural network models directly within the browser environment. It leverages web-based compute acceleration and dedicated audio threading to maintain low-latency performance, allowing users to switch between different voice profiles while processing
Provides a utility for applying deep learning inference to microphone streams for low-latency voice conversion.
LiveKit is a comprehensive framework for building and orchestrating real-time, multimodal AI agents that interact with users through voice, video, and text. It provides a centralized, event-driven architecture to manage the entire lifecycle of automated participants, from initialization and session state management to graceful shutdown. By utilizing a selective forwarding unit, the platform efficiently routes media streams between participants and agents, ensuring low-latency communication and secure, token-based authentication for all connections. The platform distinguishes itself through it
Applies noise cancellation and signal conditioning to input audio to ensure high-quality voice recognition and interaction.
NeMo is a comprehensive framework designed for the development, training, and deployment of large-scale conversational and generative artificial intelligence models. It provides an integrated platform for building multimodal systems, encompassing speech processing, language modeling, and reinforcement learning alignment. The framework is built to handle the entire lifecycle of AI development, from data curation and model pretraining to production-ready service deployment. The platform distinguishes itself through advanced distributed training capabilities, including tensor and pipeline parall
Provides extensive tools for audio signal processing, including enhancement, restoration, and separation.
Vercel is a cloud platform for building, deploying, and scaling web applications. It provides a unified infrastructure that automates the build process by detecting project frameworks and distributing static and dynamic content through a global content delivery network. The platform executes application logic using serverless functions that scale automatically based on real-time traffic demand. The platform distinguishes itself through a centralized AI gateway that proxies requests to multiple model providers, enabling standardized authentication, observability, and cost tracking. It supports
Processes audio files to reduce background noise and improve sound clarity for interactive applications.
MNN is a high-performance inference engine and framework designed for on-device machine learning. It provides a comprehensive environment for executing, optimizing, and deploying neural network models directly on mobile and resource-constrained edge devices. The framework distinguishes itself through a robust model optimization toolkit that supports quantization, compression, and structural graph manipulation to minimize memory footprint and maximize execution speed. It features a modular architecture that abstracts hardware-specific backends, allowing models to run efficiently across diverse
Extracts features and manipulates raw audio signals to prepare inputs for sound-based machine learning tasks.
Sidekiq is a background job processor and queue manager for Ruby that uses Redis to manage asynchronous tasks. It functions as a distributed task scheduler capable of handling periodic, delayed, and recurring jobs across a cluster of worker processes. The project features a job monitoring dashboard and administrative web interface for visualizing system state, tracking worker performance, and managing failed or dead jobs. It provides a distributed rate limiter to control execution frequency across multiple processes. The framework covers a broad range of operational capabilities, including j
Allows administrators to locate specific jobs within the management interface using search criteria.
This is a blind image watermarking and steganography tool designed to embed and extract hidden data from images without requiring the original source file. It functions as a framework for concealing text or bit arrays within images using mathematical transforms to ensure the marks remain invisible to the viewer. The system is designed for robust watermark extraction, allowing hidden information to be recovered even after images have undergone rotations, cropping, resizing, noise injection, or brightness changes. It utilizes a blind extraction mechanism that retrieves data using a shared passw
Retrieves hidden information from images that have been resized, cropped, or altered by noise and filters.
Chainlit is a Python framework designed for building and deploying interactive, stateful conversational AI interfaces. It provides a backend-driven platform that connects language models and agent frameworks to a web-based chat frontend, managing the complexities of session state, message history, and real-time communication. The framework distinguishes itself by offering a component-based UI builder that allows developers to inject interactive widgets, rich media, and data visualizations directly into the chat stream. It supports the visualization of complex agent workflows, enabling users t
Captures and processes audio segments from microphones for real-time voice interaction.
SpeechBrain is an all-in-one deep learning toolkit designed for speech and audio processing. Built as a modular library, it provides a structured environment for developing, training, and deploying neural network models across a wide range of tasks, including automatic speech recognition, speaker identification, and audio enhancement. The framework distinguishes itself through a configuration-driven approach that separates model architecture and training hyperparameters from application logic. By utilizing externalized configuration files and standardized recipes, it enables reproducible rese
Provides a comprehensive toolkit for feature extraction, signal augmentation, and model inference across speech and audio tasks.
Wandb is a centralized platform for machine learning experiment tracking, model registry management, and workflow orchestration. It provides a comprehensive suite of tools for logging, visualizing, and versioning training metrics, model artifacts, and hyperparameter sweeps to ensure reproducibility across development cycles. The platform also functions as an observability tool for large language model applications, enabling the tracing of execution steps, token usage, and reasoning processes. The project distinguishes itself through its event-driven automation capabilities, which allow users
Logs and visualizes audio files with associated metadata during machine learning experiments.