30 open-source projects similar to deezer/spleeter, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best Spleeter alternative.
SpeechBrain is an all-in-one deep learning toolkit designed for speech and audio processing. Built as a modular library, it provides a structured environment for developing, training, and deploying neural network models across a wide range of tasks, including automatic speech recognition, speaker identification, and audio enhancement. The framework distinguishes itself through a configuration-driven approach that separates model architecture and training hyperparameters from application logic. By utilizing externalized configuration files and standardized recipes, it enables reproducible rese
ESPnet is a comprehensive speech processing toolkit and PyTorch-based trainer designed for building end-to-end speech recognition, synthesis, and translation models. It provides a structured framework for developing automatic speech recognition systems using transducer and encoder-decoder architectures, alongside engines for text-to-speech synthesis and speech translation pipelines. The project distinguishes itself through a recipe-based workflow execution system that ensures experimental reproducibility by running standardized sequences of scripts for data preparation and model training. It
This project is a deep learning toolkit designed for audio source separation and music information retrieval. It provides a framework for decomposing polyphonic audio signals into distinct components, such as vocals, drums, and bass, by processing raw waveforms through neural network architectures. The library enables users to train custom separation models or fine-tune existing ones to improve accuracy on specific audio datasets. It supports the entire model lifecycle, including the conversion of raw audio into structured, indexed formats to optimize data loading and training efficiency. Th
Demucs is a deep learning stem splitter and AI music de-mixing software used to isolate vocals and instruments from a single audio file. It functions as a PyTorch audio source separation tool that splits mixed tracks into individual stems such as drums, bass, and vocals. The system is a hybrid spectrogram waveform separator that combines spectral and waveform analysis. This approach allows the software to process audio in both frequency and time domains to achieve high-fidelity source separation. The tool provides capabilities for audio source separation, including acapella track extraction
AllenNLP is a PyTorch-based research library and deep learning language toolkit designed for developing and training neural network architectures for linguistic tasks. It provides a distributed training system that coordinates data and gradients across multiple GPUs and a framework for integrating pretrained transformer architectures. The system distinguishes itself with a dedicated algorithmic bias mitigation tool used to identify and reduce bias in linguistic model predictions. It also includes model influence analysis to interpret predictions by calculating the influence of specific traini
This is a structured deep learning curriculum for programmers, delivered as a collection of Jupyter notebooks. It teaches the fundamentals of training neural networks for computer vision, natural language processing, tabular data analysis, and collaborative filtering using PyTorch and the fastai library. The course is designed to be hands-on, guiding learners from building a training loop from scratch to fine-tuning pretrained models for a variety of practical tasks. The curriculum distinguishes itself by covering the full lifecycle of a deep learning project, from data preparation and augmen
This project is a diffusion model training framework and image synthesis pipeline. It provides the tools necessary to train generative models to learn image data distributions through an iterative denoising process. The framework includes a generative model evaluation tool consisting of automated scripts used to measure the quality and accuracy of produced samples. The system covers model training pipelines and performance evaluation for generative diffusion models.
Vocal Remover is a deep learning application designed for audio source separation. It functions as a command-line utility that decomposes complex audio signals into individual components, specifically isolating vocals and instrumental tracks from mixed recordings. The software utilizes a symmetric encoder-decoder neural network architecture to process audio spectrograms. By applying learned magnitude masks to the original signal phase, the system reconstructs output audio while maintaining temporal coherence. It supports both the execution of pre-trained models for track extraction and the tr
Fairseq is a deep learning research toolkit and sequence-to-sequence framework built on PyTorch. It provides a system for training and deploying models that map input sequences to output sequences, with a primary focus on neural machine translation and speech recognition. The toolkit allows for the generation of text sequences through search algorithms such as beam search and nucleus sampling. It includes capabilities for producing synthetic parallel training data by translating monolingual text using reverse sequence models. The framework supports large scale model training through multi-de
This project is a PyTorch sentiment analysis tutorial and a deep learning implementation for analyzing text. It provides a natural language processing sequence classification pipeline designed to clean text data and train neural networks to categorize sequences of words. The implementation focuses on adapting pretrained language models for specific text classification tasks using custom datasets. It includes a process for fine-tuning large-scale language models and implementing recurrent networks and transformers for emotional tone detection. The project covers the broader surface of text se
Matchering is an audio mastering tool and Python library designed to match the frequency balance and loudness of a target track to a specific reference track. It functions as a reference-based mastering system that aligns a target signal's spectral envelope, RMS, and peak amplitude with those of a chosen reference file. The project utilizes a multi-stage processing pipeline featuring an FFT spectral matching engine to adjust frequency response. It ensures output quality through the use of a brickwall limiter to prevent signal clipping while preserving the original waveform shape. The tool pr
spaCy is a Python natural language processing framework designed for industrial-scale text processing. It converts raw text into structured data for machine learning pipelines through a combination of statistical language model trainers, transformer-based text processors, and syntactic dependency parsers. The project enables the integration of pretrained transformer architectures to perform complex linguistic analysis and multi-task learning. It also provides a specialized system for neural named entity recognition to identify and categorize key entities within text. The framework covers a b
AudioGPT is an LLM-driven audio framework and processing suite that uses large language models to orchestrate neural audio pipelines. It functions as a multimodal audio generator and processing system, integrating a collection of pretrained models to handle speech synthesis, sound generation, and audio manipulation. The system is distinguished by its ability to generate audio from diverse inputs, including text and images, and its capacity to produce synchronized talking head videos. It also operates as a neural speech translator, converting spoken language between different tongues while pre
MMF is a modular framework for building, training, and evaluating vision-and-language models. It provides a configuration-driven experiment system where model, dataset, and training parameters are defined through composable YAML files, alongside a curated model zoo of pretrained checkpoints for state-of-the-art multimodal architectures. The framework includes a multimodal dataset loader that downloads, processes, and batches vision-and-language data, and a vision-language model trainer supporting distributed training, mixed precision, and checkpoint-based resumption. The framework distinguish
Spark NLP is a toolkit for scalable text analysis and machine learning built on the Apache Spark distributed computing framework. It provides a multimodal machine learning framework and a distributed pipeline system for sequencing annotators to process large-scale linguistic data. The library includes a transformer text processor for generating contextual vector embeddings and a dedicated inference engine for managing large language models. The project distinguishes itself through its ability to process heterogeneous data types, including text, audio, and images, within a unified vision-langu
nlp-recipes is a collection of implementation guides and reference templates for applying natural language processing techniques to real-world tasks. It provides standardized workflows and code examples for developing NLP pipelines, from dataset preparation and model training to performance evaluation. The project focuses on the practical application of transformer-based models, offering patterns for fine-tuning pretrained architectures for tasks such as text classification, named entity recognition, and question answering. It also includes a toolkit for model interpretability, allowing users
PaddleDetection is an object detection framework designed for the end-to-end development, training, and deployment of computer vision models. It provides a comprehensive library of modular neural network architectures and pipelines that support object detection, instance segmentation, and multi-object tracking tasks. The project distinguishes itself through a configuration-driven approach that decouples model components like backbones and heads, allowing for the flexible assembly of custom vision workflows. It incorporates advanced techniques such as anchor-free detection logic, joint detecti
Gensim is a natural language processing toolkit designed for large-scale text analysis and the training of semantic vector embeddings. It provides a framework for identifying latent thematic structures within document collections and calculating semantic similarity between text segments using unsupervised statistical algorithms. The project is distinguished by its ability to handle datasets that exceed available system memory through incremental corpus streaming, which processes documents one at a time from disk. It utilizes sparse vector representations and dictionary-based token mapping to
Starcoder is a large language model and associated framework designed to generate, complete, and evaluate source code across multiple programming languages. It functions as a source code model that can produce complete function implementations and predict subsequent characters in a line of code based on provided prompts. The project provides a specialized toolkit for adapting base models to specific coding tasks and instruction-following behaviors. This includes a conversational code assistant framework for training models to generate code via natural language chat, as well as a parameter-eff
DeepPavlov is a deep learning conversational AI framework designed for building end-to-end dialog systems and chatbots. It functions as an NLP model training library and a pipeline system that connects multiple natural language processing models into a single operational chain. The framework provides a REST API model server to expose trained deep learning models as web endpoints. This allows conversational agents to be deployed as web services that handle incoming HTTP requests and return predictions. The system covers the full lifecycle of conversational AI development, including NLP pipeli
Data manipulation and transformation for audio signal processing, powered by PyTorch
DeepSpeech is an open-source speech-to-text framework and machine learning engine designed to convert spoken audio into written text locally on a device. It provides on-device speech recognition that operates without requiring an internet connection to external servers. The system supports real-time speech transcription across a variety of hardware platforms, ranging from single-board computers and edge devices to GPU servers. This allows for audio analysis and processing directly on the local hardware.
wav2letter is an automatic speech recognition toolkit and deep learning framework designed to convert audio speech signals into written text. It functions as a distributed training system and an inference engine for building and deploying neural network architectures. The system enables the training of large-scale speech models across multiple compute nodes using custom architecture files and structured recipes. It includes an inference engine that allows these trained models to be executed within Python workflows to transform audio sequences into text. The framework covers the full speech r
DeepPavlov is a conversational AI framework and deep learning NLP library designed for building end-to-end dialogue systems and chatbots. It functions as an NLP pipeline orchestrator that allows users to compose pre-trained models and text processing components into sequential data flows for complex linguistic tasks. The system is distinguished by its ability to act as a chatbot deployment server, exposing trained conversational models as web services via REST and Socket APIs. It utilizes JSON-based pipeline configurations and dynamic variable interpolation to decouple model logic from infras
jetson-inference is a set of libraries and tools for executing optimized deep learning models on embedded GPU hardware. Its primary purpose is to enable real-time computer vision and AI inference at the edge with low latency and high throughput. The project distinguishes itself through high-performance streaming analytics and the ability to execute concurrent AI pipelines on auto-grade silicon. It provides specialized support for multi-sensor stream processing, utilizing zero-copy data transport to load camera frames directly into GPU memory. The codebase covers a broad surface of capabiliti
SpleeterGui is a graphical interface for the Spleeter machine learning library, serving as an AI source separation tool and audio stem extractor. It allows users to separate mixed audio files into individual source tracks, such as vocals, drums, and bass, using a visual application. The project functions as a wrapper for the Spleeter engine, removing the requirement to use command line tools for music stem isolation and audio source separation. It provides a visual method for managing audio source isolation and preparing instrument tracks. The interface includes tools for output directory ma
Vocal-separate is an audio processing tool designed to isolate vocal and instrumental tracks from audio and video files. It functions as a local artificial intelligence engine that performs source separation directly on the user's machine, ensuring data privacy by eliminating the need for external server connectivity. The system provides a browser-based control interface for managing media uploads and monitoring processing tasks. To handle intensive signal decomposition, it utilizes hardware-accelerated tensor processing, which offloads complex mathematical calculations to dedicated graphics
CNTK is a deep learning toolkit used for the design, construction, and training of neural networks. It defines model architectures as computational graphs and optimizes network parameters using an automatic differentiation engine and stochastic gradient descent. The project emphasizes large scale model distribution, spreading training workloads across multiple hardware nodes and GPUs. It features specialized support for dynamic sequence handling, allowing filters to be convolved across both spatial and dynamic sequence axes to process data of variable lengths. The toolkit provides hardware-a
Ultimate Vocal Remover is a desktop application designed for AI-driven audio source separation. It utilizes deep learning models to isolate vocals, drums, and other individual instruments from mixed audio files, providing a utility for professional production and creative editing workflows. The software distinguishes itself by leveraging GPU-accelerated tensor computation to perform complex signal processing tasks, significantly reducing the time required for high-fidelity audio extraction. It incorporates a modular plugin architecture that integrates external utilities to support a wide rang
This project is a deep learning research toolkit and generative model library providing implementations of Variational Autoencoders using the PyTorch framework. It serves as a framework for training and evaluating autoencoder architectures to learn latent representations for data reconstruction and the generation of synthetic data samples. The toolkit focuses on unsupervised feature learning and generative model training, featuring a system for mapping external configuration files to model hyperparameters to ensure reproducible experimental runs. It includes mechanisms for tracking training p