22 open-source projects similar to iamyuanchung/autoregressive-predictive-coding, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best Autoregressive Predictive Coding alternative.
whisper.cpp is a C++ implementation of the Whisper speech-to-text model, serving as a lightweight machine learning inference engine and quantized runtime. It provides high-performance automatic speech recognition and real-time audio transcription without requiring a Python environment. The project utilizes model quantization to reduce memory usage and increase inference speed on local hardware. It incorporates hardware acceleration to optimize processing speed across different processors. The system covers audio processing capabilities including voice activity detection, speaker diarization,
| Linux | Windows | |-----------------|-----------| | | |
Robust yet lenient forced-aligner built on Kaldi. A tool for aligning speech with text.
The Montreal Forced Aligner is a command line utility for performing forced alignment of speech datasets using Kaldi (http://kaldi-asr.org/).
DeepSpeech is an open-source speech-to-text framework and machine learning engine designed to convert spoken audio into written text locally on a device. It provides on-device speech recognition that operates without requiring an internet connection to external servers. The system supports real-time speech transcription across a variety of hardware platforms, ranging from single-board computers and edge devices to GPU servers. This allows for audio analysis and processing directly on the local hardware.
Implementation of the classical and extended Short Term Objective Intelligibility measures
Tortoise-tts is a neural text-to-speech engine and voice cloning toolkit designed for high-quality audio generation. It functions as a zero-shot synthesis system, meaning it can generate speech for unseen speakers without requiring additional training or fine-tuning for each new voice. The system specializes in replicating human vocal characteristics using small sets of reference audio clips. It allows for the extraction of voice latents to mimic specific speakers, the generation of random synthetic identities, and the blending of multiple voice profiles to create hybrid vocal identities. Th
PAddle PARAllel text-to-speech toolKIT (supporting Tacotron2, Transformer TTS, FastSpeech2/FastPitch, SpeedySpeech, WaveFlow and Parallel WaveGAN)
Pyannote.audio is a PyTorch toolkit for speaker diarization, speaker identification, and speech activity detection. Its primary purpose is to partition audio recordings into segments and assign each segment to a specific speaker identity to determine who spoke when. The project includes a framework for classifying speaker identities and a pipeline for distinguishing human speech from background noise. It provides specialized tools for handling symmetric-overlap speech, where multiple speakers talk simultaneously, and employs learnable band-pass filters for raw waveform feature extraction. Th
Fairseq is a deep learning research toolkit and sequence-to-sequence framework built on PyTorch. It provides a system for training and deploying models that map input sequences to output sequences, with a primary focus on neural machine translation and speech recognition. The toolkit allows for the generation of text sequences through search algorithms such as beam search and nucleus sampling. It includes capabilities for producing synthetic parallel training data by translating monolingual text using reverse sequence models. The framework supports large scale model training through multi-de
aeneas is a Python/C library and a set of tools to automagically synchronize audio and text (aka forced alignment).
An Optimized Speech-to-Text Pipeline for the Whisper Model Supporting Multiple Inference Engine
๐๏ธ๐คCreate, Customize and Talk to your AI Character/Companion in Realtime (All in One Codebase!). Have a natural seamless conversation with AI everywhere (mobile, web and terminal) using LLM OpenAI GPT3.5/4, Anthropic Claude2, Chroma Vector DB, Whisper Speech2Text, ElevenLabs Text2Speech๐๏ธ๐ค
A pure Python implementation of Google's ViSQOL (Virtual Speech Quality Objective Listener) for objective audio/speech quality assessment.
pyAudioAnalysis is a Python library and framework for audio signal processing and analysis. It provides tools for extracting mathematical representations of sound, such as spectrograms, and implements a system for training and evaluating machine learning models to classify audio segments based on acoustic patterns. The project includes dedicated utilities for audio segmentation, which allow for the removal of silence and the detection of specific audio events to divide recordings into meaningful sections. It also provides data visualization capabilities that use dimensionality reduction to ma
This project is a high-throughput transcription engine and PyTorch inference wrapper designed to convert spoken audio files into text using the OpenAI Whisper model. It functions as a hardware-accelerated speech-to-text transcriber that runs locally on a user's machine. The system focuses on AI model performance tuning to maximize hardware throughput. It utilizes GPU acceleration, half-precision floating point tensors, and Flash-Attention to reduce processing time and memory overhead during transcription. The implementation covers large-scale transcription workflows and local speech-to-text
Pypesq is a python wrapper for the PESQ score calculation C routine. It only can be used in evaluation purpose.
.. image:: https://travis-ci.org/wiseman/py-webrtcvad.svg?branch=master :target: https://travis-ci.org/wiseman/py-webrtcvad
Parselmouth is a Python library for the Praat software.