Heartlib is an audio processing library for large language models that provides tools for audio tokenization, compression, and cross-modal alignment. It implements core models for audio-text embedding, automatic speech recognition, neural codecs, and text-driven audio synthesis.
The project features a text-to-audio synthesis engine capable of generating high-fidelity music and speech from text descriptions or reference files. It also includes a neural audio codec designed for low-bitrate compression that preserves acoustic structure and sound quality.
Additional capabilities cover audio-text alignment via a shared latent space for retrieval, as well as transcription tools specifically designed to convert vocal lyrics and singing into written text.