30 open-source projects similar to gabrielchua/open-notebooklm, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best Open Notebooklm alternative.
ChatTTS-ui is a web-based interface and API wrapper for the ChatTTS model, designed to convert written text and mixed language input into spoken audio. It functions as an AI speech synthesis dashboard and a programmatic generator for creating naturalistic voice output. The project focuses on custom voice profiling and speech nuance control. It allows for the maintenance of consistent speaker characteristics using seed values and data files, while providing controls for tone, laughter, and pauses through behavioral prompts and sampling parameters. The system includes a client-server architect
This Python SDK provides a comprehensive toolkit for synthetic audio generation, voice cloning, and the development of conversational AI agents. It enables the creation of lifelike spoken audio from text, the replication of human voices through custom cloning, and the deployment of real-time voice agents capable of interacting with external large language models. The library distinguishes itself through deep integration of conversational AI capabilities, including the design of agent personas and the execution of real-time actions via APIs. It supports professional-grade audio production thro
ShortGPT is an automated short-form video creation framework that combines large language model-driven scripting with neural voice synthesis, visual asset retrieval, and programmatic video editing. The project provides a modular pipeline architecture that chains script generation, voiceover synthesis, caption rendering, and video assembly into automated workflows, enabling the production of complete short videos from a topic prompt. The framework distinguishes itself through an LLM-oriented editing language that controls video assembly and rendering tasks programmatically, and a multilingual
Pixelle-Video is a text-to-video automation platform and generation engine that converts text topics into complete videos with synchronized narration, images, and music. It functions as a modular system for producing short-form content, utilizing large language models to automate script composition, visual asset generation, and voiceover production. The platform features a node-based workflow orchestrator that allows the composition of custom generation pipelines by linking different AI models. It includes a dynamic video layout designer that uses HTML templates to define aspect ratios and vi
Podcastfy is an AI content-to-podcast generator that converts text, URLs, PDFs, images, and videos into conversational audio podcasts. It integrates with over 100 language models for transcript creation and multiple text-to-speech engines for audio output, with support for customizable dialogue style and optional local transcript generation for privacy. The project distinguishes itself through a flexible architecture that decouples job submission from result retrieval via asynchronous polling, normalizes heterogeneous inputs into uniform text, and routes content through pluggable LLM and TTS
openai-go is an LLM SDK for Go and a client for interacting with OpenAI services. It provides type-safe bindings to generate text, images, and audio via REST endpoints, enabling the integration of large language models and AI assistant orchestration into Go applications. The library serves as an agent orchestration tool for managing stateful conversation threads and autonomous agents with integrated tool calling and file search. It also functions as an asynchronous batch processing client for monitoring large-scale request groups and fine-tuning jobs, alongside a management SDK for controllin
MoneyPrinterPlus is an automated video production system designed for the mass creation of short-form AI content. It functions as an end-to-end pipeline that uses large language models to generate scripts, synthesize voiceovers, and produce visual assets to assemble complete videos. The project is distinguished by its ability to batch-process high volumes of unique content through automated mixing and randomized asset pairing. It includes a social media auto-publisher that uses browser simulation to automate the upload and distribution of generated videos to platforms such as TikTok and Xiaoh
KnowledgeGraphData is a collection of structured datasets and corpora designed to provide a foundational layer for cognitive intelligence and artificial intelligence systems. It primarily consists of large-scale Chinese knowledge graph datasets, including entity-relation data and NLP training sets used to drive semantic understanding and automated question answering. The project focuses on the construction and export of massive entity-attribute-value graphs, organizing knowledge into portable formats. It provides specialized domain partitioning to tailor information retrieval for professional
This project is an educational curriculum and architectural framework for building autonomous AI agents and multi-agent systems. It provides a structured learning path focused on the development of independent software components capable of planning, executing tasks, and utilizing external tools to achieve high-level goals. The framework emphasizes multi-agent system orchestration through distributed architectures where specialized agents collaborate using standardized communication protocols. It details specific design patterns such as dual-memory systems for maintaining short-term plans and
This project is a comprehensive suite for neural speech synthesis, featuring a deep learning text-to-speech engine, a neural speech synthesis trainer, and a voice cloning toolkit. It provides a system for synthesizing human-like speech from text using neural network models and high-fidelity vocoders. The suite includes a speech model conversion utility to transform deep learning models between different formats for deployment across various hardware runtimes. It also provides a self-contained HTTP server to expose pre-trained text-to-speech models as a remote audio API. Capabilities include
Audiocraft is a deep learning audio library and machine learning framework designed for training, fine-tuning, and evaluating generative models for music and sound effects. It functions as a text-to-music generative model and a neural audio codec, providing the tools necessary to compress audio signals into discrete representations and synthesize high-fidelity waveforms from textual descriptions. The framework is distinguished by its ability to combine multiple conditioning signals, allowing for the generation of audio based on text prompts, melodic excerpts, or style-based audio clips. It al
This is a collection of pre-trained neural models for speech recognition, synthesis, and voice activity detection. It provides a library of assets designed for speech-to-text, text-to-speech, and the identification of human speech segments within audio. The project features text-to-speech synthesis with support for multiple languages and the use of Speech Synthesis Markup Language to control prosody, pitch, and timing. For speech recognition, the system includes capabilities for transcribing audio to text with word-level timestamp extraction and an automated punctuation restorer to insert cap
Bark is a generative audio engine and machine learning inference library designed to convert written text into high-fidelity speech and sound effects. It functions as a text-to-audio transformer, utilizing multi-stage neural network architectures to map semantic input tokens into detailed audio codebooks for synthesis. The system distinguishes itself through a hierarchical transformer stacking approach that separates semantic understanding from acoustic realization. By employing autoregressive token prediction and vector quantized codebook mapping, the engine bridges linguistic and sonic doma
Heartlib is an audio processing library for large language models that provides tools for audio tokenization, compression, and cross-modal alignment. It implements core models for audio-text embedding, automatic speech recognition, neural codecs, and text-driven audio synthesis. The project features a text-to-audio synthesis engine capable of generating high-fidelity music and speech from text descriptions or reference files. It also includes a neural audio codec designed for low-bitrate compression that preserves acoustic structure and sound quality. Additional capabilities cover audio-text
This project is a neural text-to-speech system and voice trainer that converts written text into spoken audio across a variety of global languages and regional dialects. It functions as an ONNX-based engine capable of performing fast offline inference and uses a phoneme-based controller to manage precise pronunciation. The system distinguishes itself through a comprehensive toolkit for neural voice training, allowing for the creation of custom single-speaker or multi-speaker models. It supports the export of these models to a standardized open format and provides hardware acceleration via gra
Audiblez is a text-to-speech audiobook generator that converts digital e-books into spoken audio files. The system processes written documents using speech synthesis and configurable voice profiles to produce audiobooks. The tool utilizes a graphical interface to manage the conversion workflow and task orchestration. It employs CUDA-accelerated processing to offload neural network computations to the GPU, increasing the speed of audio generation. The system includes capabilities for chapter-based file parsing and selective chapter conversion. Users can adjust synthesis parameters, including
Friendly ID is an ActiveRecord slugging plugin that generates human-readable URL slugs from model attributes, replacing numeric IDs for cleaner permalinks in Rails applications. It resolves database records by matching a slug string instead of the numeric primary key in finder methods, enabling friendlier URLs throughout an application. The plugin provides a slug conflict resolution system that appends a UUID or uses candidate attribute combinations to guarantee unique slugs when the primary choice is already taken. It also offers a scoped uniqueness engine that restricts slug uniqueness with
Logocreator is an open-source AI logo generator that creates professional logos from text descriptions. It uses the Flux AI model hosted on Together AI to interpret a company name, style preference, and optional background into a visual design, producing branded logos without requiring design skills or expensive software. The tool operates through a React frontend that manages user input and logo display, with a serverless backend that routes image generation requests to external AI APIs for scalable processing. It includes a prompt engineering pipeline that transforms user descriptions into
AudioLDM is a latent diffusion framework for generating high-fidelity audio, music, and sound effects. It functions as a text-to-audio generator that converts natural language descriptions into synthetic audio signals with control over pitch and environment. The system provides specialized tools for audio-to-audio synthesis and generative repair. This includes the ability to perform audio style transfer and replicate specific acoustic events based on existing files. The project covers a broad range of audio transformation tasks, including audio super-resolution for increasing signal fidelity
Bob is an extensible macOS utility designed for screen text extraction, translation aggregation, and speech synthesis. It functions as a wrapper that integrates multiple optical character recognition and translation services into a single interface, allowing users to capture screen areas, decode QR codes, and convert visual text into editable strings. The tool distinguishes itself through a plugin-based architecture that supports the integration of custom translation, speech synthesis, and image recognition APIs. It enables multi-engine parallel execution, allowing a single request to be proc
KittenTTS is a neural text-to-speech engine and text-to-audio synthesis tool that converts written text into spoken audio using lightweight neural network models. It functions as both a speech synthesizer and an audio file generator, producing spoken audio for offline playback. The system includes a text normalization processor that expands numbers and abbreviations into full spoken words to improve the naturalness of the synthesized speech. It supports diverse voice options and provides the ability to adjust playback speed.
Sherpa-ONNX is an ONNX-based speech processing toolkit that provides a local speech recognition engine, an on-device voice synthesis tool, and a speaker identification framework. It is designed as a cross-platform speech API that enables speech-to-text, text-to-speech, and speaker verification tasks to be executed locally on a device without requiring network access. The project is distinguished by its ability to perform zero-shot voice cloning and speaker diarization on-device. It supports a wide range of hardware accelerations, including GPU and various NPU architectures, and provides a Web
🎤 微软语音合成工具,使用 Electron Vue ElementPlus Vite 构建。
This project is a GPU-accelerated speech engine and AI voice cloning tool. It functions as a text-to-speech synthesizer and voice-to-voice converter that replicates specific human voices to generate synthetic speech. The system creates digital voice profiles by analyzing short audio samples or capturing live microphone input. These profiles enable the transformation of existing audio recordings into a target speaker's voice or the synthesis of new audio from written text. The engine supports subtitle-based speech generation for batch processing and automated dubbing workflows. A web-based au
Bert-VITS2 is a neural speech synthesis system and AI voice generator designed to convert written text into natural sounding audio. It utilizes a VITS2 engine and a neural speech synthesis model to produce high-fidelity human voices. The system incorporates a multilingual BERT language processor to improve the prosody and emotional accuracy of the generated speech. It supports multilingual voice generation and custom voice cloning to replicate specific human speech patterns and tones. The architecture covers text-to-speech synthesis through a multi-stage pipeline involving phoneme alignment,
This project is a comprehensive reference guide and directory of web browser capabilities. It serves as a technical map for accessing native operating system functions, hardware interfaces, and standard web APIs to bridge the gap between web applications and desktop or mobile environments. The resource provides detailed guidance on implementing Progressive Web App features, including offline caching, push notifications, and native installation prompts. It also catalogs methods for interacting with hardware peripherals via USB, Bluetooth, and NFC, as well as reading raw data from device sensor
OpenMAIC is an LLM multi-agent education platform designed to create immersive, interactive classroom simulations. It functions as a learning environment where multiple AI agents collaborate through a state-machine orchestration framework to coordinate conversational turns and interactions. The platform features an AI-driven interactive lesson generator that transforms documents and topics into educational experiences including slides, quizzes, and project activities. It integrates a speech-enabled interface that combines speech-to-text and text-to-speech for voice-based interaction, alongsid
This project is a scalable, containerized pipeline designed to transform digital documents and image-based ebooks into narrated audiobooks. It functions as an end-to-end production platform that integrates text-to-speech synthesis, optical character recognition, and automated workflow management to convert various file formats into spoken audio. The system distinguishes itself through advanced linguistic analysis and voice synthesis capabilities, including the ability to identify characters within a text and assign them distinct voice profiles for multi-speaker narration. Users can further pe
Magenta is a comprehensive toolkit for training, synthesizing, and performing music through neural models and hardware-integrated engines. It functions as a machine learning framework that enables the generation, manipulation, and real-time performance of audio, providing the structural foundations for musical intelligence through hierarchical sequence modeling and symbolic processing. The project distinguishes itself by enabling real-time, low-latency neural audio synthesis that can be integrated directly into professional digital audio workstations. It supports interactive musical jamming a