# AI Voice Cloning and Synthesis

> Search results for `AI voice cloning and text-to-speech` on awesome-repositories.com. 102 total matches; showing the first 50.

Explore on the web: https://awesome-repositories.com/q/ai-voice-cloning-and-text-to-speech

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [this search on awesome-repositories.com](https://awesome-repositories.com/q/ai-voice-cloning-and-text-to-speech).**

## Results

- [corentinj/real-time-voice-cloning](https://awesome-repositories.com/repository/corentinj-real-time-voice-cloning.md) (59,918 ⭐) — This project is a neural text-to-speech engine and voice cloning toolkit designed to generate synthetic speech that mimics the vocal characteristics of a target speaker. It functions as a real-time audio synthesizer, utilizing a deep learning pipeline to convert written text into high-fidelity speech output with minimal latency.

The system employs a transfer learning framework that leverages pre-trained speaker verification models to adapt synthesis to new, unseen vocal identities. By using an encoder-based speaker embedding process, the toolkit maps variable-length audio samples into a latent space to preserve unique speaker characteristics. The architecture is organized into a modular pipeline that separates the encoding, synthesis, and vocoder stages, allowing for independent optimization of each component.

The synthesis process relies on autoregressive sequence generation to transform text into acoustic representations, which are then converted into time-domain waveforms by a neural vocoder. Users can interact with the system through both command-line and graphical interfaces to process custom recordings or pre-trained models for speech generation.
- [jianchang512/clone-voice](https://awesome-repositories.com/repository/jianchang512-clone-voice.md) (8,959 ⭐) — This project is a GPU-accelerated speech engine and AI voice cloning tool. It functions as a text-to-speech synthesizer and voice-to-voice converter that replicates specific human voices to generate synthetic speech.

The system creates digital voice profiles by analyzing short audio samples or capturing live microphone input. These profiles enable the transformation of existing audio recordings into a target speaker's voice or the synthesis of new audio from written text.

The engine supports subtitle-based speech generation for batch processing and automated dubbing workflows. A web-based audio interface provides a dashboard for recording voice samples and managing synthesis tasks.
- [myshell-ai/openvoice](https://awesome-repositories.com/repository/myshell-ai-openvoice.md) (36,720 ⭐) — OpenVoice is a multilingual text-to-speech framework and voice cloning AI model designed for high-fidelity voice replication and low-latency audio generation. It functions as an instant speech synthesis engine that converts text to audio while replicating a specific speaker's tone and color.

The system is distinguished by its ability to perform cross-lingual cloning, allowing the vocal characteristics of a reference speaker to be applied to speech in different languages regardless of the original training data. It utilizes a decoupled representation to separate the physical identity of a voice from its emotional and rhythmic delivery.

This tool provides granular speech control over audio generation, enabling adjustments to parameters such as emotion, accent, rhythm, and intonation. These capabilities allow for the creation of digital replicas using short audio samples to synthesize expressive speech.
- [fishaudio/fish-speech](https://awesome-repositories.com/repository/fishaudio-fish-speech.md) (24,928 ⭐) — This project is a generative speech synthesis engine that converts text into high-fidelity human speech. It utilizes a two-stage autoregressive transformer architecture that separates semantic token prediction from acoustic detail reconstruction to balance linguistic accuracy with audio quality. The system is designed to support multilingual output and conversational AI development, enabling the generation of context-aware speech that maintains flow across multiple dialogue turns.

The platform distinguishes itself through a production-ready inference server that employs continuous batching to maximize hardware utilization and reduce latency. It includes a comprehensive voice cloning toolkit that replicates unique vocal characteristics from short reference audio samples without requiring additional model training. Users can further customize output through low-rank adaptation fine-tuning, which allows for efficient style adjustments, and speaker-specific token embeddings that manage distinct voice characteristics during multi-speaker generation.

Beyond core synthesis, the project provides a full suite of utilities for training and alignment, including reinforcement learning techniques to optimize for semantic accuracy and instruction adherence. It supports a variety of operational interfaces, including a command-line tool, a web-based dashboard, and an authenticated HTTP server for remote generation workloads. The system also includes data preparation and serialization tools to streamline the process of organizing and normalizing audio datasets for model training.
- [elevenlabs/elevenlabs-python](https://awesome-repositories.com/repository/elevenlabs-elevenlabs-python.md) (2,873 ⭐) — This Python SDK provides a comprehensive toolkit for synthetic audio generation, voice cloning, and the development of conversational AI agents. It enables the creation of lifelike spoken audio from text, the replication of human voices through custom cloning, and the deployment of real-time voice agents capable of interacting with external large language models.

The library distinguishes itself through deep integration of conversational AI capabilities, including the design of agent personas and the execution of real-time actions via APIs. It supports professional-grade audio production through a variety of specialized tools for multilingual dubbing, studio-quality music generation, and high-fidelity sound effects.

The SDK covers a broad surface of speech and media processing, including real-time audio streaming via WebSockets, speech-to-text transcription with speaker diarization, and the synchronization of audio with visual elements. It also provides utilities for monitoring generation costs and managing agent security through response guardrails and access controls.
- [babysor/mockingbird](https://awesome-repositories.com/repository/babysor-mockingbird.md) (36,903 ⭐) — MockingBird is an AI voice cloning tool and text-to-speech system designed to generate synthetic speech. It functions as a voice synthesis trainer for building custom models from audio datasets, a command-line generator for producing audio files, and a text-to-speech server for remote application integration.

The project specializes in real-time voice cloning, which extracts vocal characteristics from short audio samples to mimic a target speaker's unique timbre. It utilizes reference-driven audio synthesis to condition pre-trained models on specific audio samples, allowing for the generation of arbitrary speech that maintains a specific voice identity.

The system includes a neural text-to-speech pipeline and capabilities for dataset-driven model training to master specific languages or speaking styles. Users can interact with the software through a command-line interface or via a web server that exposes synthesis functionality as an API.
- [livekit/livekit](https://awesome-repositories.com/repository/livekit-livekit.md) (19,358 ⭐) — LiveKit is a comprehensive framework for building and orchestrating real-time, multimodal AI agents that interact with users through voice, video, and text. It provides a centralized, event-driven architecture to manage the entire lifecycle of automated participants, from initialization and session state management to graceful shutdown. By utilizing a selective forwarding unit, the platform efficiently routes media streams between participants and agents, ensuring low-latency communication and secure, token-based authentication for all connections.

The platform distinguishes itself through its modular pipeline-based media processing, which chains specialized speech-to-text, language, and text-to-speech services into cohesive workflows. It includes advanced capabilities for real-time voice activity detection, enabling natural turn-taking and interruption handling, alongside remote procedure call tooling that allows agents to execute external functions or access local resources during a conversation. Developers can further extend these interactions by integrating photorealistic virtual avatars that synchronize visual expressions with the agent's audio output.

Beyond core conversational logic, the system offers extensive support for telephony integration, allowing agents to connect to public networks via SIP for inbound and outbound calling. It provides a robust suite of observability and monitoring tools to track agent performance, connection quality, and session events, ensuring reliability in production environments. The platform also includes specialized utilities for task automation, such as capturing and validating structured user data, and supports multi-step workflow orchestration to handle complex, context-aware interactions.

The project provides a command-line interface for scaffolding, deploying, and testing agent applications, with documentation available in machine-readable formats to assist in development.
- [drewthomasson/ebook2audiobook](https://awesome-repositories.com/repository/drewthomasson-ebook2audiobook.md) (19,291 ⭐) — This project is a scalable, containerized pipeline designed to transform digital documents and image-based ebooks into narrated audiobooks. It functions as an end-to-end production platform that integrates text-to-speech synthesis, optical character recognition, and automated workflow management to convert various file formats into spoken audio.

The system distinguishes itself through advanced linguistic analysis and voice synthesis capabilities, including the ability to identify characters within a text and assign them distinct voice profiles for multi-speaker narration. Users can further personalize the output by training custom voice models on audio samples or by using markup tags to exert fine-grained control over pacing, pauses, and speaker switching during the generation process.

The platform supports high-volume production through parallel task orchestration and batch processing, with the option to offload resource-intensive rendering tasks to remote cloud environments or local graphics hardware. It provides both a command-line interface and a web-based dashboard to manage file uploads, voice assignments, and the lifecycle of audio generation tasks. The entire application stack is packaged into containerized environments to ensure consistent execution across diverse infrastructure.
- [lipku/livetalking](https://awesome-repositories.com/repository/lipku-livetalking.md) (8,042 ⭐) — LiveTalking is an interactive talking head engine and AI avatar management platform designed to synchronize synthetic speech with facial movements. It functions as a real-time orchestrator that connects large language models and text-to-speech services to neural-rendered digital humans.

The project distinguishes itself through low-latency streaming capabilities and the ability to handle real-time conversational interruptions. It supports advanced audio-visual customization, including human voice cloning and the ability to drive avatar expressions using real-time webcam data.

The platform covers a broad range of capabilities, including digital human animation, real-time video streaming via WebRTC and RTMP, and virtual camera broadcasting. It also provides tools for managing character profiles, coordinating idle animations, and rendering multiple avatars within a single frame.

The engine can be deployed via container images or cloud instances to ensure consistent environment management.
- [nari-labs/dia](https://awesome-repositories.com/repository/nari-labs-dia.md) (19,324 ⭐) — Dia is a generative AI audio tool and text-to-speech synthesis engine designed for the production-ready deployment of machine learning models. It provides a framework for creating lifelike synthetic speech by conditioning generation on reference audio samples to replicate specific vocal characteristics, emotional tones, and delivery styles.

The system distinguishes itself through its ability to perform custom voice cloning and precise control over audio output. Users can adjust generation parameters such as temperature and guidance scale to modify the pacing, creativity, and style of the synthesized speech. Additionally, the platform supports the injection of nonverbal vocal expressions, such as laughter or gasps, through the use of specialized text markers.

The framework integrates with standard machine learning ecosystems to facilitate the management and scaling of generative services. It supports modular model orchestration, ensuring that complex audio synthesis tasks remain consistent and performant within production environments.
- [coqui-ai/tts](https://awesome-repositories.com/repository/coqui-ai-tts.md) (45,568 ⭐) — This project is a deep learning text-to-speech toolkit used for training and deploying neural speech synthesis models. It provides a comprehensive framework for converting written text into spoken audio, utilizing neural vocoders to transform synthesized spectrograms into high-fidelity audio waveforms.

The toolkit includes a voice cloning system that replicates specific human voices by extracting speaker embeddings from short audio samples. It also supports multi-speaker audio synthesis, allowing the generation of speech across different vocal identities using specialized model architectures.

The system covers the full speech synthesis pipeline, including tools for speech dataset curation, custom model training with performance tracking, and a command-line interface for audio generation. For network access, it provides a self-hosted HTTP server to deploy speech synthesis models as an API.
- [jasonppy/voicecraft](https://awesome-repositories.com/repository/jasonppy-voicecraft.md) (8,500 ⭐) — VoiceCraft is a neural speech generation and manipulation system consisting of a text-to-speech system, a voice cloning tool, and an audio inpainting engine. It uses a large language model approach to synthesize high-fidelity audio from text and replicate speaker identities.

The system provides zero-shot voice cloning and speech editing capabilities, allowing users to modify spoken content within existing recordings. This includes an audio inpainting engine that replaces specific sections of audio with new speech while preserving the original acoustic characteristics and speaker identity.

The project covers high-level capabilities for text-to-speech synthesis, custom voice model training through phoneme-based tokenization, and acoustic speech refinement. It utilizes autoregressive synthesis and latent space representations to decouple speaker identity from linguistic content.
- [aidc-ai/pixelle-video](https://awesome-repositories.com/repository/aidc-ai-pixelle-video.md) (23,403 ⭐) — Pixelle-Video is a text-to-video automation platform and generation engine that converts text topics into complete videos with synchronized narration, images, and music. It functions as a modular system for producing short-form content, utilizing large language models to automate script composition, visual asset generation, and voiceover production.

The platform features a node-based workflow orchestrator that allows the composition of custom generation pipelines by linking different AI models. It includes a dynamic video layout designer that uses HTML templates to define aspect ratios and visual arrangements, as well as a voice cloning system that creates synthetic speech by analyzing uploaded audio reference files.

The system covers broad capabilities in audio-visual styling, including the application of aesthetic themes and global style configurations. It manages content production through AI-driven script synchronization, motion video generation, and multi-engine audio mixing.

Infrastructure and backend management include the configuration of AI model endpoints, cloud compute resource management for GPU memory and concurrency, and the integration of custom JSON-defined workflows.
- [nvidia/personaplex](https://awesome-repositories.com/repository/nvidia-personaplex.md) (10,030 ⭐) — Personaplex is an LLM speech-to-speech framework and conversational AI persona engine designed for real-time voice interfaces. It provides a system for defining AI identities and vocal characteristics through a combination of text-based role prompts and audio reference files.

The project features a real-time AI voice interface that supports full-duplex human-AI dialogue, enabling multiple parties to speak and listen simultaneously via bidirectional audio streaming. It includes a GPU-accelerated audio processor and a speech-to-speech pipeline to facilitate low-latency conversations.

The framework incorporates resource management tools, such as CPU model layer offloading to move components between video and system memory. It also supports offline audio processing for asynchronous generation and session routing to direct clients to specific worker instances.
- [w-okada/voice-changer](https://awesome-repositories.com/repository/w-okada-voice-changer.md) (19,729 ⭐) — This software is a real-time voice changer that utilizes machine learning inference to transform live microphone input into target vocal characteristics. It functions as an artificial intelligence audio processing tool designed to modify vocal identity during active communication or live broadcasts.

The application distinguishes itself by executing neural network models directly within the browser environment. It leverages web-based compute acceleration and dedicated audio threading to maintain low-latency performance, allowing users to switch between different voice profiles while processing audio streams in real time.

The system integrates with external communication platforms by injecting processed media streams directly into the audio pipeline. It supports a range of audio engineering tasks, enabling the application of complex signal transformations for virtual content creation and live vocal modification.
- [mastra-ai/mastra](https://awesome-repositories.com/repository/mastra-ai-mastra.md) (21,221 ⭐) — Mastra is an orchestration framework designed for building, deploying, and managing autonomous AI agents and multi-agent systems. It provides a comprehensive suite of primitives for creating resilient AI applications, including durable workflow orchestration, event-driven agent loops, and semantic memory management. By integrating these core components, the platform enables developers to build complex, multi-step processes that can reason about goals and execute tasks without manual intervention.

The framework distinguishes itself through its focus on observability and secure, isolated execution. It features a built-in telemetry pipeline that captures structured execution traces, logs, and performance metrics, allowing for real-time debugging and evaluation of agent behavior. Furthermore, it utilizes sandboxed environments to isolate code execution and filesystem operations, ensuring that agent interactions remain secure and reproducible.

Mastra covers a broad capability surface, including multi-agent delegation hierarchies, schema-validated tool execution, and real-time voice interaction. It supports advanced orchestration patterns such as human-in-the-loop approvals, persistent state management for long-running workflows, and retrieval-augmented generation using vector-based semantic memory. These features are designed to work together to support the entire lifecycle of AI-powered applications, from initial development and testing to production deployment.

The project is built for TypeScript environments and provides a modular architecture that integrates with existing web stacks and infrastructure. It includes a client SDK for interacting with remote agents and supports various authentication providers to secure API endpoints and agent resources.
- [openbmb/minicpm-o](https://awesome-repositories.com/repository/openbmb-minicpm-o.md) (23,850 ⭐) — MiniCPM-o is a multimodal large language model designed to function as a real-time conversational assistant on edge devices. By mapping text, image, video, and audio inputs into a unified latent space, the system enables simultaneous cross-modal reasoning and full-duplex interaction. It is built as an edge-side inference engine, utilizing quantized model weights to maintain high-performance processing on consumer hardware.

The system distinguishes itself through its integrated speech synthesis and voice cloning capabilities, which allow for the generation of expressive, personalized vocal output from short audio samples without additional training. Users can modulate the emotional tone, speed, and emphasis of synthesized speech in real time using latent prosody control tokens. Furthermore, the model supports the adoption of specific personas and roles, facilitating immersive, situation-aware dialogue.

Beyond its core conversational features, the framework provides tools for proactive visual assistance, such as monitoring environments to trigger navigation or scheduling alerts. The architecture is configurable, allowing for adjustments to visual token compression and frame sampling rates to balance accuracy and speed. The project supports fine-tuning for specialized domains, enabling developers to adapt the model to custom tasks using standard training frameworks.
- [tanchaowen84/voice-clone](https://awesome-repositories.com/repository/tanchaowen84-voice-clone.md) (23 ⭐) — Voice Clone is an AI-powered tool that lets you instantly clone any voice in just seconds. Built for creators, developers, and businesses, it delivers high-quality, natural-sounding results with a simple API and web interface. Perfect for podcasts, videos, games, and more — no recording studio required.
- [appwrite/appwrite](https://awesome-repositories.com/repository/appwrite-appwrite.md) (56,318 ⭐) — Appwrite is a backend-as-a-service platform that provides a unified development environment for building full-stack applications. It integrates essential infrastructure components—including authentication, databases, storage, and serverless functions—into a single, centralized interface to simplify application development and resource management.

The platform distinguishes itself through a container-based microservices architecture that ensures consistent execution across diverse infrastructure. It features a versatile connectivity layer that links frontend applications with third-party services, databases, and external APIs through standardized interfaces. Developers can manage and automate the configuration of these backend resources using infrastructure-as-code tools, while granular role-based access control enforces security policies across all platform resources and API endpoints.

Beyond its core services, the platform offers a broad capability surface that includes cross-platform data synchronization, event-driven webhooks, and comprehensive billing and usage monitoring. It supports extensive integrations for AI utilities, payment processing, messaging, and logging, allowing developers to extend application functionality through modular, event-driven workflows.

The platform is designed for both managed and self-hosted deployments, providing tools for production environment optimization, data migration, and custom domain configuration.
- [picovoice/speech-to-text-benchmark](https://awesome-repositories.com/repository/picovoice-speech-to-text-benchmark.md) (693 ⭐) — speech to text benchmark framework
- [capacitor-community/text-to-speech](https://awesome-repositories.com/repository/capacitor-community-text-to-speech.md) (0 ⭐) — Capacitor community plugin for synthesizing speech from text.
- [danielmiessler/fabric](https://awesome-repositories.com/repository/danielmiessler-fabric.md) (42,408 ⭐) — Fabric is a command-line orchestrator designed to automate complex data processing and content generation tasks by chaining artificial intelligence models with modular prompt templates. It functions as a terminal-based tool that utilizes standard input and output streams, allowing users to pipe data directly into predefined reasoning strategies. By providing a model-agnostic abstraction layer, the system decouples execution logic from specific artificial intelligence vendors, normalizing requests and responses across different service providers.

The platform distinguishes itself through its pattern-based orchestration, which enables the organization, storage, and reuse of custom prompt collections for consistent task execution. It includes a built-in server component that exposes these local prompt workflows as standard web endpoints, allowing external software and graphical interfaces to interact with custom logic as if it were a native model. Users can manage these interactions through a dedicated directory for private templates or via a graphical web dashboard, providing flexibility in how automated workflows are configured and monitored.

Beyond its core orchestration capabilities, the tool offers a suite of utilities for development tasks, including document analysis, code context generation, and system interaction. It supports advanced reasoning techniques, such as chain-of-thought processing, and allows for specific model-to-pattern mapping to balance performance and operational costs. The system maintains state and configuration through local filesystem storage, ensuring portability across different operating environments.
- [rafalwilinski/serverless-medium-text-to-speech](https://awesome-repositories.com/repository/rafalwilinski-serverless-medium-text-to-speech.md) (0 ⭐) — Serverless-based, text-to-speech service for Medium articles.
- [rvc-boss/gpt-sovits](https://awesome-repositories.com/repository/rvc-boss-gpt-sovits.md) (58,724 ⭐) — GPT-SoVITS is a text-to-speech synthesis engine and voice cloning toolkit designed for generating natural-sounding human speech. It functions as a neural audio processing pipeline that maps input text to high-fidelity audio waveforms, utilizing conditional variational autoencoders and flow-based decoders to ensure expressive output.

The platform distinguishes itself through its ability to perform few-shot voice cloning and cross-lingual speech generation, allowing users to maintain a specific speaker's vocal identity and emotional delivery across multiple languages. By employing cross-modal latent alignment, the system effectively bridges text-based linguistic features with speaker-specific embeddings, while a generative adversarial network-based vocoder ensures the final audio maintains high time-domain quality.

The software provides a modular pipeline that supports the entire lifecycle of custom voice model development, including data preprocessing, fine-tuning on small datasets, and inference. It incorporates self-supervised speech representation models to extract discrete linguistic units, facilitating robust voice conversion and automated audio content creation. The project includes documentation for model training, inference procedures, and command-line execution.
- [kevinwang676/chatglm2-voice-cloning](https://awesome-repositories.com/repository/kevinwang676-chatglm2-voice-cloning.md) (616 ⭐) — Chat with any character you like: ChatGLM2+SadTalker+Voice Cloning | 和喜欢的角色沉浸式对话吧：ChatGLM2+声音克隆+视频对话
- [google-gemini/cookbook](https://awesome-repositories.com/repository/google-gemini-cookbook.md) (17,418 ⭐) — The Gemini Cookbook is a comprehensive collection of implementation patterns, code samples, and development guides designed for building applications with Google Gemini models. It serves as a central resource for developers to integrate multimodal generative artificial intelligence into their software, providing the necessary frameworks to manage model interactions, stateful workflows, and structured data extraction.

The repository distinguishes itself by offering specialized toolkits for autonomous agent orchestration, enabling the construction of agents that can execute code, browse the web, and perform multi-step tasks in sandboxed environments. It provides deep support for real-time conversational interfaces, including bidirectional streaming for audio, video, and text, as well as advanced capabilities for multimodal content generation and long-context data processing.

Beyond core model integration, the project covers a broad capability surface including retrieval-augmented generation, batch processing for high-throughput workloads, and observability tools for monitoring token usage and debugging API interactions. It also provides guidance on security primitives, such as authentication and content safety, alongside operational strategies for cost optimization and infrastructure management.

The documentation is structured as a series of Jupyter Notebooks, offering interactive examples that demonstrate how to implement these features within production-grade artificial intelligence systems.
- [kyutai-labs/pocket-tts](https://awesome-repositories.com/repository/kyutai-labs-pocket-tts.md) (3,301 ⭐) — Pocket-tts is a text-to-speech server and neural speech synthesizer that converts written text into audible speech. It includes a CPU-optimized inference engine and a voice cloning tool capable of analyzing audio samples to reproduce specific speaker characteristics.

The system differentiates itself through the use of dynamic int8 quantization to reduce memory usage and increase generation speed on processors. It supports real-time speech synthesis by streaming audio chunks incrementally and utilizes voice state caching to store processed embeddings as portable files, bypassing redundant processing during speaker cloning.

The project covers a broad range of capabilities, including local model hosting and self-hosted API services for remote audio generation. It provides utilities for model initialization across multiple languages and a native backend to handle computationally intensive synthesis operations.
- [azex-ai/speech](https://awesome-repositories.com/repository/azex-ai-speech.md) (0 ⭐) — macOS native voice input for Crypto + AI professionals — offline ASR, domain vocabulary, implicit learning
- [encoredev/encore](https://awesome-repositories.com/repository/encoredev-encore.md) (12,049 ⭐) — Encore is a distributed systems framework designed to unify backend development, infrastructure provisioning, and observability. It functions as an infrastructure-as-code platform that allows developers to define cloud resources, databases, and messaging topics directly within their application code. By analyzing these declarations at compile-time, the system automatically manages the deployment of cloud resources and security policies, ensuring parity between local development and production environments.

The platform distinguishes itself through its integrated development experience, which includes a local workspace that mirrors production infrastructure to facilitate testing and debugging. It provides automated AI-assisted development tools that leverage application metadata and runtime telemetry to aid in code generation and performance analysis. Furthermore, the framework enforces architectural standards and automates the creation of ephemeral, production-like environments for every pull request, streamlining the validation process before deployment.

Beyond its core orchestration capabilities, the framework includes a comprehensive suite for building type-safe APIs and event-driven services. It handles the complexities of service communication, including automated client library generation, request validation, and distributed tracing instrumentation. The system also incorporates robust security primitives, such as identity token validation, secret management, and automated traffic control, to support the development of secure, scalable backend architectures.
- [emotional-text-to-speech/dl-for-emo-tts](https://awesome-repositories.com/repository/emotional-text-to-speech-dl-for-emo-tts.md) (458 ⭐) — :computer: :robot: A summary on our attempts at using Deep Learning approaches for Emotional Text to Speech :speaker:
- [getstream/vision-agents](https://awesome-repositories.com/repository/getstream-vision-agents.md) (6,029 ⭐)
- [liquidgalaxylab/lg-gesture-and-voice-control](https://awesome-repositories.com/repository/liquidgalaxylab-lg-gesture-and-voice-control.md) (0 ⭐) — LG Gesture and Voice Control An App To Provide Gesture and Voice Control for Liquid Galaxy .
- [mudler/localai](https://awesome-repositories.com/repository/mudler-localai.md) (46,889 ⭐) — LocalAI is a self-hosted inference server that enables the execution of machine learning models directly on local hardware. By providing a unified interface for text, image, and audio processing, it allows users to maintain full control over data privacy and infrastructure costs while eliminating dependencies on external network services.

The platform functions as an API gateway that mimics standard cloud-based artificial intelligence interfaces, allowing existing applications to integrate local models as drop-in replacements. It utilizes a container-based architecture to package runtimes and dependencies, ensuring consistent deployment across diverse hardware configurations. To optimize system performance, the server employs an on-demand orchestration layer that dynamically loads and unloads models based on active requests, minimizing memory usage during periods of inactivity.

The system supports a wide range of model architectures through a flexible backend abstraction that allows for driver switching at runtime. Users can manage their models and interact with the service through a web interface or via standard web requests, which the proxy translates into model-specific execution commands. The software is distributed as a containerized application to facilitate deployment across various server and cloud environments.
- [facebookresearch/fairseq](https://awesome-repositories.com/repository/facebookresearch-fairseq.md) (32,228 ⭐) — Fairseq is a PyTorch toolkit for sequence-to-sequence modeling, specializing in neural machine translation, automatic speech recognition, and large-scale language model training. It provides a framework for processing and aligning diverse data sources, including text, audio, and video, to support tasks such as speech-to-text conversion and multimodal sequence learning.

The project is distinguished by its distributed training capabilities, which utilize parameter sharding, mixed-precision training, and CPU offloading to handle models that exceed single-device memory. It also includes specialized tools for data engineering, such as parallel data mining for unsupervised learning and back-translation for expanding training corpora.

Its capability surface extends to comprehensive inference and generation tools, including beam search and lexical constraint enforcement, as well as model compression techniques like layer pruning and product quantization. The toolkit also provides utilities for feature extraction, model evaluation via metrics like perplexity and BLEU scores, and a registry-based system for extending models and tasks.

Training and evaluation workflows are managed through a command-line interface that orchestrates hyperparameter configuration and model execution.
- [paarthneekhara/text-to-image](https://awesome-repositories.com/repository/paarthneekhara-text-to-image.md) (2,160 ⭐) — Text to image synthesis using thought vectors
- [modstart-lib/aigcpanel](https://awesome-repositories.com/repository/modstart-lib-aigcpanel.md) (4,576 ⭐) — Aigcpanel is a visual workflow automation tool and model lifecycle manager designed for generative AI media pipelines. It provides a unified interface to install, launch, and configure both local and remote AI model endpoints, acting as an orchestration platform for large language models and AI tools.

The system features a drag-and-drop node editor for chaining AI models and scripts into automated processing pipelines. It distinguishes itself with a breakpoint-aware execution model that allows users to pause and resume long media tasks from specific points in the workflow. Additionally, it includes a command line interface for executing model functions and managing deployments via external scripts.

The suite covers specialized media generation capabilities, including digital human synthesis through voice cloning and lip-sync video generation. It also provides tools for audio and video processing, such as speech-to-text transcription and background removal, alongside an automation engine for monitoring live stream chat comments to trigger automated responses.
- [open-speech/speech-aligner](https://awesome-repositories.com/repository/open-speech-speech-aligner.md) (410 ⭐) — speech-aligner，是一个从“人声语音”及其“语言文本”，产生音素级别时间对齐标注的工具。speech-aligner, is a tool that generate phoneme-level alignment between human speech and its transcription
- [fmhy/fmhy](https://awesome-repositories.com/repository/fmhy-fmhy.md) (13,150 ⭐) — FMHY is a community-driven index designed to organize and distribute decentralized digital content through standardized metadata and protocol-agnostic linking. It functions as a resilient, distributed map of internet resources, providing a structured directory that facilitates the discovery of media, software, and educational tools without reliance on centralized control.

The project distinguishes itself by maintaining a massive, human-verified repository of external links that span diverse digital ecosystems, including peer-to-peer networks, Usenet, and direct download servers. By utilizing lightweight, version-controlled text files, the platform enables easy mirroring and local hosting, ensuring that its comprehensive index remains accessible and redundant across various environments.

The directory covers a broad operational surface, including tools for digital media acquisition, retro gaming emulation, and self-directed academic learning. It also provides extensive resources for system privacy and security, artificial intelligence integration, and professional development, offering a centralized hub for navigating complex online information.

The project is documented through a series of structured, navigable directories that allow users to filter and locate specific resources efficiently.
- [zsdonghao/text-to-image](https://awesome-repositories.com/repository/zsdonghao-text-to-image.md) (599 ⭐) — Generative Adversarial Text to Image Synthesis / Please Star -->
- [microsoft/vibevoice](https://awesome-repositories.com/repository/microsoft-vibevoice.md) (49,394 ⭐) — VibeVoice is a generative artificial intelligence platform designed for text-to-speech synthesis. It functions as a neural audio generation framework that converts written text into natural-sounding spoken audio, specifically engineered to maintain consistent vocal characteristics and narrative prosody across extended passages of content.

The system distinguishes itself through its ability to generate long-form conversational speech while preserving speaker identity and linguistic content. By utilizing latent space disentanglement, the model separates speaker traits from the input text, allowing for consistent voice cloning. Its architecture supports real-time streaming inference, which processes audio in sequential chunks to minimize latency during generation.

The framework covers a broad range of capabilities for automated content narration and high-quality speech synthesis. It employs hierarchical context encoding and token-based audio quantization to manage long-range dependencies and improve the efficiency of generating extended audio sequences.
- [vhpoet/alfred-text-to-calendar](https://awesome-repositories.com/repository/vhpoet-alfred-text-to-calendar.md) (8 ⭐) — An Alfred workflow that uses AI to intelligently convert text into calendar events. Simply select text or type a command to create calendar events with natural language.
- [jamiepine/voicebox](https://awesome-repositories.com/repository/jamiepine-voicebox.md) (30,041 ⭐) — Voicebox is a local speech processing system that provides text-to-speech generation, speech-to-text transcription, and voice cloning. It utilizes local machine learning inference and GPU acceleration to process audio and text data without relying on external API calls.

The project features a voice cloning toolkit for creating synthetic profiles from audio samples and a timeline-based voice editor for composing multi-character conversations. It also includes an AI voice management API that allows external applications and AI agents to programmatically manage voice profiles and generate speech.

Capabilities cover audio processing pipelines for effects like pitch shifts and reverb, as well as real-time and file-based transcription with filler word removal. The system supports persona-based dialogue generation, batch synthesis with prompt caching, and global text dictation for inserting transcripts directly into the operating system clipboard.

The processing engine can be hosted on local hardware or remote GPU servers.
- [abus-aikorea/voice-pro](https://awesome-repositories.com/repository/abus-aikorea-voice-pro.md) (6,255 ⭐) — Voice Pro is a comprehensive speech and audio processing toolkit that combines text-to-speech synthesis, voice cloning, speech recognition, and translation capabilities into a single application. At its core, the project enables users to generate natural-sounding speech from text, clone voices from short audio samples without requiring prior training data, and perform real-time speech translation across over 100 languages.

The platform distinguishes itself through its integrated multimedia workflow, allowing users to download YouTube videos, extract audio, separate voice tracks, generate word-timed subtitles, and produce dubbed content in over 100 languages through a unified pipeline. It supports multiple speech synthesis engines including Edge-TTS, F5-TTS, E2-TTS, CosyVoice, and kokoro, while also providing the ability to train custom TTS models on user-provided datasets and export trained models to ONNX format for deployment.

Beyond core speech generation, the application offers extensive audio processing features such as transcribing speech to text with word-level subtitle generation, translating subtitle files while preserving formatting, and performing real-time speech recognition and translation with customizable audio inputs. The system also includes capabilities for extracting audio from video, removing noise, and managing the application's installation and dependencies through built-in cleanup utilities.
- [getpaseo/paseo](https://awesome-repositories.com/repository/getpaseo-paseo.md) (9,118 ⭐) — Paseo is an LLM coding agent orchestrator and multi-agent workflow manager designed to coordinate multiple AI agents across isolated git worktrees. It provides a unified control interface for managing these agents and their associated environments to execute complex programming tasks.

The system distinguishes itself through a remote agent daemon that enables secure access to local coding agents via encrypted relays. It employs a git worktree environment manager to isolate parallel tasks into dedicated directories and branch-based server URLs, preventing file collisions and network port conflicts between concurrent agents.

The platform covers wide-ranging capabilities including multi-agent orchestration via specialized agent committees, iterative worker-verifier execution loops, and comprehensive git workflow management. It includes tools for visual code review, GitHub API integration, and a command line interface for streaming real-time output and managing agent sessions.

The architecture utilizes a headless daemon and a standardized JSON-RPC protocol to communicate with agent binaries over stdio.
- [boson-ai/higgs-audio](https://awesome-repositories.com/repository/boson-ai-higgs-audio.md) (7,919 ⭐) — Higgs-audio is a generative text-to-speech engine that transforms text into natural conversational speech using large language model architectures. It functions as a multilingual speech synthesizer capable of generating high-fidelity audio across different languages with control over emotional tone and prosody.

The system includes a voice cloning tool that creates synthetic replicas of specific speakers from short audio samples without requiring extensive model training. It also provides a streaming audio API designed to deliver generated speech incrementally to minimize playback delay.

The project covers a broad capability surface including real-time audio streaming, custom voice cloning, and the synthesis of conversational speech with a focus on realistic prosody and tonal control.
- [paulwoitaschek/voice](https://awesome-repositories.com/repository/paulwoitaschek-voice.md) (0 ⭐) — Voice
- [microsoft/vscode](https://awesome-repositories.com/repository/microsoft-vscode.md) (186,401 ⭐) — This project is a cross-platform code editor designed for software development, offering a comprehensive suite of tools for text editing, workspace management, and task automation. It includes native support for version control, an integrated terminal, and a flexible task runner that allows for the execution of build, test, and deployment workflows directly within the environment.

The editor features an extensive AI-driven development assistant system, which provides conversational chat interfaces, inline code suggestions, and autonomous agents capable of executing multi-step coding tasks. These AI capabilities are supported by a framework for implementation planning, context curation, and custom agent configuration, allowing developers to tailor the editor's behavior to specific project standards.

To support diverse development needs, the editor provides a robust extension framework that enables the integration of language-specific tools, custom UI elements, and specialized build system support. Administrative controls are available for enterprise environments, allowing for the management of extensions, network configurations, and compliance policies. The software is available as a downloadable application with support for portable execution and frequent release channels.
- [gitbrew/voices](https://awesome-repositories.com/repository/gitbrew-voices.md) (0 ⭐) — voices
- [plachtaa/seed-vc](https://awesome-repositories.com/repository/plachtaa-seed-vc.md) (3,590 ⭐) — seed-vc is an AI voice conversion tool and voice cloning system designed to transform the timbre, accent, and emotion of speech recordings. It provides a framework for replicating specific speaker identities and singing styles using short reference audio samples.

The project includes a voice fine-tuning framework for training models on custom audio datasets to increase the accuracy of voice clones. It also features speech anonymization tools that remove unique speaker traits to produce a generic average voice for identity protection.

The system covers a broad range of audio processing capabilities, including zero-shot voice conversion, talking pace control, and the modification of emotional delivery and accents. It supports both spoken speech and singing voice conversion to transfer styles between source and target recordings.
- [hammerspoon/hammerspoon](https://awesome-repositories.com/repository/hammerspoon-hammerspoon.md) (14,497 ⭐) — Hammerspoon is a programmable automation engine for macOS that enables deep system-level control through a Lua scripting environment. By bridging high-level scripts with native Objective-C APIs, it allows users to interact with the operating system's accessibility tree, intercept hardware input streams, and manage the lifecycle of running applications.

The project distinguishes itself through an event-driven architecture that registers asynchronous hooks for system notifications and hardware events. This allows for real-time automation, such as remapping keyboard and mouse inputs, managing window layouts via grid-based positioning, and responding to changes in network status, battery levels, or display configurations. Its modular extension system supports the loading of self-contained units of functionality, enabling users to tailor the environment to specific workflows.

Beyond core automation, the platform provides a comprehensive suite of capabilities for network integration, media and hardware control, and data persistence. It includes tools for managing audio devices, interacting with professional control panels, rendering custom graphical overlays, and executing shell commands or system scripts. The environment also supports complex window management, including spatial navigation and tabbed grouping, alongside monitoring utilities for system hardware and diagnostic logging.

The project provides a command-line interface for managing configurations and includes built-in documentation servers to assist with script development.
