30 open-source projects similar to ten-framework/ten-framework, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best Ten Framework alternative.
Pipecat is a framework and software development kit for building real-time multimodal AI agents and speech-to-speech systems. It utilizes a frame-based data pipeline to route audio, video, and text through a modular sequence of processors, enabling the orchestration of low-latency conversational AI. The project is distinguished by its ability to coordinate complex multimodal services, including speech-to-text, language models, and text-to-speech, within a single pipeline. It features semantic voice activity detection for natural turn-taking, state-machine conversation flows for dialogue manag
This project is a framework for developing multimodal AI agents that function as programmable participants in real-time communication rooms. It enables the construction of agents that can see, hear, and speak by integrating speech-to-text, large language models, and text-to-speech pipelines to facilitate low-latency, natural conversations. The system is distinguished by its advanced orchestration of real-time media and conversational flow, including support for full-duplex speech, preemptive response generation, and sophisticated interruption management. It further differentiates itself throu
LiveKit is a comprehensive framework for building and orchestrating real-time, multimodal AI agents that interact with users through voice, video, and text. It provides a centralized, event-driven architecture to manage the entire lifecycle of automated participants, from initialization and session state management to graceful shutdown. By utilizing a selective forwarding unit, the platform efficiently routes media streams between participants and agents, ensuring low-latency communication and secure, token-based authentication for all connections. The platform distinguishes itself through it
MiniCPM-V is a multimodal large language model and vision-language system designed for complex visual and linguistic understanding. It functions as an on-device AI model, providing the capacity to process text, images, and video as a compact neural network. The project is specifically developed as an edge AI framework, utilizing quantization and weight sharding to run on memory-constrained mobile chipsets. This allows for the deployment of multimodal intelligence directly on mobile operating systems for local inference. Its capabilities cover multimodal content analysis of high-resolution im
Vocode-core is a framework for building real-time conversational AI voice agents. It serves as a conversational orchestrator and pipeline that integrates speech-to-text, large language models, and text-to-speech services to enable low-latency voice interactions. The project features a provider-agnostic interface that allows for swappable speech and language model providers, including support for both cloud APIs and local binaries. It distinguishes itself through a specialized telephony integration layer that enables agents to be deployed across phone lines, WebRTC, and virtual meeting platfor
Personaplex is an LLM speech-to-speech framework and conversational AI persona engine designed for real-time voice interfaces. It provides a system for defining AI identities and vocal characteristics through a combination of text-based role prompts and audio reference files. The project features a real-time AI voice interface that supports full-duplex human-AI dialogue, enabling multiple parties to speak and listen simultaneously via bidirectional audio streaming. It includes a GPU-accelerated audio processor and a speech-to-speech pipeline to facilitate low-latency conversations. The frame
This project is a framework for building local voice assistants and a real-time audio streaming server. It functions as a containerized inference engine and a multilingual speech pipeline that orchestrates speech-to-text, language models, and text-to-speech components to convert spoken input into spoken output. The system is distinguished by its use of WebSocket-based bidirectional streaming for low-latency interactions. It features a voice activity detection system that manages speech boundaries and handles user barge-in interruptions during assistant playback. It also supports custom voice
NeMo is a multimodal AI framework and toolkit designed for the development, training, and scaling of large language models, generative AI systems, and speech-based models. It functions as an automatic speech recognition toolkit, a text-to-speech engine, and a framework for building models that process and generate combinations of text, image, and audio data. The project serves as a conversational AI orchestrator capable of managing real-time, interruptible voice interactions. It provides specialized workflows for speech translation, converting spoken audio from one language into text or speec
Sherpa-ONNX is an ONNX-based speech processing toolkit that provides a local speech recognition engine, an on-device voice synthesis tool, and a speaker identification framework. It is designed as a cross-platform speech API that enables speech-to-text, text-to-speech, and speaker verification tasks to be executed locally on a device without requiring network access. The project is distinguished by its ability to perform zero-shot voice cloning and speaker diarization on-device. It supports a wide range of hardware accelerations, including GPU and various NPU architectures, and provides a Web
RealtimeSTT is a local speech-to-text engine and real-time automatic speech recognition server. It utilizes transformer-based recognition and omnilingual pipelines to convert live audio streams into text, providing a WebSocket-based streaming API for raw PCM audio transmission. The project is distinguished by a dual-backend transcription pipeline that uses a lightweight engine for immediate partial suggestions and a heavier model for final high-accuracy results. It includes a wake word detection system to trigger recording and employs a shared-resource inference model to distribute heavy spee
Nexent is an enterprise AI control plane and LLM agent orchestration platform. It provides a zero-code environment for designing, deploying, and managing production AI agents through a multi-agent collaboration framework that coordinates specialized autonomous agents using standardized messaging protocols. The platform integrates the Model Context Protocol to connect agents with external tools, plugins, and services via a universal communication interface. It further distinguishes itself with a dedicated RAG knowledge base manager that imports unstructured documents and utilizes hybrid search
LiveTalking is an interactive talking head engine and AI avatar management platform designed to synchronize synthetic speech with facial movements. It functions as a real-time orchestrator that connects large language models and text-to-speech services to neural-rendered digital humans. The project distinguishes itself through low-latency streaming capabilities and the ability to handle real-time conversational interruptions. It supports advanced audio-visual customization, including human voice cloning and the ability to drive avatar expressions using real-time webcam data. The platform cov
jetson-inference is a set of libraries and tools for executing optimized deep learning models on embedded GPU hardware. Its primary purpose is to enable real-time computer vision and AI inference at the edge with low latency and high throughput. The project distinguishes itself through high-performance streaming analytics and the ability to execute concurrent AI pipelines on auto-grade silicon. It provides specialized support for multi-sensor stream processing, utilizing zero-copy data transport to load camera frames directly into GPU memory. The codebase covers a broad surface of capabiliti
This Python SDK provides a comprehensive toolkit for synthetic audio generation, voice cloning, and the development of conversational AI agents. It enables the creation of lifelike spoken audio from text, the replication of human voices through custom cloning, and the deployment of real-time voice agents capable of interacting with external large language models. The library distinguishes itself through deep integration of conversational AI capabilities, including the design of agent personas and the execution of real-time actions via APIs. It supports professional-grade audio production thro
Vosk is an offline speech-to-text engine and API that converts spoken audio into text locally on a device. It provides a cross-platform speech toolkit with language bindings for integrating voice recognition into server environments, Android, iOS, and Raspberry Pi. The project includes a speaker identification tool to distinguish between different voices and an acoustic model trainer for building custom neural network models. These training tools enable speech feature extraction and model accuracy evaluation to improve recognition for specialized domains. The system supports real-time audio
This project is a multimodal translation framework and large language model capable of speech-to-speech, speech-to-text, and text-to-text translation across nearly 100 languages. It provides a real-time speech translation engine and a comprehensive toolkit for converting spoken audio between languages. The system is distinguished by its ability to preserve the original speaker's tone, pace, and prosody during translation. It utilizes a specialized on-device inference toolkit that converts model checkpoints into C-based libraries, enabling low-latency execution on mobile and edge hardware with
Fonoster is a conversational AI framework and multi-tenant communications platform as a service. It serves as a programmable voice gateway and SIP telephony platform, enabling the creation of voice-based assistants and automated communication workflows using large language models. The project distinguishes itself through a vendor-agnostic speech integration engine that abstracts speech-to-text and text-to-speech providers. It features a multi-tenant architecture that isolates telephony resources and user identities into distinct organizational workspaces. The system covers a broad range of t
Duix-Mobile is a software development kit for deploying real-time conversational AI characters on mobile devices. It enables the creation of interactive digital humans capable of fluid voice-to-voice interactions, featuring low-latency speech recognition and synchronized lip movements. The project distinguishes itself through the ability to integrate custom external language models and speech providers to define an avatar's intelligence and voice. It supports the generation of real-time multilingual subtitles and provides mechanisms to track the training status of newly created digital charac
WhisperLive is a real-time speech-to-text server that converts live audio streams into text using Whisper models. It functions as a backend service that receives microphone input via WebSockets and provides incremental transcriptions with word-level timestamps. The system utilizes a GPU-accelerated inference engine and a keyword-boosted transcription API to improve the recognition accuracy of domain-specific jargon, acronyms, and product names. It also includes a speaker diarization tool that clusters audio embeddings to identify and label different participants within a recording. Additiona
The Gemini Cookbook is a comprehensive collection of implementation patterns, code samples, and development guides designed for building applications with Google Gemini models. It serves as a central resource for developers to integrate multimodal generative artificial intelligence into their software, providing the necessary frameworks to manage model interactions, stateful workflows, and structured data extraction. The repository distinguishes itself by offering specialized toolkits for autonomous agent orchestration, enabling the construction of agents that can execute code, browse the web
FunASR is an automatic speech recognition toolkit and multilingual speech-to-text engine designed to convert spoken audio into written text across more than fifty languages. It provides a framework for speaker diarization, an OpenAI-compatible transcription API for local server hosting, and speech models compatible with the ONNX format. The project distinguishes itself by supporting high-performance inference on edge hardware via self-contained binaries and portable model exports. It incorporates specialized capabilities for natural speech generation with adjustable timbre and emotional expre
Omi is an open-source wearable AI platform that captures audio and screen data to provide real-time conversational assistance and memory. It integrates a wearable hardware development kit with a vector memory database and large language model capabilities to create a persistent digital record of user interactions. The platform is distinguished by its BLE audio streaming pipeline, which transmits raw audio from wearable hardware for real-time transcription and speaker identification. It utilizes a plugin-based agent tool framework that allows AI assistants to autonomously invoke custom functio
MiniCPM-o is a multimodal large language model designed to function as a real-time conversational assistant on edge devices. By mapping text, image, video, and audio inputs into a unified latent space, the system enables simultaneous cross-modal reasoning and full-duplex interaction. It is built as an edge-side inference engine, utilizing quantized model weights to maintain high-performance processing on consumer hardware. The system distinguishes itself through its integrated speech synthesis and voice cloning capabilities, which allow for the generation of expressive, personalized vocal out
Qwen2.5-Omni is an omnichannel multimodal large language model designed to process and generate content across text, audio, vision, and video. It functions as a real-time speech AI, utilizing an end-to-end architecture to maintain synchronous voice conversations with low-latency responses. The project emphasizes efficiency through quantized edge models, allowing for local inference on mobile hardware and resource-constrained devices. It employs 4-bit weight quantization, CPU-based process offloading, and on-demand weight loading to reduce GPU memory requirements. The system integrates specia
This project is a comprehensive toolkit for on-device speech recognition, synthesis, and audio processing, specifically engineered for Apple Silicon. It provides a framework for building real-time, full-duplex voice agents that operate entirely offline, leveraging native hardware acceleration to maintain performance and privacy. By utilizing optimized machine learning models, the library enables local execution of complex audio tasks without reliance on external cloud services. The library distinguishes itself through its specialized focus on local, high-performance voice interaction. It incl
whisper.cpp is a C++ implementation of the Whisper speech-to-text model, serving as a lightweight machine learning inference engine and quantized runtime. It provides high-performance automatic speech recognition and real-time audio transcription without requiring a Python environment. The project utilizes model quantization to reduce memory usage and increase inference speed on local hardware. It incorporates hardware acceleration to optimize processing speed across different processors. The system covers audio processing capabilities including voice activity detection, speaker diarization,
Vibe is a cross-platform transcription tool that converts spoken audio into text by running Whisper neural models directly on your device, with no cloud dependency. It can transcribe audio from files, microphones, system output, and network streams, and supports both batch processing of multiple files and real-time captioning from continuous input. Beyond basic transcription, Vibe identifies and labels different speakers through speaker diarization, and offers a choice of Command-Line Interface or HTTP API for automated and remote workflows. It also includes plugins to export transcripts to c