Ten Framework is a multimodal large language model agent framework designed for building low-latency conversational agents. It integrates voice, text, and visual inputs in real time to facilitate human interaction.
The project includes a real-time speech processing pipeline for streaming transcription, voice activity detection, and speaker diarization. It also features an avatar synchronization engine that coordinates character lip animations and visual outputs with synthesized speech.
The framework covers edge AI deployment through containerized packaging and direct integration with embedded hardware boards. Additional capabilities include a telephony gateway for connecting agents to phone networks via the Session Initiation Protocol and tools for real-time visual generation of sketches and doodles.