Pipecat is a framework and software development kit for building real-time multimodal AI agents and speech-to-speech systems. It utilizes a frame-based data pipeline to route audio, video, and text through a modular sequence of processors, enabling the orchestration of low-latency conversational AI.
The project is distinguished by its ability to coordinate complex multimodal services, including speech-to-text, language models, and text-to-speech, within a single pipeline. It features semantic voice activity detection for natural turn-taking, state-machine conversation flows for dialogue management, and WebRTC-based streaming for bidirectional media connectivity.
The framework covers a broad surface of capabilities, including AI integration with various foundation models, asynchronous tool execution for external function calls, and telephony integration with providers such as Twilio and Genesys Cloud. It also includes tools for distributed session management, long-term agent memory, and cloud deployment orchestration for scaling agent instances.
The project provides command-line utilities for project scaffolding, deployment auditing, and technical documentation indexing.