Pipecat

Pipecat is a framework and software development kit for building real-time multimodal AI agents and speech-to-speech systems. It utilizes a frame-based data pipeline to route audio, video, and text through a modular sequence of processors, enabling the orchestration of low-latency conversational AI.

The project is distinguished by its ability to coordinate complex multimodal services, including speech-to-text, language models, and text-to-speech, within a single pipeline. It features semantic voice activity detection for natural turn-taking, state-machine conversation flows for dialogue management, and WebRTC-based streaming for bidirectional media connectivity.

The framework covers a broad surface of capabilities, including AI integration with various foundation models, asynchronous tool execution for external function calls, and telephony integration with providers such as Twilio and Genesys Cloud. It also includes tools for distributed session management, long-term agent memory, and cloud deployment orchestration for scaling agent instances.

The project provides command-line utilities for project scaffolding, deployment auditing, and technical documentation indexing.

Features

Multimodal AI Orchestrators - Coordinates vision, speech, and language models to enable unified real-time multimodal agent workflows.

Multimodal Service Orchestration - Coordinates simultaneous streams of audio, video, and text to create interactive agents with visual presence.

Data Flow Orchestrators - Wraps audio, text, and images into frames that flow through a modular pipeline of processors.

Message-Passing Agent Orchestrators - Implements a system where agents interact by exchanging structured messages through a central hub to manage tasks.

Agent Client Protocols - Orchestrates the communication layer and transport protocols between the user client and the AI agent.

Agent Deployment - Bootstraps new project structures and manages the full lifecycle of deploying AI agents to production.

Agent Session Management - Manages the lifecycle of conversational sessions, including identity, authentication, and runtime parameters.

Session Initializers - Initializes new instances of deployed agents and creates communication rooms to establish connectivity.

AI Agent Capabilities - Generates streaming responses and executes functions using specialized formatting for conversational agents.

Conversational Turn Detection - Analyzes speech context and acoustic markers to identify when a user has finished speaking to trigger bot responses.

Conversation History Trackers - Tracks conversation history and applies system instructions to maintain consistent agent behavior across turns.

Realtime Voice Conversation Facilitators - Facilitates real-time, low-latency voice transcription to enable seamless multimodal AI interactions.

Session Lifecycle Managers - Coordinates the lifecycle of voice AI interactions by managing pipeline termination and session state.

Voice Activity Detection - Uses server-side and local analyzers to identify when a user starts and stops speaking.

AI Execution Triggers - Provides mechanisms to initiate model runs or update context within the AI processing pipeline.

Language Model Integrations - Provides streaming interfaces and adapters to connect applications to various hosted or local language models.

Asynchronous Model Execution - Implements a non-blocking execution loop for handling model-initiated function calls and external tool execution.

Real-Time Transcription - Provides instantaneous conversion of live user audio streams into text transcripts for real-time processing.

Context Management Tools - Manages the input context, tools, and configurations sent to language models to trigger responses.

Conversation History Management - Retrieves the sequence of developer, user, and assistant messages to maintain persistent interaction history.

Conversation Management Systems - Routes data and system frames through a modular pipeline to manage dialogue flow and coherence.

Conversation State Management - Manages the conversation context by determining when to append messages to history or reset it during transitions.

Conversation Flow Design - Enables the definition of conversation paths and transitions to manage the logic of AI interactions.

Conversation State Managers - Distributes client context and messages across the application to maintain the state of the conversation.

Conversational Agent SDKs - Provides a comprehensive SDK for deploying AI bots across web, mobile, and telephony platforms with session management.

External Tool Integration - Enables conversational agents to interact with external APIs and retrieve data to perform actions.

Tool Calling - Provides non-blocking asynchronous execution for external tool calls via AI models.

LLM Model Integrations - Provides the necessary tools and settings to integrate large language models into the conversational AI ecosystem.

LLM Conversational AI Frameworks - Offers a framework for building real-time multimodal agents that coordinate speech-to-text, language models, and text-to-speech.

LLM Tool Calling - Maps natural language intents to executable functions by handling provider-specific format conversions.

Realtime Processing Pipelines - Utilizes integrated audio processing pipelines that handle input and output within a single model to minimize latency.

Multimodal AI Pipeline Orchestration - Integrates speech-to-text, language models, text-to-speech, and video services into a coordinated real-time processing pipeline.

Multimodal Input Processors - Controls multimodal input by pausing or resuming audio and video streams to balance performance and cost.

User Interruption Detection - Detects when a user starts or stops speaking to trigger bot responses and manage interruptions.

Interruption Response Handling - Ensures immediate responsiveness by discarding pending data frames when a user interrupts the bot.

Real-Time Conversational AI Frameworks - Integrates speech-to-text, language models, and text-to-speech into a single pipeline for low-latency multimodal agents.

Real-time Voice Response Generation - Produces low-latency audio and text responses using audio-native models for real-time interactions.

Realtime AI Session Managers - Manages persistent, low-latency bidirectional WebRTC connections for real-time AI voice communication.

Speech-to-Text Engines - Transcribes spoken audio into text in real-time using high-speed inference engines with automatic language detection.

Speech-to-Text Integrations - Connects conversational agents to audio transcription services for real-time multilingual speech-to-text processing.

Unified Speech Pipelines - Integrates transcription and synthesis to enable seamless voice-to-voice AI conversations with low latency.

Speech Transcription - Processes interim partial and final committed transcriptions from speech-to-text services in real time.

Text-to-Speech - Synthesizes spoken audio from text using streaming models with customizable language and quality options.

Conversational Audio Streams - Manages the full conversational loop of real-time audio streaming for speech-to-speech interactions.

Speech-to-Speech Models - Implements low-latency pipelines that process audio input directly into spoken output via integrated transcription and synthesis.

Tool Calling - Processes function names and arguments returned by a model to run external code and return results.

Tool-Execution Loops - Orchestrates the loop where a model requests a tool call, the client executes it, and the result is returned.

Voice Activity Detection - Analyzes speech content and acoustic data to automatically determine when a user has finished speaking.

Bot - Provides a framework for creating and managing the lifecycle and event callbacks of conversational agents.

Function Execution Coordination - Signals the start of function executions and handles cancellations resulting from user interruptions.

Text Processing Pipelines - Routes text through modular workflows to be consumed by aggregators, speech services, or processors.

Agent-Integrated Functions - Implements a mechanism for AI agents to trigger external code execution within a conversational loop.

Pipeline Termination - Controls the processing pipeline lifecycle, including starting, graceful shutdown, and immediate cancellation.

Generative Audio Chunking - Streams audio waveform chunks as they are generated to enable immediate, low-latency playback.

Media Input Controls - Provides controls to toggle the enabled state of the local participant's camera and microphone.

Media Track Management - Retrieves local and remote audio and video tracks for playback and real-time processing.

Media Streaming - Provides tools and frameworks for managing and routing continuous media data streams for real-time playback.

Agent-Client Communication Protocols - Manages the bidirectional exchange of messages and service configurations between the client and a conversational agent.

LLM - Produces streaming text responses from large language models based on text or audio inputs.

Messaging Client Connections - Implements a structured request-response pattern for exchanging messages between the server and connected AI agent clients.

Pipeline Framing - Implements a frame-based data pipeline to route audio, video, and text through a sequence of processors.

Real-Time Communication Systems - Manages WebRTC connections and device handling for low-latency audio and video streaming between clients and agents.

Distributed State Management - Synchronizes agent state and conversation history across distributed environments using Redis or PostgreSQL backends.

Message Stream Handlers - Provides utilities for managing the lifecycle of real-time message and speech progress streams.

Peer-to-Peer Streaming - Implements bidirectional audio and video streaming between clients and servers via WebRTC for low-latency communication.

Real-Time Media Transport - Manages the bidirectional exchange of real-time audio and video streams between users and AI agents.

Telephony Session Managers - Enables the attachment of AI agent sessions to live phone calls by converting internal data frames into telephony media streams.

WebRTC Media Orchestration - Handles device management and WebRTC connections to enable low-latency streaming between clients and bots.

Pipeline Lifecycle Hooks - Manages the pipeline lifecycle, including shutting down or halting data flow while optionally preserving processor state.

WebRTC Facilitators - Facilitates the establishment of direct real-time audio and video streams between clients and remote agents using WebRTC.

Reasoning Configuration - Provides controls to adjust the depth and exposure of a model's internal reasoning process.

Long-term Memory Injection - Fetches user-scoped context and appends it as a system message before the AI generates a response.

Multi-Agent Coordination Systems - Manages multiple specialized agents that hand off tasks or communicate over a shared bus.

Execution Control Flows - Implements controls to pause, resume, or interrupt specific agent processor threads to manage data flow.

Agent-to-Agent Communication - Coordinates multiple specialized agents using a shared message bus and priority queues for inter-agent communication.

Agent Connectivity Interfaces - Implements protocols for establishing real-time communication between local clients and remote conversational agent systems.

Agent Evaluation Tools - Runs scripted scenarios and uses model judges to verify the reasoning and output quality of agents.

Agent Memory Stores - Provides persistent storage for maintaining state, user preferences, and conversation history across agent sessions.

Agent Capability Extensions - Integrates third-party speech and language services using standardized base classes and patterns to expand agent capabilities.

Programmatic Participants - Allows the AI agent to act as a programmatic participant in media rooms, subscribing to audio and video tracks.

Multi-Agent Routing Systems - Coordinates multi-agent communication using a pub-sub system with priority queues and system commands.

Distributed Agent Systems - Extends agent communication to distributed environments using backends like Redis or PostgreSQL for state synchronization.

Reasoning Effort Configurations - Allows adjusting the depth and computational effort of model thinking processes to balance quality and latency.

Reasoning Effort Budgets - Configures the internal thinking process and token budgets for reasoning models to balance response depth and speed.

Agent Configuration Profiles - Initializes AI interactions using pre-configured profiles, inline settings, or existing call sessions.

Agent Message Proxies - Forwards conversation bus messages between local and remote agents over WebSocket connections.

AI Model Integrations - Provides connectivity to enterprise language models using Google Vertex AI within the processing pipeline.

AI Observability Tracing - Captures and analyzes execution traces of AI pipeline spans for observability and performance monitoring.

AI Request Routing - Directs prompts to a unified gateway providing access to various AI models through one interface.

Function Registries - Defines and exposes a registry of functions as executable tools for AI model interaction across the conversation.

LLM Tooling Integrations - Provides interfaces to configure the set of external tools and constraints available to the language model mid-conversation.

Audio Noise Cancellation - Integrates audio filter models to remove background noise from agent voice streams in real time.

Chat Completion Services - Produces streaming text responses using a chat-completion interface with support for function calling.

Client-Side Tool Execution - Registers callbacks that execute on the client when the bot requests a specific function call.

Cloud Provider Integrations - Connects to models via alternative cloud backends using custom client instances for flexible deployment.

Context-Aware Retrieval - Provides mechanisms for context-aware information retrieval to inform AI agent responses.

Conversation Memory Stores - Saves and retrieves past conversation data to maintain continuity across interactions.

Conversation State Persistence - Records user transcriptions and assistant responses to a remote layer to preserve context across sessions.

Conversational Models - Generates streaming conversational text and vision-based responses using large language models.

Conversational State Managers - Provides shared dictionaries to store and retrieve persistent data throughout a session to maintain context.

Tool Result Aggregators - Returns the output of executed tools back to the context aggregator to inform the next conversational turn.

Gemini Integrations - Provides dedicated integration for the Gemini AI platform to support multimodal conversational agents.

Inference Integration Layers - Connects to high-speed inference engines to handle streaming responses and context management.

Search-Enhanced Generation - Integrates language models with internet search to provide up-to-date information in real-time conversations.

Session Rotation Strategies - Prevents loss of context and interruptions during extended interactions through session rotation and audio buffering.

Local Model Integrations - Enables the connection of locally-hosted language model services to the pipeline for improved privacy and cost control.

Local Speech-to-Text - Converts spoken audio to text using local hardware to ensure user privacy and eliminate API dependencies.

AI Model Integrations - Connects to managed enterprise foundation models for streaming text and multimodal input processing.

Inference Lifecycle Tracking - Tracks the lifecycle of language model processes, including inference start and the streaming of tokens.

Multilingual Speech Translation - Converts spoken audio from one language into translated speech and text in real time.

Speech-to-Text Translation - Converts spoken audio from multiple languages into written English text.

MCP Servers - Implements an MCP server to provide AI agents with programmatic access to a local technical index.

Model Configuration Settings - Provides the ability to adjust model parameters like temperature and token limits during active conversations.

Model Parameters - Modifies parameters like temperature and token limits during a live conversation to change agent behavior.

Model Response Aggregation - Tracks the boundaries of streaming model responses, including identifying extended thinking phases.

Streaming Response Aggregators - Aggregates and streams partial AI model outputs in real-time via WebSockets or HTTP to minimize latency.

iOS Integrations - Provides a Swift library to integrate voice and multimodal AI capabilities into iOS applications.

Open Models - Connects to hosted open-source language models for streaming responses and context management.

Prompt Caching - Implements prompt caching to lower API costs and reduce latency for long conversation histories.

Realtime Avatar Integration - Implements frameworks for streaming synchronized, interactive virtual characters into low-latency AI sessions.

Reasoning Models - Integrates conversational pipelines with language models optimized for complex logical deduction and multi-step reasoning.

Reasoning Model Integrations - Connects specialized reasoning models to the pipeline to handle complex conversational tasks.

Agent Interaction Logs - Fetches detailed execution and interaction logs for specific agents using filtering and pagination.

Speaker Diarization - Identifies different speakers in an audio stream to attribute transcribed text to specific individuals.

Primary Speaker Isolation - Suppresses background noise and focuses on the main voice based on microphone proximity.

Prosody Controls - Enables adjustment of synthesized speech delivery, including speed, volume, pitch, and inflection.

UI State Awareness - Sends periodic snapshots of the page DOM and accessibility tree to enable awareness of UI state.

Conversation Analytics - Tracks call lifecycles, transcripts, and audio recordings to provide detailed observability for conversational agents.

Voice API Connections - Establishes WebRTC voice connections to manage media devices and streams for real-time conversational AI.

Desktop AI Clients - Enables the development of native desktop applications with integrated voice and multimodal AI capabilities.

Frame-Based - Injects and routes discrete data frames into the pipeline from the beginning or end for processing.

Model Output Controls - Controls whether generated tokens are sent to audio synthesis or retained as conversation context, including prompt caching.

Parallel Processing - Runs multiple independent processing branches simultaneously and coordinates their output into a single stream.

Vector Indexing - Integrates vector-based indexing to enable semantic search and knowledge injection into model contexts.

State Transition Actions - Triggers specific tasks, such as audio playback, immediately before or after moving between conversation states.

Documentation Indexing - Creates a local vector database of documentation and code to provide AI tools with current context.

Pipeline Execution Monitors - Tracks runtime statistics and triggers custom logic based on pipeline lifecycle changes like startup or errors.

Project Scaffolding - Ships a guided wizard to bootstrap new project structures with standardized configurations.

Agent Autoscaling - Configures minimum and maximum active agent instances to optimize availability and cost.

Agent Lifecycle Management - Provides utilities for managing the full lifecycle of AI agent instances, including activation and termination.

Cloud Agent Deployers - Automates the containerization and registration of agent services for cloud hosting in production environments.

Contact Center Integrations - Manages bidirectional audio streaming and session handshakes via the Genesys Cloud AudioHook protocol.

Deployment Updates - Updates agent configurations by modifying container images, scaling limits, and associated secrets.

Conversation Node Transitions - Implements triggers to move between conversation nodes to update active states and task instructions.

Runtime Configurations - Allows dynamic updates to speech and language service configurations without requiring a pipeline rebuild.

Audio Processing - Provides controls to enable, disable, or update settings for audio filters and mixers during active sessions.

Audio Streaming Engines - Facilitates low-latency bidirectional audio streaming and data messaging using WebRTC.

Audio Stream Filtering - Reduces background noise and isolates the speaker's voice in real-time audio input streams.

Inactivity Detection - Monitors conversational pipeline activity to trigger events when no interactions occur for a defined duration.

Cloud Provider Integrations - Integrates Azure speech services via WebSocket and HTTP for low-latency or batch text-to-speech synthesis.

Video Input Processing - Processes streaming video input to generate textual descriptions for AI model context.

Media Stream Processing - Processes real-time audio streams from Twilio via WebSockets and REST APIs for conversational AI.

Frame-to-Stream Serialization - Converts between frames and media streams to enable real-time communication over websockets.

Video Streaming - Displays video tracks for local or remote participants with mirroring and fit settings.

Android Integrations - Provides a Kotlin library to implement voice and multimodal agent capabilities within Android applications.

Mobile Capabilities - Enables mobile applications to connect to AI bots for real-time messaging and media stream management.

AI Agent SDKs - Provides a dedicated SDK to embed real-time voice and multimodal AI agents into React Native applications.

Multimodal Data Streams - Establishes bidirectional communication to stream combined audio and visual frames between client and server.

Bot Connectivity Endpoints - Connects conversational bots to external triggers via room-based URLs or WebSocket connections.

Bot Session Protocols - Provides a standardized protocol for exchanging structured messages to manage session state and agent interactions.

Participant Interaction Hooks - Coordinates real-time audio and video streaming between bots, avatars, and human participants.

Communication Protocols and Standards - Implements a standardized communication protocol to synchronize transcriptions and audio delivery between users and bots.

Connection Establishment Protocols - Connects the client to the agent using specific transport parameters and server endpoints to initiate the session.

Websocket Connection Managers - Provides lifecycle event handling and custom logic for persistent WebSocket connections.

Custom Data Channels - Enables the exchange of flexible, server-defined data structures between clients and servers via a generic channel.

External Integration Protocols - Translates internal pipeline data formats into the specific message structures required by third-party communication APIs.

Transport Customizers - Allows customization of audio and video parameters to control the media transport layer interaction.

Pipeline-to-Bus Bridges - Connects processing pipelines to message buses to exchange data frames across multiple agents.

Scheduled Task Cancellation - Provides mechanisms to identify and abort specific groups of in-flight asynchronous agent tasks.

Connection and Session Management - Maintains stable links between clients and bots through device management and media transmission regulation.

Connection Management - Attempts to reconnect and resume sessions using resumption handles to preserve conversation history.

Connection Lifecycle Managers - Implements event handlers to trigger custom logic during client connection and session lifecycle events.

Connection State Recovery - Automatically restores session state and synchronizes data after network interruptions to maintain stability.

Non-Audio Data Channels - Facilitates the exchange of raw data packets through a dedicated channel for non-audio communication.

Peer-to-Peer Networking - Configures STUN and TURN servers to ensure stable peer-to-peer WebRTC connections across firewalls.

Processor Resumption - Buffers incoming data in a processor queue and resumes processing upon receiving a trigger.

Call Control Interfaces - Supports dial-in and dial-out phone capabilities including the processing of keypad tones.

Voice Platform Integrations - Streams bidirectional audio and handles touch-tone events through integration with the Plivo voice platform.

WebSocket Services - Establishes bidirectional real-time audio connections over WebSockets for telephony and server-side applications.

Custom Action Handlers - Binds asynchronous handlers to specific action types that trigger during conversation node transitions.

Pipeline Parameter Configurators - Updates voice activity and idle timeout settings at runtime without requiring a pipeline restart.

Function Execution Engines - Monitors the lifecycle of model-initiated function calls to manage progress and handle user interruptions.

Observability Tools - Provides monitoring for the lifecycle, arguments, and results of model-initiated function calls.

Agent Health Monitoring - Tracks deployment status and monitors real-time CPU and memory usage metrics for AI agents.

Agent Performance Monitoring - Tracks operational metrics including processing duration and time to first byte for frame processors.

AI Session Monitoring - Unifies conversation audio, logs, traces, and metrics to troubleshoot interruptions and latency issues in real-time.

Agent Trajectory Logs - Retrieves and filters system logs by severity or session ID to debug and inspect agent reasoning trajectories.

Pipeline Health Monitors - Tracks system vitality using heartbeat frames and triggers warnings on pipeline timeouts.

Multi-track Streamers - Streams multiple custom audio and video tracks simultaneously, including screen sharing destinations.

Agent Health Metrics - Fetches comprehensive agent details including deployment status, health, and scaling configurations.

Metric and Performance Monitors - Reports high-frequency numerical performance data and service metadata, including speech-to-text latency.

Conversation Event Monitoring - Tracks room lifecycles, participant changes, and call states using programmable event handlers.

Behavioral Evaluations - Provides tools to test agent responses against scenario-based expectations to verify correct behavior.

Interactive Video Avatar Generators - Produces synchronized video and audio output by integrating realistic virtual avatars into communication rooms.

Media Renderers - Provides specialized components to handle the display and playback of audio and video streams.

State Syncing Reactivity - Synchronizes the application GUI state by processing inbound client events and sending updates from the bot.

Pipeline Event Synchronization - Translates internal pipeline events and performance metrics into messages to sync the conversation state with the client.

Real-Time Media Servers - Manages the transmission of live audio and video between clients and servers using WebRTC infrastructure.

Application Frameworks - Framework for voice and multimodal conversational AI.

pipecat-aipipecat

Features

Star history