# getstream/vision-agents

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/getstream-vision-agents).**

6,029 stars · 471 forks · Python · apache-2.0

## Links

- GitHub: https://github.com/GetStream/Vision-Agents
- Homepage: https://visionagents.ai
- awesome-repositories: https://awesome-repositories.com/repository/getstream-vision-agents.md

## Topics

`agentic-ai` `agents` `ai` `ai-agents` `realtime` `stt` `tts` `video-agents` `video-ai` `vision-ai` `voice-ai`

## Tags

### Artificial Intelligence & ML

- [Multi-Modal Component Coordinators](https://awesome-repositories.com/f/artificial-intelligence-ml/ai-agent-orchestration/multi-modal-component-coordinators.md) — Coordinates vision, audio, and language components into a single interactive agent for real-time video. ([source](https://visionagents.ai/core/overview.md))
- [Local Agent Deployments](https://awesome-repositories.com/f/artificial-intelligence-ml/agent-deployment/local-agent-deployments.md) — Runs AI agents on local hardware using microphone, speakers, and camera for development and demos. ([source](https://visionagents.ai/integrations/edge-transport/local.md))
- [Agent Knowledge Bases](https://awesome-repositories.com/f/artificial-intelligence-ml/agent-knowledge-bases.md) — Registers a search function that a language model can call to retrieve documents from a vector store. ([source](https://visionagents.ai/integrations/infrastructure/turbopuffer.md))
- [LLM Backend Attachments](https://awesome-repositories.com/f/artificial-intelligence-ml/agentic-llm-frameworks/video-editing-agents/llm-backend-attachments.md) — Attaches a language model from a supported provider to a real-time video agent so it can process and respond to visual input. ([source](https://visionagents.ai/integrations/openrouter))
- [Knowledge Base Retrieval](https://awesome-repositories.com/f/artificial-intelligence-ml/agentic-rag-development/knowledge-base-retrieval.md) — Queries a RAG backend in real time to supply the agent with relevant information while a call is active. ([source](https://visionagents.ai/examples/phone-and-rag.md))
- [Component Metrics Collectors](https://awesome-repositories.com/f/artificial-intelligence-ml/agentic-systems-frameworks/agent-capabilities-skills-tooling/agent-capability-extensions/data-collection-agents/component-metrics-collectors.md) — Automatically collects latency, token usage, and error metrics from LLM, STT, TTS, and video processors. ([source](https://visionagents.ai/core/telemetry.md))
- [Audio Stream Receivers](https://awesome-repositories.com/f/artificial-intelligence-ml/agentic-systems-frameworks/agent-capabilities-skills-tooling/ai-agent-capabilities/programmatic-participants/audio-stream-receivers.md) — Receives audio data chunks from call participants for real-time analysis and processing. ([source](https://visionagents.ai/core/processors-core.md))
- [Tool Call Configurations](https://awesome-repositories.com/f/artificial-intelligence-ml/agentic-systems-frameworks/agent-capabilities-skills-tooling/ai-agent-capabilities/tool-call-configurations.md) — Configures how many consecutive tool-calling rounds the language model may perform before returning control. ([source](https://visionagents.ai/integrations/xai))
- [Cross-Session Conversation Memories](https://awesome-repositories.com/f/artificial-intelligence-ml/agentic-systems-frameworks/agent-reasoning-engines/agent-context-management/cross-agent-context-managers/cross-session-conversation-memories.md) — Stores conversation history and user state so the agent remembers past interactions across separate calls. ([source](https://cdn.jsdelivr.net/gh/getstream/vision-agents@main/README.md))
- [Conversational Turn Detection](https://awesome-repositories.com/f/artificial-intelligence-ml/agentic-systems-frameworks/conversational-voice-interaction/conversational-ai-agents/conversational-turn-detection.md) — Monitors silence gaps to signal when a speaker has finished talking. ([source](https://visionagents.ai/integrations/stt/assemblyai.md))
- [Conversational Flow Controllers](https://awesome-repositories.com/f/artificial-intelligence-ml/agentic-systems-frameworks/conversational-voice-interaction/conversational-ai-agents/conversational-turn-detection/conversational-flow-controllers.md) — Uses voice activity detection and diarization to create natural, interruption-aware dialogue flows. ([source](https://cdn.jsdelivr.net/gh/getstream/vision-agents@main/README.md))
- [Agent](https://awesome-repositories.com/f/artificial-intelligence-ml/agentic-systems-frameworks/conversational-voice-interaction/conversational-ai-agents/conversational-turn-detection/speech-start-detectors/agent.md) — Emits an event when the agent begins speaking, marking the start of an agent turn. ([source](https://visionagents.ai/reference/events-reference.md))
- [Turn Event Emitters](https://awesome-repositories.com/f/artificial-intelligence-ml/agentic-systems-frameworks/conversational-voice-interaction/conversational-ai-agents/conversational-turn-detection/turn-event-emitters.md) — Notifies when a user or agent starts or stops speaking, allowing coordination of conversation flow. ([source](https://visionagents.ai/guides/event-system.md))
- [Voice Agents](https://awesome-repositories.com/f/artificial-intelligence-ml/agentic-systems-frameworks/conversational-voice-interaction/voice-agents.md) — Creates a real-time voice assistant that users can talk to in a browser. ([source](https://visionagents.ai/introduction/quickstart.md))
- [Custom Pipeline Assemblers](https://awesome-repositories.com/f/artificial-intelligence-ml/agentic-systems-frameworks/conversational-voice-interaction/voice-agents/custom-pipeline-assemblers.md) — Assembles voice agents by plugging in separate STT, LLM, and TTS components from different providers for full pipeline control. ([source](https://visionagents.ai/introduction/voice-agents.md))
- [Speech-to-Text Provider Selection](https://awesome-repositories.com/f/artificial-intelligence-ml/agentic-systems-frameworks/conversational-voice-interaction/voice-agents/realtime-speech-to-speech-agents/speech-to-text-provider-selection.md) — Switches between different STT services like Deepgram or Wizper to balance accuracy, language support, and processing speed. ([source](https://visionagents.ai/ai-technologies/speech-to-text.md))
- [Speech Interruption Handlers](https://awesome-repositories.com/f/artificial-intelligence-ml/agentic-systems-frameworks/conversational-voice-interaction/voice-agents/speech-interruption-handlers.md) — Detects when a user speaks over the agent and automatically stops the current response to listen to the new input. ([source](https://visionagents.ai/guides/interruption-handling.md))
- [Voice Activity Detection](https://awesome-repositories.com/f/artificial-intelligence-ml/agentic-systems-frameworks/conversational-voice-interaction/voice-agents/voice-activity-detection.md) — Identifies when a person starts and stops speaking using configurable sensitivity and silence thresholds. ([source](https://visionagents.ai/integrations/gemini))
- [Programmatic Agent Spawning](https://awesome-repositories.com/f/artificial-intelligence-ml/agentic-systems-frameworks/integration-deployment/agent-frameworks/management-and-discovery/agent-registries/programmatic-agent-spawning.md) — Creates new agent sessions on demand via a POST endpoint for real-time call interactions. ([source](https://visionagents.ai/guides/http-server.md))
- [OpenAI-Compatible APIs](https://awesome-repositories.com/f/artificial-intelligence-ml/agentic-systems-frameworks/model-integration-serving/model-integration-interfaces/ai-integration-apis/openai-compatible-apis.md) — Connects to any service that exposes an OpenAI-compatible API using the standard OpenAI plugin for integration. ([source](https://visionagents.ai/integrations/infrastructure/baseten.md))
- [MCP Server Connections](https://awesome-repositories.com/f/artificial-intelligence-ml/agentic-systems-frameworks/model-integration-serving/model-integration-interfaces/model-context-protocol/mcp-server-management/mcp-server-connections.md) — Attaches local or remote MCP servers so the agent can discover and use external tools. ([source](https://visionagents.ai/guides/mcp-tool-calling.md))
- [Streaming Chat Responses](https://awesome-repositories.com/f/artificial-intelligence-ml/ai-chat-clients/streaming-chat-responses.md) — Sends model output incrementally as it is generated so users see results before the full response finishes. ([source](https://visionagents.ai/integrations/xai))
- [AI Provider Integrations](https://awesome-repositories.com/f/artificial-intelligence-ml/ai-provider-integrations.md) — Wraps provider APIs with a consistent interface so providers can be swapped without rewriting agent logic. ([source](https://visionagents.ai/integrations/introduction-to-integrations.md))
- [Twilio Voice Integrations](https://awesome-repositories.com/f/artificial-intelligence-ml/ai-telephony-systems/twilio-voice-integrations.md) — Handles inbound and outbound voice calls over Twilio with bidirectional audio streaming. ([source](https://cdn.jsdelivr.net/gh/getstream/vision-agents@main/README.md))
- [AI Voice and Video Integration](https://awesome-repositories.com/f/artificial-intelligence-ml/ai-voice-and-video-integration.md) — Connects any LLM, speech, or vision model from 25+ providers to create agents that process live audio and video streams. ([source](https://visionagents.ai/))
- [Real-Time Transcription](https://awesome-repositories.com/f/artificial-intelligence-ml/audio-transcription/real-time-transcription.md) — Streams audio to a speech-to-text service via WebSocket and returns low-latency transcriptions with automatic language detection. ([source](https://visionagents.ai/ai-technologies/speech-to-text.md))
- [Local Object Detection](https://awesome-repositories.com/f/artificial-intelligence-ml/computer-vision-systems/computer-vision/object-detection-tracking/edge-object-detection/local-object-detection.md) — Runs object detection models on-device to avoid API calls and network latency. ([source](https://visionagents.ai/integrations/roboflow))
- [Real-Time Object Detection](https://awesome-repositories.com/f/artificial-intelligence-ml/computer-vision-systems/computer-vision/object-detection-tracking/real-time-object-detection.md) — Identifies objects in real-time video frames using local detection models and emits events with bounding boxes and confidence scores. ([source](https://visionagents.ai/examples/football-commentator.md))
- [Cross-Session Conversation Memories](https://awesome-repositories.com/f/artificial-intelligence-ml/conversational-agent-sessions/cross-session-conversation-memories.md) — Persists messages between interactions so the agent recalls prior exchanges and user details across separate calls. ([source](https://visionagents.ai/guides/chat-and-memory.md))
- [External Tool Integration](https://awesome-repositories.com/f/artificial-intelligence-ml/external-tool-integration.md) — Connects to external tools via the Model Context Protocol to extend the agent's capabilities. ([source](https://visionagents.ai/core/agent-core.md))
- [Live Video](https://awesome-repositories.com/f/artificial-intelligence-ml/face-recognition/live-video.md) — Identifies and registers known faces from a live camera feed using a face recognition model. ([source](https://visionagents.ai/examples/security-camera.md))
- [Function Calling Interfaces](https://awesome-repositories.com/f/artificial-intelligence-ml/function-calling-interfaces.md) — Registers Python functions that the language model can invoke during a conversation to fetch data or perform actions. ([source](https://visionagents.ai/integrations/llm/gemini.md))
- [Automatic Tool Executions](https://awesome-repositories.com/f/artificial-intelligence-ml/function-calling-interfaces/automatic-tool-executions.md) — Automatically executes external tools when the language model decides to call a function. ([source](https://visionagents.ai/integrations/kimi))
- [LLM Model Integrations](https://awesome-repositories.com/f/artificial-intelligence-ml/generative-ai-resources/generative-ai/llm-model-integrations.md) — Generates streamed text responses and handles function calling by implementing the LLM base class. ([source](https://visionagents.ai/integrations/create-your-own-plugin.md))
- [Expressive Synthesis](https://awesome-repositories.com/f/artificial-intelligence-ml/generative-ai-resources/speech-synthesis/expressive-synthesis.md) — Generates natural-sounding speech from text with emotional nuance and vocal style. ([source](https://visionagents.ai/integrations/inworld))
- [Stage Direction Controllers](https://awesome-repositories.com/f/artificial-intelligence-ml/generative-ai-resources/speech-synthesis/natural/stage-direction-controllers.md) — Embeds natural-language instructions in text to control articulation, intonation, volume, pitch, speed, and non-verbal sounds during speech synthesis. ([source](https://visionagents.ai/integrations/tts/inworld.md))
- [Voice Cloning Tools](https://awesome-repositories.com/f/artificial-intelligence-ml/generative-ai-resources/speech-synthesis/voice-cloning-tools.md) — Provides voice cloning tools that create custom voices from WAV file samples for speech generation. ([source](https://visionagents.ai/integrations/pocket))
- [Spoken Language Detection](https://awesome-repositories.com/f/artificial-intelligence-ml/language-detection-tools/spoken-language-detection.md) — Automatically identifies the spoken language from audio streams without manual selection. ([source](https://visionagents.ai/integrations/fish))
- [Final Response Events](https://awesome-repositories.com/f/artificial-intelligence-ml/llm-response-streaming/final-response-events.md) — Emits an event when the language model finishes generating a complete response. ([source](https://visionagents.ai/reference/events-reference.md))
- [Security Activity Querying](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/language-tools/natural-language-querying/security-activity-querying.md) — Provides a conversational AI agent that answers spoken questions about security activity in real time. ([source](https://visionagents.ai/examples/security-camera.md))
- [Real-Time Frame Processors](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/machine-learning-concepts/domain-specific-modeling/computer-vision-modelings/real-time-frame-processors.md) — Intercepts video frames to run object detection pose estimation or custom machine learning models and forwards results to the language model. ([source](https://visionagents.ai/introduction/video-agents.md))
- [Voice Synthesis](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/speech-processing/voice-synthesis.md) — Converts text to spoken audio using multiple expressive voices with configurable settings. ([source](https://visionagents.ai/integrations/tts/xai.md))
- [Tool Call Executions](https://awesome-repositories.com/f/artificial-intelligence-ml/mcp-tool-connectors/tool-call-executions.md) — Executes code or connects to MCP servers to perform actions like creating tickets or checking weather during a call. ([source](https://cdn.jsdelivr.net/gh/getstream/vision-agents@main/README.md))
- [Model Parameter Configurations](https://awesome-repositories.com/f/artificial-intelligence-ml/model-parameter-configurations.md) — Sets temperature, top_p, and deep thinking mode for a language model with sensible defaults. ([source](https://visionagents.ai/integrations/llm/minimax.md))
- [Local Execution](https://awesome-repositories.com/f/artificial-intelligence-ml/on-device-models/vision-language-models/local-execution.md) — Executes object detection and vision-language tasks on-device without cloud API calls, using a local GPU. ([source](https://visionagents.ai/integrations/moondream))
- [OpenAI API Clients](https://awesome-repositories.com/f/artificial-intelligence-ml/openai-api-clients.md) — Connects an agent to OpenAI's language models via the Responses API or ChatCompletions API for conversational reasoning. ([source](https://visionagents.ai/integrations/llm/openai.md))
- [Interruption Response Handling](https://awesome-repositories.com/f/artificial-intelligence-ml/planning-interruption-callbacks/user-interruption-detection/interruption-response-handling.md) — Flags application-triggered speech so it can be stopped mid-utterance when the user interrupts. ([source](https://visionagents.ai/guides/interruption-handling.md))
- [Barge-In Handlers](https://awesome-repositories.com/f/artificial-intelligence-ml/planning-interruption-callbacks/user-interruption-detection/interruption-response-handling/progress-bar-interruption-responses/barge-in-handlers.md) — Stops speech output at the provider when a barge-in event occurs during conversation. ([source](https://visionagents.ai/core/avatar-core.md))
- [Retrieval Agents](https://awesome-repositories.com/f/artificial-intelligence-ml/retrieval-agents.md) — Pulls relevant information from a vector database or file search to ground the agent's responses. ([source](https://cdn.jsdelivr.net/gh/getstream/vision-agents@main/README.md))
- [Retrieval-Augmented Agents](https://awesome-repositories.com/f/artificial-intelligence-ml/retrieval-augmented-agents.md) — Retrieves relevant document chunks from a managed store to provide context for an agent's responses. ([source](https://visionagents.ai/guides/rag.md))
- [End-of-Speech Detectors](https://awesome-repositories.com/f/artificial-intelligence-ml/speech-activity-detection/end-of-speech-detectors.md) — Automatically determines when a caller has finished speaking to trigger the next response. ([source](https://visionagents.ai/examples/simple-agent.md))
- [Voice Cloning Engines](https://awesome-repositories.com/f/artificial-intelligence-ml/speech-synthesis-models/voice-cloning-engines.md) — Implements voice cloning engines that generate custom synthetic voices from reference audio for TTS. ([source](https://visionagents.ai/integrations/tts/fish.md))
- [Speech-to-Text Conversions](https://awesome-repositories.com/f/artificial-intelligence-ml/speech-to-text-conversions.md) — Converts real-time audio input into text using pluggable providers, emitting partial transcripts for responsive UI. ([source](https://visionagents.ai/core/stt-tts-core.md))
- [Speech-to-Text Integrations](https://awesome-repositories.com/f/artificial-intelligence-ml/speech-to-text-integrations.md) — Processes incoming audio and emits transcript events by implementing a single abstract method on the STT base class. ([source](https://visionagents.ai/integrations/create-your-own-plugin.md))
- [Text-to-Speech Integrations](https://awesome-repositories.com/f/artificial-intelligence-ml/speech-to-text-integrations/text-to-speech-integrations.md) — Provides a TTS integration that converts text to audio chunks with interruption support via stream and stop methods. ([source](https://visionagents.ai/integrations/create-your-own-plugin.md))
- [Speech to Text Transcription](https://awesome-repositories.com/f/artificial-intelligence-ml/speech-to-text-transcription.md) — Converts spoken audio into written text with automatic language detection, usable alongside text-to-speech in the same agent. ([source](https://visionagents.ai/examples/simple-agent.md))
- [Text Generation](https://awesome-repositories.com/f/artificial-intelligence-ml/text-generation.md) — Generates text responses from user input using language models, supporting both single-turn and conversational interactions. ([source](https://visionagents.ai/core/llm-core.md))
- [Audio-to-Audio Conversational Loops](https://awesome-repositories.com/f/artificial-intelligence-ml/text-to-audio-synthesis/audio-to-audio-conversational-loops.md) — Routes real-time audio from a phone call through WebSocket to an AI agent for listening and speaking during the conversation. ([source](https://visionagents.ai/guides/calling.md))
- [Audio Track Publishers](https://awesome-repositories.com/f/artificial-intelligence-ml/text-to-audio-synthesis/audio-to-audio-conversational-loops/audio-track-publishers.md) — Outputs a custom audio track that is heard by participants in the live session. ([source](https://visionagents.ai/core/processors-core.md))
- [Text-to-Speech](https://awesome-repositories.com/f/artificial-intelligence-ml/text-to-speech.md) — Synthesizes text into lifelike spoken audio using a text-to-speech service. ([source](https://visionagents.ai/examples/simple-agent.md))
- [Speech-to-Speech Models](https://awesome-repositories.com/f/artificial-intelligence-ml/text-to-speech/speech-to-speech-models.md) — Accepts spoken language as input and produces a spoken response without separate STT or TTS services. ([source](https://visionagents.ai/core/llm-core.md))
- [Speech-to-Speech Frameworks](https://awesome-repositories.com/f/artificial-intelligence-ml/text-to-speech/speech-to-speech-models/speech-to-speech-frameworks.md) — Streams real-time speech-to-speech with optional video over WebSocket eliminating separate speech services. ([source](https://visionagents.ai/integrations/gemini))
- [Integrated STT/TTS Audio Streams](https://awesome-repositories.com/f/artificial-intelligence-ml/text-to-speech/speech-to-speech-models/speech-to-speech-frameworks/speech-to-speech-with-video-streams/integrated-stt-tts-audio-streams.md) — Processes real-time audio input and output over WebSocket using integrated speech-to-text and text-to-speech eliminating external speech services. ([source](https://visionagents.ai/integrations/realtime/qwen.md))
- [OpenAI Model Integrations](https://awesome-repositories.com/f/artificial-intelligence-ml/vector-embeddings/openai-model-integrations.md) — Connects to OpenAI's Responses API or any OpenAI-compatible endpoint to power agent reasoning and tool use. ([source](https://visionagents.ai/integrations/openai))
- [Live Video Outfit Swapping](https://awesome-repositories.com/f/artificial-intelligence-ml/video-generation/image-to-video-generation/live-video-outfit-swapping.md) — Swaps a user's outfit on live video by combining a text prompt with a reference image, applied atomically to avoid partial frames. ([source](https://visionagents.ai/examples/visual-storyteller.md))
- [YOLO Object Detectors](https://awesome-repositories.com/f/artificial-intelligence-ml/video-object-tracking/yolo-object-detectors.md) — Runs YOLO object detection on video frames in real time to identify and track objects. ([source](https://visionagents.ai/introduction/quickstart))
- [Visual Question Answering](https://awesome-repositories.com/f/artificial-intelligence-ml/visual-question-answering.md) — Responds to natural-language questions about the content of video frames using a vision-language model. ([source](https://visionagents.ai/integrations/moondream))
- [Voice Cloning](https://awesome-repositories.com/f/artificial-intelligence-ml/voice-cloning.md) — Provides voice cloning from reference audio samples for personalized speech output in real-time agents. ([source](https://visionagents.ai/integrations/tts/pocket.md))
- [Conversational Coaching Generators](https://awesome-repositories.com/f/artificial-intelligence-ml/adversarial-coaching/conversational-coaching-generators.md) — Generates real-time coaching or guidance tailored to a conversation by analyzing transcribed speech. ([source](https://visionagents.ai/examples/sales-assistant.md))
- [Transport-Agnostic Agent Launchers](https://awesome-repositories.com/f/artificial-intelligence-ml/agent-communication-protocols/agent-to-agent-communication/transport-agnostic-agent-launchers.md) — Uses the transport-agnostic agent launcher directly to serve agents via gRPC, WebSocket, or other protocols. ([source](https://visionagents.ai/guides/http-server.md))
- [Agent Deployment Servers](https://awesome-repositories.com/f/artificial-intelligence-ml/agent-deployment-servers.md) — Starts the agent as a server handling session creation, health checks, authentication, and metrics. ([source](https://visionagents.ai/guides/deploying-overview.md))
- [HTTP Agent Servers](https://awesome-repositories.com/f/artificial-intelligence-ml/agent-integration-apis/http-agent-servers.md) — Hosts the agent logic as an HTTP server for companion applications to connect and exchange data. ([source](https://visionagents.ai/examples/sales-assistant.md))
- [Idle Resource Terminators](https://awesome-repositories.com/f/artificial-intelligence-ml/agent-lifecycle-management/idle-resource-terminators.md) — Closes agent sessions automatically after a configurable idle timeout or maximum duration. ([source](https://visionagents.ai/guides/http-server.md))
- [Tool Execution Observers](https://awesome-repositories.com/f/artificial-intelligence-ml/agent-observability-tools/tool-execution-observers.md) — Emits start and end events for every tool call, reporting its name, arguments, success, and duration. ([source](https://visionagents.ai/guides/mcp-tool-calling.md))
- [Regional Latency Optimizations](https://awesome-repositories.com/f/artificial-intelligence-ml/agent-optimization/end-to-end-system-optimizers/regional-latency-optimizations.md) — Optimizes end-to-end latency in Asia by pairing MiniMax with Tencent RTC edge transport. ([source](https://visionagents.ai/integrations/llm/minimax.md))
- [Anthropic Claude Connections](https://awesome-repositories.com/f/artificial-intelligence-ml/agentic-llm-frameworks/anthropic-claude-connections.md) — Connects Claude models to agents for streaming text responses and function-calling decisions. ([source](https://visionagents.ai/integrations/llm/anthropic.md))
- [Regional Language Model Integrations](https://awesome-repositories.com/f/artificial-intelligence-ml/agentic-llm-frameworks/regional-language-model-integrations.md) — Connects to Sarvam AI's endpoint to use language models optimized for Hindi, English, and other Indian languages. ([source](https://visionagents.ai/integrations/llm/sarvam.md))
- [Hybrid Search Retrievers](https://awesome-repositories.com/f/artificial-intelligence-ml/agentic-rag-development/knowledge-base-retrieval/hybrid-search-retrievers.md) — Combines vector similarity and BM25 keyword matching using Reciprocal Rank Fusion for document retrieval. ([source](https://visionagents.ai/integrations/infrastructure/turbopuffer.md))
- [Local Document Indexing](https://awesome-repositories.com/f/artificial-intelligence-ml/agentic-rag-development/knowledge-base-retrieval/local-document-indexing.md) — Ingests all files in a local folder, chunks them, and indexes them into a vector database for later retrieval. ([source](https://visionagents.ai/integrations/infrastructure/turbopuffer.md))
- [Personality Configurators](https://awesome-repositories.com/f/artificial-intelligence-ml/agentic-systems-frameworks/agent-capabilities-skills-tooling/ai-agent-capabilities/personality-configurators.md) — Sets up the behavior and tone of the AI agent to match the desired interaction style. ([source](https://visionagents.ai/ai-technologies/speech-to-speech.md))
- [Participant Join-Leave Emitters](https://awesome-repositories.com/f/artificial-intelligence-ml/agentic-systems-frameworks/agent-capabilities-skills-tooling/ai-agent-capabilities/programmatic-participants/participant-join-leave-emitters.md) — Emits events when human participants join or leave a call for lifecycle tracking. ([source](https://visionagents.ai/reference/events-reference.md))
- [Interruption Sensitivity Configuration](https://awesome-repositories.com/f/artificial-intelligence-ml/agentic-systems-frameworks/agent-orchestration-multi-agent/autonomous-agents/agent-configurations/interruption-sensitivity-configuration.md) — Adjusts turn detection parameters to control how readily the agent stops speaking when the user starts talking. ([source](https://visionagents.ai/guides/interruption-handling.md))
- [Native Speech-to-Speech Agents](https://awesome-repositories.com/f/artificial-intelligence-ml/agentic-systems-frameworks/conversational-voice-interaction/voice-agents/native-speech-to-speech-agents.md) — Uses OpenAI's speech-to-speech model over WebRTC to handle both speech recognition and synthesis without separate services. ([source](https://visionagents.ai/integrations/realtime/openai.md))
- [Realtime Speech-to-Speech Agents](https://awesome-repositories.com/f/artificial-intelligence-ml/agentic-systems-frameworks/conversational-voice-interaction/voice-agents/realtime-speech-to-speech-agents.md) — Creates a voice agent using a speech-to-speech model that handles audio input and output natively without separate components. ([source](https://visionagents.ai/introduction/voice-agents.md))
- [Agent Speech Turn Detections](https://awesome-repositories.com/f/artificial-intelligence-ml/agentic-systems-frameworks/conversational-voice-interaction/voice-agents/realtime-speech-to-speech-agents/agent-speech-turn-detections.md) — Emits an event when the agent stops speaking, indicating whether the turn was interrupted by the user. ([source](https://visionagents.ai/reference/events-reference.md))
- [Storytelling Narrators](https://awesome-repositories.com/f/artificial-intelligence-ml/agentic-systems-frameworks/conversational-voice-interaction/voice-agents/storytelling-narrators.md) — Ships a voice agent that listens to prompts, generates creative stories with an LLM, and speaks them back expressively. ([source](https://visionagents.ai/examples/cartesia-narrator.md))
- [Audio Filter Sensitivity Tunings](https://awesome-repositories.com/f/artificial-intelligence-ml/agentic-systems-frameworks/conversational-voice-interaction/voice-agents/voice-activity-detection/threshold-tunings/audio-filter-sensitivity-tunings.md) — Adjusts the speech detection threshold and silence release duration to match the acoustic environment and expected pause lengths. ([source](https://visionagents.ai/guides/multiple-speakers.md))
- [Tool Execution Event Reactions](https://awesome-repositories.com/f/artificial-intelligence-ml/agentic-systems-frameworks/integration-deployment/agent-frameworks/tool-use-and-execution/agent-tool-execution/tool-execution-event-reactions.md) — Emits events at the start and end of a function or tool call, providing visibility into tool usage and performance. ([source](https://visionagents.ai/guides/event-system.md))
- [URL Content Fetchers](https://awesome-repositories.com/f/artificial-intelligence-ml/agentic-systems-frameworks/integration-deployment/agentic-domains/agentic-web-browsing/url-content-fetchers.md) — Fetches and incorporates text from specific URLs to inform the agent's responses. ([source](https://visionagents.ai/integrations/llm/gemini.md))
- [Web Search Tools](https://awesome-repositories.com/f/artificial-intelligence-ml/agentic-systems-frameworks/integration-deployment/ai-agent-tooling/web-search-tools.md) — Augments agent replies by retrieving real-time information from the web through a built-in search tool. ([source](https://visionagents.ai/integrations/llm/gemini.md))
- [Model Request Routing](https://awesome-repositories.com/f/artificial-intelligence-ml/ai-model-clients/model-request-routing.md) — Directs requests to a custom model deployed on a specific endpoint by providing the model's unique URL and API key. ([source](https://visionagents.ai/integrations/infrastructure/baseten.md))
- [Provider-Specific Model Selectors](https://awesome-repositories.com/f/artificial-intelligence-ml/ai-model-configurations/model-selection-policies/provider-specific-model-selectors.md) — Chooses among three model sizes from Sarvam AI to balance capability and performance for the agent's task. ([source](https://visionagents.ai/integrations/llm/sarvam.md))
- [AI Agent Plugins](https://awesome-repositories.com/f/artificial-intelligence-ml/artificial-intelligence-tooling/agent-and-tool-integrations/ai-agent-plugins.md) — Wraps any AI provider's API with a consistent interface so the agent framework can use it for speech, text, or video processing. ([source](https://visionagents.ai/integrations/create-your-own-plugin.md))
- [Meeting Transcriptions](https://awesome-repositories.com/f/artificial-intelligence-ml/audio-transcription/meeting-transcriptions.md) — Transcribes multi-speaker conversations in real time, identifying each speaker using a speech-to-text provider. ([source](https://visionagents.ai/examples/sales-assistant.md))
- [Final Transcript Subscriptions](https://awesome-repositories.com/f/artificial-intelligence-ml/audio-transcription/meeting-transcriptions/final-transcript-subscriptions.md) — Fires a handler when a user's speech transcription or the agent's LLM response is finalized, supporting logging or UI updates. ([source](https://visionagents.ai/guides/event-system.md))
- [Agent Workflow Scripting](https://awesome-repositories.com/f/artificial-intelligence-ml/code-execution-agents/agent-workflow-scripting.md) — Runs arbitrary Python scripts within the agent's workflow for computations and data processing. ([source](https://visionagents.ai/integrations/llm/gemini.md))
- [Cloud-Hosted Inference](https://awesome-repositories.com/f/artificial-intelligence-ml/computer-vision-systems/computer-vision/object-detection-tracking/edge-object-detection/local-object-detection/cloud-hosted-inference.md) — Uses hosted inference to run pre-trained object-detection models without a local GPU. ([source](https://visionagents.ai/integrations/roboflow))
- [Natural Language Object Detections](https://awesome-repositories.com/f/artificial-intelligence-ml/computer-vision-systems/computer-vision/object-detection-tracking/object-detection/natural-language-object-detections.md) — Identifies objects in video frames by describing them in natural language, without requiring pre-trained object classes. ([source](https://visionagents.ai/integrations/moondream))
- [Call Infrastructure Integrations](https://awesome-repositories.com/f/artificial-intelligence-ml/conversational-voice-ai/call-infrastructure-integrations.md) — Adds a real-time conversational AI into existing call infrastructure without separate audio channels or complex routing. ([source](https://visionagents.ai/ai-technologies/speech-to-speech.md))
- [Model Tier Selectors](https://awesome-repositories.com/f/artificial-intelligence-ml/decision-trees/minimax/model-tier-selectors.md) — Provides a configuration interface for selecting among MiniMax model tiers with different context windows and speeds. ([source](https://visionagents.ai/integrations/llm/minimax.md))
- [Audio Routing Queues](https://awesome-repositories.com/f/artificial-intelligence-ml/detection-error-handling/voice-activity-detection/speaker-diarizers/audio-routing-queues.md) — Routes audio from each participant through a separate queue and uses a first-speaker-wins filter to decide whose speech reaches the agent. ([source](https://visionagents.ai/guides/multiple-speakers.md))
- [Event-Triggered Notifications](https://awesome-repositories.com/f/artificial-intelligence-ml/face-recognition/live-video/event-triggered-notifications.md) — Triggers notifications when face recognition, package detection, or other frame-level events are detected in a live feed. ([source](https://visionagents.ai/))
- [MiniMax Connections](https://awesome-repositories.com/f/artificial-intelligence-ml/generative-ai-resources/generative-ai/llm-model-integrations/minimax-connections.md) — Configures MiniMax large language models as the agent's reasoning engine with multiple model tiers. ([source](https://visionagents.ai/integrations/llm/minimax.md))
- [xAI Grok Connections](https://awesome-repositories.com/f/artificial-intelligence-ml/generative-ai-resources/generative-ai/llm-model-integrations/xai-grok-connections.md) — Connects to xAI's Grok models for conversation memory, streaming responses, and function calling. ([source](https://visionagents.ai/integrations/llm/xai.md))
- [Local Model Execution](https://awesome-repositories.com/f/artificial-intelligence-ml/local-model-execution.md) — Runs open-weight text language models on local hardware with streaming and function calling. ([source](https://visionagents.ai/integrations/llm/huggingface-transformers.md))
- [Local Speech-to-Text](https://awesome-repositories.com/f/artificial-intelligence-ml/local-speech-to-text.md) — Transcribes speech to text on the local machine using an accelerated Whisper model, with no API key required. ([source](https://visionagents.ai/integrations/stt/fast-whisper.md))
- [Pre-Deployed Endpoint Callers](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/model-deployment-and-serving/model-hubs-and-pre-made-models/pre-made-models/pre-deployed-endpoint-callers.md) — Uses ready-made API endpoints for popular open-source models without requiring any deployment setup. ([source](https://visionagents.ai/integrations/infrastructure/baseten.md))
- [Voice Identity Selections](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/speech-processing/voice-synthesis/modular-voice-configurations/voice-identity-selections.md) — Provides voice identity selections for configuring engine and speaker identity to control speech tone. ([source](https://visionagents.ai/integrations/tts/aws-polly.md))
- [Media Transport Connections](https://awesome-repositories.com/f/artificial-intelligence-ml/mcp-servers/transport-layer-connectivity/media-transport-connections.md) — Accepts any transport layer for sending and receiving media, allowing the agent to work outside the default video pipeline. ([source](https://visionagents.ai/core/overview.md))
- [AI Provider Interfaces](https://awesome-repositories.com/f/artificial-intelligence-ml/model-capability-extensions/ai-provider-interfaces.md) — Switches between different AI models for video processing with a single configuration change. ([source](https://visionagents.ai/examples/golf-coach.md))
- [Hot-Swappable Providers](https://awesome-repositories.com/f/artificial-intelligence-ml/model-capability-extensions/ai-provider-interfaces/hot-swappable-providers.md) — Enables hot-swapping between different realtime AI models with a single configuration change. ([source](https://visionagents.ai/examples/football-commentator.md))
- [HuggingFace Evaluations](https://awesome-repositories.com/f/artificial-intelligence-ml/model-performance-evaluators/evaluation-configurations/huggingface-evaluations.md) — Routes text-only language model requests through HuggingFace's unified API with streaming and function calling. ([source](https://visionagents.ai/integrations/huggingface))
- [Runtime Provider Switching](https://awesome-repositories.com/f/artificial-intelligence-ml/model-provider-configurations/runtime-provider-switching.md) — Changes the backend provider for model inference by setting a single configuration parameter. ([source](https://visionagents.ai/integrations/infrastructure/huggingface.md))
- [Hand](https://awesome-repositories.com/f/artificial-intelligence-ml/pose-estimation/hand.md) — Highlights wrist positions and draws hand skeleton connections on detected poses during live video processing. ([source](https://visionagents.ai/integrations/ultralytics))
- [Real-Time Speech Translation](https://awesome-repositories.com/f/artificial-intelligence-ml/real-time-speech-translation.md) — Translates transcribed speech into over 99 languages in real time, supporting ISO-639-1 language codes. ([source](https://visionagents.ai/integrations/stt/wizper.md))
- [LLM Swap Integrations](https://awesome-repositories.com/f/artificial-intelligence-ml/realtime-ai-session-managers/realtime-avatar-integration/llm-swap-integrations.md) — Swaps LLM backends for avatars by subscribing to TTS and realtime audio without changing the avatar setup. ([source](https://visionagents.ai/integrations/avatars/anam.md))
- [Session Connect-Disconnect Emitters](https://awesome-repositories.com/f/artificial-intelligence-ml/realtime-ai-session-managers/session-connect-disconnect-emitters.md) — Emits events when realtime sessions connect or disconnect with session identifiers and reasons. ([source](https://visionagents.ai/reference/events-reference.md))
- [Grok Model Integrations](https://awesome-repositories.com/f/artificial-intelligence-ml/reasoning-models/reasoning-pipelines/reasoning-model-integrations/grok-model-integrations.md) — Connects to xAI's Grok models for advanced reasoning and real-time knowledge in AI agent pipelines. ([source](https://visionagents.ai/integrations/xai))
- [Self-Hosted AI Models](https://awesome-repositories.com/f/artificial-intelligence-ml/self-hosted-ai-models.md) — Hosts open-source inference model routing and vector search on self-managed hardware. ([source](https://visionagents.ai/integrations/introduction-to-integrations.md))
- [Custom STT/LLM/TTS Pipeline Assemblers](https://awesome-repositories.com/f/artificial-intelligence-ml/speech-to-text-integrations/unified-speech-pipelines/custom-stt-llm-tts-pipeline-assemblers.md) — Assembles speech-to-text, language model, and text-to-speech components into a custom processing pipeline. ([source](https://visionagents.ai/introduction/quickstart))
- [Accelerated Transcriptions](https://awesome-repositories.com/f/artificial-intelligence-ml/speech-transcription-engines/accelerated-transcriptions.md) — Transcribes speech 2-4 times faster than standard methods by using optimized CPU and GPU compute engines. ([source](https://visionagents.ai/integrations/stt/fast-whisper.md))
- [Local Speech Synthesis](https://awesome-repositories.com/f/artificial-intelligence-ml/text-to-speech/local-speech-synthesis.md) — Generates speech from text locally on a CPU with ~200ms latency, no GPU or external API required. ([source](https://visionagents.ai/integrations/pocket))
- [Speech Synthesis Markup](https://awesome-repositories.com/f/artificial-intelligence-ml/text-to-speech/speech-emphasis-controls/speech-synthesis-markup.md) — Controls vocal delivery by inserting tags for emotions, pauses, and emphasis into text before it is spoken aloud. ([source](https://visionagents.ai/examples/cartesia-narrator.md))
- [Speech Parameter Configuration](https://awesome-repositories.com/f/artificial-intelligence-ml/text-to-speech/speech-parameter-configuration.md) — Sets voice identity, language, and AWS region for text-to-speech generation through standard credential resolution. ([source](https://visionagents.ai/integrations/aws-polly))
- [Speech Synthesis Markup Controls](https://awesome-repositories.com/f/artificial-intelligence-ml/text-to-speech/speech-synthesis-markup-controls.md) — Generates natural-sounding speech from text with inline tags for whisper, laughter, and emotional tone adjustments. ([source](https://visionagents.ai/integrations/tts/fish.md))
- [AWS Bedrock Speech-to-Speech Streams](https://awesome-repositories.com/f/artificial-intelligence-ml/text-to-speech/speech-to-speech-models/aws-bedrock-speech-to-speech-streams.md) — Transcribes and synthesizes speech in real time using Amazon Nova models with automatic session management. ([source](https://visionagents.ai/integrations/aws-bedrock))
- [Indian Language Speech Streams](https://awesome-repositories.com/f/artificial-intelligence-ml/text-to-speech/speech-to-speech-models/indian-language-speech-streams.md) — Generates natural-sounding speech from text using Sarvam's Bulbul model for Indian languages. ([source](https://visionagents.ai/integrations/tts/sarvam.md))
- [Vision-Language Speech Integrations](https://awesome-repositories.com/f/artificial-intelligence-ml/text-to-speech/speech-to-speech-models/speech-to-speech-frameworks/speech-integration-engines/vision-language-speech-integrations.md) — Integrates separate STT and TTS providers alongside a vision language model for full conversational control. ([source](https://visionagents.ai/introduction/video-agents.md))
- [Speech-to-Speech with Video Streams](https://awesome-repositories.com/f/artificial-intelligence-ml/text-to-speech/speech-to-speech-models/speech-to-speech-frameworks/speech-to-speech-with-video-streams.md) — Sends and receives real-time audio and optional video over WebSocket without separate speech recognition or synthesis services. ([source](https://visionagents.ai/integrations/llm/gemini.md))
- [Vocal Nuance Controllers](https://awesome-repositories.com/f/artificial-intelligence-ml/text-to-speech/vocal-nuance-controllers.md) — Inserts pauses, breaths, laughs, and other vocal cues directly into text for fine-grained timing and expression. ([source](https://visionagents.ai/integrations/tts/xai.md))
- [Tool Execution Trackers](https://awesome-repositories.com/f/artificial-intelligence-ml/tool-execution-loops/tool-execution-trackers.md) — Emits events when a tool call starts and completes providing the tool name arguments result and execution time. ([source](https://visionagents.ai/reference/events-reference.md))
- [Unified API Text Inference](https://awesome-repositories.com/f/artificial-intelligence-ml/transformer-inference-engines/huggingface-transformers-loaders/unified-api-text-inference.md) — Routes text-only language model requests through HuggingFace's unified API with streaming and function calling. ([source](https://visionagents.ai/integrations/infrastructure/huggingface.md))
- [Video Captioning](https://awesome-repositories.com/f/artificial-intelligence-ml/video-captioning.md) — Generates descriptive text captions automatically for each video frame as it is processed. ([source](https://visionagents.ai/integrations/moondream))
- [Mid-Call Reference Image Swaps](https://awesome-repositories.com/f/artificial-intelligence-ml/video-generation/image-to-video-generation/live-video-outfit-swapping/mid-call-reference-image-swaps.md) — Updates the reference image used for visual transformation atomically while the video stream is active. ([source](https://visionagents.ai/integrations/decart))
- [Voice-Triggered Outfit Changes](https://awesome-repositories.com/f/artificial-intelligence-ml/video-generation/image-to-video-generation/live-video-outfit-swapping/voice-triggered-outfit-changes.md) — Listens for spoken requests and triggers a costume swap on the video feed based on the voice input. ([source](https://visionagents.ai/examples/visual-storyteller.md))
- [Detection Event Emitters](https://awesome-repositories.com/f/artificial-intelligence-ml/video-object-tracking/detection-event-emitters.md) — Emits an event when a video processor completes object detection, providing model ID, inference time, and detection count. ([source](https://visionagents.ai/reference/events-reference.md))
- [Vision-Language Inference](https://awesome-repositories.com/f/artificial-intelligence-ml/vision-language-inference.md) — Processes video frames with vision-language models through HuggingFace's API with automatic frame buffering. ([source](https://visionagents.ai/integrations/huggingface))
- [HuggingFace](https://awesome-repositories.com/f/artificial-intelligence-ml/vision-language-inference/huggingface.md) — Routes vision-language model requests through HuggingFace's unified API with automatic video frame buffering. ([source](https://visionagents.ai/integrations/infrastructure/huggingface.md))
- [Multi-Voice Synthesis Engines](https://awesome-repositories.com/f/artificial-intelligence-ml/voice-cloning/voice-identity-conversions/multi-voice-synthesis-engines.md) — Generates spoken responses from text using cloud or local models from expressive to ultra-low latency. ([source](https://visionagents.ai/integrations/introduction-to-integrations.md))
- [Engine Selection Configurations](https://awesome-repositories.com/f/artificial-intelligence-ml/voice-cloning/voice-identity-conversions/multi-voice-synthesis-engines/engine-selection-configurations.md) — Ships engine selection configurations for choosing between standard and neural speech synthesis engines. ([source](https://visionagents.ai/integrations/aws-polly))

### Part of an Awesome List

- [Multimodal LLM Models](https://awesome-repositories.com/f/awesome-lists/ai/multimodal-llm-models.md) — Connects a multimodal reasoning model via an OpenAI-compatible API to process video and audio in real time. ([source](https://visionagents.ai/integrations/kimi))
- [Video Language Model Integrations](https://awesome-repositories.com/f/awesome-lists/ai/video-language-models/video-language-model-integrations.md) — Processes video frames alongside text by implementing a VideoLLM base class and managing a frame buffer. ([source](https://visionagents.ai/integrations/create-your-own-plugin.md))
- [Video Pose Estimation](https://awesome-repositories.com/f/awesome-lists/ai/video-pose-estimation.md) — Identifies key body joints and draws skeleton overlays on video frames in real time using a pre-trained pose model. ([source](https://visionagents.ai/guides/video-processors.md))
- [Voice AI Agents](https://awesome-repositories.com/f/awesome-lists/ai/voice-ai-agents.md) — Builds a voice AI agent that listens, processes with an LLM, and responds with natural-sounding speech. ([source](https://visionagents.ai/examples/simple-agent.md))
- [Lip-Sync Animations](https://awesome-repositories.com/f/awesome-lists/ai/avatar-generation/lip-sync-animations.md) — Provides lip-sync animation for digital avatars to match spoken responses. ([source](https://visionagents.ai/guides/video-processors.md))
- [Provider-Backed Characters](https://awesome-repositories.com/f/awesome-lists/ai/avatar-generation/provider-backed-characters.md) — Provides an abstract base class for implementing custom provider-backed animated characters. ([source](https://visionagents.ai/core/avatar-core.md))
- [Real-Time Video Overlayers](https://awesome-repositories.com/f/awesome-lists/ai/video-annotation/real-time-video-overlayers.md) — Receives a video stream, applies modifications or overlays (like bounding boxes), and publishes the altered frames back to the call. ([source](https://visionagents.ai/core/processors-core.md))
- [NVIDIA Vision Model Integrations](https://awesome-repositories.com/f/awesome-lists/ai/video-language-models/video-language-model-integrations/nvidia-vision-model-integrations.md) — Processes real-time video frames through NVIDIA's vision language models buffering frames automatically for continuous understanding. ([source](https://visionagents.ai/integrations/nvidia))
- [Direct Text Speaking Utilities](https://awesome-repositories.com/f/awesome-lists/learning/speaking-practice/direct-text-speaking-utilities.md) — Speaks a given text string using text-to-speech, bypassing the language model entirely. ([source](https://visionagents.ai/core/agent-core.md))

### Data & Databases

- [Lifecycle Managers](https://awesome-repositories.com/f/data-databases/event-tracking/avatar-lifecycle-events/lifecycle-managers.md) — Manages the avatar's connection, audio consumption, and teardown sequence for real-time interaction. ([source](https://visionagents.ai/core/avatar-core.md))
- [Information Retrieval](https://awesome-repositories.com/f/data-databases/information-retrieval.md) — Answers queries by searching over uploaded documents using automatic chunking and retrieval. ([source](https://visionagents.ai/integrations/llm/gemini.md))
- [Cross-Node State Sharing](https://awesome-repositories.com/f/data-databases/shared-memory-data-exchange/reactive-data-sharing/cross-node-state-sharing.md) — Shares session state across multiple servers via a shared key-value store for distributed agent management. ([source](https://visionagents.ai/guides/horizontal-scaling.md))
- [Conversation History Backends](https://awesome-repositories.com/f/data-databases/pluggable-storage-drivers/conversation-history-backends.md) — Accepts a user-defined storage backend by implementing an abstract conversation interface for message operations. ([source](https://visionagents.ai/guides/chat-and-memory.md))
- [Transcription Term Boosts](https://awesome-repositories.com/f/data-databases/search-indexing-technologies/search-indexing/search-information-retrieval/query-interfaces-dsls/multi-term-search-processors/term-weighting-algorithms/transcription-term-boosts.md) — Accepts a list of domain-specific words or phrases to improve transcription accuracy for those terms. ([source](https://visionagents.ai/integrations/stt/assemblyai.md))
- [Vector Similarity Search](https://awesome-repositories.com/f/data-databases/vector-similarity-search.md) — Finds documents by semantic meaning, returning results even when query words differ from the indexed text. ([source](https://visionagents.ai/integrations/infrastructure/turbopuffer.md))

### Development Tools & Productivity

- [Agent-Integrated Functions](https://awesome-repositories.com/f/development-tools-productivity/local-function-execution/agent-integrated-functions.md) — Attaches custom Python functions to the agent that the language model can invoke as tools. ([source](https://visionagents.ai/introduction/voice-agents.md))
- [Tool Function Registrations](https://awesome-repositories.com/f/development-tools-productivity/local-function-execution/agent-integrated-functions/tool-function-registrations.md) — Attaches Python functions to the agent that the language model can automatically invoke as tool calls during a conversation. ([source](https://visionagents.ai/guides/mcp-tool-calling.md))
- [TTS Provider Selectors](https://awesome-repositories.com/f/development-tools-productivity/dynamic-configuration-providers/dynamic-provider-registration/automated-provider-selection/tts-provider-selectors.md) — Selects a voice from supported services like ElevenLabs or Cartesia and routes audio through the call automatically. ([source](https://visionagents.ai/ai-technologies/text-to-speech.md))
- [Session Concurrency Limiters](https://awesome-repositories.com/f/development-tools-productivity/parallel-execution/custom-parallel-task-execution/parallel-task-orchestrators/agent-session-parallelization/session-concurrency-limiters.md) — Caps the number of simultaneous agents and sessions per call to prevent resource exhaustion. ([source](https://visionagents.ai/guides/http-server.md))
- [Keyword Matching](https://awesome-repositories.com/f/development-tools-productivity/search-query-utilities/keyword-matching.md) — Matches exact query terms against indexed documents using BM25 for precise technical lookups. ([source](https://visionagents.ai/integrations/infrastructure/turbopuffer.md))

### DevOps & Infrastructure

- [Production Deployments](https://awesome-repositories.com/f/devops-infrastructure/cloud-agent-orchestration/cloud-agent-deployers/production-deployments.md) — Runs as an HTTP server with Prometheus metrics, horizontal scaling, and Kubernetes support. ([source](https://cdn.jsdelivr.net/gh/getstream/vision-agents@main/README.md))
- [AI Agent Deployments](https://awesome-repositories.com/f/devops-infrastructure/kubernetes-deployments/ai-agent-deployments.md) — Ships a Helm chart for deploying multi-modal AI agents to any Kubernetes cluster. ([source](https://visionagents.ai/guides/kubernetes-deployment.md))
- [Containerized Agent Packages](https://awesome-repositories.com/f/devops-infrastructure/cloud-infrastructure-deployment/managed-infrastructure-deployment/agent-deployments/containerized-agent-packages.md) — Packages a multi-modal AI agent into a container using CPU or GPU Dockerfiles for production. ([source](https://visionagents.ai/guides/deployment.md))
- [Docker Image Building](https://awesome-repositories.com/f/devops-infrastructure/container-orchestration/container-runtimes/runtime-configuration-interfaces/docker-socket-orchestrators/docker-target-configurators/docker-container-deployments/docker-image-building.md) — Builds a containerized version of the agent for CPU or GPU environments to run anywhere. ([source](https://visionagents.ai/guides/deploying-overview.md))
- [Redis-Backed Session Stores](https://awesome-repositories.com/f/devops-infrastructure/deployment-scaling/session-based-scaling/redis-backed-session-stores.md) — Adds a Redis-backed session store so multiple replicas can manage any session across nodes. ([source](https://visionagents.ai/guides/deploying-overview.md))
- [Edge Network Deployment](https://awesome-repositories.com/f/devops-infrastructure/edge-network-deployment.md) — Deploys agents on distributed edge infrastructure to minimize latency for real-time voice interaction. ([source](https://visionagents.ai/examples/cartesia-narrator.md))
- [Tool Execution Round Limits](https://awesome-repositories.com/f/devops-infrastructure/execution-rate-limiters/execution-time-limits/tool-execution-round-limits.md) — Configures how many consecutive tool-calling rounds the language model may perform before returning control. ([source](https://visionagents.ai/guides/mcp-tool-calling.md))
- [Package Presence Detections](https://awesome-repositories.com/f/devops-infrastructure/package-installations/asset-package-detection/package-presence-detections.md) — Monitors video for packages using a custom object detection model and alerts when packages are moved or stolen. ([source](https://visionagents.ai/examples/security-camera.md))

### Graphics & Multimedia

- [Lip-Sync Stream Synchronization](https://awesome-repositories.com/f/graphics-multimedia/audio-music/audio-streaming-engines/audio-playback-engines/chunked-audio-streaming/real-time-synthesis-streaming/lip-sync-stream-synchronization.md) — Streams a real-time interactive avatar with lip-sync, delivering synchronized video and audio. ([source](https://visionagents.ai/integrations/avatars/liveavatar.md))
- [Speech Synthesis & TTS](https://awesome-repositories.com/f/graphics-multimedia/audio-music/speech-synthesis-tts.md) — Converts text to natural-sounding speech using cloud-based neural or standard engines. ([source](https://visionagents.ai/integrations/tts/aws-polly.md))
- [Frame-Based Question Answering](https://awesome-repositories.com/f/graphics-multimedia/media-processing-analysis/media-manipulation/media-processing-workflows/video-transformation-enhancement/chunked-video-processing/video-processing-apis/video-input-processing/frame-based-question-answering.md) — Receives video frames as input and processes them to answer questions or provide descriptions about what is visible in the footage. ([source](https://visionagents.ai/core/llm-core.md))
- [Real-Time Video Analysis](https://awesome-repositories.com/f/graphics-multimedia/media-processing-analysis/media-manipulation/media-processing-workflows/video-transformation-enhancement/chunked-video-processing/video-processing-apis/video-input-processing/real-time-video-analysis.md) — Processes each frame of a participant's video track through custom or built-in analysis routines at a configurable frame rate. ([source](https://visionagents.ai/guides/video-processors.md))
- [Object Detection and Transformations](https://awesome-repositories.com/f/graphics-multimedia/media-processing-analysis/media-manipulation/media-processing-workflows/video-transformation-enhancement/chunked-video-processing/video-processing-apis/video-input-processing/real-time-video-analysis/object-detection-and-transformations.md) — Detects objects, analyzes video content, and applies transformations like style transfer in real time. ([source](https://visionagents.ai/integrations/introduction-to-integrations.md))
- [Vision-Language Model Analyses](https://awesome-repositories.com/f/graphics-multimedia/media-processing-analysis/media-manipulation/media-processing-workflows/video-transformation-enhancement/chunked-video-processing/video-processing-apis/video-input-processing/real-time-video-analysis/vision-language-model-analyses.md) — Processes live video frames through a vision-language model to extract understanding and generate responses. ([source](https://visionagents.ai/introduction/quickstart))
- [Real-Time Stream Transformations](https://awesome-repositories.com/f/graphics-multimedia/media-processing-analysis/media-manipulation/media-processing-workflows/video-transformation-enhancement/real-time-stream-transformations.md) — Applies visual transformations to a video stream and publishes the modified frames back into the call for other participants to see. ([source](https://visionagents.ai/guides/video-processors.md))
- [Real-Time Style Transfer](https://awesome-repositories.com/f/graphics-multimedia/media-processing-analysis/media-manipulation/media-processing-workflows/video-transformation-enhancement/real-time-style-transfer.md) — Transforms a live video stream by applying a chosen artistic style or prompt-based visual effect during the call. ([source](https://visionagents.ai/integrations/decart))
- [Vision-Language Video Agents](https://awesome-repositories.com/f/graphics-multimedia/real-time-video-analytics/vision-language-video-agents.md) — Combines vision models with LLMs to watch, listen, and respond to live video streams with low latency. ([source](https://cdn.jsdelivr.net/gh/getstream/vision-agents@main/README.md))
- [Frame Buffering Pipelines](https://awesome-repositories.com/f/graphics-multimedia/real-time-video-analytics/vision-language-video-agents/frame-buffering-pipelines.md) — Buffers video frames as JPEGs and sends them alongside text prompts for multimodal reasoning and analysis. ([source](https://visionagents.ai/integrations/gemini))
- [Pluggable Processing Pipelines](https://awesome-repositories.com/f/graphics-multimedia/video-frame-processing/pluggable-processing-pipelines.md) — Runs custom computer vision models like YOLO or Roboflow on video frames before or after an LLM call. ([source](https://cdn.jsdelivr.net/gh/getstream/vision-agents@main/README.md))
- [Real-Time Model Inference on Frames](https://awesome-repositories.com/f/graphics-multimedia/video-frame-processing/real-time-model-inference-on-frames.md) — Runs YOLO, Roboflow, or user-defined models on every frame of a live video stream for real-time detection and analysis. ([source](https://visionagents.ai/))
- [Human Pose Detections](https://awesome-repositories.com/f/graphics-multimedia/video-frame-processing/real-time-model-inference-on-frames/human-pose-detections.md) — Runs a YOLO pose model on each video frame to identify keypoints and draw skeleton overlays as the stream arrives. ([source](https://visionagents.ai/integrations/ultralytics))
- [Live Video Stream Monitoring](https://awesome-repositories.com/f/graphics-multimedia/video-stream-processing/live-video-stream-monitoring.md) — Processes real-time video feeds to detect and track people, packages, and events as they happen. ([source](https://visionagents.ai/examples/security-camera.md))
- [Packet-Level Audio Receivers](https://awesome-repositories.com/f/graphics-multimedia/audio-music/audio-capture-and-playback/raw-audio-captures/packet-level-audio-receivers.md) — Emits events for each audio packet received from participants for custom processing. ([source](https://visionagents.ai/reference/events-reference.md))
- [Audio-Video Synchronization](https://awesome-repositories.com/f/graphics-multimedia/audio-video-synchronization.md) — Delays video frames to match audio buffer depth for accurate lip-sync. ([source](https://visionagents.ai/core/avatar-core.md))
- [Sports Swing Analyses](https://awesome-repositories.com/f/graphics-multimedia/media-processing-analysis/media-manipulation/media-processing-workflows/video-transformation-enhancement/chunked-video-processing/video-processing-apis/video-input-processing/real-time-video-analysis/sports-swing-analyses.md) — Uses pose detection to track body position from live video and provides spoken coaching feedback on the swing. ([source](https://visionagents.ai/examples/golf-coach.md))
- [Media Track Management](https://awesome-repositories.com/f/graphics-multimedia/media-track-management.md) — Emits events when audio or video tracks are added to or removed from a call. ([source](https://visionagents.ai/reference/events-reference.md))
- [Real-Time Suggestion Overlays](https://awesome-repositories.com/f/graphics-multimedia/on-screen-debug-text/video-overlay-displays/real-time-suggestion-overlays.md) — Shows real-time suggestions on a semi-transparent overlay that stays visible without interrupting other applications. ([source](https://visionagents.ai/examples/sales-assistant.md))
- [Content Moderation Filters](https://awesome-repositories.com/f/graphics-multimedia/real-time-video-analytics/content-moderation-filters.md) — Detects offensive gestures using a custom model running locally and applies a Gaussian blur to censor the video stream. ([source](https://visionagents.ai/examples/video-moderator.md))
- [Play-by-Play Narration Generations](https://awesome-repositories.com/f/graphics-multimedia/real-time-video-analytics/play-by-play-narration-generations.md) — Feeds object-tracking data from a video stream to an LLM to produce live, spoken play-by-play narration. ([source](https://visionagents.ai/))
- [Sports Commentary Generations](https://awesome-repositories.com/f/graphics-multimedia/real-time-video-analytics/sports-commentary-generations.md) — Combines object detection with real-time AI models to annotate live video and trigger commentary based on detected game events. ([source](https://visionagents.ai/examples/football-commentator.md))
- [Chained Vision Processors](https://awesome-repositories.com/f/graphics-multimedia/real-time-video-analytics/vision-language-video-agents/chained-vision-processors.md) — Runs a sequence of computer vision processors on video frames and passes annotated results to the language model. ([source](https://visionagents.ai/introduction/video-agents.md))
- [Edge Network Routings](https://awesome-repositories.com/f/graphics-multimedia/streaming-distribution/streaming-broadcasting/broadcasting-streaming/live-video-broadcasting/webrtc-video-chat/edge-network-routings.md) — Routes audio and video through a global edge network with sub-500ms latency and frontend SDKs. ([source](https://visionagents.ai/integrations/introduction-to-integrations.md))
- [Native Vision Model Streams](https://awesome-repositories.com/f/graphics-multimedia/streaming-distribution/streaming-broadcasting/media-streaming/video-streaming/realtime-video-streamers/native-vision-model-streams.md) — Sends live video frames directly to a model with native vision support over WebRTC or WebSocket for the lowest latency. ([source](https://visionagents.ai/introduction/video-agents.md))
- [Live Style Swaps](https://awesome-repositories.com/f/graphics-multimedia/video-frame-styling/live-style-swaps.md) — Alters the active visual style of a video stream on the fly through a function-calling interface. ([source](https://visionagents.ai/integrations/decart))

### Networking & Communication

- [Bidirectional Audio Streaming](https://awesome-repositories.com/f/networking-communication/bidirectional-audio-streaming.md) — Transmits audio in both directions over WebSocket to enable real-time voice interaction between caller and agent. ([source](https://visionagents.ai/examples/phone-and-rag.md))
- [Custom Video Track Publishers](https://awesome-repositories.com/f/networking-communication/communication-platforms-services/video-communication-tools/video-call-integrations/custom-video-track-publishers.md) — Outputs a custom video track (e.g., AI-generated content or avatars) that participants see in the live session. ([source](https://visionagents.ai/core/processors-core.md))
- [Programmatic Call Joiners](https://awesome-repositories.com/f/networking-communication/communication-platforms-services/video-communication-tools/video-call-integrations/programmatic-call-joiners.md) — Joins a video call as an async context manager, waiting for participants before proceeding with the conversation. ([source](https://visionagents.ai/core/agent-core.md))
- [Twilio Call Connectors](https://awesome-repositories.com/f/networking-communication/inbound-call-routers/ai-powered-inbound-call-answerers/twilio-call-connectors.md) — Links a voice agent to Twilio for handling both inbound and outbound telephone calls. ([source](https://visionagents.ai/introduction/voice-agents.md))
- [Join-Leave Reactions](https://awesome-repositories.com/f/networking-communication/communication-protocols-architectures/communication-paradigms/group-membership-management/participant-management/participant-interaction-hooks/join-leave-reactions.md) — Triggers custom logic when participants join or leave a call for personalized interactions. ([source](https://visionagents.ai/guides/event-system.md))
- [AI-Powered Inbound Call Answerers](https://awesome-repositories.com/f/networking-communication/inbound-call-routers/ai-powered-inbound-call-answerers.md) — Answers incoming phone calls with an AI agent that uses a knowledge base to provide product information. ([source](https://visionagents.ai/examples/phone-and-rag.md))
- [RAG-Enhanced Call Answerers](https://awesome-repositories.com/f/networking-communication/inbound-call-routers/rag-enhanced-call-answerers.md) — Answers Twilio-powered voice calls and responds using knowledge retrieved from a RAG-backed vector store. ([source](https://visionagents.ai/))
- [Webhook-Based Call Acceptors](https://awesome-repositories.com/f/networking-communication/inbound-call-routers/webhook-based-call-acceptors.md) — Accepts incoming phone calls via webhook, validates the request, and starts a bidirectional media stream for AI processing. ([source](https://visionagents.ai/guides/calling.md))
- [Outbound Call Initiators](https://awesome-repositories.com/f/networking-communication/outbound-call-initiators.md) — Programmatically places phone calls through the REST API and connects them to a media stream for real-time AI interaction. ([source](https://visionagents.ai/guides/calling.md))
- [Automated Outbound Dialers](https://awesome-repositories.com/f/networking-communication/outbound-call-initiators/automated-outbound-dialers.md) — Places phone calls automatically for tasks such as booking reservations without human initiation. ([source](https://visionagents.ai/examples/phone-and-rag.md))
- [WebSocket PCM Audio Streams](https://awesome-repositories.com/f/networking-communication/socket-networking/audio-streaming-servers/pcm-audio-streaming/websocket-pcm-audio-streams.md) — Delivers synthesized speech as a continuous 16-bit PCM audio stream over a bidirectional WebSocket connection at a configurable sample rate. ([source](https://visionagents.ai/integrations/inworld))
- [Call Lifecycle Management](https://awesome-repositories.com/f/networking-communication/telephony-services/call-control-interfaces/call-lifecycle-management.md) — Emits events when the agent joins or leaves a call, and when the call itself ends. ([source](https://visionagents.ai/reference/events-reference.md))

### System Administration & Monitoring

- [Agent Performance Monitoring](https://awesome-repositories.com/f/system-administration-monitoring/agent-performance-monitoring.md) — Tracks latency, token usage, and errors across all components using OpenTelemetry, Prometheus, and Jaeger. ([source](https://visionagents.ai/guides/deploying-overview.md))
- [Agent Execution Tracing](https://awesome-repositories.com/f/system-administration-monitoring/agent-execution-tracing.md) — Traces requests across agent components to debug latency issues using OpenTelemetry and Jaeger. ([source](https://visionagents.ai/core/telemetry.md))
- [Security Event Notifications](https://awesome-repositories.com/f/system-administration-monitoring/alert-thresholds/agent-anomaly-alerting/security-event-notifications.md) — Creates and posts notifications, including suspect images, to external services when a security event is detected. ([source](https://visionagents.ai/examples/security-camera.md))
- [Prometheus-Based Metric Exporters](https://awesome-repositories.com/f/system-administration-monitoring/prometheus-exporters/prometheus-based-metric-exporters.md) — Exports OpenTelemetry metrics to Prometheus for dashboarding and alerting on agent performance. ([source](https://visionagents.ai/core/telemetry.md))

### User Interface & Experience

- [Interactive Video Avatar Generators](https://awesome-repositories.com/f/user-interface-experience/avatars/realtime-avatar-renderers/interactive-video-avatar-generators.md) — Drives an animated avatar that sees, hears, and responds via real-time audio and video streams. ([source](https://visionagents.ai/))
- [Interruption Handlers](https://awesome-repositories.com/f/user-interface-experience/avatars/realtime-avatar-renderers/interactive-video-avatar-generators/avatar-speech-control/interruption-handlers.md) — Pauses avatar output when the user speaks, enabling natural conversational turn-taking. ([source](https://visionagents.ai/integrations/avatars/anam.md))
- [Synchronization Pipelines](https://awesome-repositories.com/f/user-interface-experience/avatars/realtime-avatar-renderers/interactive-video-avatar-generators/avatar-speech-control/synchronization-pipelines.md) — Streams TTS audio to generate lip-synced avatar video and audio frames for call participants. ([source](https://visionagents.ai/integrations/avatars/anam.md))
- [Lip Synchronization Engines](https://awesome-repositories.com/f/user-interface-experience/avatars/realtime-avatar-renderers/lip-synchronization-engines.md) — Produces a real-time visual character with lip movements synchronized to agent speech. ([source](https://visionagents.ai/integrations/introduction-to-integrations.md))
- [Video Feed Adjustments](https://awesome-repositories.com/f/user-interface-experience/avatars/avatar-appearance-configurators/video-feed-adjustments.md) — Provides configurable video feed parameters for avatar appearance customization. ([source](https://visionagents.ai/integrations/avatars/lemonslice.md))
- [Audio Routing Pipelines](https://awesome-repositories.com/f/user-interface-experience/avatars/llm-driven-avatar-frameworks/audio-routing-pipelines.md) — Implements audio routing through LLM pipelines for avatar video generation. ([source](https://visionagents.ai/integrations/lemonslice))
- [Plugin Event Emissions](https://awesome-repositories.com/f/user-interface-experience/component-utilities/ui-frameworks/component-apis/event-communication-systems/event-emission-declarations/custom-event-emission/plugin-event-emissions.md) — Fires typed events from a plugin to communicate transcripts, errors, or custom data back to the agent framework. ([source](https://visionagents.ai/integrations/create-your-own-plugin.md))
- [Video Processing Event Emissions](https://awesome-repositories.com/f/user-interface-experience/component-utilities/ui-frameworks/component-apis/event-communication-systems/event-emission-declarations/custom-event-emission/video-processing-event-emissions.md) — Fires application-defined events during frame processing so other parts of the system can react to detected objects or conditions. ([source](https://visionagents.ai/guides/video-processors.md))
- [Component Events](https://awesome-repositories.com/f/user-interface-experience/form-and-input-management/interaction-and-event-handling/event-handling-architectures/component-events.md) — Listens to events from all components in a single place using their respective event types. ([source](https://visionagents.ai/core/agent-core.md))
- [Turn Completion Detection](https://awesome-repositories.com/f/user-interface-experience/interaction-detection/turn-completion-detection.md) — Uses neural models to predict when a speaker has finished their conversational turn, enabling natural and intelligent turn-taking. ([source](https://visionagents.ai/integrations/vogent))

### Web Development

- [Provider-Agnostic LLM Routing](https://awesome-repositories.com/f/web-development/provider-agnostic-llm-routing.md) — Sends requests to any supported language model provider through a single OpenAI-compatible interface, switching models without code changes. ([source](https://visionagents.ai/integrations/openrouter))
- [Realtime Connection State Listeners](https://awesome-repositories.com/f/web-development/event-listeners/realtime-connection-state-listeners.md) — Emits events when a realtime session connects or disconnects, providing session configuration and disconnection reason. ([source](https://visionagents.ai/core/realtime-core.md))

### Security & Cryptography

- [Custom Session Storage Providers](https://awesome-repositories.com/f/security-cryptography/identity-access-management/session-management/custom-session-storage-providers.md) — Lets you plug in any TTL-capable key-value backend by implementing a small abstract interface for session data. ([source](https://visionagents.ai/guides/horizontal-scaling.md))
- [Performance Metrics Querying](https://awesome-repositories.com/f/security-cryptography/process-sandboxes/session-resumption/ai-agent-sessions/performance-metrics-querying.md) — Returns real-time performance data for running agent sessions including latency and token counts. ([source](https://visionagents.ai/guides/http-server.md))
- [Session Access Controllers](https://awesome-repositories.com/f/security-cryptography/process-sandboxes/session-resumption/ai-agent-sessions/session-access-controllers.md) — Uses FastAPI permission callbacks to control who can start, view, close, or inspect sessions. ([source](https://visionagents.ai/guides/http-server.md))

### Software Engineering & Architecture

- [Agent Error Handlers](https://awesome-repositories.com/f/software-engineering-architecture/error-handling/error-management/agent-error-handlers.md) — Captures errors from specific components and wraps handler exceptions into a unified error event. ([source](https://visionagents.ai/guides/event-system.md))
- [Non-Realtime Error Events](https://awesome-repositories.com/f/software-engineering-architecture/error-handling/error-management/agent-error-handlers/non-realtime-error-events.md) — Emits an event when a non-realtime language model error occurs with details and recoverability status. ([source](https://visionagents.ai/reference/events-reference.md))
- [Event Subscriber Exception Catchers](https://awesome-repositories.com/f/software-engineering-architecture/error-handling/error-management/network-exception-handlers/event-subscriber-exception-catchers.md) — Catches unhandled exceptions from event subscribers and emits them as a single error event. ([source](https://visionagents.ai/reference/events-reference.md))
- [Escalating Verbal Warning Issuers](https://awesome-repositories.com/f/software-engineering-architecture/warning-issuance-systems/warning-suppressions/member-warning-issuers/escalating-verbal-warning-issuers.md) — Generates spoken warnings that increase in severity with each offense using text-to-speech synthesis. ([source](https://visionagents.ai/examples/video-moderator.md))