01 - control computers with your voice

01 is a voice-to-code agent and language model voice interface framework that enables natural language control of computers and devices. It functions as a real-time audio streaming server and a cross-platform voice client, translating spoken instructions into executable code to automate software, manage files, and browse the web.

The system supports both local and cloud-based language models, alongside local or hosted speech-to-text and text-to-speech engines. It is designed for custom hardware integration, providing the means to build embedded AI voice controllers using microcontrollers like the ESP32, including 3D-printable files for case fabrication and hardware assembly guidance.

The project covers a broad range of capabilities, including audio processing via WebSockets, agent behavior configuration through profile management, and remote access via server tunneling. It also includes security features such as execution environment isolation and system change auditing to manage the risks of autonomous code execution.

The system can be deployed across various platforms, from low-power microcontrollers to full desktop operating systems, using a unified server and client execution model.

Features

Voice Assistant Hardware - Provides instructions and materials for building physical voice interface devices using ESP32 chips and microcontrollers.
Voice Controlled Computing - Enables the execution of system-level operations and complex computer tasks via spoken natural language commands.
Language Model Integrations - Provides adapters and streaming interfaces to connect the system to various hosted or local language model providers.
Natural Language Command Translation - Translates natural language voice input into executable system commands for desktop and file management automation.
Local Model Integrations - Integrates on-device inference providers to enable intelligence layers without reliance on external cloud APIs.
Cross-Platform Deployments - Enables deployment across a diverse hardware spectrum, from ESP32 microcontrollers to full desktop operating systems.
Speech-to-Text and Text-to-Speech Integrations - Integrates both speech-to-text and text-to-speech engines to process voice data without external cloud APIs.
Assistant Personalization - Supports the definition of custom system messages to establish the identity and behavioral constraints of the voice assistant.
Voice-to-Code Translation - Converts spoken natural language requests into executable code to automate file management and software operations.
Real-Time Audio WebSockets - Transmits real-time audio byte streams between clients and servers using WebSockets for low-latency communication.
Natural Language Automation - Translates natural language instructions into cross-platform operating system tasks, web browsing, and software operations.
Hardware Audio Streaming - Streams audio byte streams directly to physical hardware devices for real-time voice interaction and playback.
IoT Device Server Implementation - Implements a server backend optimized for managing communication and logic for IoT hardware.
Voice Clients - Ships a flexible voice client for capturing audio and playing responses across desktop, mobile, and ESP32 platforms.
Real-Time Voice Backend Hosting - Runs a real-time server facilitating bidirectional audio streams via WebSockets for AI agents.
Audio Streaming Servers - Implements a dedicated server for low-latency audio streaming between hardware clients and an intelligence backend.
Voice and Vision Processing - Processes natural language audio input to trigger system actions, file management, and software control.
Natural Language Interfaces - Provides a hardware-integrated system for deploying natural language interfaces on low-power chips and custom electronics.
WebSocket Connection Management - Configures WebSocket addresses to establish live, low-latency communication between a mobile device and the computer.
Microcontroller Audio Interfaces - Integrates microcontrollers with audio and wireless capabilities to provide a physical interface for voice interactions.
Voice Interaction Interfaces - Provides the user interface layer for speech-based input and synthesized audio output to control a computer.
Voice Interfaces - Installs the necessary runtime dependencies and environments to enable natural language computer control across various operating systems.
Headless Server Hosting - Hosts non-graphical backend logic to provide API access and a communication link between computers and hardware.
Multimodal Context Providers - Monitors ambient audio and surroundings to provide multimodal situational awareness to the agent when not actively prompted.
Voice-Activated Triggers - Captures audio input via manual buttons or voice activity detection to trigger server requests.
Agent Behavioral Configuration - Allows customizing AI agent responses and behaviors through specialized instructions and system messages.
Agent Model Profiles - Defines language models and context windows through profiles to customize agent intelligence and voice output.
Isolated Execution Environments - Runs autonomous code in virtual machines or restricted accounts to prevent unintended changes to the host system.
Cross-Platform Compatibility - Supports a flexible server design that operates across diverse environments from ESP32 microcontrollers to desktop OSs.
Assistant Behavioral Profiles - Allows users to create and switch between named configuration profiles to adjust model settings and agent behaviors.
Audio Input Capture - Captures raw audio data from microphones using push-to-talk or voice activity detection for system interaction.
Hardware Assembly Guides - Enables the assembly of portable or desktop intelligent devices using microcontrollers and custom physical configurations.
Physical Construction Instructions - Provides comprehensive visual and written instructions for physically constructing the voice interface hardware.
WebSocket Clients and Servers - Implements both client and server components for bidirectional communication via WebSockets to link interfaces and backends.
Local Server Tunnels - Tunnels a local server to a public URL using a proxy to facilitate remote access from external networks.
Remote Access & Control - Provides interfaces and protocols for remotely interacting with a home server's files and applications via a mobile device.
Remote Device Connectivity - Sets network and server credentials via firmware or captive portals to establish connectivity between hardware and servers.
Voice Server Client Libraries - Provides dedicated client libraries to connect various programming languages and platforms to the voice server.
Client-Server Hardware Architectures - Separates the high-level audio capture and playback interface from the heavy computational logic on the server.
Change Auditing - Compares system files and settings before and after sessions to identify and revert unexpected modifications made by the agent.
Network Access Restrictions - Uses firewalls and VPNs to implement policies that limit the agent's ability to interact with the broader network.
Modular Provider Interfaces - Implements architectural patterns that allow swapping between local on-device inference and cloud-based APIs for processing.
Remote Agent Hardware Linking - Establishes architectural links between mobile devices and home machines to allow agent-driven control of files and IoT devices.

huggingface/speech-to-speech

4,895View on GitHub

This project is a framework for building local voice assistants and a real-time audio streaming server. It functions as a containerized inference engine and a multilingual speech pipeline that orchestrates speech-to-text, language models, and text-to-speech components to convert spoken input into spoken output. The system is distinguished by its use of WebSocket-based bidirectional streaming for low-latency interactions. It features a voice activity detection system that manages speech boundaries and handles user barge-in interruptions during assistant playback. It also supports custom voice

jasperproject/jasper-client

4,523View on GitHub

Jasper Client is a voice computing client and extensible speech framework designed to translate natural language speech into hardware actions and service requests. It functions as a voice command interface that manages the end-to-end process of audio capture, transcription, and action execution. The system features a modular architecture that allows for the integration of custom plugins, various speech recognition engines, and synthesis providers. This plugin-based approach supports the addition of new speakers and regional language capabilities without altering the core logic. The client in

benawad/dogehouse

9,025View on GitHub

Dogehouse is an open-source voice chat platform that enables users to create and join real-time voice conversation rooms with moderation controls. The platform is built around a room-based channel architecture where users are organized into isolated virtual rooms, with audio streams routed only to participants within each room. The platform separates its voice processing logic into a standalone server component, distinct from the client interface, and uses server-side audio mixing to combine multiple incoming audio streams before broadcasting to reduce client bandwidth. Real-time voice data i

KillianLucas/open-interpreter

64,024View on GitHub

Open Interpreter is a coding agent that uses large language models to write and execute code directly on a local host machine. It functions as a system for performing operating system tasks and file manipulations through a natural language interface. The project features a model orchestrator that allows switching between different language model providers and emulation harnesses. It employs a loop-based reasoning process to iteratively generate code and process execution output until a goal is achieved. Its capabilities include cross-platform system automation, local model integration for da

openinterpreter01

01

Features