Self Operating Computer

This project is a computer control framework that uses multimodal vision models to simulate mouse and keyboard inputs for automating desktop tasks. It functions as an autonomous agent and vision-based orchestrator that interprets screen visuals to interact with user interfaces.

The system employs vision language models and object detection to locate and click interface elements. It utilizes visual grounding to overlay numerical markers on UI components and uses optical character recognition to map on-screen text to precise pixel coordinates.

The framework supports voice-controlled computing by translating spoken commands into text-based objectives. It manages a full automation loop encompassing state observation through screenshots, action planning via cloud or local APIs, and the execution of synthetic inputs.

Features

GUI and Computer Agents - Implements an autonomous agent that uses vision models to interact with operating system GUIs and automate desktop tasks.

Computer Automation Interfaces - Provides a control layer that simulates human input and uses visual analysis to automate desktop tasks.

Autonomous UI Interaction - Interacts with software applications by mapping on-screen text and visual elements to precise clickable coordinates.

Multimodal AI Orchestrators - Coordinates vision and language models to simulate mouse and keyboard actions for unified agentic workflows.

Multimodal Vision Interfaces - Provides the integration layer for processing screen images through multimodal AI models for high-level action planning.

Visual UI Labeling - Overlays numerical markers on detected UI components to help the AI reference specific elements by ID.

Screen Text Extractors - Uses OCR on arbitrary screen regions to map clickable text and buttons to screen coordinates.

Text-to-Coordinate Mapping - Generates a coordinate map of on-screen text using OCR to allow precise clicking of specific elements.

Visual Grounding - The system overlays visual markers on UI components using detection models to improve AI interaction accuracy with buttons.

Multimodal Agents - Implements an autonomous agent capable of reasoning across vision and language to perform actions.

Virtual Input Simulation - Simulates programmatic mouse movements and keyboard strokes to automate operating system interactions.

Multimodal Desktop Observers - Combines visual screenshots and multimodal models to monitor and determine the current state of the desktop environment.

Computer Vision Screen Interaction Tools - Locates visual elements on a display through multimodal vision models to execute automated interactions.

OCR Coordinate Mapping - Uses optical character recognition to translate on-screen text labels into precise X and Y pixel coordinates.

Visual Localization Tools - Identifies and localizes UI components through visual analysis and coordinate mapping.

AI Model Integrations - Provides adapters and interfaces for connecting to both cloud-based and local vision models to drive interactions.

Voice Controlled Computing - Executes complex computer tasks and system objectives through spoken commands captured by audio hardware.

Voice-Controlled Goal Definition - Translates spoken user commands into text-based objectives to seed the autonomous agent loop.

Speech-to-Text Pipelines - Implements automated workflows that convert spoken audio input into text-based goals for the agent loop.

Visual Element Identification - Identifies interface components by searching for specific images or patterns within screen captures.

AI Agents - Experimental framework for AI-driven computer operation.

Autonomous Agent Frameworks - Enables multimodal models to control computer interfaces autonomously.

Computer Use - Multimodal framework for operating desktop applications.

Personal AI Assistants - Automates repetitive desktop and browser tasks via human-like interaction.

OthersideAIself-operating-computer

Features

Star history