Agent S

Agent-S is a multimodal AI agent and LLM desktop automation framework designed to control operating systems through graphical user interface interactions. It functions as a computer use interface, utilizing vision-language grounding to translate natural language goals into precise screen coordinates and system actions.

The project differentiates itself by combining structured accessibility tree inspection with vision-based element localization. It manages cross-application workflows by mapping conceptual descriptions to physical pixels and simulating low-level keyboard and mouse events to move data between disparate software.

Its broader capabilities cover hierarchical task planning, multimodal state observation, and native code execution for problem solving. The system also includes comprehensive media handling for screen capture and audio transcription, filesystem management, and interaction error recovery to refine task outcomes.

The framework provides a command-line interface for executing standalone automation scripts without a separate build step.

Features

AI Agent Orchestrators - Integrates multimodal models to autonomously execute complex goals described in natural language.

GUI and Computer Agents - Performs high-level interactions on interface components, including clicking buttons and setting text.

Desktop Automation - Controls mouse and keyboard inputs to automate complex tasks across various OS applications.

Model Provider Integrations - Implements unified interfaces to connect and configure various external and local language model providers.

Computer Vision Localization - Translates conceptual descriptions of UI elements into precise pixel coordinates on screenshots.

Cross-Application Workflow Automation - Automates multi-step sequences to move data between disparate desktop apps and web browsers.

Desktop Automation Frameworks - Uses large language models to control the OS by combining accessibility tree data and visual analysis.

Language Model Querying - Provides the core capability to send text and image prompts to LLMs to retrieve reasoning and responses.

AI Model Integrations - Provides adapters and interfaces to connect multimodal models including chat, visual grounding, and speech-to-text.

OCR Engines - Uses an OCR engine to identify and locate bounding boxes of text within the user interface.

Task Planning Systems - Decomposes high-level goals into manageable subtasks using hierarchical planning and real-time knowledge.

Vision-Language Grounding Models - Utilizes models that map natural language instructions to specific spatial coordinates on a visual user interface.

Multimodal Agents - Processes screenshots and accessibility data using vision and language models to execute complex tasks.

Natural Language Automation - Executes sequences of primitive actions to complete goals described in plain text.

Application Lifecycle Managers - Locates and launches software applications and navigates to web URLs within the desktop environment.

Desktop Applications - Manages the launching and navigation of desktop applications and web addresses.

Keyboard Input Automation - Sends text, key commands, and raw physical key presses directly to the operating system.

Mouse Control Automation - Simulates mouse button events to trigger clicks at specific screen coordinates across different OSs.

Keyboard Shortcuts - Executes specific key combinations with modifiers to trigger application-level commands.

Screen Capture Tools - Captures full-screen snapshots of the desktop, including an option to hide the cursor.

Screen Capture Utilities - Captures the graphical user interface as base64 strings to enable visual perception for the AI agent.

Focus Navigation Controllers - Manages window focus and visibility transitions when switching between different applications.

Region Capture - Takes screenshots of specific coordinate-defined regions to isolate particular UI areas for analysis.

Accessibility Tree Accessors - Deno GUI Automation retrieves the root accessibility node of the currently focused application or a specific process ID.

Desktop Application Automation - Provides the ability to identify and retrieve handles for specific desktop applications and the default browser.

Accessibility Tree Automation - Activates the accessibility tree of focused applications to enable programmatic inspection and interaction.

GUI Element Localizations - Deno GUI Automation finds specific interface nodes using direct child traversal or natural-language concept searches.

GUI Structure Inspections - Deno GUI Automation captures a snapshot of a window's accessibility tree to identify interactable components.

Window Context Binding - Connects the agent to a target interface using foreground window handles, process IDs, or platform identifiers.

Multimodal Desktop Observers - Implements a multimodal state observation system combining accessibility snapshots and image data.

Desktop Application Workflows - Executes multi-step sequences across different software to move data between disparate applications.

Accessibility-Tree-Based Locators - Traverses the system accessibility layer to locate interactive UI elements via semantic roles and names.

User Interaction Simulation - Simulates human-like keystrokes and mouse events using a cartesian coordinate system.

Accessibility Role Mapping - Uses standardized accessibility roles to consistently discover and identify specific interface components.

Window Visibility Controllers - Determines whether launched applications remain visible or hidden to manage the user interface state.

Coordinate Converters - Translates relative screen coordinates into global physical pixels to ensure precise GUI interaction.

Semantic - Locates elements within the accessibility tree using natural-language queries and similarity scoring.

Visual Analysis Processors - Captures structural text and imagery of the screen for processing by vision models.

Visual Localization Tools - Identifies physical pixel coordinates of visual concepts within screenshots using grounding models.

Process-Level Input Injectors - Simulates low-level keyboard and mouse events to interact with graphical user interfaces at the OS level.

UI Element Selectors - Searches for interface elements using a combination of ARIA roles, traversal orders, and name filters.

UI Inspection Tools - Retrieves global physical bounding boxes of UI elements and ensures they are scrolled into view.

Interaction Area Definitions - Specifies rectangular regions on the screen using coordinates to identify elements for agent interaction.

Window Management - Adjusts the size and visibility of on-screen windows through minimize, maximize, and close actions.

Window State Controls - Modifies window states such as minimizing, maximizing, or closing application instances.

Window Lifecycle Controllers - Controls top-level window states including minimizing, maximizing, and closing.

Agent Capability Extensions - Provides mechanisms to integrate custom tools and capabilities to extend the agent's behavior for specific use cases.

Audio Transcription - Converts captured audio chunks into text transcripts using integrated speech-to-text models.

Code Execution Agents - Generates and executes native code directly to solve complex problems through system interactions.

Coordinate Normalization Utilities - Determines whether screen coordinates are absolute pixel positions or relative percentages for visual grounding.

Dynamic Plan Refinement - Adjusts subgoals and action sequences based on environmental feedback and new observations.

Knowledge and Memory - Stores and retrieves successful task trajectories to refine and optimize future automation actions.

Visual State Verifications - Evaluates whether the current screen satisfies specific conditions or contains required interface elements.

Screen Capture Extraction - Captures screen regions and uses language models to parse visual content into structured data.

Metadata Extraction - Retrieves specific metadata from a UI element, such as its role, title, or current value.

Accessibility Tree Exporting - Renders a window's accessibility subtree as a structured text representation for AI analysis.

System Clipboard Access - Reads and writes data to the system clipboard to facilitate information transfer between disparate applications.

Display Automation Tools - Retrieves pixel dimensions and coordinates of physical screens to determine the available workspace area.

Execution Sampling Strategies - Optimizes task success rates by running multiple attempts in parallel and sampling the best outcome.

Web Search Integrations - Integrates external search engine APIs to retrieve real-time web information for task context.

Image Annotation Tools - Draws reference grids over images to provide spatial coordinates for visual grounding.

Image Processing - Performs pixel-level transformations including resizing, compression, and grid overlays for visual analysis.

Image Compression Tools - Compresses images into JPEG format with configurable quality settings to optimize size.

Base64 Image Decoders - Encodes images to base64 data URLs and decodes base64 strings back into image objects.

Image Transformation Utilities - Resizes images to fit within specified bounds while maintaining the original aspect ratio.

Cursor State Retrieval - Returns the current x and y coordinates of the mouse cursor in pixels.

Window Handle Management - Provides the ability to retrieve handles for all visible top-level windows or specific processes.

Error Recovery - Detects failures in GUI actions and self-corrects by adapting navigation or grounding methods.

Success-Based Parallel Sampling - Executes multiple task attempts in parallel to select the most successful outcome.

Execution Logs - Records and streams workflow-level execution history to facilitate real-time monitoring and debugging.

Accessibility Inspection Tools - Analyzes the system accessibility layer to identify UI components by roles and structure.

Focus Management - Moves keyboard focus to specific elements and manages foreground window state.

Drag and Drop Simulations - Enables moving objects and data between different desktop applications through simulated drag-and-drop actions.

Programmatic Scrolling - Implements programmatic vertical and horizontal scrolling using pixel deltas to navigate user interfaces.

Agent Frameworks - Framework for agents that interact with computers.

Agent Memory Systems - Open agentic framework for computer control using human-like interaction.

AI Agents - GUI agent capable of multi-app collaboration and self-learning.

AI Agents and Automation - Open agentic framework that autonomously interacts with computer GUIs like a human.

Computer Use - Framework for human-like computer interaction.

simular-aiAgent-S

Features

Star history