Autonomous software agents designed to interact with desktop operating systems and applications to perform tasks.
UI-TARS is an LLM GUI automation framework and multimodal action grounding system. It functions as a GUI agent orchestrator and cross-platform device controller that uses large language models to interpret graphical interfaces and execute actions across desktop and mobile operating systems. The system translates model-generated coordinates into precise screen positions to interact with visual user interface elements. It employs a multimodal approach to interpret screen layouts and decomposes complex goals into multi-step trajectories through reasoning and error correction. The project provides capabilities for cross-platform interface control, including clicking, typing, and scrolling across web, mobile, and desktop environments. It includes tools for desktop and mobile GUI interaction, automation script generation, and visual grounding evaluation to measure coordinate precision. The framework supports hosting models on cloud platforms to provide scalable inference endpoints.
UI-TARS is a comprehensive framework designed specifically for GUI automation, utilizing multimodal LLMs to perform vision-based reasoning and execute precise mouse and keyboard actions across desktop and mobile environments.
UI-TARS-desktop is a cross-platform desktop application designed to automate software interface interactions. It functions as a local agent environment that interprets graphical user interfaces through multimodal visual-language model reasoning, allowing it to navigate and manipulate software by simulating human-like mouse and keyboard inputs. The platform distinguishes itself by executing all visual recognition and decision-making logic directly on the host machine. This local inference model ensures that screen data and sensitive information remain private, as no processing is offloaded to external servers. By mapping visual analysis to low-level operating system input drivers, the tool provides a consistent method for controlling both desktop applications and web browser environments. Beyond basic interface interaction, the software includes a modular tool server protocol that allows for the integration of external functional modules. This framework enables the agent to extend its capabilities beyond graphical tasks, connecting to external systems and services to perform complex, multi-step workflows.
This platform is a purpose-built AI agent environment that integrates multimodal vision reasoning with native mouse and keyboard control to automate desktop and GUI applications locally.
This project is a computer control framework that uses multimodal vision models to simulate mouse and keyboard inputs for automating desktop tasks. It functions as an autonomous agent and vision-based orchestrator that interprets screen visuals to interact with user interfaces. The system employs vision language models and object detection to locate and click interface elements. It utilizes visual grounding to overlay numerical markers on UI components and uses optical character recognition to map on-screen text to precise pixel coordinates. The framework supports voice-controlled computing by translating spoken commands into text-based objectives. It manages a full automation loop encompassing state observation through screenshots, action planning via cloud or local APIs, and the execution of synthetic inputs.
This framework is designed specifically to enable AI agents to perceive and control desktop environments through vision-based reasoning and synthetic mouse and keyboard inputs, directly fulfilling the requirements for an AI desktop automation agent.
cc-haha is a cross-platform desktop agent and computer use framework that enables large language models to control local operating systems through screenshots, clicks, and keystrokes. It functions as an AI coding workbench and orchestration platform, allowing for the management of multi-project workflows and the coordination of multiple agents executing complex tasks in parallel. The system includes a model backend gateway to connect various artificial intelligence providers and local models to autonomous agents. It features a centralized permission gate for authorizing sensitive commands and a side-by-side diff visualization tool for verifying automated code edits. The platform covers broad capability areas including AI-assisted software development, remote desktop control via token-based sessioning, and the extension of agent skills through a plugin-driven architecture. It also provides tools for scheduling background tasks and monitoring token usage.
This framework is specifically designed for AI-driven desktop automation, providing the necessary vision-based reasoning, GUI interaction, and OS-level input control required to enable LLMs to operate desktop applications.
OmniParser is a multimodal interaction engine designed to function as a desktop automation agent. It interprets visual screen information to execute complex, multi-step tasks across operating system environments by bridging visual interface perception with language models. Through a continuous cycle of observation and command execution, the system grounds high-level natural language instructions into precise, coordinate-based actions. The project distinguishes itself by utilizing vision-based parsing to interact with software interfaces without requiring access to underlying application programming interfaces or platform-specific accessibility frameworks. It decomposes complex screenshots into structured semantic elements and maps raw pixel data to labeled interactive components. This approach enables consistent automated workflows across varying display resolutions by normalizing coordinate spaces and relying on visual recognition rather than code-level hooks. The software provides a comprehensive framework for autonomous agent development, allowing for the transformation of static interface captures into structured data representations. This capability facilitates accurate element identification and interaction for vision-based models during repetitive desktop tasks.
OmniParser is a dedicated multimodal interaction engine that enables AI agents to perform desktop automation by grounding natural language instructions into coordinate-based mouse and keyboard actions through vision-based UI parsing.
Cua is an agent benchmarking and desktop automation platform designed to evaluate autonomous agents and execute repetitive tasks within isolated, virtualized environments. It provides a framework for provisioning consistent workspaces and measuring agent performance against standardized desktop operations. The platform distinguishes itself by integrating virtual machine orchestration with headless interaction capabilities. By leveraging hypervisor-based virtualization, it runs operating systems at near-native speeds, while its automation layer injects commands directly into application processes to perform data extraction and form filling without requiring active window focus or physical input devices. The system supports the full lifecycle of agent development, from infrastructure-as-code workspace provisioning to the collection of verified interaction logs. These logs enable the benchmarking of agent decision-making accuracy and the refinement of automated workflows through deterministic execution analysis.
Cua is a platform for benchmarking and executing AI agents within virtualized desktop environments, providing the necessary infrastructure for GUI interaction and automated task performance.
Osaurus is a local AI workflow engine and LLM agent orchestration framework designed for private execution on local hardware. It functions as a desktop application automator and a voice-controlled AI interface, enabling the development of autonomous agents that can write code, execute tools, and operate a computer without keyboard or mouse input. The system is distinguished by its ability to control native desktop applications via accessibility APIs and manage web interactions through a headless browser automation tool. It supports a local-first execution model and on-premises deployment within private networks to ensure data privacy and offline functionality. The project covers a broad range of automation capabilities, including codebase task automation, vision and image processing, and the programmatic generation of spreadsheets and presentations. It also includes integrations for web search, third-party messaging, and a plugin architecture to extend agent capabilities.
Osaurus is a dedicated AI workflow engine designed for desktop automation that leverages accessibility APIs and vision-based reasoning to control GUI applications and system inputs, directly fulfilling your requirements for an AI agent platform.
Agent-S is a multimodal AI agent and LLM desktop automation framework designed to control operating systems through graphical user interface interactions. It functions as a computer use interface, utilizing vision-language grounding to translate natural language goals into precise screen coordinates and system actions. The project differentiates itself by combining structured accessibility tree inspection with vision-based element localization. It manages cross-application workflows by mapping conceptual descriptions to physical pixels and simulating low-level keyboard and mouse events to move data between disparate software. Its broader capabilities cover hierarchical task planning, multimodal state observation, and native code execution for problem solving. The system also includes comprehensive media handling for screen capture and audio transcription, filesystem management, and interaction error recovery to refine task outcomes. The framework provides a command-line interface for executing standalone automation scripts without a separate build step.
Agent-S is a comprehensive framework specifically built for AI-driven desktop automation, providing the necessary vision-based reasoning, GUI element localization, and low-level input control required to operate desktop applications.
Bytebot is an LLM desktop automation framework and virtual Linux desktop environment. It enables AI agents to plan and execute mouse and keyboard actions on a virtual computer using natural language, allowing for autonomous desktop automation and the integration of legacy systems that lack native APIs. The system operates as an LLM API gateway and a Model Context Protocol server, routing requests across multiple language model providers with integrated load balancing and rate limiting. It provides isolated, containerized environments where agents use visual reasoning to interpret screenshots and translate goals into precise UI actions. The platform includes a comprehensive suite of orchestration tools for managing asynchronous task lifecycles, programmatic desktop control via REST, and real-time state streaming via WebSockets. It supports hybrid control modes, allowing users to monitor agent execution through a browser-based viewer and intervene manually when necessary. Deployment is supported through Docker Compose, Helm charts for Kubernetes orchestration, and one-click cloud templates for private infrastructure hosting.
Bytebot is a dedicated framework for AI desktop automation that provides containerized environments, vision-based reasoning, and direct mouse and keyboard control, making it a comprehensive solution for the requested category.
Openwork is an LLM agent orchestration platform and cross-platform desktop application designed for building and running automated workflows. It serves as a local AI agent host and session manager, allowing users to connect local project folders to various large language models and remote cloud workers. The project distinguishes itself through a local-first execution model that enables agents to process files directly on a host machine. It implements human-in-the-loop permissioning to intercept agent resource requests, requiring explicit user approval before accessing specific local system files or directories. Additionally, it uses a plugin-based skills interface to extend agent capabilities and supports template-based workflow persistence for saving and sharing repeatable prompt sequences. The platform includes capabilities for agentic task monitoring through execution plan visualization and action auditing. It provides a hybrid processing model that links a local interface to remote cloud workers and utilizes server-sent events for real-time progress updates and permission requests. The application supports multi-language interface localization for a global user base.
This platform provides a cross-platform desktop environment for hosting AI agents with plugin-based skill extensions and local file access, though it focuses more on workflow orchestration and file processing than direct GUI-based mouse and keyboard interaction.
Openwork is an AI agent for desktop automation that uses large language models to execute browser tasks, manage local files, and automate desktop workflows. It operates on a local-first execution model, translating natural language prompts into sequences of tool calls to perform digital chores. The system functions as a framework for defining and saving repeatable sequences of actions as reusable skills. It integrates large language models with third-party services and local APIs to synchronize data and share files. The agent includes capabilities for headless browser automation to conduct research and complete online forms, as well as tools for sorting, renaming, and organizing local disk files based on content rules. To maintain data privacy, the system uses directory-based permission scoping to restrict file system access to a predefined list of allowed folders. Users configure the intelligence of the agent by connecting to model providers via API keys or local hosting.
Openwork is a desktop automation framework that uses LLMs to execute workflows and manage local files, though it focuses more on tool-based task execution than direct vision-based GUI interaction.
MobileAgent is an LLM-powered mobile automation agent and framework designed to navigate mobile user interfaces and execute multi-step tasks. It functions as a device interface automation system that maps semantic commands to screen coordinates to perform input events across mobile operating systems. The project operates as a cross-app workflow orchestrator, switching between native on-screen interface actions and external API tools to complete sophisticated operations. It includes a visual grounding system that analyzes screenshots and interface metadata to identify elements and validate the success of actions through a feedback loop. As a long-horizon task planner, the agent decomposes complex high-level goals into sequential executable steps. This process is supported by hierarchical state tracking and memory to maintain progress across multi-step automation workflows.
This project is a specialized automation agent for mobile operating systems rather than a desktop-focused framework, making it a neighbouring category that does not support the requested desktop GUI interaction.
Accomplish is an artificial intelligence action framework and desktop automation agent designed to execute productivity tasks through natural language prompts. It functions as a workflow orchestrator that manages connections between various cloud and local language model providers to perform cross-platform operations. The system distinguishes itself through the ability to define and save stateful, reusable custom skills for recurring workflows. It integrates local application programming interfaces with third-party services to synchronize data and manage information across different platforms. The platform covers a broad range of automation capabilities, including browser-based research and form filling, local file system analysis and management, and the generation of professional documents and reports. These actions are coordinated through a provider-agnostic model gateway that abstracts different language model integrations.
This framework provides the necessary orchestration and skill-based architecture to build AI agents capable of executing cross-platform desktop and browser-based productivity tasks.
This project is a framework for managing generative AI services through a unified provider interface and adapter layer. It provides a standardized API for calling multiple cloud-based and locally hosted models, translating provider-specific parameters and responses into a uniform format. The system includes an agent orchestrator designed for long-running tasks, featuring state persistence for resuming runs and execution tracing to monitor decision-making processes. It integrates the Model Context Protocol to connect models to external servers and filesystems and employs a policy-based execution system with approval lists to control tool calling. Additional capabilities cover automated tool execution through schema generation, local desktop automation, and speech-to-text transcription. The project also provides a conversational coding interface for file editing and shell command execution, as well as specialized subagents for read-only code review.
This framework provides a unified interface for AI agents that includes specific capabilities for local desktop automation, tool execution, and LLM integration, making it a suitable platform for building desktop-based agents.
AIOS is an LLM agent operating system and orchestration kernel designed to manage memory, resource scheduling, and tool execution for multiple autonomous AI agents. It serves as a comprehensive framework for developing and deploying agents, featuring a dedicated resource manager that coordinates model backends, GPU memory, and isolated kernel instances. The system distinguishes itself through a semantic memory engine that uses vector search and autonomous clustering for long-term knowledge management, and a semantic file system that allows users to control computer files and system operations via natural language. It also implements a virtualization layer for multi-kernel scheduling and provides a compatibility layer to run agents developed in third-party frameworks. Broad capabilities include a unified model provider interface for routing requests across cloud and local backends, a tool orchestrator for executing external functions with structured JSON output, and secure virtual machine sandboxing for system interactions. The project also provides mechanisms for agent and tool distribution through remote hubs and a command-line interface for local testing and management.
AIOS is an agent orchestration framework that provides the necessary kernel and tool execution environment to manage system-level operations, though it focuses more on resource scheduling and file system management than direct GUI-based mouse and keyboard automation.
KeymouseGo is an input automation tool and macro recorder designed to capture, edit, and replay keyboard and mouse sequences to automate repetitive desktop tasks. It functions as a scriptable input automator that translates recorded user interactions into reusable blueprints for automated playback. The system distinguishes itself through a logic-based scripting framework that supports conditional branching, sub-routine calls, and jump-to-labels for complex workflow control. It further extends runtime behavior via a plugin system that allows for the registration of custom functions to modify timing and event parameters during execution. The tool provides a command line interface for launching automation scripts with configurable repetition and loop settings. It also includes a system for triggering the start or immediate termination of active scripts using designated keyboard shortcuts.
This is a macro recorder and input automation tool for replaying recorded sequences, but it lacks the LLM integration and vision-based reasoning required to function as an AI-driven desktop agent.