30 open-source projects similar to othersideai/self-operating-computer, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best Self Operating Computer alternative.
Agent-S is a multimodal AI agent and LLM desktop automation framework designed to control operating systems through graphical user interface interactions. It functions as a computer use interface, utilizing vision-language grounding to translate natural language goals into precise screen coordinates and system actions. The project differentiates itself by combining structured accessibility tree inspection with vision-based element localization. It manages cross-application workflows by mapping conceptual descriptions to physical pixels and simulating low-level keyboard and mouse events to mov
UI-TARS is an LLM GUI automation framework and multimodal action grounding system. It functions as a GUI agent orchestrator and cross-platform device controller that uses large language models to interpret graphical interfaces and execute actions across desktop and mobile operating systems. The system translates model-generated coordinates into precise screen positions to interact with visual user interface elements. It employs a multimodal approach to interpret screen layouts and decomposes complex goals into multi-step trajectories through reasoning and error correction. The project provid
UI-TARS-desktop is a cross-platform desktop application designed to automate software interface interactions. It functions as a local agent environment that interprets graphical user interfaces through multimodal visual-language model reasoning, allowing it to navigate and manipulate software by simulating human-like mouse and keyboard inputs. The platform distinguishes itself by executing all visual recognition and decision-making logic directly on the host machine. This local inference model ensures that screen data and sensitive information remain private, as no processing is offloaded to
Open Interpreter is an autonomous agent runtime that translates natural language instructions into executable code to interact with local software and operating systems. It functions as an orchestration framework that connects language models to a secure execution environment, enabling the development of agents capable of managing system resources and performing complex tasks. To ensure safety, the system mandates explicit user verification before executing any generated code and provides robust isolation through containerized sandboxing. The project distinguishes itself through its deep inte
OmniParser is a multimodal interaction engine designed to function as a desktop automation agent. It interprets visual screen information to execute complex, multi-step tasks across operating system environments by bridging visual interface perception with language models. Through a continuous cycle of observation and command execution, the system grounds high-level natural language instructions into precise, coordinate-based actions. The project distinguishes itself by utilizing vision-based parsing to interact with software interfaces without requiring access to underlying application progr
AzurLaneAutoScript is a mobile game automation system designed to perform repetitive gameplay tasks unattended. It functions as a screenshot-driven bot that controls Android devices, emulators, and cloud phones via ADB and uiautomator2, using computer vision to make interaction decisions instead of fixed timers. The project distinguishes itself through an advanced computer vision suite that includes local optical character recognition and perspective-aware grid detection. These tools allow the bot to parse 3D game maps, compute vanishing points, and normalize grid-centered objects for precise
MobileAgent is an LLM-powered mobile automation agent and framework designed to navigate mobile user interfaces and execute multi-step tasks. It functions as a device interface automation system that maps semantic commands to screen coordinates to perform input events across mobile operating systems. The project operates as a cross-app workflow orchestrator, switching between native on-screen interface actions and external API tools to complete sophisticated operations. It includes a visual grounding system that analyzes screenshots and interface metadata to identify elements and validate the
PandaOCR is a desktop application for extracting text from images and screen captures using optical character recognition. It functions as a mathematical formula digitizer, a table data extractor, a multilingual translation utility, and a text-to-speech interface. The project distinguishes itself through specialized recognition routing that distributes data across different providers based on whether the content is standard text, tables, or formulas. It provides real-time software interface localization by rendering translated text layers directly over active application windows using coordin
PyAutoGUI is a Python GUI automation library and desktop automation framework. It provides a set of tools for programmatically controlling the mouse and keyboard to automate user interface interactions across different operating systems. The project functions as a cross-platform input simulator and computer vision screen scanner. It enables the simulation of keystrokes and cursor movements to perform repetitive tasks and utilizes screen analysis to locate specific images or pixel colors on the display. Its capability surface includes mouse and keyboard input simulation, screen image capture,
Pipecat is a framework and software development kit for building real-time multimodal AI agents and speech-to-speech systems. It utilizes a frame-based data pipeline to route audio, video, and text through a modular sequence of processors, enabling the orchestration of low-latency conversational AI. The project is distinguished by its ability to coordinate complex multimodal services, including speech-to-text, language models, and text-to-speech, within a single pipeline. It features semantic voice activity detection for natural turn-taking, state-machine conversation flows for dialogue manag
Qwen2.5-VL is an autoregressive multimodal transformer designed to process interleaved sequences of text and visual tokens. It integrates visual feature embeddings into a shared language model space to perform cross-modal reasoning and generate coherent responses or structured layout code. The project distinguishes itself through vision-language-action mapping, allowing it to perceive visual interfaces and translate that perception into actionable commands for operating digital screens and robotic hardware. It employs dynamic-resolution image encoding and temporal-frame video indexing to hand
Robotgo is a cross-platform desktop automation framework for the Go programming language. It provides a comprehensive toolkit for programmatically interacting with graphical user interfaces, enabling developers to simulate human input, manage application windows, and monitor system-wide hardware events. The library distinguishes itself through its low-level system integration, utilizing a foreign function interface to interact directly with native operating system APIs. It employs pixel-buffer memory mapping and real-time screen capture to perform visual element identification, allowing for i
RPA-Python is a robotic process automation framework for automating repetitive tasks across web browsers, desktop applications, and operating systems using Python scripts. It functions as a desktop process automator and browser automation tool designed to reduce manual labor and human error in digital workflows. The project includes an OCR screen data extractor for capturing snapshots and extracting text from images via optical character recognition. It also provides a system command wrapper for executing shell commands and managing local file operations, such as downloading files from URLs a
Better Genshin Impact is a computer vision-based automation framework designed to perform repetitive tasks and combat sequences within game environments. It functions as a macro scripting engine that utilizes synthetic input injection to simulate human interaction with the operating system, allowing for hands-free execution of complex gameplay loops. The system distinguishes itself through a combination of template-matching visual recognition and state-machine logic, which enables the software to identify on-screen game elements and transition between operational states in real time. By mappi
LiveKit is a comprehensive framework for building and orchestrating real-time, multimodal AI agents that interact with users through voice, video, and text. It provides a centralized, event-driven architecture to manage the entire lifecycle of automated participants, from initialization and session state management to graceful shutdown. By utilizing a selective forwarding unit, the platform efficiently routes media streams between participants and agents, ensuring low-latency communication and secure, token-based authentication for all connections. The platform distinguishes itself through it
Easydict is a macOS dictionary and translator application that integrates system dictionaries, external translation services, and Large Language Models such as OpenAI and Gemini. It functions as an OCR text extractor and a text-to-speech reader, allowing users to look up words and translate text directly on the desktop. The application features a local OCR engine that captures screen areas to recognize and translate text that cannot be highlighted or copied. It utilizes a provider-agnostic translation pipeline and adapter-based service integration to standardize responses from various cloud a
Superagent is a framework for AI assistant orchestration and agent security. It provides the tools to build intelligent assistants that integrate external APIs and maintain conversation memory to automate complex tasks. The project focuses on AI agent security through adversarial testing, red teaming, and the detection of prompt injections and malicious tool calls. It includes automated vulnerability patching, which scans codebases and configurations for security flaws and generates pull requests with fixes. The platform supports retrieval augmented generation by connecting language models t
Visual-ChatGPT is a visual orchestration framework and multimodal AI pipeline designed to coordinate large language models with visual foundation models. It functions as an integration layer that enables the exchange of text and images between different AI models to automate image analysis and editing tasks without requiring additional model training. The system differentiates itself through model-chain orchestration and prompt-based task dispatching, allowing natural language instructions to trigger specific vision models or tools. It utilizes coordinate-based region mapping and iterative ma
gptme is an autonomous AI agent server and framework designed for local system automation, software development, and code execution. It operates as a local execution engine that enables language models to run shell commands, modify local files, and interact with the operating system. The project functions as a Model Context Protocol client, integrating with external servers to expand agent capabilities with standardized tools and data sources. It features a provider-agnostic routing system to orchestrate tasks across multiple proprietary cloud APIs and local AI backends. The system includes
Serve is a multimodal AI orchestrator and inference server designed for deploying and scaling machine learning models as cloud-native services. It functions as a containerized workflow engine and distributed service mesh that routes multimodal data through connected execution units. The framework provides specialized capabilities for large language models, including a token streaming gateway that delivers generated text incrementally to reduce perceived latency. It distinguishes itself by enabling the chaining of executors into complex data processing pipelines and the orchestration of these
OM1 is a multimodal AI agent runtime and orchestration framework designed to connect large language models to physical robot hardware and sensors. It provides an execution environment that processes audio, video, and sensor data to drive autonomous decisions and actions in real-world settings. The system integrates a robotics SLAM and navigation stack with a hardware abstraction layer, allowing high-level AI commands to be translated into low-level motor and actuator instructions. It distinguishes itself by incorporating blockchain-based governance to enforce immutable operational rules and p
ml-ferret is a multimodal large language model framework and visual reasoning engine designed to reason about images and user interfaces. It functions as a UI grounding model and referring expression comprehension tool that maps natural language descriptions to precise pixel coordinates. The system focuses on high-resolution image analysis to identify and locate specific interface components. It employs multi-resolution image processing and region-aware visual encoding to preserve detail across different aspect ratios, enabling the model to analyze spatial relationships and functional layouts
AstrBot is an orchestration framework designed for building and managing autonomous agents that integrate multimodal artificial intelligence with secure, isolated execution environments. It serves as a platform for coordinating complex agentic workflows, allowing users to connect diverse language, speech, and vision models while maintaining personalized agent personas and domain-specific knowledge bases. The platform distinguishes itself through a modular plugin architecture and a centralized visual dashboard, which together enable users to extend agent capabilities and manage operational set
KoboldCPP is a local large language model inference engine and GGUF model runner designed to execute quantized models on personal hardware. It functions as a multimodal AI server and API gateway, providing OpenAI-compatible endpoints that allow third-party clients to interact with locally hosted models. The project distinguishes itself as an AI storytelling backend, featuring dedicated tools for long-form narrative management through persistent memory, world lore tracking, and character state management. It further extends its capabilities as a multimodal server capable of processing text, im
LunaTranslator is a real-time translation tool designed for visual novels and games. It functions as a multi-engine translation hub and text extractor that captures dialogue via memory hooking or optical character recognition to convert it into a target language. The project distinguishes itself through specialized linguistic tools, including a Japanese text analyzer for sentence segmentation and phonetic readings. It also operates as a digital dictionary aggregator, querying multiple online and offline databases simultaneously to provide comprehensive vocabulary definitions for language lear
Bytebot is an LLM desktop automation framework and virtual Linux desktop environment. It enables AI agents to plan and execute mouse and keyboard actions on a virtual computer using natural language, allowing for autonomous desktop automation and the integration of legacy systems that lack native APIs. The system operates as an LLM API gateway and a Model Context Protocol server, routing requests across multiple language model providers with integrated load balancing and rate limiting. It provides isolated, containerized environments where agents use visual reasoning to interpret screenshots
Narrator is an artificial intelligence system that converts real-time video feeds into natural language audio descriptions. It functions as a multimodal vision narrator and scene descriptor, using computer vision to transform environmental data from a camera into synthetic speech. The tool operates as a pipeline that captures periodic images from a feed and uses a multimodal large language model to analyze visual events. These analyses are then converted via text-to-speech synthesis into a voiceover that describes real-world activities and surroundings. The system supports automated environm
gpt4all-ui is a web-based user interface designed for local large language model execution and management. It provides a local execution environment that runs AI models on a user's own hardware to ensure data privacy and eliminate external telemetry. The project features a peer-to-peer inference distribution system that shares computational loads across multiple network nodes to increase processing speed. It includes a multimodal orchestrator that combines text, image, video, and audio models into a single interface, as well as a layered autonomy model for organizing specialized AI agents int
MisakaTranslator is a real-time game translation tool designed to extract text from games and manga and provide machine translations via external engines. It functions as a text extractor using both memory hooking to retrieve raw text directly from running processes and optical character recognition to convert images of in-game text into editable strings. The tool includes a speech synthesizer to read translated dialogue and sentences aloud. To maintain accuracy, it utilizes a custom translation dictionary to manage specialized word lists and manual phrase mappings for character names and loc
This project is a Telegram bot that integrates large language models, such as OpenAI and Claude, to provide an AI chat interface within the messaging app. It functions as a multi-model AI gateway that routes prompts to various providers via API keys and YAML configurations. The implementation includes a provider-agnostic routing system and response streaming to deliver text word-by-word. It distinguishes itself with a token-based cost tracker that calculates the monetary expenditure of API requests and a whitelist-based access control system to restrict usage to authorized users. The bot sup