# web-infra-dev/midscene

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/web-infra-dev-midscene).**

11,720 stars · 847 forks · TypeScript · mit

## Links

- GitHub: https://github.com/web-infra-dev/midscene
- Homepage: https://midscenejs.com
- awesome-repositories: https://awesome-repositories.com/repository/web-infra-dev-midscene.md

## Topics

`ai` `ai-test` `browser-use` `computer-use` `gpt-operator` `javascript` `phone-use` `testing`

## Description

Midscene is a multimodal automation framework designed to enable AI agents to perceive, navigate, and manipulate graphical user interfaces across web, mobile, and desktop environments. By leveraging vision-capable AI models, the platform interprets interface screenshots to execute tasks based on natural language instructions, removing the reliance on traditional, brittle code-based selectors.

The framework distinguishes itself through its ability to decompose high-level goals into autonomous, multi-step sequences that function consistently across diverse platforms. It provides a visual grounding feedback loop that maps natural language commands to specific screen coordinates, while offering interactive execution tracing and visual reports that allow developers to replay and troubleshoot the agent's decision-making process.

Beyond core automation, the project supports structured data extraction from visual elements and integrates with existing development pipelines through native interfaces for Python and Java. It also provides command-line and tool-based exposure, allowing external AI coding assistants to trigger interface actions or inspect application states programmatically.

The framework includes utilities for managing application lifecycles, attaching to active browser sessions, and connecting to remote or headless environments. Performance is optimized through execution plan caching and real-time screenshot streaming to reduce latency during automated workflows.

## Tags

### Artificial Intelligence & ML

- [Autonomous Web Agents](https://awesome-repositories.com/f/artificial-intelligence-ml/agentic-systems-frameworks/agent-orchestration-multi-agent/autonomous-agents/autonomous-web-agents.md) — Enables AI agents to autonomously plan and execute multi-step sequences across graphical interfaces.
- [AI Model Integrations](https://awesome-repositories.com/f/artificial-intelligence-ml/ai-model-integrations.md) — Integrates vision-capable AI models to interpret interface screenshots and execute automation tasks. ([source](https://midscenejs.com/model-common-config.html))
- [Vision-Based UI Parsers](https://awesome-repositories.com/f/artificial-intelligence-ml/vision-based-ui-parsers.md) — Identifies UI components and interacts with interfaces using screenshot analysis instead of traditional document selectors. ([source](https://midscenejs.com/android-api-reference.html))
- [Task Planning Systems](https://awesome-repositories.com/f/artificial-intelligence-ml/task-planning-systems.md) — Decomposes high-level natural language goals into sequential atomic actions for autonomous execution.
- [Vision-Language Grounding Models](https://awesome-repositories.com/f/artificial-intelligence-ml/vision-language-grounding-models.md) — Maps natural language instructions to specific screen coordinates using visual grounding.
- [AI Observability Tracing](https://awesome-repositories.com/f/artificial-intelligence-ml/ai-observability-tracing.md) — Generates interactive reports that replay every action and decision step to help troubleshoot AI-driven browser tasks. ([source](https://midscenejs.com/quick-start.html))
- [Tool Exposure Frameworks](https://awesome-repositories.com/f/artificial-intelligence-ml/agentic-systems-frameworks/agent-protocols-interoperability/tool-exposure-frameworks.md) — Exposes automation capabilities as standard tools for AI agents to inspect and interact with interfaces. ([source](https://midscenejs.com/mcp.html))
- [Report Generation Frameworks](https://awesome-repositories.com/f/artificial-intelligence-ml/report-generation-frameworks.md) — Produces detailed visual reports that document the steps taken and the AI's interpretation of the interface. ([source](https://midscenejs.com/android-getting-started.html))

### Development Tools & Productivity

- [AI Agent Development Tools](https://awesome-repositories.com/f/development-tools-productivity/ai-agent-development-tools.md) — Provides a toolkit for building and testing autonomous AI agents that perceive and manipulate graphical interfaces.
- [Visual Interface Test Runners](https://awesome-repositories.com/f/development-tools-productivity/debugging-profiling-testing/test-execution-management/automated-test-execution/visual-interface-test-runners.md) — Validates application behavior by analyzing visual interface states instead of relying on brittle document selectors.
- [Natural Language Interfaces](https://awesome-repositories.com/f/development-tools-productivity/natural-language-interfaces.md) — Enables software testing and workflow validation through natural language instructions instead of imperative scripts.
- [Automation Visualizers](https://awesome-repositories.com/f/development-tools-productivity/debugging-profiling-testing/debugging-diagnostics/debugging-inspection-tools/debugging-and-inspection-tools/automation-visualizers.md) — Provides interactive visual reports and step-by-step replays to troubleshoot AI-driven interface interactions.
- [Agentic Workflow Automations](https://awesome-repositories.com/f/development-tools-productivity/workflow-automation-tools/automation-execution-frameworks/automation-frameworks/agentic-workflow-automations.md) — Provides atomic and flow-based interfaces to script interactions, replay steps, and integrate with AI agents. ([source](https://midscenejs.com/))

### Testing & Quality Assurance

- [UI Automation](https://awesome-repositories.com/f/testing-quality-assurance/automation-interaction-tools/ui-automation.md) — Automates interactions with graphical user interfaces by interpreting visual screenshots through multimodal AI models.
- [End-to-End Testing](https://awesome-repositories.com/f/testing-quality-assurance/software-testing/e2e-integration-testing/end-to-end-testing.md) — Automates complex user workflows across web, mobile, and desktop environments to verify system-wide functional correctness.
- [Visual Assertion Validators](https://awesome-repositories.com/f/testing-quality-assurance/validation-verification/input-validation/agent-input-and-output-validators/automated-assertion-validators/visual-assertion-validators.md) — Verifies UI content and application behavior by querying visual elements and asserting outcomes through natural language prompts. ([source](https://midscenejs.com/android-getting-started.html))

### User Interface & Experience

- [Multimodal Automation Frameworks](https://awesome-repositories.com/f/user-interface-experience/cross-platform-ui-frameworks/multimodal-automation-frameworks.md) — Uses vision-capable AI models to interpret screenshots and execute cross-platform interface interactions via natural language.
- [Visual Localization Tools](https://awesome-repositories.com/f/user-interface-experience/interactive-ui-elements/visual-localization-tools.md) — Identifies UI components through visual analysis and coordinate mapping to ensure consistent interaction.
- [Automation Controllers](https://awesome-repositories.com/f/user-interface-experience/cross-platform-ui-toolkits/automation-controllers.md) — Automates mouse, keyboard, and touch inputs across diverse operating systems using declarative configuration files.
- [Touch Gesture Handlers](https://awesome-repositories.com/f/user-interface-experience/touch-gesture-handlers.md) — Performs touch, text input, and gesture commands on mobile and desktop interfaces using natural language instructions. ([source](https://midscenejs.com/ios-api-reference.html))

### Software Engineering & Architecture

- [Cross-Platform Abstraction Layers](https://awesome-repositories.com/f/software-engineering-architecture/cross-platform-abstraction-layers.md) — Provides unified interaction protocols across web, mobile, and desktop environments.
- [Workflow Orchestrators](https://awesome-repositories.com/f/software-engineering-architecture/workflow-orchestrators.md) — Manages stateful, multi-step automation sequences defined in configuration files.
- [Custom Action Handlers](https://awesome-repositories.com/f/software-engineering-architecture/custom-action-handlers.md) — Allows developers to register custom logic for user-triggered tasks within the automation framework. ([source](https://midscenejs.com/integrate-with-playwright.html))

### System Administration & Monitoring

- [Agent Execution Tracing](https://awesome-repositories.com/f/system-administration-monitoring/agent-execution-tracing.md) — Captures and visualizes end-to-end agent reasoning and tool usage for debugging.

### Data & Databases

- [Structured Data Extraction](https://awesome-repositories.com/f/data-databases/structured-data-extraction.md) — Retrieves information from interfaces by converting visual elements into structured data formats. ([source](https://midscenejs.com/use-javascript-to-optimize-ai-automation-code.html))

### DevOps & Infrastructure

- [Remote Desktop Infrastructure](https://awesome-repositories.com/f/devops-infrastructure/execution-environments/remote-desktop-infrastructure.md) — Operates remote desktop environments by translating natural language commands into interactions. ([source](https://midscenejs.com/computer-getting-started.html))
