3 Repos
Tools for identifying UI components through visual analysis and coordinate mapping.
Distinct from Interactive UI Elements: Distinct from Interactive UI Elements: focuses on the localization mechanism for visual element identification rather than the elements themselves.
Explore 3 awesome GitHub repositories matching user interface & experience · Visual Localization Tools. Refine with filters or upvote what's useful.
Agent-S is a multimodal AI agent and LLM desktop automation framework designed to control operating systems through graphical user interface interactions. It functions as a computer use interface, utilizing vision-language grounding to translate natural language goals into precise screen coordinates and system actions. The project differentiates itself by combining structured accessibility tree inspection with vision-based element localization. It manages cross-application workflows by mapping conceptual descriptions to physical pixels and simulating low-level keyboard and mouse events to mov
Identifies physical pixel coordinates of visual concepts within screenshots using grounding models.
Midscene is a multimodal automation framework designed to enable AI agents to perceive, navigate, and manipulate graphical user interfaces across web, mobile, and desktop environments. By leveraging vision-capable AI models, the platform interprets interface screenshots to execute tasks based on natural language instructions, removing the reliance on traditional, brittle code-based selectors. The framework distinguishes itself through its ability to decompose high-level goals into autonomous, multi-step sequences that function consistently across diverse platforms. It provides a visual ground
Identifies UI components through visual analysis and coordinate mapping to ensure consistent interaction.
This project is a computer control framework that uses multimodal vision models to simulate mouse and keyboard inputs for automating desktop tasks. It functions as an autonomous agent and vision-based orchestrator that interprets screen visuals to interact with user interfaces. The system employs vision language models and object detection to locate and click interface elements. It utilizes visual grounding to overlay numerical markers on UI components and uses optical character recognition to map on-screen text to precise pixel coordinates. The framework supports voice-controlled computing
Identifies and localizes UI components through visual analysis and coordinate mapping.