1 repo

Awesome GitHub RepositoriesVision-Language Grounding Models

Models that map natural language instructions to specific spatial coordinates on a visual interface.

Distinguishing note: Specifically addresses the grounding of language into spatial bounding boxes.

Explore 1 awesome GitHub repository matching artificial intelligence & ml · Vision-Language Grounding Models. Refine with filters or upvote what's useful.

Find the best repos with AI.We'll search the best matching repositories with AI.

microsoft/OmniParser
microsoft/OmniParser
24,377View on GitHub
OmniParser is a multimodal interaction engine designed to function as a desktop automation agent. It interprets visual screen information to execute complex, multi-step tasks across operating system environments by bridging visual interface perception with language models. Through a continuous cycle of observation and command execution, the system grounds high-level natural language instructions into precise, coordinate-based actions. The project distinguishes itself by utilizing vision-based parsing to interact with software interfaces without requiring access to underlying application progr
Maps natural language instructions to specific coordinate-based bounding boxes on a visual interface.
Jupyter Notebook
24,377View on GitHub