This project is a computer control framework that uses multimodal vision models to simulate mouse and keyboard inputs for automating desktop tasks. It functions as an autonomous agent and vision-based orchestrator that interprets screen visuals to interact with user interfaces. The system employs vision language models and object detection to locate and click interface elements. It utilizes visual grounding to overlay numerical markers on UI components and uses optical character recognition to map on-screen text to precise pixel coordinates. The framework supports voice-controlled computing
Agent-S is a multimodal AI agent and LLM desktop automation framework designed to control operating systems through graphical user interface interactions. It functions as a computer use interface, utilizing vision-language grounding to translate natural language goals into precise screen coordinates and system actions. The project differentiates itself by combining structured accessibility tree inspection with vision-based element localization. It manages cross-application workflows by mapping conceptual descriptions to physical pixels and simulating low-level keyboard and mouse events to mov
Jaaz is a self-hosted AI design suite and multimodal workspace used for generating and editing images and videos. It functions as a design workspace where users can produce visual content and assets through a combination of local and cloud-based AI models. The project features a hybrid model orchestrator that routes requests between local model runners and remote APIs to balance data privacy with processing performance. It utilizes an infinite canvas collaborative tool for organizing storyboards and assets, and includes an image prompt optimizer to translate rough ideas into detailed generati
The mission of JARVIS is to explore artificial general intelligence (AGI) and deliver cutting-edge research to the whole community.