This project is a browser automation system that connects Google's Gemini API to a web browser, enabling an AI agent to perform tasks on a user's behalf by interpreting natural language instructions. At its core, it operates through a continuous screenshot-based action loop, where the agent captures the browser's current state, sends the image to the Gemini model, and executes the model's returned commands to click, type, and navigate.
The system distinguishes itself through a dual browser backend abstraction, supporting both local Playwright-controlled browsers and remote Browserbase cloud instances, with the ability to switch between them at runtime. It also offers a Vertex AI routing switch, allowing model inference requests to be directed to either the public Gemini API or Vertex AI endpoints. A mouse cursor overlay injection feature visually marks the cursor position on screenshots sent to the model, aiding in debugging and tracking agent actions.
The project provides distinct agent implementations for each backend, including a Playwright browser agent, a Browserbase cloud browser agent, and a Vertex AI browser agent, all driven by the same natural language interface. Configuration is managed through environment variables and a .env file, with runtime settings for browser backend selection, headless mode, cloud region, model version, and startup URL. The system also includes a workaround for handling operating-system-rendered dropdown menus that Playwright cannot natively capture.