Computer Use Preview

This project is a browser automation system that connects Google's Gemini API to a web browser, enabling an AI agent to perform tasks on a user's behalf by interpreting natural language instructions. At its core, it operates through a continuous screenshot-based action loop, where the agent captures the browser's current state, sends the image to the Gemini model, and executes the model's returned commands to click, type, and navigate.

The system distinguishes itself through a dual browser backend abstraction, supporting both local Playwright-controlled browsers and remote Browserbase cloud instances, with the ability to switch between them at runtime. It also offers a Vertex AI routing switch, allowing model inference requests to be directed to either the public Gemini API or Vertex AI endpoints. A mouse cursor overlay injection feature visually marks the cursor position on screenshots sent to the model, aiding in debugging and tracking agent actions.

The project provides distinct agent implementations for each backend, including a Playwright browser agent, a Browserbase cloud browser agent, and a Vertex AI browser agent, all driven by the same natural language interface. Configuration is managed through environment variables and a .env file, with runtime settings for browser backend selection, headless mode, cloud region, model version, and startup URL. The system also includes a workaround for handling operating-system-rendered dropdown menus that Playwright cannot natively capture.

Features

Natural Language Browser Control Platforms - Controls a web browser by translating plain-English instructions into clicks, typing, and navigation.

AI Powered Web Automation - Executes multi-step web workflows like form filling and data extraction via an AI agent.

Browser Automation Agents - Executes natural language instructions by controlling a web browser to perform specified tasks.

Gemini-Powered Agents - Uses the Gemini API to control a web browser through natural language instructions.

Gemini Integrations - Connects to the Gemini API to power an agent that perceives and interacts with GUIs.

Natural Language Command Translation - Translates plain-English commands into browser actions like clicking, typing, and navigating.

AI-Driven Action Loops - Operates through a continuous screenshot-based action loop where the agent captures browser state and executes model commands.

Local and Remote Backend Switching - Ships a dual browser backend abstraction that switches between local Playwright and remote Browserbase instances at runtime.

Debugging Overlays - Provides debugging tools like mouse cursor overlays and headless mode toggling for agent actions.

API and Vertex AI Routing - Routes model inference requests to either the public Gemini API or Vertex AI endpoints based on a runtime toggle.

CLI-Driven Browser Agents - Executes a natural-language instruction by launching a browser, navigating web pages, and performing actions via the Gemini API from the CLI.

Cloud Browser Agent Execution - Launches a Browserbase cloud browser, connects it to a Gemini-powered agent, and executes a natural-language task loop.

Playwright Agent Execution - Launches a Playwright-controlled browser, connects it to a Gemini-powered agent, and executes a natural-language task loop.

Browser Automation Agents - Routes browser automation requests through Vertex AI for model inference and task execution.

Browser - Selects between a local Playwright-controlled browser or a remote Browserbase instance to run the agent.

API-Driven Executions - Leverages Playwright to launch and control a browser, executing tasks via the Gemini API.

Cloud Browser Provisioners - Configures remote Browserbase cloud browser instances for AI-driven automation.

Cloud Browser Integrations - Connects a Gemini-powered agent to a remote Browserbase cloud browser for task loops.

google-geminicomputer-use-preview

Features

Star history