FARA is a visual computer-use agent model that controls a browser by predicting screen coordinates for clicking, typing, and scrolling, without relying on DOM or accessibility trees. It is designed to automate multi-step web tasks such as searching, form filling, booking, and shopping by reasoning over visual state and decomposing tasks into sequential actions.
The model uses a compact 7-billion-parameter decoder-only transformer that can run on consumer GPUs for low-latency on-device inference, or be deployed as a managed endpoint on Azure Foundry for cloud-based inference without local infrastructure. It also supports self-hosted serving via vLLM, LM Studio, or Ollama, giving users full control over the inference environment.
FARA includes a reproducible evaluation framework that runs agent benchmarks on 609 real, live web-browsing tasks with automatic retry handling for time-sensitive and error-prone scenarios. The framework provides standardized scoring rubrics to compare agent performance across different task descriptions and versions.