30 open-source projects similar to firecrawl/firecrawl, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best Firecrawl alternative.
Crawlee is a web scraping framework designed for building scalable, reliable, and distributed data extraction pipelines. It provides a unified interface for managing headless browser automation and lightweight HTTP requests, allowing developers to handle complex web navigation, dynamic content rendering, and large-scale data collection within a single, modular architecture. The project distinguishes itself through its resource-aware concurrency controller, which dynamically scales task execution based on real-time CPU and memory usage to prevent host machine exhaustion. It also features a rob
Firecrawl is a headless browser automation tool and web crawling engine designed to extract structured data from the web. It functions as an API that transforms raw website content and documents into clean markdown and JSON formats to serve as context for large language models. The project distinguishes itself by using natural language prompts to translate human instructions into targeted data extraction tasks and browser actions. It can execute interactive page navigation, such as clicking and scrolling, and perform automated web research to retrieve structured data without manual interventi
This project is a Model Context Protocol server that connects large language models to web scraping and crawling tools. It functions as a bridge, allowing LLM clients to utilize a web crawling engine and scraping utilities to extract and process web data. The server integrates a markdown web converter that transforms dynamic web pages and PDF documents into clean markdown to optimize consumption by AI models. It also provides a browser automation interface for controlling headless sessions and bypassing access restrictions. The system covers broad capabilities including large-scale website d
AnyCrawl is an AI-powered data extractor, automated web crawler, and headless browser orchestrator. It serves as a web content extraction API and a gateway that connects crawling and scraping tools to language models using a standardized API protocol. The project specializes in converting unstructured website content into structured JSON or markdown optimized for AI assistants. It utilizes language models and JSON schemas to pull specific information into validated formats and provides capabilities for AI page summarization and LLM-optimized content extraction. The system manages comprehensi
Pholcus is a distributed web crawling system designed for large-scale data scraping. It employs a master-worker distribution model to coordinate high-concurrency scraping tasks across a network of remote client nodes, enabling both horizontal and vertical data collection. The system features a hot-loadable rule engine that allows extraction and navigation logic to be updated at runtime without restarting the process. It handles dynamic content through headless browser integration and bypasses bot detection using proxy rotation, automated user authentication, and simulated human behavior. The
Firecrawl MCP Server is a Model Context Protocol tool server that exposes the full suite of Firecrawl’s web scraping, crawling, and automation capabilities as tools that large language models can invoke directly. It acts as a proxy to the Firecrawl cloud platform, which manages headless browser orchestration, async job queues, and rate limiting behind the scenes. The server distinguishes itself by packaging autonomous web agents — both a research agent that browses and collects structured data from multiple pages, and a general web agent that performs multi-step browsing and extraction tasks
This project is a comprehensive framework for building and managing autonomous agent systems. It provides a unified architecture for orchestrating multi-agent societies, where specialized agents collaborate through roleplay to decompose and solve complex tasks. The system integrates language models with external environments, enabling agents to perform real-world actions through a standardized tool-calling abstraction layer. The framework distinguishes itself through its focus on iterative reasoning and data reliability. It employs automated feedback loops to refine agent outputs and self-eva
Katana is a web crawler and spider designed for security reconnaissance and web application mapping. It functions as a utility for identifying endpoints, forms, and API structures across web targets by combining standard HTTP request traversal with headless browser automation to render dynamic, JavaScript-heavy content. The tool distinguishes itself through its ability to maintain authenticated sessions and handle complex web interactions, such as automated form submission and captcha resolution. It provides granular control over the discovery process, allowing users to define specific crawl
This project is an MCP browser automation server that connects large language models to headless cloud browsers. It functions as an autonomous web workflow engine and an LLM web agent interface, enabling the translation of natural language instructions into browser actions and structured data retrieval. The system distinguishes itself through a managed headless browser cloud API that supports concurrent Chromium sessions with integrated stealth modes, CAPTCHA solving, and proxy traffic routing. It utilizes self-healing element selection to maintain automation resilience when page structures c
Skyvern is an autonomous web navigation agent and browser-based workflow orchestrator that uses large language models to execute multi-step tasks on websites. By translating natural language instructions into actionable browser commands, the framework enables the automation of complex user workflows, including data extraction and interface interaction, without manual intervention. The platform distinguishes itself through a focus on secure, self-hosted infrastructure and stealth-oriented execution. It utilizes containerized browser isolation to maintain consistent environments and employs pro
PySpider is a Python web crawling framework designed for automated data extraction. It provides a pipeline for periodically fetching web content, processing HTML, and persisting scraped information into database backends. The system features a web-based management interface for editing scraping scripts, monitoring task progress, and reviewing collected data. It includes a headless browser JavaScript renderer to capture rendered HTML from dynamic web pages and a distributed architecture that uses message queues to scale crawling workloads across multiple nodes. The framework also covers task
This project is an agentic framework designed to enable autonomous web navigation and browser automation. It functions as a controller that translates natural language instructions into deterministic browser actions, allowing agents to interact with websites, perform data extraction, and manage complex authentication flows. By leveraging accessibility trees and semantic element resolution, the framework mimics human-like navigation, moving beyond brittle DOM selectors to interact reliably with modern web interfaces. The framework distinguishes itself through its focus on secure, scalable exec
Crawl4AI is an AI-powered web crawling and data extraction engine designed to transform complex web content into structured formats. It functions as a headless browser orchestrator, enabling the navigation of dynamic websites, the execution of custom scripts, and the capture of visual assets like screenshots and PDFs. By integrating language models directly into the extraction workflow, the system converts raw HTML into clean, structured data or Markdown files optimized for downstream ingestion. The platform distinguishes itself through a distributed, self-hosted infrastructure that manages l
This project is a comprehensive suite of AI tools and frameworks, featuring an LLM multi-agent orchestrator, an autonomous agent runtime, and a stateful application framework. It provides the infrastructure to build and manage specialized AI agents capable of coordinating complex tasks through graph-based workflows and shared state. The system is distinguished by its implementation of the Model Context Protocol, allowing for standardized resource discovery and communication between AI clients and servers. It further includes an AI-powered documentation generator designed to analyze source cod
Stagehand is an AI-native browser automation framework that enables developers to build reliable web automations using a hybrid of natural language instructions and deterministic TypeScript code.
FlareSolverr is a proxy server designed to provide programmatic access to websites protected by automated security challenges and firewall restrictions. It functions by orchestrating headless browser instances to render web pages, execute JavaScript, and retrieve the necessary cookies and content required to bypass common security hurdles. The service distinguishes itself by maintaining persistent browser sessions in memory, which allows for the reuse of authenticated states across multiple requests. It integrates with external captcha resolution services to handle interactive security challe
Colly is a web scraping framework and concurrent crawler written in Go. It provides a system for traversing web pages, following links, and extracting structured data from HTML and XML documents. The framework includes a distributed scraping engine designed to spread data collection tasks across multiple instances to increase throughput. It ensures compliance with website owner policies by automatically reading and respecting robots.txt files. The system manages request lifecycles through domain-based rate limiting, concurrency controls, and session management via a stateful cookie jar. It s
Goutte is a PHP web scraper and DOM crawler designed for extracting data from websites. It functions as an HTTP client wrapper that enables the retrieval of web pages and the parsing of HTML content. The project provides a web form automator to programmatically fill and submit HTML forms to remote servers. It also includes a mechanism for automated website crawling by following links to discover and archive web content. The system supports stateful session management to maintain cookies and headers across requests. It further covers HTML data extraction through DOM-based element selection an
This project is an LLM-powered web crawler and data extractor that uses large language models to navigate websites and parse content into structured JSON or Markdown formats. It functions as an automated browser orchestrator and domain discovery engine, interpreting plain English instructions to identify relevant pages and extract specific information. The system distinguishes itself through agentic browser automation, allowing it to perform human-like interactions such as clicking buttons and scrolling based on natural language commands. It employs goal-oriented crawling to analyze website s
geckodriver is a browser automation driver and W3C WebDriver implementation. It functions as a proxy server that translates standardized WebDriver commands into internal instructions for web browsers based on the Gecko engine. The project enables the programmatic control of Gecko-based browsers to simulate user interactions and automate repetitive web tasks. It supports both standard browser automation and headless browser orchestration for workflows executing without a graphical user interface. The software is used for automated web testing to verify website functionality and user interface
This project is a reference library of architectural blueprints, study materials, and design patterns for building scalable, high-availability distributed systems. It serves as a technical guide for scalability engineering, providing structural solutions for common engineering challenges. The repository focuses on distributed systems design, covering essential patterns for data replication, consensus algorithms, and transaction management. It distinguishes itself by offering detailed blueprints for specialized domains, including real-time data streaming, large-scale data storage, and high-ava
Browserless is a service-oriented platform designed for remote browser automation and headless execution. It provides a distributed infrastructure that manages browser sessions through containerized isolation, allowing users to execute scripts and interact with web content without maintaining local browser state or infrastructure. The platform functions as a remote API and WebSocket-based control layer, enabling stateless HTTP requests for tasks like document generation and real-time browser interaction. It incorporates proxy-based routing to manage traffic signatures and supports the integra
gstack is an AI agent framework and development workflow system designed to automate the software development lifecycle. It coordinates specialized AI personas to manage tasks across product design, engineering management, and quality assurance, transforming product intent into technical specifications and final releases. The project is distinguished by its deep integration of headless browser automation and semantic code memory. It utilizes a persistent Chromium daemon for web scraping and visual auditing, and implements a searchable knowledge base that logs architectural decisions and repos
Scrapegraph-ai is a Python framework that uses large language models to automate the extraction of structured data from websites and documents. It functions as an AI-driven data extraction pipeline that converts unstructured web content into structured formats using natural language processing and graph-based logic. The project utilizes graph-based task orchestration to model scraping workflows as interconnected nodes. It features a pluggable model interface for connecting to cloud or local artificial intelligence providers and can generate executable Python code on the fly to handle site-spe
Photon is a command-line web crawler designed for security reconnaissance and information gathering. It systematically traverses websites to discover URLs, map domain infrastructure, and identify associated subdomains by retrieving DNS records. The tool distinguishes itself through its ability to perform deep content analysis, including the extraction of sensitive data such as API keys and authentication tokens using user-defined regular expressions. It supports offline inspection by cloning crawled web content to the local filesystem, allowing for structural analysis without additional netwo
Pholcus is a distributed web crawler framework written in Go designed for high-concurrency data extraction. It functions as a distributed crawling orchestrator and dynamic data extraction engine, utilizing a server-client architecture to coordinate tasks across multiple nodes. The system integrates a headless browser engine to render dynamic content and execute JavaScript, allowing it to extract data from single-page applications. It features a web-based management interface for configuring spider parameters and monitoring execution progress, alongside the ability to update extraction rules v
Reader is an AI data ingestion pipeline and web content parser designed to convert websites and documents into clean markdown for use with large language models. It functions as a headless browser content extractor and web-to-markdown converter, transforming URLs and PDF files into structured text formats while removing irrelevant web clutter. The system optimizes retrieval augmented generation by acting as a search optimizer that retrieves web results and applies re-ranking to improve context relevance. It further enhances content accessibility by using vision models to generate descriptive
This project is a high-performance headless browser engine designed for scalable web automation, data extraction, and AI agent integration. It provides a specialized environment that allows autonomous agents and testing frameworks to interact with web content through standardized remote control protocols. By executing pages in a lightweight, headless state, the engine minimizes resource consumption while maintaining the ability to perform complex navigation and dynamic content rendering. The platform distinguishes itself through deep integration with AI-centric communication layers and advanc
Crawlee-python is a web crawling framework for building scalable scrapers using Python. It serves as a comprehensive tool for web scraping automation, providing a system to extract structured data from websites using both lightweight HTTP requests and headless browser automation. The framework is distinguished by its anti-bot evasion capabilities, which include browser fingerprint impersonation and tiered proxy rotation to bypass detection systems and solve challenges such as Cloudflare. It also incorporates artificial intelligence for autonomous website navigation and schema-based data extra
Nuclei is a modular security scanning framework designed for automated vulnerability detection and infrastructure reconnaissance. It functions as a template-driven engine that executes security checks across diverse network protocols, allowing users to define custom detection logic to identify vulnerabilities, misconfigurations, and exposed assets. The platform distinguishes itself through its highly extensible architecture, which supports distributed scanning, headless browser automation for dynamic web content, and out-of-band interaction monitoring to detect blind vulnerabilities. It integ