30 open-source projects similar to friendsofphp/goutte, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best Goutte alternative.
Crawlee is a web scraping framework designed for building scalable, reliable, and distributed data extraction pipelines. It provides a unified interface for managing headless browser automation and lightweight HTTP requests, allowing developers to handle complex web navigation, dynamic content rendering, and large-scale data collection within a single, modular architecture. The project distinguishes itself through its resource-aware concurrency controller, which dynamically scales task execution based on real-time CPU and memory usage to prevent host machine exhaustion. It also features a rob
Firecrawl is a web data extraction platform designed to convert unstructured web content into clean, LLM-ready formats like markdown or JSON. It functions as an autonomous web crawler and scraper, capable of mapping entire domains, performing recursive navigation, and executing complex data gathering tasks. By leveraging headless browser orchestration, the system handles dynamic, JavaScript-heavy pages to ensure comprehensive data capture. The platform distinguishes itself through its focus on agentic workflows, providing a programmatic interface that allows autonomous agents to perform live
Crawlab is a distributed web scraping platform designed to centralize the management, deployment, and execution of large-scale data extraction tasks. It functions as a control plane that orchestrates scraping scripts and automated workflows across multiple nodes, providing a unified environment for managing complex data collection operations. The platform distinguishes itself through a distributed architecture that coordinates worker nodes via a central master, utilizing real-time communication to maintain oversight of all active processes. It ensures operational consistency by isolating task
AnyCrawl is an AI-powered data extractor, automated web crawler, and headless browser orchestrator. It serves as a web content extraction API and a gateway that connects crawling and scraping tools to language models using a standardized API protocol. The project specializes in converting unstructured website content into structured JSON or markdown optimized for AI assistants. It utilizes language models and JSON schemas to pull specific information into validated formats and provides capabilities for AI page summarization and LLM-optimized content extraction. The system manages comprehensi
Scraperr is a self-hosted web scraping and crawling platform designed for extracting structured data from websites using XPath selectors. It functions as a containerized system for managing scraping jobs through a queue and analyzing the resulting content using artificial intelligence. The project differentiates itself through its Kubernetes-native architecture, allowing for scalable deployment and management via package managers. It includes a crawling engine capable of domain-level spidering to discover linked pages and a data analyzer that uses artificial intelligence to query extracted we
MechanicalSoup is a Python web automation library designed to simulate browser behavior. It functions as a toolkit for web scraping and automation, providing an HTML parsing engine and an HTTP session manager to interact with websites programmatically. The library enables headless web interaction by mimicking a real user session. It manages persistent state through cookie handling and automatic redirect following, allowing for programmatic website navigation and the simulation of complex browser interactions. Its capabilities cover automated form population and submission using CSS selectors
Photon is a command-line web crawler designed for security reconnaissance and information gathering. It systematically traverses websites to discover URLs, map domain infrastructure, and identify associated subdomains by retrieving DNS records. The tool distinguishes itself through its ability to perform deep content analysis, including the extraction of sensitive data such as API keys and authentication tokens using user-defined regular expressions. It supports offline inspection by cloning crawled web content to the local filesystem, allowing for structural analysis without additional netwo
node-fetch is a promise-based HTTP client library that provides a lightweight implementation of the Fetch API for the Node.js runtime. It serves as a network interface for performing asynchronous HTTP requests, handling server communication, and managing headers. The library utilizes a promise-based request lifecycle to wrap network calls, ensuring asynchronous behavior. It incorporates stream-based handling for both requests and responses to process large payloads efficiently without overloading system memory. Its capabilities cover a broad range of network communication tasks, including th
Defuddle is a command line web parser and content extractor designed to isolate the primary article body from web pages and convert the result into standardized markdown. It functions as a content cleaner that removes layout clutter, such as sidebars and headers, to retrieve the main text and associated metadata. The tool provides a terminal interface that processes content from remote URLs, local files, or piped HTML streams. It supports custom content targeting, allowing users to specify CSS selectors to manually define the main content area when automatic detection is insufficient. The sy
so-novel is a web novel downloader and scraping engine designed to extract structured text from websites and convert it into electronic book formats. It functions as a multi-interface content extractor, providing a shared backend accessible via a web-based management dashboard, a terminal user interface, and a command line interface. The system utilizes a rule-driven approach for data extraction, using CSS selectors and XPath rules defined in external configuration files to map web elements to specific data fields. To maintain access to content, it includes a proxy-routed request pipeline to
php-webdriver is a WebDriver PHP client and browser automation framework that implements the W3C WebDriver standard. It serves as a programmatic interface for controlling web browsers, executing JavaScript, and managing browser sessions in both headed and headless environments. The library functions as a Selenium protocol implementation, allowing PHP applications to communicate with browser drivers such as ChromeDriver or GeckoDriver. It provides the ability to automate user actions, navigate pages, and validate DOM elements for web UI testing. Its capabilities cover broad areas of browser i
MechanicalSoup is a Python web automation library and scraping framework designed to simulate browser sessions and navigate websites without requiring JavaScript execution. It functions as an HTML parsing tool and HTTP session manager, allowing for the programmatic retrieval of page content and the automation of web interactions. The library distinguishes itself by combining session persistence with automated form interaction. It maps user data to HTML input fields and selection boxes for programmatic submission and maintains authenticated states by managing cookies and user-agent headers acr
This project is a distributed scraping engine designed to extract business details, customer reviews, and lead information from Google Maps. It functions as a business scraper and data extractor that can be deployed as a permanent system or as on-demand serverless functions. The system utilizes a proxy-routed web crawler to manage request origins via SOCKS5, HTTP, and HTTPS proxies. To locate contact information, it includes an email extraction tool that recursively crawls business websites linked within map listings. The software supports coordinate-based radius searches for efficient data
Katana is a web crawler and spider designed for security reconnaissance and web application mapping. It functions as a utility for identifying endpoints, forms, and API structures across web targets by combining standard HTTP request traversal with headless browser automation to render dynamic, JavaScript-heavy content. The tool distinguishes itself through its ability to maintain authenticated sessions and handle complex web interactions, such as automated form submission and captcha resolution. It provides granular control over the discovery process, allowing users to define specific crawl
Curl is a command-line tool and portable library for transferring data across a wide range of network protocols. It functions as a unified engine that abstracts diverse communication standards, allowing users and developers to move files and information between servers using a consistent interface. The project provides both a versatile command-line client for terminal-based automation and a stable programmatic interface for integrating complex network operations into applications. The system is distinguished by its protocol-agnostic core and its ability to manage both synchronous and asynchro
Mechanize is a Ruby library for web browser automation and headless browser emulation. It allows for programmatically navigating websites and simulating human behavior without a graphical user interface. The library provides an automated interface for populating and submitting web forms, including text fields, checkboxes, and file uploads. It manages stateful sessions by automatically storing and sending cookies across multiple requests to maintain user authentication and identity. Additional capabilities include web data scraping, the ability to download remote web content, and the maintena
This project is a Python-based automation toolkit designed to manage programmatic authentication and session persistence across web services. It provides a framework for executing automated login sequences, including the handling of interactive security challenges such as QR code verification and captcha resolution. The toolkit distinguishes itself by simulating native mobile application environments, allowing for the execution of scripts that require specific device-level headers and behaviors. It also incorporates hook-based interception to monitor workflow states and manage exceptions duri
This project is a comprehensive framework for building and managing autonomous agent systems. It provides a unified architecture for orchestrating multi-agent societies, where specialized agents collaborate through roleplay to decompose and solve complex tasks. The system integrates language models with external environments, enabling agents to perform real-world actions through a standardized tool-calling abstraction layer. The framework distinguishes itself through its focus on iterative reasoning and data reliability. It employs automated feedback loops to refine agent outputs and self-eva
DotnetSpider is a .NET web crawler framework and programmable tool designed for traversing websites and capturing structured data from web pages. It functions as a distributed crawling engine that enables the automation of web crawling to discover and extract data. The framework is designed for distributed data extraction, allowing crawling tasks to be spread across multiple servers to process large volumes of web content. This architecture supports high-performance web scraping and enterprise data collection workflows for gathering structured information.
Colly is a web scraping framework and concurrent crawler written in Go. It provides a system for traversing web pages, following links, and extracting structured data from HTML and XML documents. The framework includes a distributed scraping engine designed to spread data collection tasks across multiple instances to increase throughput. It ensures compliance with website owner policies by automatically reading and respecting robots.txt files. The system manages request lifecycles through domain-based rate limiting, concurrency controls, and session management via a stateful cookie jar. It s
This project is a Python web scraping library and automated data collection suite. It provides tools for extracting structured data from websites, implementing web crawlers to navigate site links, and parsing HTML DOM structures to isolate specific elements and attributes. The toolkit includes a pipeline for processing unstructured text and cleaning raw web content to extract meaningful information. It also features capabilities for image data extraction and the integration of external APIs to retrieve structured data from remote endpoints. The system covers broad capability areas including
Webmagic is a Java web crawling framework designed for building scalable automated crawlers to download and process large volumes of web pages. It functions as a distributed web crawler and dynamic content crawler, utilizing an XPath HTML parser to locate and extract specific data points from page structures. The framework distinguishes itself through its ability to handle dynamic content by rendering JavaScript and executing asynchronous requests to extract data from non-static pages. It also allows users to define and execute crawler logic via scripting languages, enabling the update of col
PySpider is a Python web crawling framework designed for automated data extraction. It provides a pipeline for periodically fetching web content, processing HTML, and persisting scraped information into database backends. The system features a web-based management interface for editing scraping scripts, monitoring task progress, and reviewing collected data. It includes a headless browser JavaScript renderer to capture rendered HTML from dynamic web pages and a distributed architecture that uses message queues to scale crawling workloads across multiple nodes. The framework also covers task
node-crawler is a programmable web crawler for Node.js that manages request queues and automates data extraction. It functions as a rate-limited HTTP client and a headless HTML parser, providing the infrastructure to visit large sets of URLs asynchronously while preventing duplicate processing through task deduplication. The project distinguishes itself through a proxy rotation manager that cycles user agents and proxy servers to bypass access restrictions. It utilizes the HTTP/2 protocol to improve request performance and server compatibility during large-scale scraping operations. The syst
PythonSpiderNotes is a comprehensive instructional resource and framework for building web crawlers and extracting data using the Python programming language. It provides a set of methods for parsing unstructured HTML and JSON data into structured formats for persistent storage. The project includes detailed guides and tutorials on browser automation for retrieving dynamic content, as well as a framework for data extraction. It specifically covers anti-bot bypass techniques, such as rotating proxies and spoofing headers, to avoid IP blocks and detection systems. The capability surface extend
This project is a comprehensive resource directory for web data extraction, providing a curated collection of tools and libraries for parsing data, automating browsers, and managing network operations. It serves as a guide for extracting structured information from HTML, XML, JSON, and PDF formats. The toolkit focuses on advanced data collection strategies, including headless browser automation to interact with JavaScript and a suite of network utilities for DNS resolution and WebSocket connections. It specifically covers methods for bypassing bot protections through proxy pool management, us
Subfinder is a security reconnaissance framework designed for subdomain enumeration and attack surface management. It functions as a discovery engine that identifies and maps internet-exposed infrastructure, cloud-hosted assets, and network ranges to maintain a comprehensive inventory of an organization's digital footprint. The project distinguishes itself through a modular, template-driven scanning engine that executes security checks against discovered assets. It leverages cloud-native asset discovery to query provider APIs and infrastructure metadata, while supporting distributed agent orc
DotnetSpider is a .NET web crawling framework and C# data extraction tool designed for automated web page discovery and the retrieval of structured data from the internet at scale. It functions as a high-level web scraping library for collecting information from various websites. The framework provides capabilities for automated web crawling and large-scale data scraping. It enables web content extraction to facilitate the creation of local databases or the analysis of online information through programmatic web automation within the .NET ecosystem. The system utilizes a pipeline-based data
OkHttpUtils is a convenience wrapper for the OkHttp HTTP client that simplifies common networking operations on Android. It provides a straightforward interface for executing GET and POST requests, including sending form parameters and JSON payloads, as well as uploading files via multipart form data and downloading remote files to local storage. The library distinguishes itself through a set of practical utilities built on top of OkHttp's core architecture. It wraps synchronous calls into an asynchronous callback pattern, includes an interceptor-based logging layer for request and response d
This project is a comprehensive Python network request framework designed for both synchronous and asynchronous HTTP communication. It provides a high-performance client capable of executing non-blocking requests within event-driven applications, while also supporting standard blocking calls for simpler scripts. The library is built to operate natively across diverse asynchronous runtimes, automatically detecting and utilizing the underlying event loop for concurrency. What distinguishes this library is its modular architecture, which decouples request construction from network execution thro