We curate 29 open-source GitHub repositories matching "open-source web scraping and data extraction tools". Results are ranked by relevance to your query — pick filters below to narrow, or refine with AI.
Pholcus is a distributed web crawler framework written in Go designed for high-concurrency data extraction. It functions as a distributed crawling orchestrator and dynamic data extraction engine, utilizing a server-client architecture to coordinate tasks across multiple nodes. The system integrates a headless browser engine to render dynamic content and execute JavaScript, allowing it to extract data from single-page applications. It features a web-based management interface for configuring spider parameters and monitoring execution progress, alongside the ability to update extraction rules v
Pholcus is a Go-based distributed web crawler framework that integrates a headless browser for dynamic content extraction, includes proxy and user-agent rotation for anti-detection, and offers a web management interface — fitting the core scraping workflow and most of the required features like headless browser support, selectors, export, and automation.
EasySpider is a no-code automation platform designed to orchestrate repetitive web interactions and data collection processes. It functions as a browser task orchestrator, providing a visual environment where users can build and execute complex workflows through point-and-click configuration rather than manual programming. The platform distinguishes itself by enabling visual web scraping design, allowing users to create data extraction tasks by interacting directly with web elements. It utilizes a headless browser engine to simulate human navigation and event-driven interactions, mapping thes
EasySpider is a no-code visual web scraping tool that uses a headless browser to automate data extraction, fitting the requirement for automated scraping and dynamic content handling, though explicit anti-detection and proxy support are not highlighted.
CyberScraper-2077 is an AI-powered web scraping tool that uses large language models to extract and structure data from websites into organized formats. It functions as an LLM web scraper and AI content parser, transforming unstructured raw web text into specific data schemas. The project distinguishes itself through a suite of anonymity and evasion tools, including proxy rotation, SOCKS-based identity masking, and the ability to route traffic through the Tor network to access hidden onion services. It further includes a bot detection bypass system that employs stealth parameters and custom n
CyberScraper-2077 is an AI-powered web scraper that uses headless browser automation, proxy rotation, and anti-detection techniques to extract structured data, fitting the search for a programmatic scraping tool, though it does not highlight CSS/XPath selectors, scheduling, or a visual interface.
Spider-flow is a Java-based web crawling and data extraction platform that provides a centralized environment for managing automated information gathering. It functions as a no-code tool, allowing users to define complex data collection pipelines through a visual, drag-and-drop interface rather than manual programming. The platform distinguishes itself through a graph-based workflow orchestration system where users link discrete nodes to define navigation and parsing logic. It supports dynamic content crawling by integrating headless browsers to execute JavaScript and render page content that
Spider-flow is a visual, no-code web scraping platform with headless browser and XPath selector support, making it a solid fit for this search, though its graph-based workflow is code-first from a design perspective and it lacks explicit support for scheduling or anti-detection features.
Maxun is an open-source web scraping and automation platform designed to transform dynamic website content into structured data. By leveraging artificial intelligence to interpret natural language prompts, the system identifies page elements and extracts information without requiring manual selector configuration. It serves as a bridge between raw web content and intelligent workflows, providing structured outputs in formats optimized for large language model ingestion and agent-based applications. The platform distinguishes itself through its ability to handle complex, authenticated, and dyn
Maxun is an open-source web scraping platform that uses AI and a headless browser (Playwright) to extract structured data from dynamic websites, with no-code natural language input, automation, and self-hosting capabilities – making it a strong fit for your scraping needs.
Crawl4AI is an AI-powered web crawling and data extraction engine designed to transform complex web content into structured formats. It functions as a headless browser orchestrator, enabling the navigation of dynamic websites, the execution of custom scripts, and the capture of visual assets like screenshots and PDFs. By integrating language models directly into the extraction workflow, the system converts raw HTML into clean, structured data or Markdown files optimized for downstream ingestion. The platform distinguishes itself through a distributed, self-hosted infrastructure that manages l
Crawl4AI is an AI-powered, self-hosted web crawling engine that uses headless browsers to navigate dynamic sites and extract structured data into multiple formats like JSON and Markdown, directly fitting the request for programmatic scraping with support for dynamic content and multiple export formats.
Pholcus is a distributed web crawling system designed for large-scale data scraping. It employs a master-worker distribution model to coordinate high-concurrency scraping tasks across a network of remote client nodes, enabling both horizontal and vertical data collection. The system features a hot-loadable rule engine that allows extraction and navigation logic to be updated at runtime without restarting the process. It handles dynamic content through headless browser integration and bypasses bot detection using proxy rotation, automated user authentication, and simulated human behavior. The
Pholcus is a distributed, headless-browser-powered web scraping system that supports large-scale crawling with proxy rotation and anti-detection, fitting your search for a programmatic scraping tool — though it forgoes a visual/no-code interface and does not explicitly advertise CSS/XPath selectors or scheduling automation.
Obscura is a web scraping infrastructure and headless browser server designed for AI agents. It provides a system for AI models to control browser sessions, interact with websites, and extract web data using a WebSocket implementation of the Chrome DevTools Protocol. The project focuses on bot detection evasion by randomizing browser fingerprints, masking native functions, and blocking tracking scripts to mimic human behavior. It further secures identities through a traffic layer that routes network requests via HTTP or SOCKS5 proxies. The system supports large-scale data extraction through
Obscura is a headless browser server and scraping infrastructure with built-in anti-detection and proxy support, directly fitting the need for programmatic web scraping; however, it lacks explicit scheduling, multiple export format options, and a visual interface, so it is a capable but narrower tool than the full-featured category leaders.
This project is a LinkedIn data scraper and professional profile extractor designed to collect information from professional networking sites. It functions as a headless browser scraper that extracts professional profiles, company details, and job listings using automated browser sessions. The tool includes a session manager that saves and loads authentication cookies to maintain persistent access to protected profiles. It employs configurable browser settings and user-agent mimicry to simulate human activity and bypass bot detection. Data extraction capabilities cover person profiles, compa
This is a dedicated LinkedIn scraper that uses a headless browser with session management and anti-detection, so it fits the scraping category; however, it is limited to a single site and lacks general CSS/XPath selectors, multiple export formats, scheduling, and visual interfaces.
ai-goofish-monitor is an AI-driven marketplace monitor and containerized web scraper designed to track online listings. It uses multimodal large language models and natural language prompts to analyze product text and images, determining if items meet specific requirements. The system employs an anti-detection workflow that rotates network proxies and authenticated accounts to bypass rate limits. It captures browser cookies and session states to mimic real user behavior during automated requests. The project includes a task scheduler using cron expressions and an embedded SQLite database for
This is a Playwright-based web scraper with anti-detection and cron scheduling built for monitoring online listings, so it directly supports programmatic scraping and extraction, though it lacks multiple export formats and a visual interface.
node-crawler is a programmable web crawler for Node.js that manages request queues and automates data extraction. It functions as a rate-limited HTTP client and a headless HTML parser, providing the infrastructure to visit large sets of URLs asynchronously while preventing duplicate processing through task deduplication. The project distinguishes itself through a proxy rotation manager that cycles user agents and proxy servers to bypass access restrictions. It utilizes the HTTP/2 protocol to improve request performance and server compatibility during large-scale scraping operations. The syst
node-crawler is a Node.js library for programmable web crawling and data extraction that supports proxy rotation and rate limiting, fitting your need for a programmatic scraping tool, but it lacks headless browser support, scheduling, and a visual interface.
Firecrawl is a headless browser automation tool and web crawling engine designed to extract structured data from the web. It functions as an API that transforms raw website content and documents into clean markdown and JSON formats to serve as context for large language models. The project distinguishes itself by using natural language prompts to translate human instructions into targeted data extraction tasks and browser actions. It can execute interactive page navigation, such as clicking and scrolling, and perform automated web research to retrieve structured data without manual interventi
Firecrawl is a headless browser API and web crawling engine that extracts structured data with natural language commands, supporting markdown/JSON output and proxy rotation, but it does not explicitly offer CSS/XPath selectors or built-in scheduling.
Stagehand is an AI-native browser automation framework that enables developers to build reliable web automations using a hybrid of natural language instructions and deterministic TypeScript code.
Stagehand is an AI-driven browser automation framework built for programmatic data extraction using headless browsers and anti-detection, but it focuses on agent-based workflows rather than providing a ready-to-use scraping tool with scheduling, multiple export formats, or a visual interface.
DrissionPage is a Python library designed for web automation, data scraping, and testing. It functions as a browser automation framework that communicates directly with the browser engine via the Chrome DevTools Protocol, allowing for precise control over browser instances and page states. The library distinguishes itself by providing a unified interface that combines full browser automation with raw HTTP request capabilities. This hybrid approach allows users to switch between lightweight network requests and heavy browser-based interactions within a single workflow. By wrapping asynchronous
DrissionPage is a Python library that combines headless browser automation via Chrome DevTools Protocol with raw HTTP requests, making it a solid programmatic web scraping tool—it handles dynamic content and DOM selection, but lacks built-in scheduling, multiple export formats, and a visual interface, so it fits the category but not all requested features.
Crawlee-python is a web crawling framework for building scalable scrapers using Python. It serves as a comprehensive tool for web scraping automation, providing a system to extract structured data from websites using both lightweight HTTP requests and headless browser automation. The framework is distinguished by its anti-bot evasion capabilities, which include browser fingerprint impersonation and tiered proxy rotation to bypass detection systems and solve challenges such as Cloudflare. It also incorporates artificial intelligence for autonomous website navigation and schema-based data extra
Crawlee-python is a Python web scraping framework that provides headless browser automation via Playwright, anti-detection measures like proxy rotation and fingerprint impersonation, and structured data extraction—making it a solid fit for programmatic scraping, though it lacks a built-in visual interface and out-of-the-box scheduling.
AnyCrawl is an AI-powered data extractor, automated web crawler, and headless browser orchestrator. It serves as a web content extraction API and a gateway that connects crawling and scraping tools to language models using a standardized API protocol. The project specializes in converting unstructured website content into structured JSON or markdown optimized for AI assistants. It utilizes language models and JSON schemas to pull specific information into validated formats and provides capabilities for AI page summarization and LLM-optimized content extraction. The system manages comprehensi
AnyCrawl is an AI-powered web scraper and headless browser orchestrator that converts website content into structured JSON and markdown, with support for scheduling, proxy rotation, and pattern-based extraction, making it a genuine match for programmatic scraping, though it lacks explicit CSS/XPath selectors and a no-code interface.
MediaCrawler is an automated web scraping framework designed to extract public posts, comments, and creator metadata from various social media platforms. It functions as a headless browser automator, utilizing real browser instances to render dynamic content and execute the client-side scripts necessary for interacting with modern web interfaces. The system distinguishes itself through a focus on session persistence and network flexibility. It supports remote debugging to reuse active browser sessions and cookies, which helps minimize the risk of triggering platform security challenges. To ma
MediaCrawler is a Python framework that automates scraping of social media platforms using real browser instances, handling dynamic content and proxies, which aligns with programmatic data extraction, though its social‑media specificity and lack of a visual/no‑code interface narrow the fit.
Crawlee is a web scraping framework designed for building scalable, reliable, and distributed data extraction pipelines. It provides a unified interface for managing headless browser automation and lightweight HTTP requests, allowing developers to handle complex web navigation, dynamic content rendering, and large-scale data collection within a single, modular architecture. The project distinguishes itself through its resource-aware concurrency controller, which dynamically scales task execution based on real-time CPU and memory usage to prevent host machine exhaustion. It also features a rob
Crawlee is a developer-focused web scraping framework that provides headless browser automation through Playwright and Puppeteer, CSS/XPath selectors, and scalable data extraction — fitting the programmatic scraping need, though it lacks the visual/no-code interface you mentioned.
Automa is a browser-based automation platform that enables users to build, schedule, and execute repetitive web tasks through a visual, no-code interface. By operating as a browser extension, it provides a canvas-based environment where users construct workflows by connecting functional blocks to interact with web elements, manage browser state, and process data. The platform distinguishes itself through its deep integration with the browser environment, allowing for complex orchestration such as event-driven triggers, cross-origin request handling, and the ability to package workflows as sta
Automa is a visual, no-code browser automation platform that can scrape and extract structured data from websites through its workflow builder, though it focuses on general browser automation rather than being a dedicated scraping tool.
OpenCLI is an AI browser automation framework designed to automate web navigation, data extraction, and repetitive browser tasks. It functions as a browser-based CLI generator that converts website interfaces into command-line interactions by controlling authenticated web browser sessions. The project features a web-to-CLI adapter platform for mapping web elements to programmatic command-line inputs and outputs. It includes a browser profile manager to organize and switch between isolated session profiles to maintain different user identities. The toolkit provides capabilities for web conten
OpenCLI is an AI-powered browser automation framework that can extract data from websites by turning them into CLI interactions, making it suitable for programmatic scraping—though it focuses on automation and CLI generation rather than offering a dedicated scraping UI, built-in scheduling, or explicit anti-detection features.
pydoll is a Chrome DevTools Protocol automation library and headless browser controller used for web data extraction and parallel browser automation. It controls Chromium-based browsers via direct WebSocket connections, allowing it to manage isolated browser contexts and tabs while bypassing the overhead and detection associated with WebDriver. The project features an anti-bot evasion framework that mimics natural human behavior, including mouse movements generated via Bezier curves and variable typing patterns. It provides specialized stealth capabilities to bypass behavioral analysis and au
pydoll is a headless browser automation library built on Chrome DevTools Protocol with strong anti-detection and proxy support, making it a good fit for programmatic scraping if you're comfortable coding in Python—it covers headless browsing and stealth well but lacks built-in scheduling, multiple export formats, and a visual interface.
Colly is a high-performance web scraping framework designed for the automated extraction of structured data from websites. It provides a programmable toolkit that manages the complexities of large-scale data collection, including concurrent request orchestration, automatic cookie handling, and robots.txt compliance. By utilizing an asynchronous execution model, the engine maintains high throughput while preventing resource exhaustion during recursive or distributed crawling tasks. The framework is distinguished by its modular, event-driven architecture, which allows developers to hook into sp
Colly is a high-performance web scraping framework in Go that provides a programmable API for extracting structured data, fitting the need for programmatic scraping, but it lacks built-in headless browser support and visual/no-code features.
Firecrawl is a web data extraction platform designed to convert unstructured web content into clean, LLM-ready formats like markdown or JSON. It functions as an autonomous web crawler and scraper, capable of mapping entire domains, performing recursive navigation, and executing complex data gathering tasks. By leveraging headless browser orchestration, the system handles dynamic, JavaScript-heavy pages to ensure comprehensive data capture. The platform distinguishes itself through its focus on agentic workflows, providing a programmatic interface that allows autonomous agents to perform live
Firecrawl is a headless-browser-powered crawler that extracts structured data from dynamic websites and outputs LLM-ready formats like markdown or JSON, making it a strong fit for programmatic scraping, though it does not advertise a visual interface or built-in anti-detection.
Scrapy is a comprehensive framework designed for automated web data extraction and large-scale crawling. It operates on an asynchronous, event-driven engine that manages non-blocking network requests and data processing tasks, allowing for the efficient retrieval of structured information from web documents using path-based selectors. The system distinguishes itself through a highly modular architecture that supports complex data collection workflows. Users can implement custom middleware and signal handlers to intercept and modify request flows, while a priority-based scheduler manages concu
Scrapy is a production-grade web scraping framework with built-in CSS/XPath selectors, modular pipelines, and scheduling — exactly the kind of programmatic tool you need, though it lacks built-in headless browser support and a visual interface for full coverage.
Browser-use is a framework for building autonomous agents that navigate, interact with, and extract data from web interfaces using natural language instructions. By acting as an orchestration layer between large language models and browser automation protocols, it enables the execution of complex, multi-step workflows without relying on brittle selectors. The system functions as a headless browser controller, providing a programmatic interface to manage browser instances and execute granular interactions. The project distinguishes itself through its ability to translate high-level intent into
Browser-use is a framework that orchestrates browser automation with LLMs to extract structured data programmatically, fitting the core need, though it lacks a visual interface and built-in anti-detection or scheduling features.
Colly is a web scraping framework and concurrent crawler written in Go. It provides a system for traversing web pages, following links, and extracting structured data from HTML and XML documents. The framework includes a distributed scraping engine designed to spread data collection tasks across multiple instances to increase throughput. It ensures compliance with website owner policies by automatically reading and respecting robots.txt files. The system manages request lifecycles through domain-based rate limiting, concurrency controls, and session management via a stateful cookie jar. It s
Colly is a Go-based web scraping framework that efficiently crawls and extracts structured data from static HTML/XML pages, making it a solid choice for straightforward scraping tasks, but it lacks built-in headless browser support for dynamic JavaScript-rendered content.
Huginn is a self-hosted automation platform that functions as an event-driven workflow engine. It allows users to build autonomous agents that monitor web services, scrape data, and execute complex tasks by propagating events through a directed graph. By running on your own server infrastructure, it provides a private environment for orchestrating workflows without relying on third-party automation services. The platform distinguishes itself through a modular, plugin-based architecture that enables the development of custom agents to handle specific data processing needs. Each agent maintains
Huginn is a self-hosted automation platform that can scrape and extract data from websites via configurable agents, with built-in scheduling and a visual workflow builder—making it a solid fit for programmatic scraping even without dedicated headless browser support.
Cheerio is an HTML and XML parsing library and server-side DOM implementation. It functions as a markup manipulation tool and CSS selector engine, allowing users to parse, query, and modify HTML or XML documents in non-browser environments. The project provides a DOM-like tree representation of markup strings, enabling programmatic addition, removal, and modification of elements and attributes. It features a prototype-based plugin system that allows the extension of core functionality by adding custom methods to the document prototype. The library covers a broad range of capabilities includi
Cheerio is a fast HTML parsing and DOM manipulation library with CSS selectors, making it a strong choice for extracting data from static web pages, but it lacks headless browser support, XPath, scheduling, and anti-detection features that this search may require.
Scrapegraph-ai is a Python framework that uses large language models to automate the extraction of structured data from websites and documents. It functions as an AI-driven data extraction pipeline that converts unstructured web content into structured formats using natural language processing and graph-based logic. The project utilizes graph-based task orchestration to model scraping workflows as interconnected nodes. It features a pluggable model interface for connecting to cloud or local artificial intelligence providers and can generate executable Python code on the fly to handle site-spe
ScrapeGraphAI is a Python framework that uses LLMs and graph-based orchestration to automate structured data extraction from websites, which is exactly the kind of programmatic scraping tool you're looking for, though it does not explicitly mention headless browser, CSS/XPath selectors, scheduling, or anti-detection features.