Web Scraping Data Extraction Tools

These open-source libraries and frameworks parse unstructured HTML content into clean, usable structured data formats.

Find the best repos with AI.We'll search the best matching repositories with AI.

gocolly/colly
gocolly/colly
25,101View on GitHub
Colly is a high-performance web scraping framework designed for the automated extraction of structured data from websites. It provides a programmable toolkit that manages the complexities of large-scale data collection, including concurrent request orchestration, automatic cookie handling, and robots.txt compliance. By utilizing an asynchronous execution model, the engine maintains high throughput while preventing resource exhaustion during recursive or distributed crawling tasks. The framework is distinguished by its modular, event-driven architecture, which allows developers to hook into specific lifecycle stages of a network request to process content or control flow. It features a flexible middleware pipeline for handling proxy rotation, user agents, and rate limiting, alongside an interface-driven storage layer that supports swapping default in-memory state for persistent external databases. This design enables the coordination of multiple scraping instances and the maintenance of crawl history across application restarts. Beyond its core engine, the project offers extensive customization options for network transport, including support for custom round-trippers to manage connection pooling and timeouts. It also provides robust observability tools, allowing for the attachment of custom debuggers and logging observers to monitor internal state during execution. Developers can further extend functionality through a plugin system or by sharing request context and configuration across different collector instances to support complex, multi-stage data extraction workflows.
Colly is a high-performance web scraping framework that provides robust tools for automated crawling, proxy management, and structured data extraction, though it requires integration with external libraries for headless browser rendering.
GoProxy Rotation ServicesConcurrent Crawling Engines
View on GitHub25,101
steel-dev/steel-browser
steel-dev/steel-browser
6,450View on GitHub
Steel is a cloud browser automation platform that provides a REST API for launching and controlling remote Chrome browser sessions. It enables programmatic browsing and web scraping using standard automation tools like Puppeteer, Playwright, and Selenium, connecting to cloud-hosted browser instances via WebSocket and the Chrome DevTools Protocol. The platform supports both headless and headful browser sessions, with language-specific SDKs for TypeScript and Python. The service distinguishes itself through comprehensive anti-detection capabilities, including residential proxy rotation, CAPTCHA solving, browser fingerprint randomization, and human behavior simulation to evade bot detection systems. It maintains persistent browser state across sessions, preserving cookies, local storage, and authentication for multi-step workflows. Steel also offers natural-language browser automation, allowing AI agents to drive web interactions using plain-English instructions rather than low-level selectors. Beyond core automation, the platform provides session monitoring and debugging tools with live streaming and recorded replays, file transfer capabilities, and content extraction features that capture screenshots, PDFs, and Markdown from fully rendered web pages. It supports mobile browser emulation, geographic traffic routing, and serverless execution from edge environments. The platform can be deployed as a self-hosted runtime using Docker, giving teams full control over the browser infrastructure.
Steel is a cloud-based browser automation platform that provides the infrastructure for headless browsing, proxy management, and anti-bot evasion, making it a powerful tool for building custom web scraping and data extraction pipelines.
TypeScriptAnti-Bot EvasionJavaScript RenderingJavaScript-Rendered Content Extractors
View on GitHub6,450
browser-use/browser-use
browser-use/browser-use
100,229View on GitHub
Browser-use is a framework for building autonomous agents that navigate, interact with, and extract data from web interfaces using natural language instructions. By acting as an orchestration layer between large language models and browser automation protocols, it enables the execution of complex, multi-step workflows without relying on brittle selectors. The system functions as a headless browser controller, providing a programmatic interface to manage browser instances and execute granular interactions. The project distinguishes itself through its ability to translate high-level intent into specific browser primitives, supported by a serialization process that converts complex web page structures into simplified text for model processing. It includes robust support for stateful session persistence, allowing agents to maintain authenticated environments across long-running tasks. Furthermore, the framework facilitates remote browser orchestration, enabling the scaling of automation routines in cloud environments with integrated support for stealth configurations and proxy management. Beyond its core agent capabilities, the platform provides extensive tooling for structured data extraction and workflow integration. It supports a variety of model configurations and allows for the definition of custom tools to extend interaction logic. The project documentation includes quickstart guides for command-line execution and examples for integrating browser automation into broader software ecosystems.
This framework provides a sophisticated orchestration layer for headless browser automation and structured data extraction, making it a powerful tool for scraping complex, dynamic web interfaces using LLM-driven interaction.
PythonBrowser Environment Configurations
View on GitHub100,229
oxylabs/how-to-scrape-amazon-product-data
oxylabs/how-to-scrape-amazon-product-data
2,511View on GitHub
This project is an Amazon web scraper and e-commerce data extractor designed to retrieve product names, prices, and ratings. It functions as a headless browser crawler that converts unstructured web content from product listings into structured JSON and CSV formats. The tool incorporates anti-bot bypass capabilities to circumvent CAPTCHAs and security challenges. It achieves this through the use of residential proxy integration, automatic proxy rotation, and the modification of browser fingerprints to simulate human interaction patterns. The system provides broad web scraping capabilities, including server-side JavaScript rendering and automated browser interaction. It handles product listing traversal and pagination to discover deep web content, utilizing CSS selectors for product detail extraction and unique identification numbers for region-specific data retrieval. The project also includes utilities for localized web data access and automated ad verification to check display and delivery across different geographic locations.
This repository is a specialized, single-purpose application for scraping Amazon data rather than a general-purpose web scraping framework or library that you can use to build your own extraction pipelines.
Anti-Bot EvasionBot Detection BypassProxy and Fingerprint Rotation
View on GitHub2,511
docling-project/docling
docling-project/docling
61,674View on GitHub
Docling is a modular framework designed for document parsing, layout analysis, and structured data extraction. It transforms unstructured files and web content into a unified, hierarchical data model that preserves the spatial and semantic relationships between text, tables, images, and layout elements. By normalizing diverse input formats into a consistent internal representation, the library enables uniform processing across various document types. The project distinguishes itself through a schema-driven approach that maps document regions to strongly-typed objects, ensuring data accuracy through validation against predefined templates. Its pipeline-based architecture supports pluggable processing backends, allowing for the dynamic integration of specialized engines for optical character recognition and complex visual layout analysis. Users can control parsing behavior and extraction parameters through declarative configuration files, facilitating integration into automated workflows and server-based architectures. The library provides both a programmatic interface and a command-line toolkit to support automated document processing and format conversion. It utilizes optional dependency management to allow for modular installation of specific features, such as media rendering or advanced processing capabilities, depending on the requirements of the application.
Docling is a powerful document parsing and extraction framework that excels at converting complex, unstructured layouts into structured formats, though it focuses more on document intelligence than on the automated crawling and anti-bot features typical of web-specific scrapers.
PythonDocument and LLM PreparationDocument Layout AnalyzersHierarchical Document Models
View on GitHub61,674
nanmicoder/mediacrawler
NanmiCoder/MediaCrawler
51,294View on GitHub
MediaCrawler is an automated web scraping framework designed to extract public posts, comments, and creator metadata from various social media platforms. It functions as a headless browser automator, utilizing real browser instances to render dynamic content and execute the client-side scripts necessary for interacting with modern web interfaces. The system distinguishes itself through a focus on session persistence and network flexibility. It supports remote debugging to reuse active browser sessions and cookies, which helps minimize the risk of triggering platform security challenges. To maintain stable data collection at scale, the tool integrates proxy-based request routing, allowing users to distribute traffic across external IP services to bypass rate limits and geographic restrictions. The architecture is built for extensibility and modularity, employing a provider pattern that allows developers to integrate new platforms or custom storage backends through standardized interfaces. Users can manage complex scraping workflows via command-line configuration, enabling the definition of specific targets and storage formats—such as JSON, CSV, or various database systems—without modifying the core logic. The project also includes utilities for data visualization, such as generating word clouds from collected comments. Installation requires setting up the necessary runtime environments, including a JavaScript engine for handling complex client-side rendering and the appropriate browser automation drivers.
MediaCrawler is a specialized web scraping framework that provides headless browser automation, proxy support, and structured data export, making it a capable tool for extracting content from complex social media platforms.
PythonWeb ScrapersWeb Scraping FrameworksBrowser Automation
View on GitHub51,294
Less-relevant matchesScored below the primary cut
ultrafunkamsterdam/undetected-chromedriver
ultrafunkamsterdam/undetected-chromedriver
12,353View on GitHub
Undetected-chromedriver is a framework for automated browser navigation designed to bypass anti-bot security measures. It functions by patching browser drivers at the binary level to obscure automation signals, allowing scripts to interact with protected websites without being flagged or blocked by security services. The project distinguishes itself through its ability to maintain stealth during automated sessions, including those executed in headless mode. It achieves this by injecting custom configurations to mimic human user behavior and by hooking into low-level browser debugging protocols to monitor internal states and network traffic in real time. Beyond its core security capabilities, the framework provides comprehensive tools for managing the browser lifecycle. This includes automated downloading and verification of driver binaries to ensure compatibility with current browser engine releases, as well as persistent profile mapping to maintain session state and cookies across multiple execution cycles. The software supports flexible deployment, including containerized environments and various hardware architectures.
This tool is a specialized browser automation driver designed to bypass anti-bot protections, serving as a foundational component for scraping rather than a complete framework for parsing, transforming, and managing data pipelines.
PythonAnti-Bot EvasionHeadless BrowsersBrowser Environment Configurations
View on GitHub12,353
guyungy/damaihelper
Guyungy/damaihelper
2,551View on GitHub
Damaihelper is a ticketing automation bot and browser automation framework designed to monitor ticket availability and execute checkout processes. It utilizes a ticket purchasing script to automate the selection and purchase of tickets on web platforms based on predefined user criteria. The tool includes a graphical user interface for managing scripts and configuring automation parameters, allowing users to trigger tasks without using a command line. To maintain access, it employs browser session management to save and reuse authentication cookies, avoiding repetitive manual login procedures. To avoid security blocks, the system implements bot detection bypass techniques by configuring browser headers and disabling automation flags to mimic human behavior. It also includes execution activity logging to record operational results to a local directory for auditing and troubleshooting.
This is a specialized ticketing automation and purchase bot rather than a general-purpose web scraping or data extraction framework for transforming unstructured content into structured formats.
HTMLAnti-Bot EvasionBot Detection Bypass
View on GitHub2,551
proxifly/free-proxy-list
proxifly/free-proxy-list
3,865View on GitHub
This project is a public proxy aggregator and directory providing curated lists of validated HTTP and SOCKS proxy servers. It features a machine-readable API service and tools designed for anonymous network routing and the automated rotation of outgoing IP addresses. The system distinguishes itself through a proxy rotation tool used to bypass rate limits and prevent detection by automated security systems. It provides a programmatic interface for retrieving and filtering verified proxies by country and protocol, delivering this data in JSON and text formats for integration into custom applications. The platform covers broader capabilities including geographic traffic routing and web scraping infrastructure. It manages the aggregation and validation of network gateways, offering both automated API connectivity and manual switching options for operating systems and browsers.
This repository provides a directory and API for proxy servers to assist with network routing, but it is a supporting infrastructure component rather than a framework for scraping, parsing, and transforming HTML content.
Proxy and Fingerprint RotationProxy Rotation Services
View on GitHub3,865
jsdom/jsdom
jsdom/jsdom
21,587View on GitHub
jsdom is a Node.js DOM implementation that functions as a headless browser emulator and virtual browser environment. It provides a pure JavaScript implementation of web standards, acting as a web standards polyfill that simulates the window and document objects within a non-browser runtime. The project implements W3C and WHATWG specifications to provide a programmatic environment for parsing HTML and manipulating content. It serves as an HTML parser and serializer, allowing for the transformation of HTML strings into document structures and the export of those structures back into text. The system covers a broad range of browser emulation capabilities, including the execution of in-page and external scripts, the management of HTTP cookies, and the loading of external resources via network request interception. It also includes support for CSSOM mapping, canvas API integration, and virtual console log capture. Documents can be initialized using local files or remote URLs.
This is a DOM implementation and virtual browser environment that provides the underlying parsing and emulation capabilities needed to build a scraper, but it lacks the high-level crawling, proxy management, and data pipeline features required for a complete extraction framework.
JavaScriptHeadless BrowsersHTML Parsing
View on GitHub21,587
daijro/camoufox
daijro/camoufox
5,456View on GitHub
Camoufox is a Firefox-based stealth automation browser designed to evade detection during automated browsing. It combines a fingerprint randomization engine that generates thousands of unique device attributes per session, native-level API interception to spoof WebRTC, WebGL, media, and other fingerprintable properties, and human behavior simulation that moves the cursor along natural, distance-aware trajectories. The browser is compiled from source with build-time stealth patches and runs headlessly via a lightweight virtual display buffer, making it suitable for web scraping, automated testing, and other tasks that require undetectable browser sessions. What sets Camoufox apart is its comprehensive anti-detection approach that operates at multiple layers. It integrates directly with automation frameworks like Playwright and Puppeteer as a drop-in replacement, and provides a Python API for generating device profiles and managing proxies. Each session generates a new, realistic fingerprint covering fonts, screen dimensions, WebGL parameters, navigator properties, and more. The system also spoofs location, timezone, locale, and HTTP headers to match a target region, while intercepting WebRTC ICE candidates at the protocol level to replace the real IP address. Human-like browser launch behaviors, such as automatic profile generation and geolocation setting, further reduce detection risk. Beyond core stealth, Camoufox includes features for content processing and privacy. It can block ads and tracking with custom filters, remove CSS animations and telemetry to produce a clean DOM, and support both main-world DOM modification and isolated DOM reading. The browser can be built from source for specific platforms via scripts and Docker, and exposes a remote WebSocket server for accessing browser instances from remote locations.
This is a specialized stealth browser and anti-detection engine designed to be integrated with automation frameworks, rather than a complete scraping and data extraction framework itself.
C++Anti-Bot EvasionHeadless Browsers
View on GitHub5,456
projectdiscovery/nuclei
projectdiscovery/nuclei
29,189View on GitHub
Nuclei is a modular security scanning framework designed for automated vulnerability detection and infrastructure reconnaissance. It functions as a template-driven engine that executes security checks across diverse network protocols, allowing users to define custom detection logic to identify vulnerabilities, misconfigurations, and exposed assets. The platform distinguishes itself through its highly extensible architecture, which supports distributed scanning, headless browser automation for dynamic web content, and out-of-band interaction monitoring to detect blind vulnerabilities. It integrates advanced reconnaissance capabilities, including cloud infrastructure assessment, subdomain discovery, and technology fingerprinting, into a unified workflow that can be orchestrated via a command-line interface or programmatic API. Beyond core scanning, the project provides a comprehensive suite of tools for external attack surface management, including asset inventorying, visual evidence capture, and automated ticketing integration. It supports collaborative security operations through team workspaces, centralized template management, and real-time alerting, ensuring that vulnerability findings can be tracked, verified, and remediated within a single environment. The platform is distributed as a command-line utility and supports containerized execution, enabling integration into existing CI/CD pipelines and automated security workflows.
This is a security-focused vulnerability scanner and reconnaissance engine rather than a general-purpose web scraping framework for data extraction and transformation.
GoHeadless BrowsersHeadless Browser OrchestratorsWeb Crawling
View on GitHub29,189
opendatalab/mineru
opendatalab/MinerU
67,734View on GitHub
MinerU is a document parsing pipeline designed to transform unstructured files into machine-readable, structured data. It utilizes deep learning models to perform layout analysis, identifying document regions and extracting complex content such as mathematical expressions. By combining these neural network inferences with geometric heuristics, the system reconstructs the reading order and structural hierarchy of documents to ensure accurate data representation. The project distinguishes itself through a multi-stage processing workflow that integrates layout detection, optical character recognition, and formula extraction into a unified pipeline. It serializes all extracted features and spatial coordinates into a standardized format, ensuring that output remains consistent for downstream integration. To support verification, the tool includes a diagnostic suite that generates visual overlays, allowing users to inspect segmentation boundaries and reading order directly against the original source files. The software provides a comprehensive framework for automated data extraction, organizing parsed elements into a page-based structure suitable for large-scale information retrieval. It is distributed as a Python-based package, with documentation and installation instructions available in the repository.
This tool is designed for document parsing and layout analysis of PDFs and structured files rather than web scraping, making it a specialized document processing pipeline rather than a web-crawling framework.
PythonStructured Data Exporters
View on GitHub67,734
firecrawl/firecrawl-mcp-server
firecrawl/firecrawl-mcp-server
5,542View on GitHub
Firecrawl MCP Server is a Model Context Protocol tool server that exposes the full suite of Firecrawl’s web scraping, crawling, and automation capabilities as tools that large language models can invoke directly. It acts as a proxy to the Firecrawl cloud platform, which manages headless browser orchestration, async job queues, and rate limiting behind the scenes. The server distinguishes itself by packaging autonomous web agents — both a research agent that browses and collects structured data from multiple pages, and a general web agent that performs multi-step browsing and extraction tasks — as callable MCP tools. It also provides LLM-guided structured extraction, allowing users to define a schema and have a language model parse unstructured web content into precise fields. Beyond scraping, the server supports live page interaction (clicking, typing, scrolling via natural language or code), web change monitoring with webhook notifications, and recursive crawling that discovers and indexes linked pages up to a configurable depth. The broader capability surface includes single and batch URL scraping with output in markdown, HTML, JSON, or screenshot format, parsing of non-HTML documents such as PDFs and Office files, web search that returns structured results, and site link mapping to reveal page structure. All of these are registered as MCP tools, enabling any compatible language model client to orchestrate web data collection and automation tasks through a unified interface. Setup requires installing the server (via npm or from source) and configuring it with a Firecrawl API key; the server then registers its tools with the MCP client, making each Firecrawl action available for use in prompts and agent workflows.
This is an MCP server designed to expose web scraping capabilities to LLMs rather than a standalone data extraction framework or library for developers to integrate into their own pipelines.
JavaScriptHeadless Browser Orchestrators
View on GitHub5,542

Web Scraping Data Extraction Tools

gocolly/colly

steel-dev/steel-browser

browser-use/browser-use

oxylabs/how-to-scrape-amazon-product-data

docling-project/docling

NanmiCoder/MediaCrawler

ultrafunkamsterdam/undetected-chromedriver

Guyungy/damaihelper

proxifly/free-proxy-list

jsdom/jsdom

daijro/camoufox

projectdiscovery/nuclei

opendatalab/MinerU

firecrawl/firecrawl-mcp-server