Crawlee Python

Crawlee-python is a web crawling framework for building scalable scrapers using Python. It serves as a comprehensive tool for web scraping automation, providing a system to extract structured data from websites using both lightweight HTTP requests and headless browser automation.

The framework is distinguished by its anti-bot evasion capabilities, which include browser fingerprint impersonation and tiered proxy rotation to bypass detection systems and solve challenges such as Cloudflare. It also incorporates artificial intelligence for autonomous website navigation and schema-based data extraction, reducing the need for manual selector maintenance.

The system covers a broad range of capability areas, including headless browser orchestration, recursive crawling workflows, and persistent request queue management. It features automated data extraction using CSS selectors, adaptive concurrency scaling based on system load, and a unified storage interface for managing datasets and key-value stores. Monitoring and observability are handled through resource health tracking, error snapshot capture, and OpenTelemetry-compatible metrics.

Users can accelerate project setup via a command-line interface for bootstrapping and deploy their crawlers using Docker or cloud environments.

Features

Structured Data Extraction - Identifies and collects specific information from page elements using CSS selectors to create structured datasets.
URL Crawl Queues - Manages a dynamic queue of URLs to discover and process new links during recursive website crawling.
Browser Impersonation - Generates randomized browser metadata and headers to mimic real users and bypass anti-bot systems.
Web Scraping and Automation - Provides a comprehensive system for automating browser interactions and crawling web content at scale.
HTML and XML Parsing - Provides libraries for extracting and processing structured data from HTML and XML markup documents.
HTML Parsing - Parses HTML structures using CSS selectors to identify and capture specific text or elements.
JavaScript Rendering - Uses headless browsers to execute client-side JavaScript and render dynamic content before data extraction.
Automated Data Extraction - Parses HTML and XML content using CSS selectors to convert raw web pages into structured digital formats.
Content Extraction - Retrieves HTML content and combined text from matched elements and their descendants for structured data collection.
Data Extraction - Saves scraped information into specified datasets in machine-readable formats for further analysis.
Atomic Duplicate Prevention - Uses atomic locks to ensure each unique URL is processed only once, even across parallel instances.
Shared State Persisters - Implements a persistent key-value store to maintain shared application state across multiple crawler executions.
Persistent State Management - Persists the internal state of a crawl to a storage backend to allow the process to resume after crashes.
Unified Storage Interfaces - Provides a consistent interface to persist datasets, key-value stores, and queues across memory, files, or SQL.
URL Filtering Strategies - Filters discovered URLs from the crawl queue based on protocol, domain, hostname, or origin.
Headless Browser Automation - Controls headless browser engines to interact with dynamic JavaScript content and perform complex user actions.
Crawl Depth Limiters - Restricts the number of recursive hops during web traversal to prevent infinite loops and manage resources.
Crawl Strategy Management - Controls whether the crawler follows a breadth-first or depth-first strategy to discover and visit pages.
Durable Crawl Queues - Provides a durable storage backend for request queues to ensure crawl progress survives process restarts.
DOM Traversers - Implements algorithms for navigating the live document object model to find descendants, ancestors, or siblings.
HTTP Request Configurations - Allows precise definition of target URLs, HTTP methods, headers, and payloads for automation.
HTTP Request Orchestrators - Executes standalone HTTP requests to external APIs independently of the browser navigation process.
HTTP Response Processors - Captures and stores the result of network requests, including status, headers, and body.
Batch Request Enqueueing - Adds single or multiple web requests to the processing queue with support for batch additions.
Crawl Request Enqueueing - Provides a mechanism to queue discovered URLs and request objects for subsequent processing during a crawl.
Request Processing Logic - Retrieves pending requests for processing and manages their state as handled or available for retry.
Proxy Rotation Services - Implements tiered proxy rotation to automatically replace blocked addresses and maintain connection stability.
Proxy and Fingerprint Rotation - Implements automated rotation of proxies and browser fingerprints to bypass anti-bot detection systems.
Proxy Routing - Provides automated rotation of requests through proxy servers to prevent IP-based rate limits and blacklisting.
Anti-Bot Evasion - Bypasses anti-bot protections by rotating proxies and mimicking browser fingerprints to avoid detection.
Browser Fingerprint Generators - Generates unique browser identity signatures to mimic real user behavior and avoid bot detection.
Concurrent Task Limiters - Manages system resource consumption by autoscaling simultaneous requests and browser instances.
Crawl Logic Orchestration - Uses a router and middleware system to manage complex navigation paths and keep scraping logic organized.
URL Request Tracking - Tracks which URLs have been successfully handled to prevent redundant requests and manage retries.
Browser Tab Concurrency Scaling - Optimizes speed by adjusting the number of open browser tabs based on available CPU and memory.
Dynamic Concurrency Tuning - Dynamically adjusts the number of simultaneous requests to prevent memory errors based on system resources.
API Request Configurations - Defines the method, URL, headers, and timeout behavior for outgoing HTTP requests.
Browser Automation - Programmatically controls headless browsers to visit URLs and interact with web pages.
Browser Session Persistence - Rotates and persists user-like browser sessions to bypass bot detection and security challenges.
Pagination Crawlers - Identifies and queues subsequent page links from response data to enable continuous data collection.
DOM Element Selectors - Provides utilities for targeting specific page elements using CSS selectors for data extraction.
Headless Rendering Engines - Uses a headless rendering engine to process web pages and extract data via plain HTTP requests.
Headless Browser Orchestrators - Manages the lifecycle and pooling of multiple headless browser instances to execute tasks at scale.
Web Crawling Frameworks - Manages the discovery and traversal of website links through persistent request queues and recursive crawling strategies.
Web Scraping - Parses HTML and extracts structured data using CSS selectors and a jQuery-like interface.
Extraction Element Filters - Filters matched elements using selectors and predicate functions to refine the set of extracted data.
Automatic Page Metadata Extraction - Collects structured metadata, statistics, and transcripts from web pages for downstream analysis.
Document Rendering - Outputs the current state of the document as HTML, XML, or plain text for data extraction.
Data Exporters - Exports stored scraping results from internal memory into machine-readable files such as JSON.
Tabular Data Exports - Saves extracted information into tabular CSV files for easy analysis.
Data Processing - Iterates over stored dataset entries to execute transformation functions on scraped information.
Request Source Integrators - Integrates custom data sources to feed the list of URLs into the crawling queue.
Dataset Iterators - Provides async iterators, paginated lists, and reduction functions to fetch stored records from datasets.
Functional Data Aggregation - Reduces collections of dataset entries into a single accumulated result using custom reduction functions.
Bulk Dataset Export - Saves the entire contents of a dataset into a single file in CSV or JSON format.
Key-Value Stores - Saves and retrieves data records or files using unique keys and MIME types on a local disk or cloud.
Append-Only Dataset Storage - Saves structured objects and arrays into an append-only store on the local disk or in the cloud.
URL Pattern Detectors - Uses regular expression patterns to identify and filter which URLs on a page should be followed.
Infinite Scrolling - Triggers continuous page scrolling to load dynamic content that only appears as the user moves down the page.
Storage Configuration - Allows configuration of unique identifiers, names, and storage backends for persisting crawled data.
Record Transformers - Processes raw scraped data through user-defined functions to clean, format, or restructure record content.
AI-Driven Interaction Agents - Uses natural language instructions and AI models to perform complex browser interactions and autonomous workflows.
Conditional Crawl Termination - Stops the crawling process immediately when a specific condition or target data point is found.
Crawl Scope Management - Caps the total number of pages visited, maximum recursive depth, and requests per minute.
Retry Suppression Policies - Signals the crawler to halt retries for non-recoverable fatal errors to optimize processing.
Cloud Deployment - Enables pushing local scraping code to remote platforms to run tasks in hosted cloud environments.
Docker Container Deployments - Provides pre-configured Docker setups to streamline the deployment of crawlers across different environments.
Route Middleware - Executes a sequence of middleware functions to perform shared setup or preprocessing before requests reach the final route handler.
Tiered Proxy Rotation - Rotates between different quality levels of proxies and automatically escalates to higher tiers when errors increase.
Multi-Instance Process Isolations - Runs separate crawler instances with isolated configurations to prevent interference and ensure stability.
Interactive Challenge Resolvers - Provides automated interaction logic to resolve Cloudflare checkboxes and security challenges.
HTTP Session Persisters - Saves and restores session cookies and custom metadata to a key-value store for persistence.
Crawler Lifecycle Hooks - Tracks lifecycle transitions and system notifications to trigger custom logic during the scraping process.
Failure Handling Policies - Implements automatic retry logic for failed requests based on configurable limits.
Rate Limiting - Limits the number of simultaneous requests to external servers to prevent overloading the target.
Concurrency Adjusters - Dynamically adjusts the number of simultaneous requests based on real-time CPU and memory load.
Request State Transfers - Transfers state and metadata between requests to track sequences or rankings.
Request Routing - Maps incoming URLs to specific handler functions using labels to organize extraction logic by page type.
Request Rate Limiting - Implements client-side pacing to limit the frequency of outgoing HTTP requests and avoid server overload.
State Persistence - Implements state values that automatically persist across restarts or sessions via simple assignment.
Crawler Operational Statistics - Logs and saves request metrics to a persistent key-value store for operational analysis.
Metric and Performance Monitors - Integrates with OpenTelemetry to collect standardized traces and performance metrics for requests.
Crawl - Collects request counts, durations, and failure rates to analyze the reliability of the crawl process.
Performance & Resource Management - Tracks CPU load and hardware resource consumption to trigger system throttling during overloads.
Overload Detectors - Creates specialized indicators for health checks, such as proxy stability, to trigger system-wide throttling.
Session Health Monitors - Tracks session error rates and usage counts to detect when a session is blocked or expired.
Element Iteration Utilities - Executes a custom function for each matched element in a set to perform repetitive extraction operations.
Browser Page Management - Manages the creation and provisioning of isolated browser pages and contexts.
Page Lifecycle Monitors - Tracks the creation and closure of browser pages using unique identifiers to manage page retrieval.
Dynamic Class Management - Includes utilities for conditionally adding, removing, or toggling CSS classes to modify element styling or state.
Inline Style Manipulations - Provides capabilities to directly modify an element's style object to change its appearance at runtime.
API Servers - Transforms a scraping process into a persistent server that listens for requests and returns real-time data.
Browser Automation Engines - Allows the selection of specific headless browser engines like Chrome, Firefox, Safari, or Edge for rendering.
Navigation Hooks - Implements asynchronous hooks before and after page navigation to modify browser state and verify page loads.
Element Node Wrapping - Allows users to surround elements with specific DOM structures or remove parent wrappers from matched elements.
DOM Manipulation - Provides methods for dynamically updating the structure and content of web pages during the scraping process.
Element Attributes - Implements a system for getting, setting, and removing HTML and data attributes on selected DOM elements.
External API Integrations - Extracts structured data by connecting directly to official service APIs instead of parsing HTML.
HTTP Cookie Managers - Automates the lifecycle of HTTP cookies to maintain session state across headless browser requests.
Client Switching Strategies - Toggles between browser automation and HTTP clients within the same project to optimize for speed and complexity.
Request-Browser Toggles - Toggles between lightweight HTTP requests and full headless browser rendering based on page requirements.
Label-Based - Maps specific request labels to dedicated handler functions to ensure pages are processed by the appropriate logic.
Robots Exclusion Compliance - Automatically checks and obeys robots.txt exclusion standards for ethical crawling.
Sitemap Crawlers - Traverses XML sitemaps to discover and index URLs for systematic data extraction.
Sitemap Discovery - Locates sitemap files by analyzing robots.txt and checking standard directory paths.
AI-Driven Extraction - Leverages artificial intelligence to extract structured data from websites, reducing the need for manual selector maintenance.
Browser Lifecycle Managers - Executes custom asynchronous logic via hooks during the startup, shutdown, and creation stages of browser processes.
Browser Automation - Integrates multiple browser automation engines within a unified pool for flexible web scraping.
Crawler Health Monitoring - Tracks client performance and error rates to detect when a crawler is overloaded.
Crawler Configuration Managers - Manages crawler behavior, concurrency, and resource usage through a hierarchy of configuration files and environment variables.
Data Extraction And Generation - Library for web scraping and browser automation.

apify/crawlee

24,002View on GitHub

Crawlee is a web scraping framework designed for building scalable, reliable, and distributed data extraction pipelines. It provides a unified interface for managing headless browser automation and lightweight HTTP requests, allowing developers to handle complex web navigation, dynamic content rendering, and large-scale data collection within a single, modular architecture. The project distinguishes itself through its resource-aware concurrency controller, which dynamically scales task execution based on real-time CPU and memory usage to prevent host machine exhaustion. It also features a rob

mendableai/firecrawl

139,399View on GitHub

Firecrawl is a headless browser automation tool and web crawling engine designed to extract structured data from the web. It functions as an API that transforms raw website content and documents into clean markdown and JSON formats to serve as context for large language models. The project distinguishes itself by using natural language prompts to translate human instructions into targeted data extraction tasks and browser actions. It can execute interactive page navigation, such as clicking and scrolling, and perform automated web research to retrieve structured data without manual interventi

lorien/web-scraping

7,931View on GitHub

This project is a comprehensive resource directory for web data extraction, providing a curated collection of tools and libraries for parsing data, automating browsers, and managing network operations. It serves as a guide for extracting structured information from HTML, XML, JSON, and PDF formats. The toolkit focuses on advanced data collection strategies, including headless browser automation to interact with JavaScript and a suite of network utilities for DNS resolution and WebSocket connections. It specifically covers methods for bypassing bot protections through proxy pool management, us

NanmiCoder/CrawlerTutorial

4,262View on GitHub

CrawlerTutorial is a comprehensive Python web scraping tutorial and framework designed for extracting data from static and dynamic websites. It functions as a web data extraction pipeline and an HTTP request orchestrator, covering the full lifecycle of scraping applications from initial fetching to final data storage. The project provides specialized guidance on anti-bot bypass techniques and web API reverse engineering. It includes methods for evading browser detection through identity masking and proxy rotation, as well as techniques for identifying hidden API endpoints by analyzing network

apifycrawlee-python

Features