Crawlee

Crawlee is a web scraping framework designed for building scalable, reliable, and distributed data extraction pipelines. It provides a unified interface for managing headless browser automation and lightweight HTTP requests, allowing developers to handle complex web navigation, dynamic content rendering, and large-scale data collection within a single, modular architecture.

The project distinguishes itself through its resource-aware concurrency controller, which dynamically scales task execution based on real-time CPU and memory usage to prevent host machine exhaustion. It also features a robust session-based fingerprint isolation system that manages unique browser contexts, TLS fingerprints, and proxy rotation to mimic human behavior and bypass anti-bot protections. These capabilities are supported by a persistent request queueing system that ensures crawl operations can survive process restarts and resume from their last state.

The framework offers a comprehensive suite of tools for the entire scraping lifecycle, including event-driven lifecycle hooks for custom logic, a middleware-based request pipeline for handling authentication and data transformation, and a pluggable storage backend interface that decouples data persistence from application logic. It supports advanced automation tasks such as AI-driven navigation, sitemap discovery, and multi-engine browser orchestration, while providing extensive observability through performance metrics, error snapshots, and configurable logging.

The project is implemented in TypeScript and provides a command-line interface for scaffolding, managing, and deploying scraping projects to cloud or serverless environments.

Features

Web Crawling - Provides a systematic framework for discovering, navigating, and extracting data from web pages at scale.

Web Scraping Frameworks - Provides a comprehensive framework for building scalable web crawlers that support both lightweight HTTP requests and headless browser automation.

Resource-Aware Scaling Controllers - Dynamically scales task execution based on real-time CPU and memory usage to prevent host machine exhaustion.

Web Data Extraction - Automates the parsing and collection of structured data from websites into standardized formats.

Headless Browser Automation - Controls automated browser instances to render dynamic JavaScript content and interact with complex web interfaces.

Asynchronous Crawl Queues - Provides a persistent, asynchronous queueing system to manage and process large-scale web crawling tasks.

Distributed Crawling Engines - Manages large-scale data extraction tasks with automatic request queuing, proxy rotation, and persistent state management.

Browser Automation - Controls headless browsers to navigate pages, scroll, and interact with dynamic elements for data extraction.

Headless Rendering Engines - Executes client-side scripts using headless browsers to scrape dynamic content.

Browser Session Managers - Manages isolated browser sessions, cookies, and proxy rotation to maintain state across scraping requests.

Large-Scale Domain Crawlers - Builds and manages high-performance, distributed web crawlers that extract structured data from thousands of pages.

Autonomous Web Browsing Agents - Enables AI-driven interaction with web pages using natural language instructions instead of manual selectors.

Cloud Deployment Platforms - Publishes automation scripts to a managed platform to execute tasks remotely and scale data collection.

Durable Crawl Queues - Maintains a durable record of URLs to be crawled, allowing the process to resume or scale without losing progress.

Proxy and Fingerprint Rotation - Applies randomized browser fingerprints and proxy configurations by default to bypass anti-scraping protections and prevent IP blocking during data collection.

Anti-Bot Evasion - Provides specialized HTTP clients that mimic browser TLS fingerprints and headers to evade detection by security services like Cloudflare.

Browser Automation Interfaces - Provides a consistent interface for common browser operations across different automation engines.

Concurrent Crawling Engines - Dynamically scales concurrency and resource usage based on system health to maximize throughput.

Headless Browser Orchestrators - Orchestrates headless browser instances to render dynamic JavaScript and interact with web elements during scraping.

Browser Automation - Provides a unified interface for managing and scaling headless browser automation instances.

Crawling Optimization - Manages concurrency, request timeouts, browser types, and proxy settings to optimize performance and minimize the risk of being blocked by target servers.

Crawler Configuration Managers - Manages concurrency, retries, and browser impersonation to minimize blocking during web scraping.

Web Crawling Orchestrators - Orchestrates recursive crawling by processing URLs from queues to discover and visit linked pages automatically.

Web Scraping Engines - Integrates multiple scraping and browser automation tools through a unified interface.

Crawl Progress Persisters - Saves the progress of a URL list to storage automatically, allowing crawlers to resume interrupted tasks.

Content Extraction - Extracts structured data from HTML pages using CSS selectors to isolate specific content.

Web Content Scrapers - Extracts information from web pages and converts retrieved content into structured formats.

Data Pipeline Orchestration - Orchestrates modular, scalable workflows that discover, queue, process, and export web content into structured datasets.

Distributed Crawling Systems - Persists crawl progress to allow resuming interrupted jobs from the last processed state.

Persistent Storage Backends - Saves extracted information into structured formats and storage backends to ensure reliable data capture.

State Persistence - Maintains mutable state across crawler executions to track progress and share information.

Structured Data Extraction - Parses raw HTML or JSON responses using selectors to transform unstructured content into clean data.

AI-Driven Interaction Agents - Performs page actions and extracts structured data using AI-driven navigation without manual selector maintenance.

Parallel Execution Strategies - Scales concurrent task execution dynamically based on CPU, memory, and event loop health to maximize throughput.

Web Interaction Agents - Interprets page elements using AI to navigate complex interfaces and extract data like a human user.

Crawling - Saves the progress of a crawl to storage so that interrupted tasks can resume from the last processed URL.

Serverless Deployment - Packages scraping logic into isolated, cloud-ready units with managed infrastructure, storage, and proxy support.

Task Queues - Organizes URLs into dynamic queues to facilitate systematic site traversal and prevent duplicate processing.

Middleware-Based Request Pipelines - Processes network requests through a sequence of modular functions to handle authentication, transformation, and proxy rotation.

Crawl Queue Batchers - Adds individual or batched URLs to the crawl queue while automatically deduplicating requests.

Proxy Rotation Services - Distributes network traffic across multiple proxy servers to maintain connectivity and bypass rate limits during large-scale scraping.

Proxy Configurations - Routes all browser connections through a pool of proxies to bypass restrictions and distribute traffic across different IP addresses.

Multi-Instance Process Isolations - Supports running multiple isolated crawler instances with unique proxy and session configurations to prevent cross-request interference.

Anti-Abuse Systems - Implements advanced techniques like proxy rotation and fingerprinting to bypass security challenges and anti-scraping protections.

Circumvention Strategies - Detects and attempts to circumvent common anti-bot measures like rate limiting or challenge pages to ensure successful data extraction.

Stateful Session Persistence - Maintains browser context and authentication state across multi-step web interactions to ensure reliable scraping sessions.

Browser Impersonation - Masks HTTP requests with browser-specific fingerprints to bypass automated access protections and security challenges during web scraping.

Challenge Resolution - Detects and resolves common security challenges like Cloudflare to maintain uninterrupted access to protected web content.

Session & Cookie Handlers - Extracts and injects session cookies to maintain authentication state across multiple web scraping requests.

Rendering Strategy Automation - Switches dynamically between lightweight HTTP requests and full browser rendering based on page content to optimize speed and resource usage.

Queue Injection Utilities - Adds discovered URLs to a request queue for processing, supporting filtering by patterns or selectors.

Retry Policies - Implements automated retry policies to handle transient network or server failures during data extraction.

Performance & Resource Management - Dynamically adjusts active browser instances based on system capacity to prevent resource exhaustion.

Link Discovery Engines - Automatically identifies and adds links from pages to the crawl queue using pattern-based filtering.

Isolated Browser Contexts - Creates isolated browser contexts with unique cookies, proxies, and fingerprints to mimic human behavior and bypass anti-bot protections.

Browser Session Persistence - Stores cookies, local storage, and cache to maintain user state across multiple scraping sessions.

Custom Page Frameworks - Executes custom logic on each visited page to extract data and perform navigation tasks.

Request Handling - Tracks and manages the lifecycle of web requests to ensure all target pages are processed.

Request Interception Middleware - Provides middleware for intercepting and modifying network traffic during the crawling process.

Crawl Task Managers - Fetches pending URLs for execution, tracks completion status, and allows for the reclamation of failed tasks.

Browser Isolation Strategies - Creates isolated, ephemeral browser sessions to ensure clean states and prevent data leakage between scraping tasks.

Browser Lifecycle Managers - Orchestrates the launching, retirement, and teardown of browser instances for efficient resource management.

Crawler Health Monitoring - Logs runtime metrics like throughput and success rates to monitor the health of data extraction tasks.

JavaScript Crawling Frameworks - Reliable browser automation and scraping library.

Caching and Performance - Switches between lightweight requests and browser rendering to minimize resource consumption during data collection.

Data Exporters - Saves collected information into structured formats like JSON or CSV for external analysis.

Persistent Application State - Maintains and automatically saves data across crawler executions by storing values in a persistent key-value store.

Persistent Storage Management - Defines storage identifiers and persistence intervals to ensure scraped data is saved reliably.

Browser Impersonators - Mimics browser behavior and headers to reduce the likelihood of being blocked by anti-bot systems.

Request Retries - Marks crawl tasks as handled or failed to ensure reliable retries during subsequent processing cycles.

Cloud Deployment - Packages and uploads local automation scripts to a managed infrastructure for remote execution and scheduling.

Traffic Routing Proxies - Directs network traffic through intermediary proxy servers to manage connection paths and bypass geographic restrictions.

Fingerprint Configuration - Generates realistic browser headers, TLS fingerprints, and rendering characteristics to prevent detection by modern anti-bot security systems.

Fingerprint Randomization - Creates randomized browser fingerprints including headers, user agents, and screen resolutions to help automated scrapers mimic human behavior and avoid detection by anti-bot systems.

Stealth Navigation - Employs browser fingerprinting and stealth techniques to mimic human behavior and prevent detection by anti-scraping systems.

Cross-Browser Abstractions - Provides a consistent interface for managing multiple headless browser engines to enable seamless switching between rendering environments.

Crawler Lifecycle Hooks - Provides event-driven hooks to manage crawler state changes and lifecycle events.

Request Context Managers - Maintains state and metadata across the request lifecycle to facilitate navigation and data parsing.

Automated Retry Strategies - Forces automatic retries for failed requests to ensure data extraction succeeds despite transient errors.

Crawling Request Throttlers - Limits the number of concurrent tasks and requests per minute to ensure stable data collection and prevent server overloading.

Error Snapshots - Saves a screenshot and the HTML content of a web page when an error occurs to assist with debugging.

Links - Adds discovered URLs to the crawl queue with support for pattern-based filtering.

Browser Automation Engines - Manages multiple browser engines through a unified interface to switch between rendering environments.

Browser Cookie Management - Retrieves and injects session cookies to maintain authentication states across automated scraping tasks.

Navigation Hooks - Enables custom logic execution before or after page navigation to handle anti-bot challenges or state modification.

Pagination Crawlers - Extracts links from web pages using selectors to traverse multi-page search results and site structures.

DOM Element Selectors - Selects and traverses elements within a document using CSS-style selectors to extract data or manipulate the DOM.

Rendering Strategies - Optimizes rendering strategies by switching between network requests and headless browsers to improve load speeds.

Request Routing - Directs crawling requests to specific processing logic based on page type to handle multi-step workflows.

Adaptive Crawling Engines - Dynamically adjusts crawling strategies based on website structure and content requirements to improve navigation effectiveness.

State Persistence - Maintains persistent crawl state to allow continuous monitoring and task injection even after the initial queue is exhausted.

Crawler Identity Masking - Randomizes browser fingerprints and HTTP headers to simulate human behavior and bypass anti-bot detection mechanisms during web scraping sessions.

Collection Lifecycle Management - Provides utilities for opening, inspecting, and managing the lifecycle of data collections.

Data Persistence and Storage - Saves arbitrary data, files, or crawler states to local or cloud storage using unique keys.

Storage Adapters - Persists extracted tabular data and binary files to local disk or cloud storage backends through a unified interface.

Shared State Persisters - Maintains mutable data across multiple crawler executions by storing it in a persistent key-value store.

Key-Value Stores - Manages persistent key-value storage for configuration and state associated with crawling tasks.

Pluggable Storage Drivers - Decouples data persistence from application logic to allow swapping between local, memory, or cloud-based storage backends.

Task Result Storage - Saves extracted data to internal storage during execution and exports the final collection to standard file formats.

Lifecycle Event Hooks - Executes custom logic at specific stages of the browser and page lifecycle to manage initialization and cleanup.

Containerized Deployments - Includes pre-configured container settings to simplify the packaging and deployment of crawling tasks.

Execution Environment Configurations - Adjusts resource limits, logging verbosity, and browser automation settings through configuration objects to control how scraping tasks run.

Route Middleware - Executes registered functions sequentially before request handlers to perform logging or data transformation.

Fingerprint Injection - Injects realistic device signals and browser attributes into automated sessions to prevent detection by anti-bot systems that monitor for headless browser patterns.

Device Fingerprinting - Generates realistic browser headers and TLS fingerprints to mimic human behavior and evade detection by security services.

Request Limiters - Sets a hard limit on the number of pages processed during a crawl to prevent infinite loops and manage resource consumption.

Callback-Based Bypass Logic - Detects and solves automated bot protection challenges with configurable callbacks for custom detection logic and interaction behavior.

Session Authentication - Automates the retrieval and storage of verification headers or tokens from web pages to maintain authenticated state across subsequent API requests.

Browser Task Limiters - Scales the number of active browser pages based on available system resources to prevent memory exhaustion.

Execution Flow Control - Manages crawler execution flow by allowing graceful or immediate start, pause, and stop operations.

Overload Signal Handlers - Defines custom logic to report resource pressure and manage crawler concurrency based on system health.

Robots Policy Enforcers - Checks and adheres to website robots.txt files automatically to ensure compliance with site crawling policies.

Workflow Input Schemas - Creates structured interfaces for crawler configuration, allowing users to provide dynamic parameters like target URLs and limits at runtime.

Metric and Performance Monitors - Tracks and logs performance metrics and request status to provide visibility into crawl progress.

Session Health Monitors - Monitors session health by tracking usage and error scores to automatically retire blocked or unreliable sessions.

Task Status Monitors - Provides real-time metrics on pending and handled requests to track the progress of crawling tasks.

Device and Network Emulators - Configures browser automation to mimic specific hardware profiles like desktop or mobile for accurate content rendering.

Browser Page Management - Spawns new pages within browser instances to handle concurrent web navigation tasks.

URL Pattern Matchers - Restricts crawling to specific URL patterns to ensure the crawler stays within defined domain boundaries.

Cross-Browser Execution Engines - Opens pages across multiple browser engines simultaneously to facilitate cross-browser testing or parallel data extraction.

Element Availability Synchronizers - Pauses execution until a specific element appears in the document to ensure content is fully loaded before extraction.

API Servers - Transforms scraping tasks into persistent server processes that expose extracted data via HTTP endpoints.

Remote Browser Infrastructure Management - Scales the number of parallel browser instances based on system resources to optimize performance.

Identity Customization - Sets custom user agent strings and persistent user data directories to mimic human browsing behavior and maintain state across multiple scraping sessions.

Crawl Request Metadata Trackers - Analyzes and summarizes failed requests to help identify and resolve issues with target websites.

Route Organization Patterns - Maps specific URL patterns or labels to dedicated handler functions for modular and maintainable data extraction.

Pattern-Matching Routers - Excludes specific URLs from the crawl queue by matching them against patterns and triggering custom skip logic.

Browser Environment Configurations - Configures browser initialization parameters including proxy routing and stealth headers for automated sessions.

Crawler Lifecycle Controllers - Provides programmatic control to shut down crawling processes immediately upon encountering critical failures.

Sitemap Generators - Parses website sitemaps to automatically discover and queue target pages for large-scale extraction.

Collection Iterators - Supports asynchronous iteration over large datasets to process records efficiently without memory exhaustion.

Sequential Iterators - Provides sequential iteration methods for processing stored records with mapping and reduction support.

Data Collections & Datasets - Organizes extracted information into structured collections that support separate storage for different data types.

Dataset Processors - Executes a custom function for every item in a collection, providing access to the data and its index.

Request Source Integrators - Integrates external data sources with internal queues to control how URLs are accessed and processed during a crawl.

Storage Backend Adapters - Provides a consistent interface for datasets and queues, allowing data to be stored in memory, local files, or databases.

Storage Lifecycle Management - Provides lifecycle management for data stores to maintain clean persistence for crawler runs.

Browser-Simulated Parsers - Simulates a browser environment using a lightweight DOM implementation to extract data from web pages.

Project Scaffolding and Configuration - Provides a command-line interface to initialize, scaffold, and execute crawling projects, simplifying the development workflow.

Targeting Utilities - Configures target URLs with custom HTTP methods, headers, and payloads for specific scraping tasks.

Task Scheduling - Configures automated execution intervals for scripts to perform periodic data collection.

Execution Flow Controls - Provides controls to start, pause, resume, or abort task processing during long-running scraping operations.

Queue State Configurations - Allows configuring whether to clear request history or resume from previous states.

Request Execution - Provides tools for configuring and executing network requests to fetch data from target URLs.

Request Locking Mechanisms - Prevents concurrent processing of the same request by locking it during execution to maintain data integrity.

Proxy Management - Routes traffic through specified proxy servers and isolates browser instances to improve anonymity.

Fingerprint Caching - Links specific browser fingerprints to individual sessions to ensure consistent identity across multiple requests and improve the reliability of automated scraping tasks.

Event-Driven Hooks - Executes custom user logic at specific stages of the crawling process, such as navigation or browser launch.

Service Configuration Management - Swaps core infrastructure components like storage clients or event managers to adapt the crawler to different execution environments.

Error Reporting - Captures and reports application-level runtime errors and stack traces during scraping operations.

Error Tracking - Aggregates and summarizes runtime errors to identify failure patterns during automated scraping tasks.

System Usage Monitoring - Monitors memory consumption and triggers alerts to prevent process crashes during data extraction.

Event Loop Latency Monitors - Tracks event loop latency to trigger overload signals and prevent system instability.

Page Lifecycle Trackers - Assigns unique identifiers to browser pages to monitor their state and retrieve specific instances during scraping tasks.

Performance Monitoring - Monitors browser instance resource usage to detect overload and trigger automated recovery.

Rate Limit Overload Monitors - Tracks HTTP 429 error frequency to trigger overload signals and manage request flow.

Resource Monitoring Tools - Captures system resource snapshots to identify potential overload states during automated data collection.

System Load Monitors - Tracks processor utilization and triggers signals to prevent performance degradation during high-load scraping.

Web Performance Monitoring - Records diagnostic traces of browser activity to monitor execution and debug performance issues.

Page Lifecycle Monitors - Executes callbacks when browser pages are created or closed to track activity and manage sessions.

Element Property Inspection - Retrieves or updates specific attributes, data properties, and input values from matched DOM elements to extract or modify page content.

Infinite Scroll Components - Automates repeated scrolling to the bottom of webpages to capture all dynamic content during a crawl.

Remote Data Fetching - Provides utilities for retrieving and parsing data from remote network resources.

Request Lifecycle Hooks - Monitors the progress of individual web requests through various stages to ensure reliable data extraction.

Robots Exclusion Compliance - Checks and adheres to site-specific crawling rules defined in robots.txt files to ensure ethical and compliant automated data collection.

apifycrawlee

Features

Open-source alternatives to Crawlee

apify/crawlee-python

omkarcloud/botasaurus

camel-ai/camel

andeya/pholcus

Star history