Crawlee | Awesome Repository

Crawlee is a web scraping framework designed for building scalable, reliable, and distributed data extraction pipelines. It provides a unified interface for managing headless browser automation and lightweight HTTP requests, allowing developers to handle complex web navigation, dynamic content rendering, and large-scale data collection within a single, modular architecture.

The project distinguishes itself through its resource-aware concurrency controller, which dynamically scales task execution based on real-time CPU and memory usage to prevent host machine exhaustion. It also features a robust session-based fingerprint isolation system that manages unique browser contexts, TLS fingerprints, and proxy rotation to mimic human behavior and bypass anti-bot protections. These capabilities are supported by a persistent request queueing system that ensures crawl operations can survive process restarts and resume from their last state.

The framework offers a comprehensive suite of tools for the entire scraping lifecycle, including event-driven lifecycle hooks for custom logic, a middleware-based request pipeline for handling authentication and data transformation, and a pluggable storage backend interface that decouples data persistence from application logic. It supports advanced automation tasks such as AI-driven navigation, sitemap discovery, and multi-engine browser orchestration, while providing extensive observability through performance metrics, error snapshots, and configurable logging.

The project is implemented in TypeScript and provides a command-line interface for scaffolding, managing, and deploying scraping projects to cloud or serverless environments.

Features

Web Crawling - Provides a systematic framework for discovering, navigating, and extracting data from web pages at scale.
Web Scraping Frameworks - Provides a comprehensive framework for building scalable web crawlers that support both lightweight HTTP requests and headless browser automation.
Resource-Aware Scaling Controllers - Dynamically scales task execution based on real-time CPU and memory usage to prevent host machine exhaustion.
Web Data Extraction - Automates the parsing and collection of structured data from websites into standardized formats.

Features

Web Crawling - Provides a systematic framework for discovering, navigating, and extracting data from web pages at scale.
Web Scraping Frameworks - Provides a comprehensive framework for building scalable web crawlers that support both lightweight HTTP requests and headless browser automation.
Resource-Aware Scaling Controllers - Dynamically scales task execution based on real-time CPU and memory usage to prevent host machine exhaustion.
Web Data Extraction - Automates the parsing and collection of structured data from websites into standardized formats.

The project is implemented in TypeScript and provides a command-line interface for scaffolding, managing, and deploying scraping projects to cloud or serverless environments.