Crawlee is a web scraping framework designed for building scalable, reliable, and distributed data extraction pipelines. It provides a unified interface for managing headless browser automation and lightweight HTTP requests, allowing developers to handle complex web navigation, dynamic content rendering, and large-scale data collection within a single, modular architecture.
The project distinguishes itself through its resource-aware concurrency controller, which dynamically scales task execution based on real-time CPU and memory usage to prevent host machine exhaustion. It also features a robust session-based fingerprint isolation system that manages unique browser contexts, TLS fingerprints, and proxy rotation to mimic human behavior and bypass anti-bot protections. These capabilities are supported by a persistent request queueing system that ensures crawl operations can survive process restarts and resume from their last state.
The framework offers a comprehensive suite of tools for the entire scraping lifecycle, including event-driven lifecycle hooks for custom logic, a middleware-based request pipeline for handling authentication and data transformation, and a pluggable storage backend interface that decouples data persistence from application logic. It supports advanced automation tasks such as AI-driven navigation, sitemap discovery, and multi-engine browser orchestration, while providing extensive observability through performance metrics, error snapshots, and configurable logging.
The project is implemented in TypeScript and provides a command-line interface for scaffolding, managing, and deploying scraping projects to cloud or serverless environments.
Crawlee-python is a web crawling framework for building scalable scrapers using Python. It serves as a comprehensive tool for web scraping automation, providing a system to extract structured data from websites using both lightweight HTTP requests and headless browser automation. The framework is distinguished by its anti-bot evasion capabilities, which include browser fingerprint impersonation and tiered proxy rotation to bypass detection systems and solve challenges such as Cloudflare. It also incorporates artificial intelligence for autonomous website navigation and schema-based data extra
Botasaurus is a Python web scraping framework and headless browser automation system used to build scalable data extraction tools. It functions as a web data extraction tool and OCR document parser, converting website content, images, and PDF files into structured formats such as JSON, CSV, and Excel. The framework distinguishes itself by providing a scraper management interface that allows Python functions to be wrapped in a web-based UI or deployed as standalone desktop applications. This enables non-technical users to trigger extraction jobs and manage tasks via a graphical interface or RE
This project is a comprehensive framework for building and managing autonomous agent systems. It provides a unified architecture for orchestrating multi-agent societies, where specialized agents collaborate through roleplay to decompose and solve complex tasks. The system integrates language models with external environments, enabling agents to perform real-world actions through a standardized tool-calling abstraction layer. The framework distinguishes itself through its focus on iterative reasoning and data reliability. It employs automated feedback loops to refine agent outputs and self-eva
Pholcus is a distributed web crawling system designed for large-scale data scraping. It employs a master-worker distribution model to coordinate high-concurrency scraping tasks across a network of remote client nodes, enabling both horizontal and vertical data collection. The system features a hot-loadable rule engine that allows extraction and navigation logic to be updated at runtime without restarting the process. It handles dynamic content through headless browser integration and bypasses bot detection using proxy rotation, automated user authentication, and simulated human behavior. The