2 repository-uri
Tools for connecting external data sources to internal crawling queues.
Distinct from Data Sources: Distinct from Data Sources: focuses on the integration logic for feeding URLs into a crawler rather than the data source itself.
Explore 2 awesome GitHub repositories matching data & databases · Request Source Integrators. Refine with filters or upvote what's useful.
Crawlee is a web scraping framework designed for building scalable, reliable, and distributed data extraction pipelines. It provides a unified interface for managing headless browser automation and lightweight HTTP requests, allowing developers to handle complex web navigation, dynamic content rendering, and large-scale data collection within a single, modular architecture. The project distinguishes itself through its resource-aware concurrency controller, which dynamically scales task execution based on real-time CPU and memory usage to prevent host machine exhaustion. It also features a rob
Integrates external data sources with internal queues to control how URLs are accessed and processed during a crawl.
Crawlee-python is a web crawling framework for building scalable scrapers using Python. It serves as a comprehensive tool for web scraping automation, providing a system to extract structured data from websites using both lightweight HTTP requests and headless browser automation. The framework is distinguished by its anti-bot evasion capabilities, which include browser fingerprint impersonation and tiered proxy rotation to bypass detection systems and solve challenges such as Cloudflare. It also incorporates artificial intelligence for autonomous website navigation and schema-based data extra
Integrates custom data sources to feed the list of URLs into the crawling queue.