Colly | Awesome Repository

Colly is a high-performance web scraping framework designed for the automated extraction of structured data from websites. It provides a programmable toolkit that manages the complexities of large-scale data collection, including concurrent request orchestration, automatic cookie handling, and robots.txt compliance. By utilizing an asynchronous execution model, the engine maintains high throughput while preventing resource exhaustion during recursive or distributed crawling tasks.

The framework is distinguished by its modular, event-driven architecture, which allows developers to hook into specific lifecycle stages of a network request to process content or control flow. It features a flexible middleware pipeline for handling proxy rotation, user agents, and rate limiting, alongside an interface-driven storage layer that supports swapping default in-memory state for persistent external databases. This design enables the coordination of multiple scraping instances and the maintenance of crawl history across application restarts.

Beyond its core engine, the project offers extensive customization options for network transport, including support for custom round-trippers to manage connection pooling and timeouts. It also provides robust observability tools, allowing for the attachment of custom debuggers and logging observers to monitor internal state during execution. Developers can further extend functionality through a plugin system or by sharing request context and configuration across different collector instances to support complex, multi-stage data extraction workflows.

Features

Web Scraping Engines - Extracts web content using a high-performance engine that manages concurrency, caching, and robots.txt compliance.
Web Scraping Frameworks - Provides a programmable toolkit for extracting structured data through automated request handling and parsing workflows.
Concurrent Crawling Engines - Manages high-performance asynchronous network requests and distributed state across parallel scraping tasks.
Web Data Extractors - Automates the retrieval and parsing of structured information from websites to build datasets.

Features

Web Scraping Engines - Extracts web content using a high-performance engine that manages concurrency, caching, and robots.txt compliance.
Web Scraping Frameworks - Provides a programmable toolkit for extracting structured data through automated request handling and parsing workflows.
Concurrent Crawling Engines - Manages high-performance asynchronous network requests and distributed state across parallel scraping tasks.
Web Data Extractors - Automates the retrieval and parsing of structured information from websites to build datasets.