# gocolly/colly

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/gocolly-colly).**

25,101 stars · 1,837 forks · Go · apache-2.0

## Links

- GitHub: https://github.com/gocolly/colly
- Homepage: https://go-colly.org/
- awesome-repositories: https://awesome-repositories.com/repository/gocolly-colly.md

## Topics

`crawler` `crawling` `framework` `go` `golang` `scraper` `scraping` `spider`

## Description

Colly is a high-performance web scraping framework designed for the automated extraction of structured data from websites. It provides a programmable toolkit that manages the complexities of large-scale data collection, including concurrent request orchestration, automatic cookie handling, and robots.txt compliance. By utilizing an asynchronous execution model, the engine maintains high throughput while preventing resource exhaustion during recursive or distributed crawling tasks.

The framework is distinguished by its modular, event-driven architecture, which allows developers to hook into specific lifecycle stages of a network request to process content or control flow. It features a flexible middleware pipeline for handling proxy rotation, user agents, and rate limiting, alongside an interface-driven storage layer that supports swapping default in-memory state for persistent external databases. This design enables the coordination of multiple scraping instances and the maintenance of crawl history across application restarts.

Beyond its core engine, the project offers extensive customization options for network transport, including support for custom round-trippers to manage connection pooling and timeouts. It also provides robust observability tools, allowing for the attachment of custom debuggers and logging observers to monitor internal state during execution. Developers can further extend functionality through a plugin system or by sharing request context and configuration across different collector instances to support complex, multi-stage data extraction workflows.

## Tags

### Web Development

- [Web Scraping Engines](https://awesome-repositories.com/f/web-development/web-scraping-engines.md) — Extracts web content using a high-performance engine that manages concurrency, caching, and robots.txt compliance. ([source](https://github.com/gocolly/colly/blob/master/README.md))
- [Web Scraping Frameworks](https://awesome-repositories.com/f/web-development/web-scraping-frameworks.md) — Provides a programmable toolkit for extracting structured data through automated request handling and parsing workflows.
- [Concurrent Crawling Engines](https://awesome-repositories.com/f/web-development/concurrent-crawling-engines.md) — Manages high-performance asynchronous network requests and distributed state across parallel scraping tasks.
- [Web Data Extractors](https://awesome-repositories.com/f/web-development/web-data-extractors.md) — Automates the retrieval and parsing of structured information from websites to build datasets.
- [High-Volume Web Scrapers](https://awesome-repositories.com/f/web-development/high-volume-web-scrapers.md) — Manages large-scale data collection tasks that require proxy rotation, rate limiting, and persistent state.
- [Web Crawling Orchestrators](https://awesome-repositories.com/f/web-development/web-crawling-orchestrators.md) — Manages asynchronous network execution to maintain high throughput during large-scale data extraction.
- [Asynchronous Request Runners](https://awesome-repositories.com/f/web-development/asynchronous-request-runners.md) — Performs network requests in the background to keep the application responsive during large-scale extraction. ([source](https://go-colly.org/docs/best_practices/crawling/))
- [Distributed Crawler Orchestrators](https://awesome-repositories.com/f/web-development/distributed-crawler-orchestrators.md) — Coordinates multiple scraping instances across different environments by sharing state and configuration.
- [Event-Driven Data Extractors](https://awesome-repositories.com/f/web-development/event-driven-data-extractors.md) — Executes user-defined functions at specific lifecycle stages to process content and control network flow.
- [Request Middleware Pipelines](https://awesome-repositories.com/f/web-development/request-middleware-pipelines.md) — Processes outgoing requests through a chain of configurable components to handle proxy rotation and rate limiting.
- [Event-Driven Scraping Hooks](https://awesome-repositories.com/f/web-development/event-driven-scraping-hooks.md) — Executes user-defined functions at specific lifecycle stages of a network request to process data.
- [Collector Lifecycle Managers](https://awesome-repositories.com/f/web-development/collector-lifecycle-managers.md) — Creates a collector object to manage network communication and trigger registered callback functions. ([source](https://go-colly.org/docs/introduction/start/))
- [Crawler Extensions](https://awesome-repositories.com/f/web-development/crawler-extensions.md) — Provides pluggable extensions to automate user agent rotation and referrer management during web data collection. ([source](https://go-colly.org/docs/best_practices/extensions/))
- [Request Context Propagation](https://awesome-repositories.com/f/web-development/request-context-propagation.md) — Carries metadata and state across asynchronous network calls by embedding values within the request object.
- [Scraping Callback Registrars](https://awesome-repositories.com/f/web-development/scraping-callback-registrars.md) — Attaches callback functions to monitor request lifecycles and process retrieved content. ([source](https://go-colly.org/docs/introduction/start/))

### Networking & Communication

- [Proxy Rotation Services](https://awesome-repositories.com/f/networking-communication/proxy-rotation-services.md) — Distributes network traffic across multiple proxy servers to avoid IP-based blocking during data collection. ([source](https://go-colly.org/docs/best_practices/distributed/))
- [HTTP Transport Configurations](https://awesome-repositories.com/f/networking-communication/http-transport-configurations.md) — Customizes the networking layer to manage proxies, timeouts, and connection pooling for outgoing requests. ([source](https://go-colly.org/docs/introduction/configuration/))

### Software Engineering & Architecture

- [Extensible Scraping Frameworks](https://awesome-repositories.com/f/software-engineering-architecture/extensible-scraping-frameworks.md) — Provides a modular framework that allows developers to inject custom storage, proxy logic, and processing callbacks.

### Data & Databases

- [Distributed State Persistence](https://awesome-repositories.com/f/data-databases/distributed-state-persistence.md) — Maintains shared cookie and URL history state across multiple independent instances in distributed environments. ([source](https://go-colly.org/docs/best_practices/distributed/))
- [Storage Abstraction Layers](https://awesome-repositories.com/f/data-databases/storage-abstraction-layers.md) — Decouples state management from the core engine to allow swapping in-memory storage for persistent databases.
- [Persistent Application State](https://awesome-repositories.com/f/data-databases/persistent-application-state.md) — Maintains cookies and visited URL history across application restarts and long-running data extraction sessions. ([source](https://go-colly.org/docs/best_practices/crawling/))
- [Persistent Storage Backends](https://awesome-repositories.com/f/data-databases/persistent-storage-backends.md) — Swaps in-memory state management for persistent databases to maintain crawl history.
- [Storage Backend Adapters](https://awesome-repositories.com/f/data-databases/storage-backend-adapters.md) — Overrides default memory storage by assigning custom backends to manage persistent data. ([source](https://go-colly.org/docs/best_practices/storage/))
