# bda-research/node-crawler

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/bda-research-node-crawler).**

6,785 stars · 877 forks · TypeScript · mit

## Links

- GitHub: https://github.com/bda-research/node-crawler
- awesome-repositories: https://awesome-repositories.com/repository/bda-research-node-crawler.md

## Topics

`cheerio` `crawler` `extract-data` `javascript` `jquery` `nodejs` `spider`

## Description

node-crawler is a programmable web crawler for Node.js that manages request queues and automates data extraction. It functions as a rate-limited HTTP client and a headless HTML parser, providing the infrastructure to visit large sets of URLs asynchronously while preventing duplicate processing through task deduplication.

The project distinguishes itself through a proxy rotation manager that cycles user agents and proxy servers to bypass access restrictions. It utilizes the HTTP/2 protocol to improve request performance and server compatibility during large-scale scraping operations.

The system covers a broad range of capabilities, including traffic management with independent rate limiting and automatic request retries. It provides content processing tools for XML and HTML parsing via CSS selectors, as well as binary file downloading and character encoding normalization to standard UTF-8.

## Tags

### Web Development

- [Web Crawling](https://awesome-repositories.com/f/web-development/web-automation-scraping/web-scraping-automation/web-crawling.md) — Queues and visits large sets of URLs asynchronously while managing request retries and preventing duplicate processing.
- [Web Crawlers](https://awesome-repositories.com/f/web-development/web-crawlers.md) — Provides a programmable Node.js framework for managing request queues and automating data extraction.
- [Duplicate Prevention](https://awesome-repositories.com/f/web-development/task-execution-engines/crawl-task-managers/duplicate-prevention.md) — Prevents the same URL or task from entering the queue multiple times to avoid redundant processing. ([source](https://github.com/bda-research/node-crawler/blob/master/README.md))

### Part of an Awesome List

- [HTML Parsing](https://awesome-repositories.com/f/awesome-lists/data/html-parsing.md) — Extracts data from HTML responses using a server-side DOM implementation and CSS-style selectors. ([source](https://github.com/bda-research/node-crawler#readme))
- [XML Parsing](https://awesome-repositories.com/f/awesome-lists/data/html-and-xml-parsing/xml-parsing.md) — Processes page content by recognizing MIME types and using DOM manipulation to extract data from XML. ([source](https://github.com/bda-research/node-crawler/blob/master/CHANGELOG.md))
- [JavaScript Crawling Frameworks](https://awesome-repositories.com/f/awesome-lists/devtools/javascript-crawling-frameworks.md) — Simple API-driven crawler for Node.js.
- [Web Scraping](https://awesome-repositories.com/f/awesome-lists/devtools/web-scraping.md) — Web crawler with jQuery-like parsing.

### Data & Databases

- [Web Data Extraction](https://awesome-repositories.com/f/data-databases/web-data-extraction.md) — Implements programmatic scraping and processing of web content to extract structured data.

### DevOps & Infrastructure

- [Asynchronous Crawl Queues](https://awesome-repositories.com/f/devops-infrastructure/scheduling/asynchronous-crawl-queues.md) — Manages asynchronous crawl queues for long-running data extraction jobs. ([source](https://github.com/bda-research/node-crawler#readme))
- [Request Retries](https://awesome-repositories.com/f/devops-infrastructure/api-service-management/api-resilience/request-retries.md) — Automatically attempts to refetch failed pages to increase the success rate of data collection. ([source](https://github.com/bda-research/node-crawler/blob/master/CHANGELOG.md))
- [JSON Response Parsers](https://awesome-repositories.com/f/devops-infrastructure/response-parsing-utilities/json-response-parsers.md) — Includes utilities to treat response bodies as JSON and disable HTML parsing for simplified API data extraction. ([source](https://github.com/bda-research/node-crawler#readme))

### Graphics & Multimedia

- [HTML Parsers](https://awesome-repositories.com/f/graphics-multimedia/media-production-suites/media-management-production/media-management-systems/data-parsing-conversion/html-parsers.md) — Includes a headless parser that converts HTML responses into a DOM structure for data extraction.

### Networking & Communication

- [High Performance Scraping](https://awesome-repositories.com/f/networking-communication/high-performance-scraping.md) — Uses HTTP/2 and concurrent connection management to collect data quickly while respecting target server load limits.
- [Multi-Protocol Handlers](https://awesome-repositories.com/f/networking-communication/http-2-protocol-implementations/multi-protocol-handlers.md) — Supports both HTTP/1.1 and HTTP/2 protocols to optimize connection performance and ensure server compatibility.
- [HTTP/2 Support](https://awesome-repositories.com/f/networking-communication/http-2-support.md) — Implements HTTP/2 support to enhance request performance and ensure compatibility with modern servers during large-scale scraping. ([source](https://github.com/bda-research/node-crawler/blob/master/README.md))
- [Proxy Rotation Services](https://awesome-repositories.com/f/networking-communication/proxy-rotation-services.md) — Distributes network traffic across a pool of proxy servers to bypass rate limits. ([source](https://github.com/bda-research/node-crawler/blob/master/README.md))
- [Proxy and Fingerprint Rotation](https://awesome-repositories.com/f/networking-communication/proxy-rotation-services/proxy-and-fingerprint-rotation.md) — Distributes outgoing network traffic across a pool of proxy servers and user agents to bypass access restrictions.
- [User Agent Rotation](https://awesome-repositories.com/f/networking-communication/user-agent-rotation.md) — Cycles through different user agent strings to mimic various browser environments and avoid detection. ([source](https://github.com/bda-research/node-crawler/blob/master/CHANGELOG.md))
- [Pre-Request Logic Hooks](https://awesome-repositories.com/f/networking-communication/communication-protocols-architectures/request-processing-architectures/request-execution/pre-request-logic-hooks.md) — Executes custom synchronous or asynchronous functions before each queued request to modify options or prepare state. ([source](https://github.com/bda-research/node-crawler#readme))
- [Crawlers](https://awesome-repositories.com/f/networking-communication/http-2-protocol-implementations/crawlers.md) — Utilizes the HTTP/2 protocol to improve request performance and server compatibility during large-scale scraping.
- [MIME-Aware Content Parsers](https://awesome-repositories.com/f/networking-communication/mime-aware-content-parsers.md) — Detects response types to switch between raw binary downloads, JSON extraction, or DOM-based HTML and XML processing.
- [Remote File Downloads](https://awesome-repositories.com/f/networking-communication/remote-file-downloads.md) — Retrieves raw response bodies without string conversion to save images, PDFs, and other non-text assets. ([source](https://github.com/bda-research/node-crawler/blob/master/README.md))
- [Request Header Configuration](https://awesome-repositories.com/f/networking-communication/request-header-configuration.md) — Allows the configuration of custom HTTP headers globally or per request to manage client identification. ([source](https://github.com/bda-research/node-crawler/blob/master/CHANGELOG.md))

### Software Engineering & Architecture

- [Asynchronous Task Queues](https://awesome-repositories.com/f/software-engineering-architecture/asynchronous-task-queues.md) — Manages a list of URLs for asynchronous retrieval while preventing duplicate entries through task deduplication.
- [Request Rate Limiting](https://awesome-repositories.com/f/software-engineering-architecture/traffic-management/request-rate-limiting.md) — Limits the frequency of outgoing requests and concurrent connections to prevent server overload. ([source](https://github.com/bda-research/node-crawler/blob/master/CHANGELOG.md))
- [ID-Based Rate Limiting](https://awesome-repositories.com/f/software-engineering-architecture/traffic-management/request-rate-limiting/id-based-rate-limiting.md) — Regulates request frequency and concurrency by grouping outgoing traffic into independent buckets mapped to specific target IDs.
- [Crawler Lifecycle Hooks](https://awesome-repositories.com/f/software-engineering-architecture/application-lifecycle-management/lifecycle-event-systems/crawler-lifecycle-hooks.md) — Provides event-driven hooks to trigger custom logic during task scheduling, request dispatch, and queue drainage.
- [URL Request Tracking](https://awesome-repositories.com/f/software-engineering-architecture/execution-tracking-caches/url-request-tracking.md) — Tracks visited URLs in a local registry to prevent redundant network requests. ([source](https://github.com/bda-research/node-crawler/blob/master/CHANGELOG.md))
- [Target-Based](https://awesome-repositories.com/f/software-engineering-architecture/request-throttling/rate-limiting/target-based.md) — Groups requests by ID so that different targets or proxies maintain distinct rate limits and connection caps. ([source](https://github.com/bda-research/node-crawler/blob/master/README.md))

### System Administration & Monitoring

- [Rate-Limited Clients](https://awesome-repositories.com/f/system-administration-monitoring/rate-limited-clients.md) — Ships a request client with built-in concurrency controls and per-target rate limiting.

### User Interface & Experience

- [Encoding Normalizers](https://awesome-repositories.com/f/user-interface-experience/character-encoding-support/encoding-normalizers.md) — Analyzes response headers and meta tags to normalize diverse character sets into a standard UTF-8 format.
