node-crawler is a programmable web crawler for Node.js that manages request queues and automates data extraction. It functions as a rate-limited HTTP client and a headless HTML parser, providing the infrastructure to visit large sets of URLs asynchronously while preventing duplicate processing through task deduplication.
The project distinguishes itself through a proxy rotation manager that cycles user agents and proxy servers to bypass access restrictions. It utilizes the HTTP/2 protocol to improve request performance and server compatibility during large-scale scraping operations.
The system covers a broad range of capabilities, including traffic management with independent rate limiting and automatic request retries. It provides content processing tools for XML and HTML parsing via CSS selectors, as well as binary file downloading and character encoding normalization to standard UTF-8.