What are the best Awesome JavaScript Crawling Frameworks GitHub Repositories?

Node.js libraries for web scraping, browser automation, and crawling. Explore 16 awesome GitHub repositories matching part of an awesome list · JavaScript Crawling Frameworks. Refine with filters or upvote what's useful. Top picks: apify/crawlee, bda-research/node-crawler, lapwinglabs/x-ray, projectdiscovery/naabu, yujiosaka/headless-chrome-crawler, hakluke/hakrawler, gerbenjavado/linkfinder, rchipka/node-osmosis, ionicabizau/scrape-it, ruipgil/scraperjs.

Why is apify/crawlee a recommended JavaScript Crawling Frameworks GitHub Repositories repository?

Reliable browser automation and scraping library.

Why is bda-research/node-crawler a recommended JavaScript Crawling Frameworks GitHub Repositories repository?

Simple API-driven crawler for Node.js.

Why is lapwinglabs/x-ray a recommended JavaScript Crawling Frameworks GitHub Repositories repository?

Web scraper with pagination and crawler support.

Why is projectdiscovery/naabu a recommended JavaScript Crawling Frameworks GitHub Repositories repository?

Parses JavaScript files during crawling to discover hidden API endpoints and routes.

Why is yujiosaka/headless-chrome-crawler a recommended JavaScript Crawling Frameworks GitHub Repositories repository?

Headless Chrome crawler with jQuery support.

Why is hakluke/hakrawler a recommended JavaScript Crawling Frameworks GitHub Repositories repository?

Extracts JavaScript file locations from web pages to find potential endpoints or hidden functionality.

Why is gerbenjavado/linkfinder a recommended JavaScript Crawling Frameworks GitHub Repositories repository?

Extracts URLs and routes from JavaScript code using regular expressions to uncover hidden API endpoints.

Why is rchipka/node-osmosis a recommended JavaScript Crawling Frameworks GitHub Repositories repository?

HTML/XML parser and scraper for Node.js.

Why is ionicabizau/scrape-it a recommended JavaScript Crawling Frameworks GitHub Repositories repository?

Human-friendly scraper for Node.js.

Why is ruipgil/scraperjs a recommended JavaScript Crawling Frameworks GitHub Repositories repository?

Versatile web scraper for Node.js.

16 Repos

Awesome GitHub RepositoriesJavaScript Crawling Frameworks

Node.js libraries for web scraping, browser automation, and crawling.

Explore 16 awesome GitHub repositories matching part of an awesome list · JavaScript Crawling Frameworks. Refine with filters or upvote what's useful.

Finde die besten Repos mit KI.Wir suchen mit KI nach den am besten passenden Repositories.

apify/crawlee
apify/crawlee
24,002Auf GitHub ansehen
Crawlee is a web scraping framework designed for building scalable, reliable, and distributed data extraction pipelines. It provides a unified interface for managing headless browser automation and lightweight HTTP requests, allowing developers to handle complex web navigation, dynamic content rendering, and large-scale data collection within a single, modular architecture. The project distinguishes itself through its resource-aware concurrency controller, which dynamically scales task execution based on real-time CPU and memory usage to prevent host machine exhaustion. It also features a rob
Reliable browser automation and scraping library.
TypeScriptapifyautomationcrawler
Auf GitHub ansehen24,002
bda-research/node-crawler
bda-research/node-crawler
6,785Auf GitHub ansehen
node-crawler is a programmable web crawler for Node.js that manages request queues and automates data extraction. It functions as a rate-limited HTTP client and a headless HTML parser, providing the infrastructure to visit large sets of URLs asynchronously while preventing duplicate processing through task deduplication. The project distinguishes itself through a proxy rotation manager that cycles user agents and proxy servers to bypass access restrictions. It utilizes the HTTP/2 protocol to improve request performance and server compatibility during large-scale scraping operations. The syst
Simple API-driven crawler for Node.js.
TypeScriptcheeriocrawlerextract-data
Auf GitHub ansehen6,785
lapwinglabs/x-ray
lapwinglabs/x-ray
5,904Auf GitHub ansehen
X-Ray ist ein Web-Scraping-Framework und asynchroner Web-Crawler, der darauf ausgelegt ist, strukturierte Daten von Websites zu extrahieren. Es fungiert als HTML-Datenextraktor, der rohe Seiteninhalte mittels CSS-artiger Selektoren in ein definiertes Schema transformiert. Das Projekt implementiert einen Headless-Browser-Crawler, der JavaScript ausführen kann, um dynamische Inhalte zu rendern. Es handhabt die Entdeckung von Website-Inhalten durch eine Breadth-First-Crawling-Strategie und automatische Paginierungserkennung, um mehrseitige Ergebnismengen zu durchlaufen. Das Framework verwaltet Web-Daten-Pipelines mittels einer Concurrency-limitierten Request-Queue und Request-Rate-Control, um ausgehende Netzwerkanrufe zu regulieren. Extrahierte Ergebnisse werden über Stream-basierte Datenpersistenz verarbeitet, um große Datensätze ohne Überlastung des Systemspeichers zu bewältigen.
Web scraper with pagination and crawler support.
JavaScript
Auf GitHub ansehen5,904
projectdiscovery/naabu
projectdiscovery/naabu
5,766Auf GitHub ansehen
Naabu is a port scanner library and tool that probes hosts for open ports using SYN, CONNECT, and UDP methods to identify active services. It functions as a Go library for embedding port scanning into programs, and as a standalone tool that accepts targets as hostnames, IP addresses, CIDR ranges, or ASN numbers. The tool discovers live hosts before scanning, filters ports by range or top lists, and can integrate with Nmap for service version detection. The project distinguishes itself through its SYN-based port probing approach that sends TCP SYN packets and analyzes responses without complet
Parses JavaScript files during crawling to discover hidden API endpoints and routes.
Gocdn-exclusionhacktoberfestnmap
Auf GitHub ansehen5,766
yujiosaka/headless-chrome-crawler
yujiosaka/headless-chrome-crawler
5,643Auf GitHub ansehen
This project is a distributed headless Chrome web crawler and data extraction framework. It functions as a JavaScript rendering engine that uses a headless browser to process dynamic pages, extracting structured data from websites that require JavaScript execution. The system is designed for scalable data collection across multiple nodes, using distributed task synchronization and shared caches to prevent duplicate work. It distinguishes itself through the ability to emulate specific client environments by configuring user agents and viewport dimensions, while capturing visual evidence such a
Headless Chrome crawler with jQuery support.
JavaScript
Auf GitHub ansehen5,643
hakluke/hakrawler
hakluke/hakrawler
4,993Auf GitHub ansehen
Hakrawler is a command-line web spider tool designed for security reconnaissance, built to crawl target websites and extract hyperlinks along with JavaScript file references. As a focused reconnaissance utility, it collects every discoverable URL and script source from a given domain, mapping the attack surface for penetration testing and vulnerability assessment. The tool differentiates itself through its concurrent architecture: a fixed-size goroutine pool fetches pages in parallel, while CSS selectors parse HTML to extract anchor and script references. A depth-aware recursion limiter preve
Extracts JavaScript file locations from web pages to find potential endpoints or hidden functionality.
Gobugbountycrawlinghacking
Auf GitHub ansehen4,993
gerbenjavado/linkfinder
GerbenJavado/LinkFinder
4,390Auf GitHub ansehen
LinkFinder is a security reconnaissance and static analysis tool designed for JavaScript endpoint discovery. It extracts absolute and relative URLs and parameters from JavaScript files to map the attack surface of web applications and identify hidden API routes. The tool operates through static code analysis and regular expression pattern matching to find endpoints without executing the source code. It includes a data processor for importing exported files from Burp Suite, enabling the batch analysis of multiple JavaScript assets in a single execution. The system provides capabilities for do
Extracts URLs and routes from JavaScript code using regular expressions to uncover hidden API endpoints.
Python
Auf GitHub ansehen4,390
rchipka/node-osmosis
rchipka/node-osmosis
4,110Auf GitHub ansehen
Dieses Projekt ist ein Node.js-Web-Scraping-Framework zur Automatisierung der Datenextraktion durch einen programmatischen Workflow aus Anfragen, Parsing und Dokumentinteraktion. Es fungiert als Headless-Web-Crawler, HTTP-Request-Manager sowie DOM-Parser und -Extraktor. Das Framework zeichnet sich durch die Kombination einer JavaScript-Execution-Engine zur Interaktion mit dynamischen Inhalten und einem hybriden Selektionssystem aus, das sowohl CSS- als auch XPath-Selektoren nutzt. Es enthält spezialisierte Middleware für Proxy-Rotation und Cookie-Jar-Session-Management, um authentifizierte Zustände beizubehalten und automatisierten Traffic zu verwalten. Die breiteren Funktionen umfassen rekursives Link-Crawling, Paginierungs-Handling und Web-Formular-Automatisierung. Das Tool bietet zudem Traffic-Management-Funktionen wie Request-Rate-Limiting durch zeitliche Verzögerungen und benutzerdefinierte HTTP-Header-Konfiguration.
HTML/XML parser and scraper for Node.js.
JavaScript
Auf GitHub ansehen4,110
ionicabizau/scrape-it
IonicaBizau/scrape-it
4,074Auf GitHub ansehen
scrape-it ist ein Node.js-Web-Scraper und HTML-Parser, der darauf ausgelegt ist, strukturierte Daten von Websites und aus HTML-Dateien zu extrahieren. Es fungiert als Tool zur Web-Datenextraktion, das spezifische Informationen aus DOM-Elementen abruft und Webinhalte in nutzbare Datenfelder konvertiert. Das Tool verwendet CSS-Selektoren, um gezielt Datenpunkte anzusteuern, und nutzt schema-gesteuertes Data-Mapping, um unstrukturierten Web-Text in ein konsistentes Format zu bringen. Es unterstützt benutzerdefinierte Transformationen, um extrahierte Roh-Strings in spezifische Datenformate umzuwandeln. Das System bietet Funktionen für die Web-Datenextraktion und automatisiertes Content-Mapping. Es kann HTML-Inhalte von URLs, rohe HTML-Strings oder aus dem lokalen Dateisystem parsen und lässt sich mit Headless-Browsern integrieren, um Inhalte von dynamischen Webseiten zu verarbeiten.
Human-friendly scraper for Node.js.
JavaScripthacktoberfestnode-scraperscraper
Auf GitHub ansehen4,074
ruipgil/scraperjs
ruipgil/scraperjs
3,718Auf GitHub ansehen
Scraperjs is a web scraper module that make scraping the web an easy job.
Versatile web scraper for Node.js.
JavaScript
Auf GitHub ansehen3,718
cgiffard/node-simplecrawler
cgiffard/node-simplecrawler
2,133Auf GitHub ansehen
simplecrawler is designed to provide a basic, flexible and robust API for crawling websites. It was written to archive, analyse, and search some very large websites and has happily chewed through hundreds of thousands of pages and written tens of gigabytes to disk without issue.
Event-driven web crawler for Node.js.
JavaScript
Auf GitHub ansehen2,133
martinsbalodis/web-scraper-chrome-extension
martinsbalodis/web-scraper-chrome-extension
1,364Auf GitHub ansehen
Web Scraper is a chrome browser extension built for data extraction from web pages. Using this extension you can create a plan (sitemap) how a web site should be traversed and what should be extracted. Using these sitemaps the Web Scraper will navigate the site accordingly and extract all data.…
Browser-based data extraction tool.
JavaScript
Auf GitHub ansehen1,364
zhuyingda/webster
zhuyingda/webster
559Auf GitHub ansehen
Webster is a reliable web crawling and scraping framework written with Node.js, used to crawl websites and extract structured data from their pages.
Framework for scraping AJAX and JavaScript-rendered content.
JavaScript
Auf GitHub ansehen559
brendonboshell/supercrawler
brendonboshell/supercrawler
381Auf GitHub ansehen
Supercrawler is a Node.js web crawler. It is designed to be highly configurable and easy to use.
Crawler with custom handlers and rate limiting.
JavaScript
Auf GitHub ansehen381
antivanov/js-crawler
antivanov/js-crawler
257Auf GitHub ansehen
js-crawler
Node.js crawler supporting HTTP and HTTPS.
TypeScript
Auf GitHub ansehen257
n0tan3rd/squidwarc
n0tan3rd/squidwarc
176Auf GitHub ansehen
Squidwarc is a high fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head.
High-fidelity archival crawler using Chrome.
JavaScript
Auf GitHub ansehen176