16 Repos
Node.js libraries for web scraping, browser automation, and crawling.
Explore 16 awesome GitHub repositories matching part of an awesome list · JavaScript Crawling Frameworks. Refine with filters or upvote what's useful.
Crawlee is a web scraping framework designed for building scalable, reliable, and distributed data extraction pipelines. It provides a unified interface for managing headless browser automation and lightweight HTTP requests, allowing developers to handle complex web navigation, dynamic content rendering, and large-scale data collection within a single, modular architecture. The project distinguishes itself through its resource-aware concurrency controller, which dynamically scales task execution based on real-time CPU and memory usage to prevent host machine exhaustion. It also features a rob
Reliable browser automation and scraping library.
node-crawler is a programmable web crawler for Node.js that manages request queues and automates data extraction. It functions as a rate-limited HTTP client and a headless HTML parser, providing the infrastructure to visit large sets of URLs asynchronously while preventing duplicate processing through task deduplication. The project distinguishes itself through a proxy rotation manager that cycles user agents and proxy servers to bypass access restrictions. It utilizes the HTTP/2 protocol to improve request performance and server compatibility during large-scale scraping operations. The syst
Simple API-driven crawler for Node.js.
X-Ray ist ein Web-Scraping-Framework und asynchroner Web-Crawler, der darauf ausgelegt ist, strukturierte Daten von Websites zu extrahieren. Es fungiert als HTML-Datenextraktor, der rohe Seiteninhalte mittels CSS-artiger Selektoren in ein definiertes Schema transformiert. Das Projekt implementiert einen Headless-Browser-Crawler, der JavaScript ausführen kann, um dynamische Inhalte zu rendern. Es handhabt die Entdeckung von Website-Inhalten durch eine Breadth-First-Crawling-Strategie und automatische Paginierungserkennung, um mehrseitige Ergebnismengen zu durchlaufen. Das Framework verwaltet Web-Daten-Pipelines mittels einer Concurrency-limitierten Request-Queue und Request-Rate-Control, um ausgehende Netzwerkanrufe zu regulieren. Extrahierte Ergebnisse werden über Stream-basierte Datenpersistenz verarbeitet, um große Datensätze ohne Überlastung des Systemspeichers zu bewältigen.
Web scraper with pagination and crawler support.
Naabu is a port scanner library and tool that probes hosts for open ports using SYN, CONNECT, and UDP methods to identify active services. It functions as a Go library for embedding port scanning into programs, and as a standalone tool that accepts targets as hostnames, IP addresses, CIDR ranges, or ASN numbers. The tool discovers live hosts before scanning, filters ports by range or top lists, and can integrate with Nmap for service version detection. The project distinguishes itself through its SYN-based port probing approach that sends TCP SYN packets and analyzes responses without complet
Parses JavaScript files during crawling to discover hidden API endpoints and routes.
This project is a distributed headless Chrome web crawler and data extraction framework. It functions as a JavaScript rendering engine that uses a headless browser to process dynamic pages, extracting structured data from websites that require JavaScript execution. The system is designed for scalable data collection across multiple nodes, using distributed task synchronization and shared caches to prevent duplicate work. It distinguishes itself through the ability to emulate specific client environments by configuring user agents and viewport dimensions, while capturing visual evidence such a
Headless Chrome crawler with jQuery support.
Hakrawler is a command-line web spider tool designed for security reconnaissance, built to crawl target websites and extract hyperlinks along with JavaScript file references. As a focused reconnaissance utility, it collects every discoverable URL and script source from a given domain, mapping the attack surface for penetration testing and vulnerability assessment. The tool differentiates itself through its concurrent architecture: a fixed-size goroutine pool fetches pages in parallel, while CSS selectors parse HTML to extract anchor and script references. A depth-aware recursion limiter preve
Extracts JavaScript file locations from web pages to find potential endpoints or hidden functionality.
LinkFinder is a security reconnaissance and static analysis tool designed for JavaScript endpoint discovery. It extracts absolute and relative URLs and parameters from JavaScript files to map the attack surface of web applications and identify hidden API routes. The tool operates through static code analysis and regular expression pattern matching to find endpoints without executing the source code. It includes a data processor for importing exported files from Burp Suite, enabling the batch analysis of multiple JavaScript assets in a single execution. The system provides capabilities for do
Extracts URLs and routes from JavaScript code using regular expressions to uncover hidden API endpoints.
Dieses Projekt ist ein Node.js-Web-Scraping-Framework zur Automatisierung der Datenextraktion durch einen programmatischen Workflow aus Anfragen, Parsing und Dokumentinteraktion. Es fungiert als Headless-Web-Crawler, HTTP-Request-Manager sowie DOM-Parser und -Extraktor. Das Framework zeichnet sich durch die Kombination einer JavaScript-Execution-Engine zur Interaktion mit dynamischen Inhalten und einem hybriden Selektionssystem aus, das sowohl CSS- als auch XPath-Selektoren nutzt. Es enthält spezialisierte Middleware für Proxy-Rotation und Cookie-Jar-Session-Management, um authentifizierte Zustände beizubehalten und automatisierten Traffic zu verwalten. Die breiteren Funktionen umfassen rekursives Link-Crawling, Paginierungs-Handling und Web-Formular-Automatisierung. Das Tool bietet zudem Traffic-Management-Funktionen wie Request-Rate-Limiting durch zeitliche Verzögerungen und benutzerdefinierte HTTP-Header-Konfiguration.
HTML/XML parser and scraper for Node.js.
scrape-it ist ein Node.js-Web-Scraper und HTML-Parser, der darauf ausgelegt ist, strukturierte Daten von Websites und aus HTML-Dateien zu extrahieren. Es fungiert als Tool zur Web-Datenextraktion, das spezifische Informationen aus DOM-Elementen abruft und Webinhalte in nutzbare Datenfelder konvertiert. Das Tool verwendet CSS-Selektoren, um gezielt Datenpunkte anzusteuern, und nutzt schema-gesteuertes Data-Mapping, um unstrukturierten Web-Text in ein konsistentes Format zu bringen. Es unterstützt benutzerdefinierte Transformationen, um extrahierte Roh-Strings in spezifische Datenformate umzuwandeln. Das System bietet Funktionen für die Web-Datenextraktion und automatisiertes Content-Mapping. Es kann HTML-Inhalte von URLs, rohe HTML-Strings oder aus dem lokalen Dateisystem parsen und lässt sich mit Headless-Browsern integrieren, um Inhalte von dynamischen Webseiten zu verarbeiten.
Human-friendly scraper for Node.js.
Scraperjs is a web scraper module that make scraping the web an easy job.
Versatile web scraper for Node.js.
simplecrawler is designed to provide a basic, flexible and robust API for crawling websites. It was written to archive, analyse, and search some very large websites and has happily chewed through hundreds of thousands of pages and written tens of gigabytes to disk without issue.
Event-driven web crawler for Node.js.
Web Scraper is a chrome browser extension built for data extraction from web pages. Using this extension you can create a plan (sitemap) how a web site should be traversed and what should be extracted. Using these sitemaps the Web Scraper will navigate the site accordingly and extract all data.…
Browser-based data extraction tool.
Webster is a reliable web crawling and scraping framework written with Node.js, used to crawl websites and extract structured data from their pages.
Framework for scraping AJAX and JavaScript-rendered content.
Supercrawler is a Node.js web crawler. It is designed to be highly configurable and easy to use.
Crawler with custom handlers and rate limiting.
js-crawler
Node.js crawler supporting HTTP and HTTPS.
Squidwarc is a high fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head.
High-fidelity archival crawler using Chrome.