What are the best Awesome JavaScript Crawling Frameworks GitHub Repositories?

Node.js libraries for web scraping, browser automation, and crawling. Explore 16 awesome GitHub repositories matching part of an awesome list · JavaScript Crawling Frameworks. Refine with filters or upvote what's useful. Top picks: apify/crawlee, bda-research/node-crawler, lapwinglabs/x-ray, projectdiscovery/naabu, yujiosaka/headless-chrome-crawler, hakluke/hakrawler, gerbenjavado/linkfinder, rchipka/node-osmosis, ionicabizau/scrape-it, ruipgil/scraperjs.

Why is apify/crawlee a recommended JavaScript Crawling Frameworks GitHub Repositories repository?

Reliable browser automation and scraping library.

Why is bda-research/node-crawler a recommended JavaScript Crawling Frameworks GitHub Repositories repository?

Simple API-driven crawler for Node.js.

Why is lapwinglabs/x-ray a recommended JavaScript Crawling Frameworks GitHub Repositories repository?

Web scraper with pagination and crawler support.

Why is projectdiscovery/naabu a recommended JavaScript Crawling Frameworks GitHub Repositories repository?

Parses JavaScript files during crawling to discover hidden API endpoints and routes.

Why is yujiosaka/headless-chrome-crawler a recommended JavaScript Crawling Frameworks GitHub Repositories repository?

Headless Chrome crawler with jQuery support.

Why is hakluke/hakrawler a recommended JavaScript Crawling Frameworks GitHub Repositories repository?

Extracts JavaScript file locations from web pages to find potential endpoints or hidden functionality.

Why is gerbenjavado/linkfinder a recommended JavaScript Crawling Frameworks GitHub Repositories repository?

Extracts URLs and routes from JavaScript code using regular expressions to uncover hidden API endpoints.

Why is rchipka/node-osmosis a recommended JavaScript Crawling Frameworks GitHub Repositories repository?

HTML/XML parser and scraper for Node.js.

Why is ionicabizau/scrape-it a recommended JavaScript Crawling Frameworks GitHub Repositories repository?

Human-friendly scraper for Node.js.

Why is ruipgil/scraperjs a recommended JavaScript Crawling Frameworks GitHub Repositories repository?

Versatile web scraper for Node.js.

16 repository-uri

Awesome GitHub RepositoriesJavaScript Crawling Frameworks

Node.js libraries for web scraping, browser automation, and crawling.

Explore 16 awesome GitHub repositories matching part of an awesome list · JavaScript Crawling Frameworks. Refine with filters or upvote what's useful.

Găsește cele mai bune repo-uri cu AI.Vom căuta cele mai potrivite repository-uri folosind AI.

apify/crawlee
apify/crawlee
24,002Vezi pe GitHub
Crawlee is a web scraping framework designed for building scalable, reliable, and distributed data extraction pipelines. It provides a unified interface for managing headless browser automation and lightweight HTTP requests, allowing developers to handle complex web navigation, dynamic content rendering, and large-scale data collection within a single, modular architecture. The project distinguishes itself through its resource-aware concurrency controller, which dynamically scales task execution based on real-time CPU and memory usage to prevent host machine exhaustion. It also features a rob
Reliable browser automation and scraping library.
TypeScriptapifyautomationcrawler
Vezi pe GitHub24,002
bda-research/node-crawler
bda-research/node-crawler
6,785Vezi pe GitHub
node-crawler is a programmable web crawler for Node.js that manages request queues and automates data extraction. It functions as a rate-limited HTTP client and a headless HTML parser, providing the infrastructure to visit large sets of URLs asynchronously while preventing duplicate processing through task deduplication. The project distinguishes itself through a proxy rotation manager that cycles user agents and proxy servers to bypass access restrictions. It utilizes the HTTP/2 protocol to improve request performance and server compatibility during large-scale scraping operations. The syst
Simple API-driven crawler for Node.js.
TypeScriptcheeriocrawlerextract-data
Vezi pe GitHub6,785
lapwinglabs/x-ray
lapwinglabs/x-ray
5,904Vezi pe GitHub
X-Ray este un framework de scraping web și un crawler web asincron conceput pentru a extrage date structurate de pe site-uri web. Funcționează ca un extractor de date HTML care transformă conținutul brut al paginii într-o schemă definită folosind selectori de tip CSS. Proiectul implementează un crawler cu browser headless capabil să execute JavaScript pentru a randa conținut dinamic. Gestionează descoperirea conținutului site-ului printr-o strategie de crawling în lățime și descoperirea automată a paginării pentru a traversa seturile de rezultate multi-pagină. Framework-ul gestionează pipeline-urile de date web folosind o coadă de cereri cu concurență limitată și controlul ratei cererilor pentru a regla apelurile de rețea de ieșire. Rezultatele extrase sunt gestionate prin persistența datelor bazată pe flux pentru a procesa seturi mari de date fără a supraîncărca memoria sistemului.
Web scraper with pagination and crawler support.
JavaScript
Vezi pe GitHub5,904
projectdiscovery/naabu
projectdiscovery/naabu
5,766Vezi pe GitHub
Naabu is a port scanner library and tool that probes hosts for open ports using SYN, CONNECT, and UDP methods to identify active services. It functions as a Go library for embedding port scanning into programs, and as a standalone tool that accepts targets as hostnames, IP addresses, CIDR ranges, or ASN numbers. The tool discovers live hosts before scanning, filters ports by range or top lists, and can integrate with Nmap for service version detection. The project distinguishes itself through its SYN-based port probing approach that sends TCP SYN packets and analyzes responses without complet
Parses JavaScript files during crawling to discover hidden API endpoints and routes.
Gocdn-exclusionhacktoberfestnmap
Vezi pe GitHub5,766
yujiosaka/headless-chrome-crawler
yujiosaka/headless-chrome-crawler
5,643Vezi pe GitHub
Acest proiect este un framework distribuit de web crawling headless Chrome și de extracție a datelor. Funcționează ca un motor de randare JavaScript care utilizează un browser headless pentru a procesa pagini dinamice, extrăgând date structurate de pe site-uri web care necesită execuție JavaScript. Sistemul este conceput pentru colectarea scalabilă a datelor pe mai multe noduri, utilizând sincronizarea distribuită a sarcinilor și cache-uri partajate pentru a preveni munca duplicată. Se distinge prin capacitatea de a emula medii client specifice prin configurarea user-agent-urilor și a dimensiunilor viewport-ului, capturând în același timp dovezi vizuale precum capturi de ecran ale paginilor. Framework-ul acoperă gestionarea cuprinzătoare a crawl-ului, inclusiv programarea cererilor în cozi de prioritate, traversarea depth-first și breadth-first și respectarea fișierelor robots.txt și sitemap.xml. Oferă instrumente pentru limitarea concurenței, monitorizarea evenimentelor și streaming-ul datelor extrase în formate CSV sau JSON.
Headless Chrome crawler with jQuery support.
JavaScript
Vezi pe GitHub5,643
hakluke/hakrawler
hakluke/hakrawler
4,993Vezi pe GitHub
Hakrawler is a command-line web spider tool designed for security reconnaissance, built to crawl target websites and extract hyperlinks along with JavaScript file references. As a focused reconnaissance utility, it collects every discoverable URL and script source from a given domain, mapping the attack surface for penetration testing and vulnerability assessment. The tool differentiates itself through its concurrent architecture: a fixed-size goroutine pool fetches pages in parallel, while CSS selectors parse HTML to extract anchor and script references. A depth-aware recursion limiter preve
Extracts JavaScript file locations from web pages to find potential endpoints or hidden functionality.
Gobugbountycrawlinghacking
Vezi pe GitHub4,993
gerbenjavado/linkfinder
GerbenJavado/LinkFinder
4,390Vezi pe GitHub
LinkFinder este un instrument de recunoaștere de securitate și analiză statică conceput pentru descoperirea endpoint-urilor JavaScript. Extrage URL-uri absolute și relative și parametri din fișierele JavaScript pentru a mapa suprafața de atac a aplicațiilor web și a identifica rute API ascunse. Instrumentul operează prin analiză statică de cod și pattern matching cu expresii regulate pentru a găsi endpoint-uri fără a executa codul sursă. Include un procesor de date pentru importul fișierelor exportate din Burp Suite, permițând analiza batch a mai multor asset-uri JavaScript într-o singură execuție. Sistemul oferă capabilități pentru analiză la nivel de domeniu și filtrare specifică domeniului pentru a concentra descoperirea pe ținte vizate. De asemenea, dispune de notificări de detectare a cuvintelor cheie pentru a alerta utilizatorii atunci când șiruri specifice apar în rezultate și suportă exportul datelor descoperite în formate plaintext sau HTML.
Extracts URLs and routes from JavaScript code using regular expressions to uncover hidden API endpoints.
Python
Vezi pe GitHub4,390
rchipka/node-osmosis
rchipka/node-osmosis
4,110Vezi pe GitHub
Acest proiect este un framework de web scraping Node.js conceput pentru a automatiza extragerea datelor printr-un flux de lucru programatic de cereri, parsare și interacțiune cu documentele. Acesta funcționează ca un crawler web headless, un manager de cereri HTTP și un parser și extractor DOM. Framework-ul se distinge prin combinarea unui motor de execuție JavaScript pentru a interacționa cu conținutul dinamic și a unui sistem hibrid de selecție care utilizează atât selectori CSS, cât și XPath. Include middleware specializat pentru rotația proxy-urilor și gestionarea sesiunilor cookie-jar pentru a menține stările autentificate și a gestiona traficul automatizat. Capabilitățile sale mai largi acoperă crawling-ul recursiv al link-urilor, gestionarea paginării și automatizarea formularelor web. Instrumentul oferă, de asemenea, funcții de gestionare a traficului, cum ar fi limitarea ratei cererilor prin întârzieri temporizate și configurarea antetelor HTTP personalizate.
HTML/XML parser and scraper for Node.js.
JavaScript
Vezi pe GitHub4,110
ionicabizau/scrape-it
IonicaBizau/scrape-it
4,074Vezi pe GitHub
scrape-it este un scraper web și parser HTML pentru Node.js, conceput pentru a extrage date structurate de pe site-uri web și fișiere HTML. Funcționează ca un instrument de extracție a datelor web care preia informații specifice din elementele DOM și convertește conținutul web în câmpuri de date utilizabile. Instrumentul folosește selectori CSS pentru a viza puncte de date specifice și utilizează maparea datelor bazată pe schemă pentru a organiza textul web nestructurat într-un format consistent. Suportă transformarea personalizată a valorilor pentru a converti șirurile brute extrase în formate de date specifice. Sistemul oferă capabilități pentru extracția datelor web și maparea automată a conținutului. Poate analiza conținut HTML provenit din URL-uri, șiruri HTML brute sau stocare locală și se integrează cu browsere headless pentru a procesa conținutul din pagini web dinamice.
Human-friendly scraper for Node.js.
JavaScripthacktoberfestnode-scraperscraper
Vezi pe GitHub4,074
ruipgil/scraperjs
ruipgil/scraperjs
3,718Vezi pe GitHub
Scraperjs is a web scraper module that make scraping the web an easy job.
Versatile web scraper for Node.js.
JavaScript
Vezi pe GitHub3,718
cgiffard/node-simplecrawler
cgiffard/node-simplecrawler
2,133Vezi pe GitHub
simplecrawler is designed to provide a basic, flexible and robust API for crawling websites. It was written to archive, analyse, and search some very large websites and has happily chewed through hundreds of thousands of pages and written tens of gigabytes to disk without issue.
Event-driven web crawler for Node.js.
JavaScript
Vezi pe GitHub2,133
martinsbalodis/web-scraper-chrome-extension
martinsbalodis/web-scraper-chrome-extension
1,364Vezi pe GitHub
Web Scraper is a chrome browser extension built for data extraction from web pages. Using this extension you can create a plan (sitemap) how a web site should be traversed and what should be extracted. Using these sitemaps the Web Scraper will navigate the site accordingly and extract all data.…
Browser-based data extraction tool.
JavaScript
Vezi pe GitHub1,364
zhuyingda/webster
zhuyingda/webster
559Vezi pe GitHub
Webster is a reliable web crawling and scraping framework written with Node.js, used to crawl websites and extract structured data from their pages.
Framework for scraping AJAX and JavaScript-rendered content.
JavaScript
Vezi pe GitHub559
brendonboshell/supercrawler
brendonboshell/supercrawler
381Vezi pe GitHub
Supercrawler is a Node.js web crawler. It is designed to be highly configurable and easy to use.
Crawler with custom handlers and rate limiting.
JavaScript
Vezi pe GitHub381
antivanov/js-crawler
antivanov/js-crawler
257Vezi pe GitHub
js-crawler
Node.js crawler supporting HTTP and HTTPS.
TypeScript
Vezi pe GitHub257
n0tan3rd/squidwarc
n0tan3rd/squidwarc
176Vezi pe GitHub
Squidwarc is a high fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head.
High-fidelity archival crawler using Chrome.
JavaScript
Vezi pe GitHub176