Sparkler

A web crawler is a bot program that fetches resources from the web for the sake of building applications like search engines, knowledge bases, etc. Sparkler (contraction of Spark-Crawler) is a new web crawler that makes use of recent advancements in distributed computing and information…

CrawlScript/WebCollector

3,091View on GitHub

WebCollector is an open source web crawler framework based on Java.It provides some simple interfaces for crawling the Web,you can setup a multi-threaded web crawler in less than 5 minutes.

DigitalPebble/storm-crawler

980View on GitHub

A scalable, mature and versatile web crawler based on Apache Storm

internetarchive/heritrix3

3,246View on GitHub

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project. Heritrix (sometimes spelled heretrix, or misspelled or missaid as heratrix/heritix/heretix/heratix) is an archaic word for heiress (woman who inherits). Since our crawler seeks to collect…

code4craft/webmagic

11,680View on GitHub

Webmagic is a Java web crawling framework designed for building scalable automated crawlers to download and process large volumes of web pages. It functions as a distributed web crawler and dynamic content crawler, utilizing an XPath HTML parser to locate and extract specific data points from page structures. The framework distinguishes itself through its ability to handle dynamic content by rendering JavaScript and executing asynchronous requests to extract data from non-static pages. It also allows users to define and execute crawler logic via scripting languages, enabling the update of col

CrawlScript/WebCollector

3,091View on GitHub

WebCollector is an open source web crawler framework based on Java.It provides some simple interfaces for crawling the Web,you can setup a multi-threaded web crawler in less than 5 minutes.

DigitalPebble/storm-crawler

980View on GitHub

A scalable, mature and versatile web crawler based on Apache Storm

internetarchive/heritrix3

3,246View on GitHub

code4craft/webmagic

11,680View on GitHub

USCDataSciencesparkler

Features

Open-source alternatives to Sparkler

CrawlScript/WebCollector

DigitalPebble/storm-crawler

internetarchive/heritrix3

code4craft/webmagic

Star history

Open-source alternatives to Sparkler

CrawlScript/WebCollector

DigitalPebble/storm-crawler

internetarchive/heritrix3

code4craft/webmagic