What are the main features of crawlscript/webcollector?

The main features of crawlscript/webcollector are: Java Crawling Frameworks.

What are some open-source alternatives to crawlscript/webcollector?

Open-source alternatives to crawlscript/webcollector include: code4craft/webmagic — Webmagic is a Java web crawling framework designed for building scalable automated crawlers to download and process… digitalpebble/storm-crawler — A scalable, mature and versatile web crawler based on Apache Storm. internetarchive/heritrix3 — Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project. Heritrix… norconex/collector-http — Norconex HTTP Collector. pkwenda/webbee — 🐝 Web vertical crawler framework for fun. ssssssss-team/spider-flow — Spider-flow is a Java-based web crawling and data extraction platform that provides a centralized environment for…

WebCollector | Awesome Repos

WebCollector is an open source web crawler framework based on Java.It provides some simple interfaces for crawling the Web,you can setup a multi-threaded web crawler in less than 5 minutes.

DigitalPebble/storm-crawler

980View on GitHub

A scalable, mature and versatile web crawler based on Apache Storm

internetarchive/heritrix3

3,246View on GitHub

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project. Heritrix (sometimes spelled heretrix, or misspelled or missaid as heratrix/heritix/heretix/heratix) is an archaic word for heiress (woman who inherits). Since our crawler seeks to collect…

Norconex/collector-http

202View on GitHub

Norconex HTTP Collector

code4craft/webmagic

11,680View on GitHub

Webmagic is a Java web crawling framework designed for building scalable automated crawlers to download and process large volumes of web pages. It functions as a distributed web crawler and dynamic content crawler, utilizing an XPath HTML parser to locate and extract specific data points from page structures. The framework distinguishes itself through its ability to handle dynamic content by rendering JavaScript and executing asynchronous requests to extract data from non-static pages. It also allows users to define and execute crawler logic via scripting languages, enabling the update of col

DigitalPebble/storm-crawler

980View on GitHub

A scalable, mature and versatile web crawler based on Apache Storm

internetarchive/heritrix3

3,246View on GitHub

Norconex/collector-http

202View on GitHub

Norconex HTTP Collector

code4craft/webmagic

11,680View on GitHub

CrawlScriptWebCollector

WebCollector

Features

Open-source alternatives to WebCollector

DigitalPebble/storm-crawler

internetarchive/heritrix3

Norconex/collector-http

code4craft/webmagic

Frequently asked questions

Star history

Open-source alternatives to WebCollector

DigitalPebble/storm-crawler

internetarchive/heritrix3

Norconex/collector-http

code4craft/webmagic

Frequently asked questions