A scalable, mature and versatile web crawler based on Apache Storm
Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project. Heritrix (sometimes spelled heretrix, or misspelled or missaid as heratrix/heritix/heretix/heratix) is an archaic word for heiress (woman who inherits). Since our crawler seeks to collect…
Webmagic is a Java web crawling framework designed for building scalable automated crawlers to download and process large volumes of web pages. It functions as a distributed web crawler and dynamic content crawler, utilizing an XPath HTML parser to locate and extract specific data points from page structures. The framework distinguishes itself through its ability to handle dynamic content by rendering JavaScript and executing asynchronous requests to extract data from non-static pages. It also allows users to define and execute crawler logic via scripting languages, enabling the update of col
WebCollector is an open source web crawler framework based on Java.It provides some simple interfaces for crawling the Web,you can setup a multi-threaded web crawler in less than 5 minutes.
The main features of crawlscript/webcollector are: Java Crawling Frameworks.
Open-source alternatives to crawlscript/webcollector include: code4craft/webmagic — Webmagic is a Java web crawling framework designed for building scalable automated crawlers to download and process… digitalpebble/storm-crawler — A scalable, mature and versatile web crawler based on Apache Storm. internetarchive/heritrix3 — Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project. Heritrix… norconex/collector-http — Norconex HTTP Collector. pkwenda/webbee — 🐝 Web vertical crawler framework for fun. ssssssss-team/spider-flow — Spider-flow is a Java-based web crawling and data extraction platform that provides a centralized environment for…