WebCollector is an open source web crawler framework based on Java.It provides some simple interfaces for crawling the Web,you can setup a multi-threaded web crawler in less than 5 minutes.
A scalable, mature and versatile web crawler based on Apache Storm
Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project. Heritrix (sometimes spelled heretrix, or misspelled or missaid as heratrix/heritix/heretix/heratix) is an archaic word for heiress (woman who inherits). Since our crawler seeks to collect…
Webmagic is a Java web crawling framework designed for building scalable automated crawlers to download and process large volumes of web pages. It functions as a distributed web crawler and dynamic content crawler, utilizing an XPath HTML parser to locate and extract specific data points from page structures. The framework distinguishes itself through its ability to handle dynamic content by rendering JavaScript and executing asynchronous requests to extract data from non-static pages. It also allows users to define and execute crawler logic via scripting languages, enabling the update of col