13 Repos
Scalable and extensible web crawling solutions for the Java ecosystem.
Explore 13 awesome GitHub repositories matching part of an awesome list · Java Crawling Frameworks. Refine with filters or upvote what's useful.
Webmagic is a Java web crawling framework designed for building scalable automated crawlers to download and process large volumes of web pages. It functions as a distributed web crawler and dynamic content crawler, utilizing an XPath HTML parser to locate and extract specific data points from page structures. The framework distinguishes itself through its ability to handle dynamic content by rendering JavaScript and executing asynchronous requests to extract data from non-static pages. It also allows users to define and execute crawler logic via scripting languages, enabling the update of col
Scalable crawler framework for Java.
Spider-flow is a Java-based web crawling and data extraction platform that provides a centralized environment for managing automated information gathering. It functions as a no-code tool, allowing users to define complex data collection pipelines through a visual, drag-and-drop interface rather than manual programming. The platform distinguishes itself through a graph-based workflow orchestration system where users link discrete nodes to define navigation and parsing logic. It supports dynamic content crawling by integrating headless browsers to execute JavaScript and render page content that
Visual spider framework requiring no coding.
Crawler4j ist ein Multi-Threaded-Java-Webcrawler und -Spider für hochvolumiges Web-Traversing und Content-Extraktion. Es fungiert als „höfliches“ Crawling-Framework, das die Entdeckung und Indizierung von HTML- und Binärinhalten über mehrere Websites hinweg ermöglicht. Das Projekt zeichnet sich durch ein persistentes Crawling-Modell aus, das den Session-Status im lokalen Speicher serialisiert, sodass die Engine die Indizierung nach einem Absturz oder einer Unterbrechung fortsetzen kann. Es enthält einen Politeness-Controller zur Regulierung der Anfragefrequenz und -verzögerungen, um Serverüberlastungen und IP-Sperren zu vermeiden. Das System deckt eine breite Palette an Traversierungsfunktionen ab, einschließlich tiefenbegrenztem Scope-Management, Zielfilterung und Request-Interception für benutzerdefinierte User-Agents und Proxy-Routing. Die Datenspeicherung erfolgt über ein Repository-Pattern, das die Crawling-Logik von der Persistenz der Seitenmetadaten in relationalen Datenbanken entkoppelt.
Simple and lightweight web crawler.
Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project. Heritrix (sometimes spelled heretrix, or misspelled or missaid as heratrix/heritix/heretix/heratix) is an archaic word for heiress (woman who inherits). Since our crawler seeks to collect…
Extensible, web-scale, archival-quality crawler.
WebCollector is an open source web crawler framework based on Java.It provides some simple interfaces for crawling the Web,you can setup a multi-threaded web crawler in less than 5 minutes.
Multi-threaded crawler with simple interfaces.
Anthelion is a Nutch plugin for focused crawling of semantic data. The project is an open-source project released under the Apache License 2.0.
Plugin for Nutch to crawl semantic HTML annotations.
Gecco is a easy to use lightweight web crawler developed with java language.Gecco integriert jsoup, httpclient, fastjson, spring, htmlunit, redission ausgezeichneten framework,Let you only need to configure a number of jQuery style selector can be very quick to write a crawler.Gecco framework…
Easy-to-use lightweight web crawler.
SeimiCrawler
Agile, distributed crawler framework.
A scalable, mature and versatile web crawler based on Apache Storm
Low-latency, scalable crawler built on Apache Storm.
ACHE is a focused web crawler. It collects web pages that satisfy some specific criteria, e.g., pages that belong to a given domain or that contain a user-specified pattern. ACHE differs from generic crawlers in sense that it uses page classifiers to distinguish between relevant and irrelevant…
Domain-specific web crawler for focused search.
A web crawler is a bot program that fetches resources from the web for the sake of building applications like search engines, knowledge bases, etc. Sparkler (contraction of Spark-Crawler) is a new web crawler that makes use of recent advancements in distributed computing and information…
Apache Nutch implementation running on Spark.
Norconex HTTP Collector
Full-featured HTTP crawler with data storage capabilities.