13 个仓库
Scalable and extensible web crawling solutions for the Java ecosystem.
Explore 13 awesome GitHub repositories matching part of an awesome list · Java Crawling Frameworks. Refine with filters or upvote what's useful.
Webmagic is a Java web crawling framework designed for building scalable automated crawlers to download and process large volumes of web pages. It functions as a distributed web crawler and dynamic content crawler, utilizing an XPath HTML parser to locate and extract specific data points from page structures. The framework distinguishes itself through its ability to handle dynamic content by rendering JavaScript and executing asynchronous requests to extract data from non-static pages. It also allows users to define and execute crawler logic via scripting languages, enabling the update of col
Scalable crawler framework for Java.
Spider-flow is a Java-based web crawling and data extraction platform that provides a centralized environment for managing automated information gathering. It functions as a no-code tool, allowing users to define complex data collection pipelines through a visual, drag-and-drop interface rather than manual programming. The platform distinguishes itself through a graph-based workflow orchestration system where users link discrete nodes to define navigation and parsing logic. It supports dynamic content crawling by integrating headless browsers to execute JavaScript and render page content that
Visual spider framework requiring no coding.
Crawler4j 是一个多线程 Java 网络爬虫,专为高容量 Web 遍历和内容提取而设计。它作为一个“礼貌”的爬取框架,能够发现并索引多个网站上的 HTML 和二进制内容。 该项目通过一种持久化爬取模型脱颖而出,该模型将会话状态序列化到本地存储,允许引擎在崩溃或中断后恢复索引。它包含一个礼貌控制器来调节请求频率和延迟,防止服务器过载和 IP 被封。 该系统涵盖了广泛的遍历功能,包括深度限制范围管理、目标过滤,以及针对自定义用户代理和代理路由的请求拦截。数据存储通过存储库模式处理,将爬取逻辑与关系数据库中页面元数据的持久化解耦。
Simple and lightweight web crawler.
Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project. Heritrix (sometimes spelled heretrix, or misspelled or missaid as heratrix/heritix/heretix/heratix) is an archaic word for heiress (woman who inherits). Since our crawler seeks to collect…
Extensible, web-scale, archival-quality crawler.
WebCollector is an open source web crawler framework based on Java.It provides some simple interfaces for crawling the Web,you can setup a multi-threaded web crawler in less than 5 minutes.
Multi-threaded crawler with simple interfaces.
Anthelion is a Nutch plugin for focused crawling of semantic data. The project is an open-source project released under the Apache License 2.0.
Plugin for Nutch to crawl semantic HTML annotations.
Gecco is a easy to use lightweight web crawler developed with java language.Gecco integriert jsoup, httpclient, fastjson, spring, htmlunit, redission ausgezeichneten framework,Let you only need to configure a number of jQuery style selector can be very quick to write a crawler.Gecco framework…
Easy-to-use lightweight web crawler.
SeimiCrawler
Agile, distributed crawler framework.
A scalable, mature and versatile web crawler based on Apache Storm
Low-latency, scalable crawler built on Apache Storm.
ACHE is a focused web crawler. It collects web pages that satisfy some specific criteria, e.g., pages that belong to a given domain or that contain a user-specified pattern. ACHE differs from generic crawlers in sense that it uses page classifiers to distinguish between relevant and irrelevant…
Domain-specific web crawler for focused search.
A web crawler is a bot program that fetches resources from the web for the sake of building applications like search engines, knowledge bases, etc. Sparkler (contraction of Spark-Crawler) is a new web crawler that makes use of recent advancements in distributed computing and information…
Apache Nutch implementation running on Spark.
Norconex HTTP Collector
Full-featured HTTP crawler with data storage capabilities.
🐝 Web vertical crawler framework for fun
DFS-based web spider.