30 open-source projects similar to code4craft/webmagic, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best Webmagic alternative.
Crawlee is a web scraping framework designed for building scalable, reliable, and distributed data extraction pipelines. It provides a unified interface for managing headless browser automation and lightweight HTTP requests, allowing developers to handle complex web navigation, dynamic content rendering, and large-scale data collection within a single, modular architecture. The project distinguishes itself through its resource-aware concurrency controller, which dynamically scales task execution based on real-time CPU and memory usage to prevent host machine exhaustion. It also features a rob
Crawlee-python is a web crawling framework for building scalable scrapers using Python. It serves as a comprehensive tool for web scraping automation, providing a system to extract structured data from websites using both lightweight HTTP requests and headless browser automation. The framework is distinguished by its anti-bot evasion capabilities, which include browser fingerprint impersonation and tiered proxy rotation to bypass detection systems and solve challenges such as Cloudflare. It also incorporates artificial intelligence for autonomous website navigation and schema-based data extra
PySpider is a Python web crawling framework designed for automated data extraction. It provides a pipeline for periodically fetching web content, processing HTML, and persisting scraped information into database backends. The system features a web-based management interface for editing scraping scripts, monitoring task progress, and reviewing collected data. It includes a headless browser JavaScript renderer to capture rendered HTML from dynamic web pages and a distributed architecture that uses message queues to scale crawling workloads across multiple nodes. The framework also covers task
DotnetSpider is a .NET web crawling framework and C# data extraction tool designed for automated web page discovery and the retrieval of structured data from the internet at scale. It functions as a high-level web scraping library for collecting information from various websites. The framework provides capabilities for automated web crawling and large-scale data scraping. It enables web content extraction to facilitate the creation of local databases or the analysis of online information through programmatic web automation within the .NET ecosystem. The system utilizes a pipeline-based data
Crawler4j is a multi-threaded Java web crawler and spider designed for high-volume web traversal and content extraction. It functions as a polite crawling framework that enables the discovery and indexing of HTML and binary content across multiple websites. The project distinguishes itself through a persistent crawling model that serializes session state to local storage, allowing the engine to resume indexing after a crash or interruption. It includes a politeness controller to regulate request frequency and delays, preventing server overloading and IP blocking. The system covers a broad ra
node-crawler is a programmable web crawler for Node.js that manages request queues and automates data extraction. It functions as a rate-limited HTTP client and a headless HTML parser, providing the infrastructure to visit large sets of URLs asynchronously while preventing duplicate processing through task deduplication. The project distinguishes itself through a proxy rotation manager that cycles user agents and proxy servers to bypass access restrictions. It utilizes the HTTP/2 protocol to improve request performance and server compatibility during large-scale scraping operations. The syst
X-ray is a headless browser web scraper and HTML content crawler designed to extract structured data from websites. It functions as a stream-based data scraper and structured data extractor, using selectors to retrieve text and attributes from HTML as nested objects or arrays. The project includes a request rate controller to manage network traffic through concurrency limits, throttles, and timeouts. It handles dynamic website scraping by rendering JavaScript via a headless browser and performs automated website crawling using breadth-first link following and pagination management. The syste
Firecrawl is a headless browser automation tool and web crawling engine designed to extract structured data from the web. It functions as an API that transforms raw website content and documents into clean markdown and JSON formats to serve as context for large language models. The project distinguishes itself by using natural language prompts to translate human instructions into targeted data extraction tasks and browser actions. It can execute interactive page navigation, such as clicking and scrolling, and perform automated web research to retrieve structured data without manual interventi
This project is a distributed headless Chrome web crawler and data extraction framework. It functions as a JavaScript rendering engine that uses a headless browser to process dynamic pages, extracting structured data from websites that require JavaScript execution. The system is designed for scalable data collection across multiple nodes, using distributed task synchronization and shared caches to prevent duplicate work. It distinguishes itself through the ability to emulate specific client environments by configuring user agents and viewport dimensions, while capturing visual evidence such a
requests-html is a Python HTML parsing library and web scraping framework. It functions as an asynchronous HTTP client and a JavaScript rendering engine designed to fetch and parse web pages for structured data extraction. The project integrates a headless browser to execute JavaScript, allowing it to retrieve dynamically generated content that standard HTML parsers cannot see. It provides tools for automated data extraction using CSS selectors and XPath expressions to isolate specific text or attributes from HTML structures. The framework covers network operations including asynchronous pag
Spider-flow is a Java-based web crawling and data extraction platform that provides a centralized environment for managing automated information gathering. It functions as a no-code tool, allowing users to define complex data collection pipelines through a visual, drag-and-drop interface rather than manual programming. The platform distinguishes itself through a graph-based workflow orchestration system where users link discrete nodes to define navigation and parsing logic. It supports dynamic content crawling by integrating headless browsers to execute JavaScript and render page content that
This project is a Python-based automation toolkit designed to manage programmatic authentication and session persistence across web services. It provides a framework for executing automated login sequences, including the handling of interactive security challenges such as QR code verification and captcha resolution. The toolkit distinguishes itself by simulating native mobile application environments, allowing for the execution of scripts that require specific device-level headers and behaviors. It also incorporates hook-based interception to monitor workflow states and manage exceptions duri
Subfinder is a security reconnaissance framework designed for subdomain enumeration and attack surface management. It functions as a discovery engine that identifies and maps internet-exposed infrastructure, cloud-hosted assets, and network ranges to maintain a comprehensive inventory of an organization's digital footprint. The project distinguishes itself through a modular, template-driven scanning engine that executes security checks against discovered assets. It leverages cloud-native asset discovery to query provider APIs and infrastructure metadata, while supporting distributed agent orc
This project is a comprehensive framework for building and managing autonomous agent systems. It provides a unified architecture for orchestrating multi-agent societies, where specialized agents collaborate through roleplay to decompose and solve complex tasks. The system integrates language models with external environments, enabling agents to perform real-world actions through a standardized tool-calling abstraction layer. The framework distinguishes itself through its focus on iterative reasoning and data reliability. It employs automated feedback loops to refine agent outputs and self-eva
DotnetSpider is a .NET web crawler framework and programmable tool designed for traversing websites and capturing structured data from web pages. It functions as a distributed crawling engine that enables the automation of web crawling to discover and extract data. The framework is designed for distributed data extraction, allowing crawling tasks to be spread across multiple servers to process large volumes of web content. This architecture supports high-performance web scraping and enterprise data collection workflows for gathering structured information.
Colly is a web scraping framework and concurrent crawler written in Go. It provides a system for traversing web pages, following links, and extracting structured data from HTML and XML documents. The framework includes a distributed scraping engine designed to spread data collection tasks across multiple instances to increase throughput. It ensures compliance with website owner policies by automatically reading and respecting robots.txt files. The system manages request lifecycles through domain-based rate limiting, concurrency controls, and session management via a stateful cookie jar. It s
AnyCrawl is an AI-powered data extractor, automated web crawler, and headless browser orchestrator. It serves as a web content extraction API and a gateway that connects crawling and scraping tools to language models using a standardized API protocol. The project specializes in converting unstructured website content into structured JSON or markdown optimized for AI assistants. It utilizes language models and JSON schemas to pull specific information into validated formats and provides capabilities for AI page summarization and LLM-optimized content extraction. The system manages comprehensi
Pholcus is a distributed web crawling system designed for large-scale data scraping. It employs a master-worker distribution model to coordinate high-concurrency scraping tasks across a network of remote client nodes, enabling both horizontal and vertical data collection. The system features a hot-loadable rule engine that allows extraction and navigation logic to be updated at runtime without restarting the process. It handles dynamic content through headless browser integration and bypasses bot detection using proxy rotation, automated user authentication, and simulated human behavior. The
Portia is a containerized scraping platform and visual web scraper that enables no-code data extraction. It serves as a Scrapy visual scraping tool and spider generator, allowing users to design and deploy web scrapers through a graphical interface instead of writing manual selector code. The system distinguishes itself by converting visual web page annotations into executable Scrapy spider code and structured JSON specifications. This visual-to-code mapping allows users to define scraping logic and extraction rules through a point-and-click interface, which can then be exported for use in ex
This project is an LLM-powered web crawler and data extractor that uses large language models to navigate websites and parse content into structured JSON or Markdown formats. It functions as an automated browser orchestrator and domain discovery engine, interpreting plain English instructions to identify relevant pages and extract specific information. The system distinguishes itself through agentic browser automation, allowing it to perform human-like interactions such as clicking buttons and scrolling based on natural language commands. It employs goal-oriented crawling to analyze website s
Jsoup is a Java library designed for parsing, extracting, and manipulating HTML and XML content. It provides a document object model that represents web content as a hierarchical tree, allowing for programmatic navigation and modification of elements, attributes, and text. The library functions as a toolkit for web scraping, enabling the retrieval of remote content via standard web protocols and the management of HTTP sessions for automated form interaction. The library distinguishes itself through its fault-tolerant tokenization, which reconstructs valid document structures from malformed or
MechanicalSoup is a Python web automation library designed to simulate browser behavior. It functions as a toolkit for web scraping and automation, providing an HTML parsing engine and an HTTP session manager to interact with websites programmatically. The library enables headless web interaction by mimicking a real user session. It manages persistent state through cookie handling and automatic redirect following, allowing for programmatic website navigation and the simulation of complex browser interactions. Its capabilities cover automated form population and submission using CSS selectors
Splash is a headless browser HTTP API and JavaScript rendering engine designed to convert dynamic web content into static HTML or images. It functions as a Lua-scriptable browser service that exposes browser automation and rendering capabilities through a RESTful interface for programmatic data extraction. The service distinguishes itself by allowing the execution of custom Lua scripts to automate complex user interaction sequences and page navigation. It provides the ability to switch rendering engines on a per-request basis to verify cross-browser compatibility and visual consistency. The
This project is a reference library of architectural blueprints, study materials, and design patterns for building scalable, high-availability distributed systems. It serves as a technical guide for scalability engineering, providing structural solutions for common engineering challenges. The repository focuses on distributed systems design, covering essential patterns for data replication, consensus algorithms, and transaction management. It distinguishes itself by offering detailed blueprints for specialized domains, including real-time data streaming, large-scale data storage, and high-ava
Karakeep is a self-hosted, open-source platform designed for personal knowledge management and web content archiving. It functions as a centralized repository where users can capture, organize, and preserve bookmarks, notes, and media files, ensuring long-term access to digital information even if original sources are removed or modified. The system distinguishes itself through its automated content processing and security-focused architecture. It utilizes headless browser crawling and optical character recognition to ingest and index web content, while a modular artificial intelligence pipel
This project is a high-performance headless browser engine designed for scalable web automation, data extraction, and AI agent integration. It provides a specialized environment that allows autonomous agents and testing frameworks to interact with web content through standardized remote control protocols. By executing pages in a lightweight, headless state, the engine minimizes resource consumption while maintaining the ability to perform complex navigation and dynamic content rendering. The platform distinguishes itself through deep integration with AI-centric communication layers and advanc
pipet is a command-line tool that turns web scraping into a piped data flow through Unix filters. It provides a set of specialized scrapers — for CSS selector extraction, headless browser JavaScript rendering, JSON API querying, and change monitoring — each outputting structured data that can be transformed by chaining additional commands. The tool uses declarative selectors (CSS and JSON path expressions) to define what to extract, automatically follows pagination links to collect data across multiple pages, and serializes results into JSON, custom-delimited text, or rendered templates. It c
This project is a machine learning data pipeline designed to automate the collection, curation, and preparation of large-scale image datasets. It functions as an image dataset scraper and computer vision curator, providing the necessary infrastructure to aggregate categorized files from web sources and organize them into structured directories for model development. The system distinguishes itself through a batch-processing architecture that integrates data acquisition with automated integrity validation. By scanning files to remove corrupted or invalid images and applying deterministic parti
Skill Seekers is a toolset for generating large language model knowledge bases, featuring a multi-source content scraper and a dedicated RAG data pipeline. It extracts technical data from documentation, code, and video to create structured assets and configuration files for AI-powered IDE extensions. The project distinguishes itself through the ability to transform raw data into polished tutorials and specialized skills for AI plugin marketplaces. It utilizes abstract syntax tree parsing and optical character recognition to analyze GitHub repositories, PDFs, and video frames, converting these