30 open-source projects similar to nanmicoder/crawlertutorial, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best CrawlerTutorial alternative.
Crawlee is a web scraping framework designed for building scalable, reliable, and distributed data extraction pipelines. It provides a unified interface for managing headless browser automation and lightweight HTTP requests, allowing developers to handle complex web navigation, dynamic content rendering, and large-scale data collection within a single, modular architecture. The project distinguishes itself through its resource-aware concurrency controller, which dynamically scales task execution based on real-time CPU and memory usage to prevent host machine exhaustion. It also features a rob
This project is a comprehensive educational guide and framework for building web scrapers using Python. It provides a course-based approach to data extraction, combining a Python crawler framework with tutorials on web reverse engineering and network traffic analysis. The project distinguishes itself by covering advanced extraction challenges, including the decryption of obfuscated JavaScript and the bypass of anti-scraping measures. It specifically addresses mobile application scraping through the simulation of user interactions and the interception of network traffic. The capability surfac
Crawlee-python is a web crawling framework for building scalable scrapers using Python. It serves as a comprehensive tool for web scraping automation, providing a system to extract structured data from websites using both lightweight HTTP requests and headless browser automation. The framework is distinguished by its anti-bot evasion capabilities, which include browser fingerprint impersonation and tiered proxy rotation to bypass detection systems and solve challenges such as Cloudflare. It also incorporates artificial intelligence for autonomous website navigation and schema-based data extra
Damaihelper is a ticketing automation bot and browser automation framework designed to monitor ticket availability and execute checkout processes. It utilizes a ticket purchasing script to automate the selection and purchase of tickets on web platforms based on predefined user criteria. The tool includes a graphical user interface for managing scripts and configuring automation parameters, allowing users to trigger tasks without using a command line. To maintain access, it employs browser session management to save and reuse authentication cookies, avoiding repetitive manual login procedures.
This project is a Python web scraping tutorial and framework designed for building automated data extraction tools and web crawlers. It provides a structured approach to navigating websites and persisting scraped data to databases. The project includes a toolset for web API analysis, focusing on reverse engineering obfuscated API requests and inspecting network traffic to extract structured data. It also covers optical character recognition workflows to convert visual text within images into machine-readable strings. The framework covers capabilities for headless browser automation to handle
This project is a comprehensive resource directory for web data extraction, providing a curated collection of tools and libraries for parsing data, automating browsers, and managing network operations. It serves as a guide for extracting structured information from HTML, XML, JSON, and PDF formats. The toolkit focuses on advanced data collection strategies, including headless browser automation to interact with JavaScript and a suite of network utilities for DNS resolution and WebSocket connections. It specifically covers methods for bypassing bot protections through proxy pool management, us
OpenCLI is an AI browser automation framework designed to automate web navigation, data extraction, and repetitive browser tasks. It functions as a browser-based CLI generator that converts website interfaces into command-line interactions by controlling authenticated web browser sessions. The project features a web-to-CLI adapter platform for mapping web elements to programmatic command-line inputs and outputs. It includes a browser profile manager to organize and switch between isolated session profiles to maintain different user identities. The toolkit provides capabilities for web conten
nodriver is an asynchronous Chromium browser automation framework that provides headless control and web scraping capabilities. It functions as a Chrome DevTools Protocol client, allowing for granular engine control by attaching directly to the browser's debug port without the need for external driver binaries. The framework is specifically designed as an anti-bot detection bypass tool. It modifies browser fingerprints and protocol headers to evade automated security systems, handle security warnings, and bypass common obstacles like insecure connection alerts. The system covers a broad rang
CyberScraper-2077 is an AI-powered web scraping tool that uses large language models to extract and structure data from websites into organized formats. It functions as an LLM web scraper and AI content parser, transforming unstructured raw web text into specific data schemas. The project distinguishes itself through a suite of anonymity and evasion tools, including proxy rotation, SOCKS-based identity masking, and the ability to route traffic through the Tor network to access hidden onion services. It further includes a bot detection bypass system that employs stealth parameters and custom n
PythonSpiderNotes is a comprehensive instructional resource and framework for building web crawlers and extracting data using the Python programming language. It provides a set of methods for parsing unstructured HTML and JSON data into structured formats for persistent storage. The project includes detailed guides and tutorials on browser automation for retrieving dynamic content, as well as a framework for data extraction. It specifically covers anti-bot bypass techniques, such as rotating proxies and spoofing headers, to avoid IP blocks and detection systems. The capability surface extend
Pholcus is a distributed web crawler framework written in Go designed for high-concurrency data extraction. It functions as a distributed crawling orchestrator and dynamic data extraction engine, utilizing a server-client architecture to coordinate tasks across multiple nodes. The system integrates a headless browser engine to render dynamic content and execute JavaScript, allowing it to extract data from single-page applications. It features a web-based management interface for configuring spider parameters and monitoring execution progress, alongside the ability to update extraction rules v
This project is a ticket purchase automation tool and browser automation bot designed to secure high-demand event tickets. It functions as a web scraping purchase script that monitors availability and executes checkout transactions programmatically. The tool utilizes a hybrid execution model that combines headless browser automation for authentication and session management with direct HTTP requests to ticketing server APIs. This approach is used to bypass user interface latency and handle high-speed request processing during flash sales. The system includes capabilities for automated availa
Maxun is an open-source web scraping and automation platform designed to transform dynamic website content into structured data. By leveraging artificial intelligence to interpret natural language prompts, the system identifies page elements and extracts information without requiring manual selector configuration. It serves as a bridge between raw web content and intelligent workflows, providing structured outputs in formats optimized for large language model ingestion and agent-based applications. The platform distinguishes itself through its ability to handle complex, authenticated, and dyn
scrape-it is a Node.js web scraper and HTML parser designed to extract structured data from websites and HTML files. It functions as a web data extraction tool that retrieves specific information from DOM elements and converts web content into usable data fields. The tool uses CSS selectors to target specific data points and employs schema-driven data mapping to organize unstructured web text into a consistent format. It supports custom value transformation to convert raw extracted strings into specific data formats. The system provides capabilities for web data extraction and automated cont
Automa is a browser-based automation platform that enables users to build, schedule, and execute repetitive web tasks through a visual, no-code interface. By operating as a browser extension, it provides a canvas-based environment where users construct workflows by connecting functional blocks to interact with web elements, manage browser state, and process data. The platform distinguishes itself through its deep integration with the browser environment, allowing for complex orchestration such as event-driven triggers, cross-origin request handling, and the ability to package workflows as sta
Obscura is a web scraping infrastructure and headless browser server designed for AI agents. It provides a system for AI models to control browser sessions, interact with websites, and extract web data using a WebSocket implementation of the Chrome DevTools Protocol. The project focuses on bot detection evasion by randomizing browser fingerprints, masking native functions, and blocking tracking scripts to mimic human behavior. It further secures identities through a traffic layer that routes network requests via HTTP or SOCKS5 proxies. The system supports large-scale data extraction through
Pyppeteer is a Python library for controlling Chromium-based browsers using the Chrome DevTools Protocol. It functions as a headless browser automation tool, allowing for the programmatic navigation of web pages and the extraction of data from dynamic websites. The project provides low-level browser control through direct communication with the Chrome DevTools Protocol, enabling the interception and modification of network traffic. It differentiates itself by offering specialized performance profiling capabilities, including the measurement of JavaScript and CSS code coverage and the capture
node-crawler is a programmable web crawler for Node.js that manages request queues and automates data extraction. It functions as a rate-limited HTTP client and a headless HTML parser, providing the infrastructure to visit large sets of URLs asynchronously while preventing duplicate processing through task deduplication. The project distinguishes itself through a proxy rotation manager that cycles user agents and proxy servers to bypass access restrictions. It utilizes the HTTP/2 protocol to improve request performance and server compatibility during large-scale scraping operations. The syst
This project is an Amazon web scraper and e-commerce data extractor designed to retrieve product names, prices, and ratings. It functions as a headless browser crawler that converts unstructured web content from product listings into structured JSON and CSV formats. The tool incorporates anti-bot bypass capabilities to circumvent CAPTCHAs and security challenges. It achieves this through the use of residential proxy integration, automatic proxy rotation, and the modification of browser fingerprints to simulate human interaction patterns. The system provides broad web scraping capabilities, i
Playwright-cli is a command line interface for executing web tasks and automating browser interactions using the Playwright framework. It serves as a browser binary manager for downloading and installing specific browser engines and their required system dependencies, as well as a tool for running automated test suites across multiple engines to verify application behavior. The utility functions as a browser session controller, managing browser profiles and persistent storage states via the command line. It enables the execution of automation suites across different browser engines and config
This repository is a comprehensive collection of instructional guides and practical examples for Python development, focusing on machine learning, data science, and web scraping. It provides implementations for neural networks, reinforcement learning algorithms, and deep learning architectures using PyTorch, alongside detailed manuals for scientific computing and data visualization. The project distinguishes itself by offering specialized tutorials on concurrent programming to optimize CPU performance and guides for setting up Linux development environments. It covers the implementation of ad
Open Deep Research is an AI-powered web research agent that combines a reasoning model with live web search and data extraction to perform deep, multi-source investigations on any topic. It operates through a dual interface, offering both a command-line tool and a Model Context Protocol server, allowing developers to integrate web capabilities directly into AI agents and coding assistants. The project distinguishes itself by orchestrating an iterative research loop where a reasoning model plans steps, interprets search results, and guides subsequent web interactions. It uses Firecrawl for scr
X-Ray is a web scraping framework and asynchronous web crawler designed to extract structured data from websites. It functions as an HTML data extractor that transforms raw page content into a defined schema using CSS-style selectors. The project implements a headless browser crawler capable of executing JavaScript to render dynamic content. It handles website content discovery through a breadth-first crawling strategy and automatic pagination discovery to traverse multi-page result sets. The framework manages web data pipelines using a concurrency-limited request queue and request rate cont
Pholcus is a distributed web crawling system designed for large-scale data scraping. It employs a master-worker distribution model to coordinate high-concurrency scraping tasks across a network of remote client nodes, enabling both horizontal and vertical data collection. The system features a hot-loadable rule engine that allows extraction and navigation logic to be updated at runtime without restarting the process. It handles dynamic content through headless browser integration and bypasses bot detection using proxy rotation, automated user authentication, and simulated human behavior. The
X-ray is a headless browser web scraper and HTML content crawler designed to extract structured data from websites. It functions as a stream-based data scraper and structured data extractor, using selectors to retrieve text and attributes from HTML as nested objects or arrays. The project includes a request rate controller to manage network traffic through concurrency limits, throttles, and timeouts. It handles dynamic website scraping by rendering JavaScript via a headless browser and performs automated website crawling using breadth-first link following and pagination management. The syste
X-crawl is a Node.js-based web scraping framework designed to automate data collection from both static and dynamic websites. It integrates artificial intelligence to perform semantic parsing, allowing it to transform unstructured HTML into structured data formats that remain accurate even when website layouts or class names change. The project distinguishes itself through a comprehensive suite of stealth and reliability features. It manages crawler identity by randomizing device fingerprints and rotating proxy servers to bypass access restrictions. To handle complex, JavaScript-heavy interfa
This project is a high-performance headless browser engine designed for scalable web automation, data extraction, and AI agent integration. It provides a specialized environment that allows autonomous agents and testing frameworks to interact with web content through standardized remote control protocols. By executing pages in a lightweight, headless state, the engine minimizes resource consumption while maintaining the ability to perform complex navigation and dynamic content rendering. The platform distinguishes itself through deep integration with AI-centric communication layers and advanc
Browserless is a service-oriented platform designed for remote browser automation and headless execution. It provides a distributed infrastructure that manages browser sessions through containerized isolation, allowing users to execute scripts and interact with web content without maintaining local browser state or infrastructure. The platform functions as a remote API and WebSocket-based control layer, enabling stateless HTTP requests for tasks like document generation and real-time browser interaction. It incorporates proxy-based routing to manage traffic signatures and supports the integra
gstack is an AI agent framework and development workflow system designed to automate the software development lifecycle. It coordinates specialized AI personas to manage tasks across product design, engineering management, and quality assurance, transforming product intent into technical specifications and final releases. The project is distinguished by its deep integration of headless browser automation and semantic code memory. It utilizes a persistent Chromium daemon for web scraping and visual auditing, and implements a searchable knowledge base that logs architectural decisions and repos