30 open-source projects similar to gocolly/colly, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best Colly alternative.
Crawlee is a web scraping framework designed for building scalable, reliable, and distributed data extraction pipelines. It provides a unified interface for managing headless browser automation and lightweight HTTP requests, allowing developers to handle complex web navigation, dynamic content rendering, and large-scale data collection within a single, modular architecture. The project distinguishes itself through its resource-aware concurrency controller, which dynamically scales task execution based on real-time CPU and memory usage to prevent host machine exhaustion. It also features a rob
Pholcus is a distributed web crawler framework written in Go designed for high-concurrency data extraction. It functions as a distributed crawling orchestrator and dynamic data extraction engine, utilizing a server-client architecture to coordinate tasks across multiple nodes. The system integrates a headless browser engine to render dynamic content and execute JavaScript, allowing it to extract data from single-page applications. It features a web-based management interface for configuring spider parameters and monitoring execution progress, alongside the ability to update extraction rules v
Go Spider is a modular framework designed for building concurrent web scrapers and data extraction workflows. It provides a structured engine for orchestrating automated crawling tasks, managing request scheduling, and processing web content through a unified pipeline. The framework distinguishes itself through a highly configurable architecture that allows developers to inject custom logic for downloaders, schedulers, and storage components via interface-driven contracts. It manages network interactions using middleware-based request throttling and URL deduplication, ensuring that crawling o
Botasaurus is a Python web scraping framework and headless browser automation system used to build scalable data extraction tools. It functions as a web data extraction tool and OCR document parser, converting website content, images, and PDF files into structured formats such as JSON, CSV, and Excel. The framework distinguishes itself by providing a scraper management interface that allows Python functions to be wrapped in a web-based UI or deployed as standalone desktop applications. This enables non-technical users to trigger extraction jobs and manage tasks via a graphical interface or RE
Crawlab is a distributed web scraping platform designed to centralize the management, deployment, and execution of large-scale data extraction tasks. It functions as a control plane that orchestrates scraping scripts and automated workflows across multiple nodes, providing a unified environment for managing complex data collection operations. The platform distinguishes itself through a distributed architecture that coordinates worker nodes via a central master, utilizing real-time communication to maintain oversight of all active processes. It ensures operational consistency by isolating task
LMCache is a distributed key-value cache manager and tiering system designed to accelerate large language model inference. It functions as a tiered storage layer that offloads tensors from GPU memory to CPU RAM, local disks, or remote object stores, enabling the reuse of cached prefixes across different inference sessions and serving engines. The system differentiates itself through a disaggregated prefill-decode model, which separates prompt processing from token generation by transferring caches between distributed compute nodes. It utilizes peer-to-peer orchestration to share and retrieve
This project is a reactive, offline-first NoSQL database engine designed for JavaScript applications. It provides a robust framework for managing application state by synchronizing data across browsers, mobile devices, and server-side runtimes. By treating local storage as the primary source of truth, it enables applications to remain functional without network connectivity, automatically reconciling changes with remote backends once a connection is restored. The database distinguishes itself through a modular architecture that supports cross-environment synchronization and high-performance d
FastMCP is a Python framework designed for building servers that expose functions, resources, and prompts to AI models using the Model Context Protocol. It simplifies the development process by automatically deriving tool metadata, input schemas, and documentation directly from Python function signatures and type hints. The framework provides a unified container for managing these components, allowing developers to build modular applications that integrate seamlessly with AI assistants. The project distinguishes itself through its support for interactive, server-defined user interface compone
MediaCrawler is an automated web scraping framework designed to extract public posts, comments, and creator metadata from various social media platforms. It functions as a headless browser automator, utilizing real browser instances to render dynamic content and execute the client-side scripts necessary for interacting with modern web interfaces. The system distinguishes itself through a focus on session persistence and network flexibility. It supports remote debugging to reuse active browser sessions and cookies, which helps minimize the risk of triggering platform security challenges. To ma
MechanicalSoup is a Python web automation library and scraping framework designed to simulate browser sessions and navigate websites without requiring JavaScript execution. It functions as an HTML parsing tool and HTTP session manager, allowing for the programmatic retrieval of page content and the automation of web interactions. The library distinguishes itself by combining session persistence with automated form interaction. It maps user data to HTML input fields and selection boxes for programmatic submission and maintains authenticated states by managing cookies and user-agent headers acr
AngleSharp is an HTML5 DOM parser and web scraping framework designed to parse HTML5, SVG, and MathML documents into a W3C compliant document object model. It functions as a programmatic HTML generator and a CSS selector engine for querying and locating specific elements within a DOM. The project provides tools for simulating browser environments to automate web interactions, navigate URLs, and submit forms. It includes a dedicated HTML and CSS minifier to reduce the file size of web assets by removing unnecessary characters. The library supports HTML DOM manipulation and the extraction of s
so-novel is a web novel downloader and scraping engine designed to extract structured text from websites and convert it into electronic book formats. It functions as a multi-interface content extractor, providing a shared backend accessible via a web-based management dashboard, a terminal user interface, and a command line interface. The system utilizes a rule-driven approach for data extraction, using CSS selectors and XPath rules defined in external configuration files to map web elements to specific data fields. To maintain access to content, it includes a proxy-routed request pipeline to
This project is a manga source extension repository and content aggregator. It functions as an HTTP content scraping engine that retrieves images and metadata from external provider websites by parsing HTML and making network requests to display digital manga within a unified reader. The system utilizes a JSON extension repository to allow reader applications to discover and install third-party content providers. It employs an interface-based plugin framework that defines a common set of methods to ensure external sources remain compatible with a standardized internal format. The project cov
requests-html is a Python HTML parsing library and web scraping framework. It functions as an asynchronous HTTP client and a JavaScript rendering engine designed to fetch and parse web pages for structured data extraction. The project integrates a headless browser to execute JavaScript, allowing it to retrieve dynamically generated content that standard HTML parsers cannot see. It provides tools for automated data extraction using CSS selectors and XPath expressions to isolate specific text or attributes from HTML structures. The framework covers network operations including asynchronous pag
Puppeteer Sharp is a .NET wrapper and automation library used to programmatically drive headless Chrome and Chromium browsers. It functions as a Chrome DevTools Protocol client, providing a framework for web scraping and the automation of web page interactions. The project enables the execution of JavaScript within the browser context and supports attaching to remote browser sessions via WebSocket endpoints. It allows for the manipulation of browser states to perform functional web testing and visual regression analysis. Capability areas include content transformation via HTML injection, pag
X-Ray is a web scraping framework and asynchronous web crawler designed to extract structured data from websites. It functions as an HTML data extractor that transforms raw page content into a defined schema using CSS-style selectors. The project implements a headless browser crawler capable of executing JavaScript to render dynamic content. It handles website content discovery through a breadth-first crawling strategy and automatic pagination discovery to traverse multi-page result sets. The framework manages web data pipelines using a concurrency-limited request queue and request rate cont
This project is a Python-based automation toolkit designed to manage programmatic authentication and session persistence across web services. It provides a framework for executing automated login sequences, including the handling of interactive security challenges such as QR code verification and captcha resolution. The toolkit distinguishes itself by simulating native mobile application environments, allowing for the execution of scripts that require specific device-level headers and behaviors. It also incorporates hook-based interception to monitor workflow states and manage exceptions duri
Social-analyzer is an open-source intelligence framework designed for the automated discovery, correlation, and verification of digital identities across online platforms. It functions as a comprehensive engine for gathering social media intelligence, utilizing distributed browser automation to extract metadata and profile information from hundreds of websites simultaneously. The platform distinguishes itself through its ability to perform cross-platform identity correlation using heuristic-based pattern matching and name permutation generation. It processes these findings through a confidenc
CrawlerTutorial is a comprehensive Python web scraping tutorial and framework designed for extracting data from static and dynamic websites. It functions as a web data extraction pipeline and an HTTP request orchestrator, covering the full lifecycle of scraping applications from initial fetching to final data storage. The project provides specialized guidance on anti-bot bypass techniques and web API reverse engineering. It includes methods for evading browser detection through identity masking and proxy rotation, as well as techniques for identifying hidden API endpoints by analyzing network
This project is a Python web scraping library and automated data collection suite. It provides tools for extracting structured data from websites, implementing web crawlers to navigate site links, and parsing HTML DOM structures to isolate specific elements and attributes. The toolkit includes a pipeline for processing unstructured text and cleaning raw web content to extract meaningful information. It also features capabilities for image data extraction and the integration of external APIs to retrieve structured data from remote endpoints. The system covers broad capability areas including
JMComic-Crawler-Python is a high-performance asynchronous web scraper and API client designed to programmatically retrieve images and metadata from a comic hosting service. It functions as a media archiving tool for batch downloading albums and chapters, automating the process of saving content to a local filesystem. The project is distinguished by its ability to reverse server-side pixel obfuscation, using a decryption tool to reconstruct sliced and shuffled images. To maintain stable connectivity, it utilizes a network bypass utility featuring dynamic domain rotation and proxy routing to ci
Puppeteer is a JavaScript library for programmatically controlling Chrome and Firefox through the Chrome DevTools Protocol or the WebDriver BiDi protocol. It launches and manages browser instances—typically without a visible user interface—to automate interactions with web pages, enabling navigation, clicking, typing, and data extraction entirely through code. The library distinguishes itself through deep integration with the Chromium embedding layer, allowing fine-grained process configuration with custom flags, permissions, and sandbox policies. It maintains multiple concurrent command stre
Helium is a Python library and high-level wrapper for Selenium designed for browser automation, functional UI testing, and web scraping. It provides a simplified interface for interacting with web applications across different browser engines. The library distinguishes itself by allowing users to identify and interact with web elements using visible text labels rather than relying exclusively on technical identifiers like XPaths or CSS selectors. This approach enables the creation of automation scripts based on human-readable labels. The toolkit covers a broad range of browser automation cap
JobSpy is a job board scraper and listing aggregator designed to extract employment opportunities from multiple websites and compile them into a unified dataset. It functions as a job search automation tool that programmatically collects vacancies based on keywords, locations, and specific filters. The project serves as a web scraping framework that utilizes proxy routing and user-agent rotation to bypass rate limits and avoid server-side blocking during data extraction. It includes infrastructure for concurrent request aggregation and schema-based data normalization to ensure consistent form
ProxyBroker is a tool for scraping public HTTP and SOCKS proxy addresses, validating their connectivity, and managing a curated pool of functional proxies. It consists of a proxy scraper for discovery, a validation engine to check anonymity and response times, and a pool manager to maintain a filtered queue of servers. The project includes a local rotating proxy server that acts as a single entry point, automatically distributing incoming network traffic across a pool of validated external proxies. This infrastructure allows for the rotation of IP addresses to maintain resilience during web d
qd is a server-side execution engine and request scheduler designed to automate recurring network tasks. It functions as a task automator that converts HTTP Archive files into reusable request templates for scheduled execution. The system is powered by a non-blocking server that manages a timer-driven execution engine. This allows the project to orchestrate API tasks by replaying captured network traffic and triggering network requests based on defined recurring intervals. The tool covers a broad range of automation capabilities, including schema-driven task configuration and stateless reque
Agent-Reach is an AI agent web gateway and search tool that provides language models with the ability to search and read content from the open web, social media, and community forums without using official APIs. It functions as a routing layer that connects large language models to various internet backends while managing content parsing and connection health. The system enables API-free information retrieval by using open-source backends to extract text and metadata from platforms such as Twitter, Reddit, and YouTube. It converts unstructured website content, RSS feeds, and video transcripts
WebAgent is an autonomous web navigation agent and research system designed to browse the internet and synthesize information to answer complex queries. It functions as a reasoning orchestrator that navigates the web iteratively to perform deep research and extract structured data. The project includes a reinforcement learning training pipeline that generates synthetic interaction datasets for model pre-training and fine-tuning. It employs token-level policy gradients to stabilize training in non-stationary environments and uses a dual-mode inference scaling mechanism to balance execution bet
Grequests is an asynchronous HTTP batcher and Gevent-based client library used to execute large sets of network requests simultaneously. It functions as a concurrent request wrapper for the Requests library, enabling non-blocking operations to reduce the total time spent waiting for server responses. The project provides a task-pool execution model to handle batch network operations, such as high-throughput web scraping and API polling. It can stream responses as they arrive via a generator, allowing for immediate data processing without waiting for the entire batch to complete. The library