30 open-source projects similar to dedsecinside/torbot, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best TorBot alternative.
This project is a Python web scraping library and automated data collection suite. It provides tools for extracting structured data from websites, implementing web crawlers to navigate site links, and parsing HTML DOM structures to isolate specific elements and attributes. The toolkit includes a pipeline for processing unstructured text and cleaning raw web content to extract meaningful information. It also features capabilities for image data extraction and the integration of external APIs to retrieve structured data from remote endpoints. The system covers broad capability areas including
This project is a Node.js web scraping framework designed to automate data extraction through a programmatic workflow of requests, parsing, and document interaction. It functions as a headless web crawler, an HTTP request manager, and a DOM parser and extractor. The framework distinguishes itself by combining a JavaScript execution engine to interact with dynamic content and a hybrid selection system that utilizes both CSS and XPath selectors. It includes specialized middleware for proxy rotation and cookie-jar session management to maintain authenticated states and manage automated traffic.
OnionScan is a free and open source tool for investigating the Dark Web.
reconftw is an attack surface management framework and reconnaissance workflow orchestrator designed to automate the discovery, mapping, and monitoring of external digital assets. It operates as a modular tool-chain pipeline that coordinates a sequence of security tools to perform intelligence gathering and vulnerability scanning. The project distinguishes itself through a cloud-native deployment model that parallelizes scanning workloads across a fleet of remote VPS instances to bypass local resource constraints. It utilizes container-based environment isolation to ensure consistent executio
OnionSearch is a script that scrapes urls on different .onion search engines.
This project is a community-curated directory of open-source software designed for deployment in private server environments and home labs. It serves as a comprehensive resource for discovering independent, self-hosted alternatives to mainstream cloud services, enabling users to maintain full data ownership and control over their digital infrastructure. The directory is structured through a hierarchical taxonomy that organizes a vast collection of applications into logical categories, ranging from media management and data analytics to private communication and team productivity tools. It dis
This project is a software engineering educational resource providing a collection of canonical system implementations. It serves as a library of computer science case studies and polyglot code examples designed to demonstrate architectural tradeoffs and design patterns through concise versions of fundamental software components. The repository focuses on studying the implementation of core concepts such as consensus algorithms, interpreters, and database engines. It provides minimal versions of complex systems to facilitate the analysis of language design, data structure implementation, and
ByeDPIAndroid is a deep packet inspection bypass tool for Android that functions as a local SOCKS5 proxy. It modifies TCP packets to evade network censorship and bypass regional internet restrictions on mobile devices. The project operates as a network traffic obfuscator and TCP packet fragmenter. It splits network data into smaller pieces and hides the nature of internet requests to prevent automated blocking and traffic shaping by internet service providers. The system covers a range of capabilities including host-based traffic interception and dynamic packet modification. It utilizes non-
This project is a comprehensive resource directory for web data extraction, providing a curated collection of tools and libraries for parsing data, automating browsers, and managing network operations. It serves as a guide for extracting structured information from HTML, XML, JSON, and PDF formats. The toolkit focuses on advanced data collection strategies, including headless browser automation to interact with JavaScript and a suite of network utilities for DNS resolution and WebSocket connections. It specifically covers methods for bypassing bot protections through proxy pool management, us
node-crawler is a programmable web crawler for Node.js that manages request queues and automates data extraction. It functions as a rate-limited HTTP client and a headless HTML parser, providing the infrastructure to visit large sets of URLs asynchronously while preventing duplicate processing through task deduplication. The project distinguishes itself through a proxy rotation manager that cycles user agents and proxy servers to bypass access restrictions. It utilizes the HTTP/2 protocol to improve request performance and server compatibility during large-scale scraping operations. The syst
Goutte is a PHP web scraper and DOM crawler designed for extracting data from websites. It functions as an HTTP client wrapper that enables the retrieval of web pages and the parsing of HTML content. The project provides a web form automator to programmatically fill and submit HTML forms to remote servers. It also includes a mechanism for automated website crawling by following links to discover and archive web content. The system supports stateful session management to maintain cookies and headers across requests. It further covers HTML data extraction through DOM-based element selection an
Crawlab is a distributed web scraping platform designed to centralize the management, deployment, and execution of large-scale data extraction tasks. It functions as a control plane that orchestrates scraping scripts and automated workflows across multiple nodes, providing a unified environment for managing complex data collection operations. The platform distinguishes itself through a distributed architecture that coordinates worker nodes via a central master, utilizing real-time communication to maintain oversight of all active processes. It ensures operational consistency by isolating task
PythonSpiderNotes is a comprehensive instructional resource and framework for building web crawlers and extracting data using the Python programming language. It provides a set of methods for parsing unstructured HTML and JSON data into structured formats for persistent storage. The project includes detailed guides and tutorials on browser automation for retrieving dynamic content, as well as a framework for data extraction. It specifically covers anti-bot bypass techniques, such as rotating proxies and spoofing headers, to avoid IP blocks and detection systems. The capability surface extend
Webmagic is a Java web crawling framework designed for building scalable automated crawlers to download and process large volumes of web pages. It functions as a distributed web crawler and dynamic content crawler, utilizing an XPath HTML parser to locate and extract specific data points from page structures. The framework distinguishes itself through its ability to handle dynamic content by rendering JavaScript and executing asynchronous requests to extract data from non-static pages. It also allows users to define and execute crawler logic via scripting languages, enabling the update of col
DotnetSpider is a .NET web crawler framework and programmable tool designed for traversing websites and capturing structured data from web pages. It functions as a distributed crawling engine that enables the automation of web crawling to discover and extract data. The framework is designed for distributed data extraction, allowing crawling tasks to be spread across multiple servers to process large volumes of web content. This architecture supports high-performance web scraping and enterprise data collection workflows for gathering structured information.
PySpider is a Python web crawling framework designed for automated data extraction. It provides a pipeline for periodically fetching web content, processing HTML, and persisting scraped information into database backends. The system features a web-based management interface for editing scraping scripts, monitoring task progress, and reviewing collected data. It includes a headless browser JavaScript renderer to capture rendered HTML from dynamic web pages and a distributed architecture that uses message queues to scale crawling workloads across multiple nodes. The framework also covers task
Colly is a web scraping framework and concurrent crawler written in Go. It provides a system for traversing web pages, following links, and extracting structured data from HTML and XML documents. The framework includes a distributed scraping engine designed to spread data collection tasks across multiple instances to increase throughput. It ensures compliance with website owner policies by automatically reading and respecting robots.txt files. The system manages request lifecycles through domain-based rate limiting, concurrency controls, and session management via a stateful cookie jar. It s
Crawlee is a web scraping framework designed for building scalable, reliable, and distributed data extraction pipelines. It provides a unified interface for managing headless browser automation and lightweight HTTP requests, allowing developers to handle complex web navigation, dynamic content rendering, and large-scale data collection within a single, modular architecture. The project distinguishes itself through its resource-aware concurrency controller, which dynamically scales task execution based on real-time CPU and memory usage to prevent host machine exhaustion. It also features a rob
shadowsocks-libev is an event-driven network daemon that provides an encrypted SOCKS5 proxy. It functions as a lightweight proxy server using a non-blocking event loop to route TCP and UDP traffic through encrypted tunnels to bypass network restrictions. The project implements a transparent proxy gateway capable of intercepting outbound packets at the network layer, allowing system traffic to be redirected through the encrypted tunnel without per-application configuration. It also includes a daemon process manager to control multiple proxy server instances as child processes via local communi
Firecrawl is a web data extraction platform designed to convert unstructured web content into clean, LLM-ready formats like markdown or JSON. It functions as an autonomous web crawler and scraper, capable of mapping entire domains, performing recursive navigation, and executing complex data gathering tasks. By leveraging headless browser orchestration, the system handles dynamic, JavaScript-heavy pages to ensure comprehensive data capture. The platform distinguishes itself through its focus on agentic workflows, providing a programmatic interface that allows autonomous agents to perform live
Photon is a command-line web crawler designed for security reconnaissance and information gathering. It systematically traverses websites to discover URLs, map domain infrastructure, and identify associated subdomains by retrieving DNS records. The tool distinguishes itself through its ability to perform deep content analysis, including the extraction of sensitive data such as API keys and authentication tokens using user-defined regular expressions. It supports offline inspection by cloning crawled web content to the local filesystem, allowing for structural analysis without additional netwo
Katana is a web crawler and spider designed for security reconnaissance and web application mapping. It functions as a utility for identifying endpoints, forms, and API structures across web targets by combining standard HTTP request traversal with headless browser automation to render dynamic, JavaScript-heavy content. The tool distinguishes itself through its ability to maintain authenticated sessions and handle complex web interactions, such as automated form submission and captcha resolution. It provides granular control over the discovery process, allowing users to define specific crawl
Subfinder is a security reconnaissance framework designed for subdomain enumeration and attack surface management. It functions as a discovery engine that identifies and maps internet-exposed infrastructure, cloud-hosted assets, and network ranges to maintain a comprehensive inventory of an organization's digital footprint. The project distinguishes itself through a modular, template-driven scanning engine that executes security checks against discovered assets. It leverages cloud-native asset discovery to query provider APIs and infrastructure metadata, while supporting distributed agent orc
wstunnel is a tool that tunnels arbitrary TCP traffic through WebSocket connections, enabling communication across restrictive firewalls and proxies. It operates as both a client and server, encapsulating TCP data within WebSocket binary frames and multiplexing multiple connections over a single WebSocket link. The tool supports mutual TLS authentication, requiring clients to present signed certificates for verification before establishing a tunnel, and provides shared secret access control and tunnel forwarding restrictions for additional security. The project distinguishes itself by offerin
This project is a censorship circumvention tool and transparent proxy gateway designed to bypass local network restrictions. It functions as a SOCKS5 proxy server, a DNS tunneling tool, and a network traffic obfuscator to help users access blocked websites. The software implements masking protocols to hide the origin and destination of data to evade restrictive firewalls. It provides capabilities for network traffic obfuscation and secure DNS tunneling to protect network privacy and resolve blocked domains. The system handles wide-scale traffic management by intercepting system network traff
naiveproxy is a censorship circumvention tool and traffic obfuscation proxy. It functions as an HTTP/2 transport proxy that tunnels SOCKS5 traffic over HTTP/2 to hide network activity and bypass network blocks. The project distinguishes itself by mimicking standard web browser requests to evade deep packet inspection. It employs traffic camouflage techniques such as redirecting unauthorized probing requests to decoy web servers and using randomized packet padding to defeat length-based traffic analysis. The software provides a local SOCKS5 proxy endpoint, credential-based request authenticat
AnyCrawl is an AI-powered data extractor, automated web crawler, and headless browser orchestrator. It serves as a web content extraction API and a gateway that connects crawling and scraping tools to language models using a standardized API protocol. The project specializes in converting unstructured website content into structured JSON or markdown optimized for AI assistants. It utilizes language models and JSON schemas to pull specific information into validated formats and provides capabilities for AI page summarization and LLM-optimized content extraction. The system manages comprehensi
This project is a Model Context Protocol server that connects large language models to web scraping and crawling tools. It functions as a bridge, allowing LLM clients to utilize a web crawling engine and scraping utilities to extract and process web data. The server integrates a markdown web converter that transforms dynamic web pages and PDF documents into clean markdown to optimize consumption by AI models. It also provides a browser automation interface for controlling headless sessions and bypassing access restrictions. The system covers broad capabilities including large-scale website d
Pholcus is a distributed web crawling system designed for large-scale data scraping. It employs a master-worker distribution model to coordinate high-concurrency scraping tasks across a network of remote client nodes, enabling both horizontal and vertical data collection. The system features a hot-loadable rule engine that allows extraction and navigation logic to be updated at runtime without restarting the process. It handles dynamic content through headless browser integration and bypasses bot detection using proxy rotation, automated user authentication, and simulated human behavior. The
DocSearch is an integrated toolset for adding search capabilities to documentation websites. It provides a JavaScript and React search interface for embedding autocomplete search bars, a dedicated web crawler to extract and synchronize site content into a searchable index, and a monitoring system to track user queries and interaction events. The project distinguishes itself by incorporating a conversational AI assistant powered by retrieval-augmented generation. This assistant grounds a large language model in a specific documentation index to provide factual answers, with configurable system