30 open-source projects similar to jack-cherish/python-spider, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best Python Spider alternative.
TikTokDownload is a configurable batch video downloader for TikTok and Douyin that strips watermarks and supports automated downloads from user profiles, likes, and collections. It functions as a social media content archiving tool, enabling users to download videos and audio from these platforms for offline viewing or personal backup. The project distinguishes itself through a modular download pipeline that combines audio extraction, batch scheduling, config-driven workflows, cookie-based authentication, URL parsing, paginated API scraping, and watermark removal. It uses a settings file to c
This project is a specialized TikTok API scraper and data extractor. It functions as a proxy-based web scraper designed to collect user metadata, video posts, and trend feeds, while providing a webhook data pipeline to route scraped information to external URLs via HTTP requests. The tool includes a watermark-free video downloader that saves high-definition content to local storage. It employs cryptographic request signing for server authentication and utilizes session cookie authentication combined with proxy rotation to manage network traffic and avoid rate limits. Capabilities cover bulk
CrawlerTutorial is a comprehensive Python web scraping tutorial and framework designed for extracting data from static and dynamic websites. It functions as a web data extraction pipeline and an HTTP request orchestrator, covering the full lifecycle of scraping applications from initial fetching to final data storage. The project provides specialized guidance on anti-bot bypass techniques and web API reverse engineering. It includes methods for evading browser detection through identity masking and proxy rotation, as well as techniques for identifying hidden API endpoints by analyzing network
This project is a community-curated directory of open-source software designed for deployment in private server environments and home labs. It serves as a comprehensive resource for discovering independent, self-hosted alternatives to mainstream cloud services, enabling users to maintain full data ownership and control over their digital infrastructure. The directory is structured through a hierarchical taxonomy that organizes a vast collection of applications into logical categories, ranging from media management and data analytics to private communication and team productivity tools. It dis
PROXY-List is a public proxy aggregator that provides data structures for storing and aggregating publicly available HTTP and SOCKS proxy server addresses. It serves as a source for retrieving network traffic routing lists used to mask origin IP addresses during web requests. The project utilizes a data pipeline to automatically scrape, poll, and serialize proxy lists from multiple public websites. This infrastructure ensures the availability of active servers through scheduled periodic polling and automated content refreshes, delivering the resulting lists as plain text files. These capabil
nodriver is an asynchronous Chromium browser automation framework that provides headless control and web scraping capabilities. It functions as a Chrome DevTools Protocol client, allowing for granular engine control by attaching directly to the browser's debug port without the need for external driver binaries. The framework is specifically designed as an anti-bot detection bypass tool. It modifies browser fingerprints and protocol headers to evade automated security systems, handle security warnings, and bypass common obstacles like insecure connection alerts. The system covers a broad rang
Crawlee-python is a web crawling framework for building scalable scrapers using Python. It serves as a comprehensive tool for web scraping automation, providing a system to extract structured data from websites using both lightweight HTTP requests and headless browser automation. The framework is distinguished by its anti-bot evasion capabilities, which include browser fingerprint impersonation and tiered proxy rotation to bypass detection systems and solve challenges such as Cloudflare. It also incorporates artificial intelligence for autonomous website navigation and schema-based data extra
This project is a railway booking automation tool designed to monitor ticket inventory and execute purchases on the 12306 platform. Its primary purpose is to secure high-demand train tickets by automating the login, booking, and checkout processes. The system utilizes automated captcha solving and headless session management to bypass security barriers and maintain user authentication. It employs a concurrent request queue and polling-based inventory monitoring to track seat availability and execute purchases immediately as they open. The automation surface includes waitlist management for r
JobSpy is a job board scraper and listing aggregator designed to extract employment opportunities from multiple websites and compile them into a unified dataset. It functions as a job search automation tool that programmatically collects vacancies based on keywords, locations, and specific filters. The project serves as a web scraping framework that utilizes proxy routing and user-agent rotation to bypass rate limits and avoid server-side blocking during data extraction. It includes infrastructure for concurrent request aggregation and schema-based data normalization to ensure consistent form
This project is a public proxy aggregator and directory providing curated lists of validated HTTP and SOCKS proxy servers. It features a machine-readable API service and tools designed for anonymous network routing and the automated rotation of outgoing IP addresses. The system distinguishes itself through a proxy rotation tool used to bypass rate limits and prevent detection by automated security systems. It provides a programmatic interface for retrieving and filtering verified proxies by country and protocol, delivering this data in JSON and text formats for integration into custom applica
BiliTools is a modular download tool for Bilibili, supporting authentication, media extraction, metadata management, and user content backup. It provides a configurable download pipeline with QR-based session authentication, automatic captcha and device verification, and stream muxing that merges separate audio and video segments into a single file. A plugin-based media extractor handles multiple content types and streaming endpoints, while a metadata scraping and tagging pipeline writes structured tags into files for media organizers. Subtitle and caption synchronization converts comment o
This is a collection of Python automation scripts and utility tools designed to handle repetitive technical tasks, system administration, and developer workflows. The project serves as a suite for task automation, data utility, and web automation. The collection includes specialized tools for multimedia processing, such as optical character recognition for extracting text from images, speech-to-text conversion, and real-time face and human body detection. It also features web scraping and monitoring capabilities to track product prices, fetch external API content, and automate interactions wi
Scylla is a system for managing HTTP proxy pools and automating web extraction. It provides a specialized data acquisition pipeline designed for gathering large-scale internet datasets for training and fine-tuning large language models. The project features a proxy rotation gateway that assigns fresh proxy addresses to incoming requests to mask origin traffic and avoid IP blocking. It includes a proxy pool manager that handles the collection, functional validation, and orchestration of proxy servers, complemented by a web dashboard for monitoring the health and geographic distribution of the
Spider_XHS is a data extraction and automation tool built specifically for the Xiaohongshu social platform. It orchestrates multi-step workflows that combine comment tree traversal, cookie-based session reuse, high-resolution media retrieval, keyword search, proxy-backed retries, QR-code login, structured file export, and aggregated user profile collection into a single pipeline. The tool distinguishes itself through its integrated authentication and publishing capabilities, supporting login via QR code scanning or phone verification codes to establish and maintain authenticated sessions. It
PythonSpiderNotes is a comprehensive instructional resource and framework for building web crawlers and extracting data using the Python programming language. It provides a set of methods for parsing unstructured HTML and JSON data into structured formats for persistent storage. The project includes detailed guides and tutorials on browser automation for retrieving dynamic content, as well as a framework for data extraction. It specifically covers anti-bot bypass techniques, such as rotating proxies and spoofing headers, to avoid IP blocks and detection systems. The capability surface extend
This project is a web-based manga and novel downloader and multi-site web scraper designed to extract images and text from diverse media platforms. It functions as a digital media archiver and EPUB e-book generator, using a plugin-based crawler architecture with site-specific scripts to define how content is extracted from various international websites. The system distinguishes itself through authenticated web crawling, using browser cookie simulation to access restricted or member-only content. It includes specialized capabilities for digital comic archiving, which organizes image sequences
BiliBiliToolPro is an account automation tool for Bilibili designed to manage multiple profiles, claim rewards, and maintain session cookies via QR code authentication. It functions as a growth bot and reward collector that automates daily activities to increase account rank and experience points. The project is built as a containerized automation suite, allowing for scheduled task execution across Docker, Kubernetes, or other automation panels. It features multi-account profile isolation, which separates user credentials and session data to execute tasks independently for different accounts.
node-lessons is a comprehensive Node.js programming course and instructional guide. It provides a collection of guided lessons and code examples designed to teach the fundamentals of the Node.js runtime and server-side JavaScript development. The project serves as a practical guide for building web servers and backend applications, specifically covering the implementation of HTTP servers, request routing, and middleware chains. It includes specialized instructional material on managing asynchronous JavaScript workflows through promises and flow control, as well as guides for integrating NoSQL
This is a tool for downloading videos, images, and audio from the Douyin social media platform using shareable URLs or profile links. It can download individual posts, entire user profiles including all posts and liked content, collections, and music tracks, with options for watermark-free and high-quality output. The tool also supports live stream recording, comment collection, and keyword-based content search with JSONL export. The project distinguishes itself through an integrated REST API server that accepts download and transcription requests, tracks job status, and exposes health check
cloudscraper is a Python library designed to bypass Cloudflare anti-bot protections by resolving JavaScript challenges and mimicking browser fingerprints. It functions as a specialized tool for accessing websites that employ automated security systems to block scripts and headless browsers. The project differentiates itself through the use of interchangeable JavaScript runtimes, such as Node.js or V8, to execute challenge code and obtain security clearance tokens. It employs a fingerprint rotation engine and HTTP request emulation to rotate browser headers and device identifiers, mimicking hu
Crawlee is a web scraping framework designed for building scalable, reliable, and distributed data extraction pipelines. It provides a unified interface for managing headless browser automation and lightweight HTTP requests, allowing developers to handle complex web navigation, dynamic content rendering, and large-scale data collection within a single, modular architecture. The project distinguishes itself through its resource-aware concurrency controller, which dynamically scales task execution based on real-time CPU and memory usage to prevent host machine exhaustion. It also features a rob
OkHttpUtils is a convenience wrapper for the OkHttp HTTP client that simplifies common networking operations on Android. It provides a straightforward interface for executing GET and POST requests, including sending form parameters and JSON payloads, as well as uploading files via multipart form data and downloading remote files to local storage. The library distinguishes itself through a set of practical utilities built on top of OkHttp's core architecture. It wraps synchronous calls into an asynchronous callback pattern, includes an interceptor-based logging layer for request and response d
MVVMHabit is an Android development framework and base library that implements the MVVM architecture using Android Architecture Components. It provides a pre-integrated foundation designed to decouple business logic from user interface rendering and lifecycle management. The project distinguishes itself by bundling a comprehensive set of architectural templates, including a reactive event bus for decoupled component communication, token-based data exchange between logic instances, and a single-activity fragment hosting system to reduce manifest overhead. The framework covers broad capability
This project is a collection of Python scripts and tools designed for web scraping, browser automation, and large-scale data extraction. It provides a set of implementations for retrieving information from websites and private APIs, including tools for multimedia downloading and social media data archiving. The toolset includes specialized mechanisms for bypassing anti-scraping measures through IP proxy pool rotation and multi-threaded crawlers. It also features capabilities for simulating browser sessions to handle authentication, intercepting session cookies, and decrypting network payloads
Jsoup is a Java library designed for parsing, extracting, and manipulating HTML and XML content. It provides a document object model that represents web content as a hierarchical tree, allowing for programmatic navigation and modification of elements, attributes, and text. The library functions as a toolkit for web scraping, enabling the retrieval of remote content via standard web protocols and the management of HTTP sessions for automated form interaction. The library distinguishes itself through its fault-tolerant tokenization, which reconstructs valid document structures from malformed or
Newspaper is a Python library designed for scraping, parsing, and analyzing web-based information. It functions as a framework for automated news aggregation and large-scale web content extraction, providing tools to download, clean, and structure text, metadata, and media from diverse online sources. The project distinguishes itself through a pipeline-oriented architecture that combines heuristic-based content extraction with natural language processing. It automatically identifies and isolates article bodies from web page boilerplate while simultaneously performing language detection, keywo
youtube-transcript-api is a Python library designed to retrieve and download subtitles and captions from YouTube videos using video IDs. It functions as an API client that extracts text and timing data for video content. The project includes a wrapper for automated translation, allowing transcripts to be converted into different target languages. It also features a retrieval system that supports routing requests through HTTP, HTTPS, or SOCKS proxies to avoid IP blocking and regional restrictions. The library provides tools for identifying available subtitle tracks and converting raw transcri
Integuru is a system of AI-driven agents and frameworks designed to document undocumented APIs and convert network traffic into automation scripts. It functions as a headless API automation framework that replaces browser-based tools with direct HTTP requests to increase throughput and reliability. The project features an LLM-based reverse engineering agent that analyzes network traffic to discover internal APIs and a natural language integration engine that transforms text descriptions of workflows into sequences of valid API calls. It includes tools for extracting request and response forma
BilibiliDown is a cross-platform desktop application designed for downloading high-resolution videos, audio, and images from the Bilibili platform. It functions as a batch media downloader and content archiver, enabling the retrieval of content from user spaces, playlists, and curated favorites lists for local offline storage. The tool distinguishes itself through media processing and archiving capabilities, including the ability to save video supplements such as closed captions and bullet comments. It features a media transcoding tool to convert streams into standard audio and video formats
Webmagic is a Java web crawling framework designed for building scalable automated crawlers to download and process large volumes of web pages. It functions as a distributed web crawler and dynamic content crawler, utilizing an XPath HTML parser to locate and extract specific data points from page structures. The framework distinguishes itself through its ability to handle dynamic content by rendering JavaScript and executing asynchronous requests to extract data from non-static pages. It also allows users to define and execute crawler logic via scripting languages, enabling the update of col