# jack-cherish/python-spider

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/jack-cherish-python-spider).**

19,660 stars · 5,974 forks · Python

## Links

- GitHub: https://github.com/Jack-Cherish/python-spider
- Homepage: https://cuijiahua.com/blog/spider/
- awesome-repositories: https://awesome-repositories.com/repository/jack-cherish-python-spider.md

## Topics

`python` `python-spider` `python3` `webspider`

## Description

This is a collection of Python scripts designed for extracting data from popular Chinese websites and mobile applications. It functions as a multi-platform data extraction toolkit, capable of automating tasks such as downloading videos from platforms like Bilibili and Douyin, scraping product reviews and images from e-commerce sites like Taobao and JD.com, and booking train tickets on the 12306 railway system.

The project distinguishes itself through its focus on automating specific, high-value tasks within the Chinese internet ecosystem. It includes capabilities for solving Chinese CAPTCHA challenges like GEETEST, removing watermarks from downloaded videos, and building a pool of proxy IPs to avoid blocking during large-scale scraping operations. A notable feature is its ability to assist with live quiz games by capturing questions from streaming apps, searching for answers online, and broadcasting the results in real time via WebSocket.

Beyond these differentiators, the toolkit covers a broad range of standard web scraping techniques. It handles both static and dynamic web content, manages session-based authentication for sites like Taobao, and provides utilities for downloading various media types including images, music, and novels. The project also includes scripts for querying university academic systems and scheduling automated actions, such as booking tickets at a precise time.

## Tags

### Web Development

- [Web Scraping](https://awesome-repositories.com/f/web-development/web-scraping.md) — A collection of Python scripts for scraping data from popular Chinese websites and apps.
- [HTML and JSON Parsers](https://awesome-repositories.com/f/web-development/backend-development/request-response-handling/http-request-handling/request-parsing/html-and-json-parsers.md) — Chains HTTP requests with HTML and JSON parsing to extract structured data from websites.
- [API Reverse Engineering](https://awesome-repositories.com/f/web-development/web-scraping-engines/api-reverse-engineering.md) — Intercepts mobile app and web traffic to reverse-engineer undocumented API endpoints for data extraction.
- [Real-Time Data Pushing](https://awesome-repositories.com/f/web-development/websocket-integrations/real-time-data-pushing.md) — Broadcasts scraped or computed results to connected clients instantly using persistent WebSocket connections.

### Part of an Awesome List

- [Live Quiz Answerers](https://awesome-repositories.com/f/awesome-lists/ai/question-answering/live-quiz-answerers.md) — Captures live quiz questions from streaming apps and automatically searches for and broadcasts answers in real time. ([source](https://cdn.jsdelivr.net/gh/jack-cherish/python-spider@master/README.md))
- [Web Search Answer Retrievers](https://awesome-repositories.com/f/awesome-lists/ai/question-answering/web-search-answer-retrievers.md) — Automates searching Baidu Zhidao for quiz answers by submitting extracted questions and parsing results. ([source](http://cuijiahua.com/blog/2018/01/spider_3.html))
- [CAPTCHA Solving](https://awesome-repositories.com/f/awesome-lists/security/captcha-solving.md) — Automates the recognition and solving of GEETEST-style CAPTCHA challenges for scraping workflows. ([source](https://cdn.jsdelivr.net/gh/jack-cherish/python-spider@master/README.md))
- [Chinese CAPTCHA Solvers](https://awesome-repositories.com/f/awesome-lists/security/captcha-solving/chinese-captcha-solvers.md) — Automates the recognition and solving of Chinese GEETEST-style CAPTCHA challenges for scraping workflows.

### Business & Productivity Software

- [Automated Ticket Booking Systems](https://awesome-repositories.com/f/business-productivity-software/automated-ticket-booking-systems.md) — Submits purchase requests to a ticketing system at a scheduled time to secure a train seat. ([source](https://cdn.jsdelivr.net/gh/jack-cherish/python-spider@master/README.md))
- [Chinese E-Commerce Scrapers](https://awesome-repositories.com/f/business-productivity-software/e-commerce-product-data-extraction/chinese-e-commerce-scrapers.md) — Scrapes product listings, reviews, and images from major Chinese e-commerce platforms like Taobao and JD.com.
- [Multi-Platform Media Downloads](https://awesome-repositories.com/f/business-productivity-software/media-downloaders/multi-platform-media-downloads.md) — Downloads multiple videos, images, or music files from platforms like Bilibili, JD, and Netease. ([source](https://cdn.jsdelivr.net/gh/jack-cherish/python-spider@master/README.md))
- [Railway Ticket Booking Automators](https://awesome-repositories.com/f/business-productivity-software/purchase-order-management/automated-order-issuance/ticket-purchase-submissions/railway-ticket-booking-automators.md) — Submits purchase requests to the 12306 railway booking system at scheduled times to secure train tickets.
- [Railway Booking Automation](https://awesome-repositories.com/f/business-productivity-software/railway-booking-automation.md) — Submits purchase requests to the 12306 railway booking system to secure tickets as soon as they become available. ([source](https://cdn.jsdelivr.net/gh/jack-cherish/python-spider@master/README.md))

### Graphics & Multimedia

- [Multi-Platform Content Extractors](https://awesome-repositories.com/f/graphics-multimedia/multi-platform-video-extraction/multi-platform-content-extractors.md) — Extracts content from e-commerce, social media, video, and ticketing platforms using automated requests.
- [Bilibili Video Downloads](https://awesome-repositories.com/f/graphics-multimedia/video-downloaders/bilibili-video-downloads.md) — Downloads videos and real-time comments from Bilibili using advanced techniques to bypass anti-scraping measures. ([source](https://cuijiahua.com/blog/spider/))
- [Chinese Video Platform Scrapers](https://awesome-repositories.com/f/graphics-multimedia/video-downloaders/bilibili-video-downloads/chinese-video-platform-scrapers.md) — Downloads videos and comments from Chinese platforms like Bilibili and Douyin, including watermark removal.
- [Video Downloaders](https://awesome-repositories.com/f/graphics-multimedia/video-downloaders.md) — Captures video files from mobile apps by intercepting network requests. ([source](https://cdn.jsdelivr.net/gh/jack-cherish/python-spider@master/README.md))
- [Douyin Video Downloads](https://awesome-repositories.com/f/graphics-multimedia/video-downloaders/douyin-video-downloads.md) — Downloads videos from a Douyin user's profile by searching for the user and fetching each video's download URL. ([source](http://cuijiahua.com/blog/2018/03/spider-5.html))
- [Watermark-Free Media Retrieval](https://awesome-repositories.com/f/graphics-multimedia/watermark-free-media-retrieval.md) — Retrieves a video from a sharing URL and strips the platform's watermark using a third-party or direct API. ([source](https://cdn.jsdelivr.net/gh/jack-cherish/python-spider@master/README.md))

### Networking & Communication

- [Quiz Facilitators](https://awesome-repositories.com/f/networking-communication/real-time-event-streams/quiz-facilitators.md) — Fetches quiz questions via packet capture, searches for answers using Baidu Zhidao, and pushes results to a client. ([source](https://cdn.jsdelivr.net/gh/jack-cherish/python-spider@master/README.md))
- [Live Quiz Answer Broadcasters](https://awesome-repositories.com/f/networking-communication/communication-platforms-services/messaging-notification-systems/real-time-notification-broadcasters/live-quiz-answer-broadcasters.md) — Ships a WebSocket broadcaster that pushes live quiz answers to browser clients in real time. ([source](http://cuijiahua.com/blog/2018/01/spider_3.html))
- [Proxy Pool Builders](https://awesome-repositories.com/f/networking-communication/network-reliability-diagnostics/network-filtering/ip-address-filters/network-traffic-proxying/outbound-ip-rotation/proxy-pool-builders.md) — Collects and manages a rotating pool of proxy IPs from public sources to avoid blocking during scraping. ([source](https://cdn.jsdelivr.net/gh/jack-cherish/python-spider@master/README.md))
- [Proxy Pool Automation](https://awesome-repositories.com/f/networking-communication/proxy-pool-automation.md) — Collects and manages a pool of proxy IPs from public sources to rotate during large-scale scraping operations.
- [Proxy and Fingerprint Rotation](https://awesome-repositories.com/f/networking-communication/proxy-rotation-services/proxy-and-fingerprint-rotation.md) — Distributes requests across a dynamic pool of proxy IPs to avoid rate limiting and IP-based blocking.
- [Live Quiz Answer Assistants](https://awesome-repositories.com/f/networking-communication/real-time-event-streams/quiz-facilitators/live-quiz-answer-assistants.md) — Captures quiz questions from live-streaming apps and searches for answers in real time.
- [Live Quiz Answer Automators](https://awesome-repositories.com/f/networking-communication/real-time-event-streams/quiz-facilitators/live-quiz-answer-automators.md) — Captures live quiz questions from streaming apps and automatically searches for and broadcasts answers in real time.

### Software Engineering & Architecture

- [Scraping Platform Adapters](https://awesome-repositories.com/f/software-engineering-architecture/adapter-patterns/scraping-platform-adapters.md) — Provides a unified interface for downloading media and content from multiple Chinese platforms.

### Artificial Intelligence & ML

- [Quiz OCR Extractors](https://awesome-repositories.com/f/artificial-intelligence-ml/on-device-models/vision-language-models/ocr/quiz-ocr-extractors.md) — Extracts text from images using OCR to automate answering quiz questions on live-streaming platforms. ([source](https://cuijiahua.com/blog/spider/))

### Content Management & Publishing

- [Web Content Scraping](https://awesome-repositories.com/f/content-management-publishing/web-content-scraping.md) — Downloads images or videos from pages that load content asynchronously. ([source](https://cdn.jsdelivr.net/gh/jack-cherish/python-spider@master/README.md))

### DevOps & Infrastructure

- [Task Schedulers](https://awesome-repositories.com/f/devops-infrastructure/automation-orchestration/task-execution-frameworks/task-job-management/task-schedulers.md) — Triggers automated actions at precise times using system timers or sleep loops for time-sensitive services.

### Security & Cryptography

- [E-Commerce Login Simulators](https://awesome-repositories.com/f/security-cryptography/authentication-services/automated-login-frameworks/e-commerce-login-simulators.md) — Automates the login process on Taobao to enable authenticated scraping of user-specific data. ([source](https://cuijiahua.com/blog/spider/))
- [OCR Captcha Solving](https://awesome-repositories.com/f/security-cryptography/authentication-services/automated-login-frameworks/ocr-captcha-solving.md) — Automates GEETEST and other visual CAPTCHA solving using OCR and pattern-matching algorithms.
- [Session-Cookie Persistences](https://awesome-repositories.com/f/security-cryptography/session-cookie-handlers/session-cookie-persistences.md) — Maintains login state across requests by storing and reusing session cookies from initial authentication.

### User Interface & Experience

- [Watermark Removal](https://awesome-repositories.com/f/user-interface-experience/content-rendering-components/image-overlays/media-watermarking-tools/watermark-removal.md) — Uses browser automation to interact with a third-party service that strips watermarks from downloaded videos. ([source](http://cuijiahua.com/blog/2018/03/spider-5.html))
