# wistbean/learn_python3_spider

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/wistbean-learn-python3-spider).**

21,802 stars · 3,920 forks · Python · MIT

## Links

- GitHub: https://github.com/wistbean/learn_python3_spider
- Homepage: http://fxxkpython.com
- awesome-repositories: https://awesome-repositories.com/repository/wistbean-learn-python3-spider.md

## Topics

`python-script` `python-spider` `python3`

## Description

This project is a comprehensive educational guide and framework for building web scrapers using Python. It provides a course-based approach to data extraction, combining a Python crawler framework with tutorials on web reverse engineering and network traffic analysis.

The project distinguishes itself by covering advanced extraction challenges, including the decryption of obfuscated JavaScript and the bypass of anti-scraping measures. It specifically addresses mobile application scraping through the simulation of user interactions and the interception of network traffic.

The capability surface extends to distributed scraping architectures that scale data collection across multiple servers and concurrent request optimization using multi-threading and multi-processing. It further covers browser automation for dynamic content, captcha solving, and the persistence of extracted data into relational databases, document stores, or spreadsheets.

## Tags

### Data & Databases

- [Web Data Extraction](https://awesome-repositories.com/f/data-databases/web-data-extraction.md) — Provides a comprehensive framework for programmatically scraping and processing structured web content.
- [Web Crawlers](https://awesome-repositories.com/f/data-databases/data-engineering-infrastructure/data-extraction-ingestion/data-collection-tools/web-crawlers.md) — Ships a collection of automated scripts and frameworks using Scrapy and Selenium for systematic web indexing.
- [Data Parsing and Extraction](https://awesome-repositories.com/f/data-databases/data-processing-pipelines/data-transformation/data-parsing-extraction.md) — Extracts specific data from web formats using regular expressions and specialized parsing logic. ([source](https://github.com/wistbean/learn_python3_spider/blob/master/README.md))
- [Distributed Crawling Systems](https://awesome-repositories.com/f/data-databases/data-processing-pipelines/distributed-crawling-systems.md) — Implements scalable architectures for managing high-volume, asynchronous web crawling across multiple nodes.

### Part of an Awesome List

- [HTML Parsing](https://awesome-repositories.com/f/awesome-lists/data/html-parsing.md) — Uses CSS selectors and BeautifulSoup to extract and manipulate structured data from HTML content.
- [Scraping and Anti-Detection](https://awesome-repositories.com/f/awesome-lists/security/scraping-and-anti-detection.md) — Bypasses website restrictions using proxy rotation, header spoofing, and automated captcha solving.

### Development Tools & Productivity

- [Headless Browser Automation](https://awesome-repositories.com/f/development-tools-productivity/headless-browser-automation.md) — Controls browser engines via Selenium or Appium to interact with dynamic content and simulate human behavior.
- [Parallel Execution](https://awesome-repositories.com/f/development-tools-productivity/parallel-execution.md) — Accelerates data collection by executing multiple download jobs concurrently via a process pool. ([source](https://github.com/wistbean/learn_python3_spider/blob/master/meizitu.py))

### Education & Learning Resources

- [Web Scraping Courses](https://awesome-repositories.com/f/education-learning-resources/python-programming-guides/web-scraping-courses.md) — Offers a comprehensive educational guide and course for building data extractors using Python.
- [Web Scraping Tutorials](https://awesome-repositories.com/f/education-learning-resources/educational-resources/reference-and-media/tutorials-media-curated-lists/technical-tutorials/data-analytics/web-scraping-tutorials.md) — Provides guides on decrypting obfuscated JavaScript and bypassing anti-scraping measures to access hidden data.

### Mobile Development

- [Mobile App Scraping](https://awesome-repositories.com/f/mobile-development/mobile-app-scraping.md) — Implements a methodology for capturing data from mobile applications using network interception and Appium automation.
- [Native Mobile Automation](https://awesome-repositories.com/f/mobile-development/mobile-infrastructure-security/mobile-synchronization/automation-frameworks/mobile-browser-automation/native-mobile-automation.md) — Uses Appium to execute scripts within native mobile environments for automated data collection. ([source](https://github.com/wistbean/learn_python3_spider/blob/master/wechat_moment.py))
- [UI Content Extraction](https://awesome-repositories.com/f/mobile-development/mobile-infrastructure-security/mobile-synchronization/automation-frameworks/mobile-browser-automation/mobile-ui-automation-frameworks/ui-content-extraction.md) — Retrieves text and data from mobile UI elements by iterating through lists and scrolling. ([source](https://github.com/wistbean/learn_python3_spider/blob/master/wechat_moment.py))

### Networking & Communication

- [Distributed Crawl Coordination](https://awesome-repositories.com/f/networking-communication/distributed-systems-p2p/distributed-computing/distributed-crawl-coordination.md) — Provides mechanisms for partitioning and synchronizing web discovery tasks across multiple worker nodes.
- [Network Traffic Inspectors](https://awesome-repositories.com/f/networking-communication/network-traffic-inspectors.md) — Provides tools for examining raw network request and response data to identify hidden data sources. ([source](https://github.com/wistbean/learn_python3_spider/blob/master/README.md))
- [Traffic Proxying](https://awesome-repositories.com/f/networking-communication/traffic-proxying.md) — Intercepts network requests using proxy tools like Fiddler or mitmproxy to reverse engineer API calls.
- [Outbound IP Rotation](https://awesome-repositories.com/f/networking-communication/network-reliability-diagnostics/network-filtering/ip-address-filters/network-traffic-proxying/outbound-ip-rotation.md) — Implements outbound IP rotation through a proxy pool to evade detection and prevent permanent IP blocks.
- [Network Traffic Analyzers](https://awesome-repositories.com/f/networking-communication/network-traffic-analyzers.md) — Provides tutorials on capturing and analyzing HTTP requests using proxy tools like Fiddler and mitmproxy.
- [Traffic Analysis Tools](https://awesome-repositories.com/f/networking-communication/traffic-analysis-tools.md) — Inspects network requests and responses using proxy tools to reverse engineer client-server data exchange. ([source](https://github.com/wistbean/learn_python3_spider#readme))

### Software Engineering & Architecture

- [Scraping Architectures](https://awesome-repositories.com/f/software-engineering-architecture/distributed-systems-architectures/scraping-architectures.md) — Implements an architectural approach to scale data extraction across multiple servers using Python concurrency.

### Web Development

- [Browser Automation](https://awesome-repositories.com/f/web-development/browser-automation.md) — Controls headless browsers to interact with dynamic content, perform searches, and navigate pages. ([source](https://github.com/wistbean/learn_python3_spider#readme))
- [Web Page Retrievers](https://awesome-repositories.com/f/web-development/web-page-retrievers.md) — Enables retrieving content from websites using network libraries to simulate browser requests for data extraction. ([source](https://github.com/wistbean/learn_python3_spider/blob/master/README.md))
- [Concurrent Request Pooling](https://awesome-repositories.com/f/web-development/http-request-managers/concurrent-request-pooling.md) — Implements parallel HTTP request execution using capped pools to optimize data collection throughput.

### Security & Cryptography

- [Automated Captcha Solvers](https://awesome-repositories.com/f/security-cryptography/captcha-services/automated-captcha-solvers.md) — Automates the resolution of CAPTCHA challenges by calculating drag distances for slider movement. ([source](https://github.com/wistbean/learn_python3_spider/blob/master/fuck_bilibili_captcha.py))
- [JavaScript De-obfuscation](https://awesome-repositories.com/f/security-cryptography/javascript-de-obfuscation.md) — Analyzes obfuscated scripts and decryption logic to extract hidden data from encrypted web responses.
- [Obfuscated Data Decoders](https://awesome-repositories.com/f/security-cryptography/obfuscated-data-decoders.md) — Decodes obfuscated scripts and decrypts font-mapping or app-level encryption to retrieve hidden information. ([source](https://github.com/wistbean/learn_python3_spider#readme))
