# nanmicoder/crawlertutorial

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/nanmicoder-crawlertutorial).**

4,262 stars · 471 forks · Python

## Links

- GitHub: https://github.com/NanmiCoder/CrawlerTutorial
- Homepage: https://nanmicoder.github.io/CrawlerTutorial/
- awesome-repositories: https://awesome-repositories.com/repository/nanmicoder-crawlertutorial.md

## Description

CrawlerTutorial is a comprehensive Python web scraping tutorial and framework designed for extracting data from static and dynamic websites. It functions as a web data extraction pipeline and an HTTP request orchestrator, covering the full lifecycle of scraping applications from initial fetching to final data storage.

The project provides specialized guidance on anti-bot bypass techniques and web API reverse engineering. It includes methods for evading browser detection through identity masking and proxy rotation, as well as techniques for identifying hidden API endpoints by analyzing network traffic and request signatures.

The framework encompasses a broad set of capabilities, including browser automation for JavaScript-heavy pages, automated user authentication via QR codes or SMS, and session persistence management. It also features data preprocessing tools for cleaning raw text, removing duplicate records, and persisting gathered information into flat files or relational databases.

## Tags

### Data & Databases

- [Web Data Extraction](https://awesome-repositories.com/f/data-databases/web-data-extraction.md) — Functions as a comprehensive system for programmatically scraping and processing web content from various sources. ([source](https://cdn.jsdelivr.net/gh/nanmicoder/crawlertutorial@main/README.md))
- [Web Data Pipelines](https://awesome-repositories.com/f/data-databases/data-integration-synchronization/event-driven-data-pipelines/web-data-pipelines.md) — Implements an automated pipeline for fetching, cleaning, and storing scraped web data into flat files or databases.
- [Web Page Parsing](https://awesome-repositories.com/f/data-databases/web-page-parsing.md) — Extracts specific text, links, and images by analyzing the structure of downloaded web pages. ([source](https://nanmicoder.github.io/CrawlerTutorial/%E7%88%AC%E8%99%AB%E5%85%A5%E9%97%A8/03_%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB%E5%88%B0%E5%BA%95%E6%98%AF%E4%BB%80%E4%B9%88.html))
- [Text Cleaning](https://awesome-repositories.com/f/data-databases/client-side-data-processing/text-cleaning.md) — Cleans raw scraped text by removing HTML tags and fixing encoding for structured analysis. ([source](https://nanmicoder.github.io/CrawlerTutorial/%E7%88%AC%E8%99%AB%E8%BF%9B%E4%BB%B7/09_%E6%95%B0%E6%8D%AE%E6%B8%85%E6%B4%97%E4%B8%8E%E9%A2%84%E5%A4%84%E7%90%86.html))
- [Data Cleaning Pipelines](https://awesome-repositories.com/f/data-databases/data-cleaning-pipelines.md) — Implements data cleaning pipelines to transform raw scraped content into usable formats for analysis. ([source](https://nanmicoder.github.io/CrawlerTutorial/))
- [Data Processing](https://awesome-repositories.com/f/data-databases/data-processing-pipelines/data-processing.md) — Transforms raw crawled data into structured formats suitable for visualization and analysis. ([source](https://cdn.jsdelivr.net/gh/nanmicoder/crawlertutorial@main/README.md))
- [Text Preprocessing](https://awesome-repositories.com/f/data-databases/data-processing-pipelines/data-transformation/text-nlp-preprocessing/text-preprocessing.md) — Includes tools for cleaning raw scraped text, removing duplicate records, and transforming data into analysis-ready formats.
- [Web Document Parsing](https://awesome-repositories.com/f/data-databases/document-parsing-engines/web-document-parsing.md) — Extracts specific data from HTML and XML documents using specialized parsing libraries. ([source](https://nanmicoder.github.io/CrawlerTutorial/%E7%88%AC%E8%99%AB%E5%85%A5%E9%97%A8/06_%E4%B8%BA%E4%BB%80%E4%B9%88%E8%AF%B4%E7%94%A8Python%E5%86%99%E7%88%AC%E8%99%AB%E6%9C%89%E5%A4%A9%E7%94%9F%E4%BC%98%E5%8A%BF.html))
- [Extracted Data Storage](https://awesome-repositories.com/f/data-databases/extracted-data-storage.md) — Saves gathered information into databases or indexes to support searching and analysis. ([source](https://nanmicoder.github.io/CrawlerTutorial/))
- [Flat-File Storage](https://awesome-repositories.com/f/data-databases/flat-file-storage.md) — Writes extracted information to simple text files like CSV or JSON using asynchronous I/O. ([source](https://nanmicoder.github.io/CrawlerTutorial/%E7%88%AC%E8%99%AB%E5%85%A5%E9%97%A8/10_%E7%88%AC%E8%99%AB%E5%85%A5%E9%97%A8%E5%AE%9E%E6%88%983_%E6%95%B0%E6%8D%AE%E5%AD%98%E5%82%A8%E5%AE%9E%E7%8E%B0.html))
- [Multi-Format Data Persistence](https://awesome-repositories.com/f/data-databases/multi-format-data-persistence.md) — Saves extracted information across multiple storage types including JSON and CSV flat files. ([source](https://nanmicoder.github.io/CrawlerTutorial/%E7%88%AC%E8%99%AB%E8%BF%9B%E4%BB%B7/11_%E8%BF%9B%E9%98%B6%E7%BB%BC%E5%90%88%E5%AE%9E%E6%88%98%E9%A1%B9%E7%9B%AE.html))
- [Relational Database Persistence](https://awesome-repositories.com/f/data-databases/mysql-integrations/mysql-storage-support/relational-database-persistence.md) — Persists structured scraped data into relational databases using upsert logic to prevent duplicates. ([source](https://nanmicoder.github.io/CrawlerTutorial/%E7%88%AC%E8%99%AB%E5%85%A5%E9%97%A8/10_%E7%88%AC%E8%99%AB%E5%85%A5%E9%97%A8%E5%AE%9E%E6%88%983_%E6%95%B0%E6%8D%AE%E5%AD%98%E5%82%A8%E5%AE%9E%E7%8E%B0.html))
- [Pagination Pattern Analysis](https://awesome-repositories.com/f/data-databases/query-aggregates/paginated-results/pagination-pattern-analysis.md) — Includes methods for analyzing URL structures and HTML elements to determine how to navigate paginated results. ([source](https://nanmicoder.github.io/CrawlerTutorial/%E7%88%AC%E8%99%AB%E5%85%A5%E9%97%A8/08_%E7%88%AC%E8%99%AB%E5%85%A5%E9%97%A8%E5%AE%9E%E6%88%981_%E9%9D%99%E6%80%81%E7%BD%91%E9%A1%B5%E6%95%B0%E6%8D%AE%E6%8F%90%E5%8F%96.html))
- [API-Based Extractions](https://awesome-repositories.com/f/data-databases/structured-data-extraction/api-based-extractions.md) — Retrieves structured data by constructing authenticated HTTP requests to identified API endpoints. ([source](https://nanmicoder.github.io/CrawlerTutorial/%E7%88%AC%E8%99%AB%E5%85%A5%E9%97%A8/04_%E7%88%AC%E8%99%AB%E7%9A%84%E5%9F%BA%E6%9C%AC%E5%B7%A5%E4%BD%9C%E5%8E%9F%E7%90%86.html))
- [Tabular Data Manipulations](https://awesome-repositories.com/f/data-databases/tabular-data-manipulations.md) — Performs tabular data manipulations using data frames to structure and transform extracted information. ([source](https://nanmicoder.github.io/CrawlerTutorial/%E7%88%AC%E8%99%AB%E8%BF%9B%E4%BB%B7/10_%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90%E4%B8%8E%E5%8F%AF%E8%A7%86%E5%8C%96.html))

### Education & Learning Resources

- [Web Scraping Tutorials](https://awesome-repositories.com/f/education-learning-resources/educational-resources/reference-and-media/tutorials-media-curated-lists/technical-tutorials/data-analytics/web-scraping-tutorials.md) — Provides an extensive educational guide and project-based materials for performing web scraping using Python.

### Part of an Awesome List

- [HTML Parsing](https://awesome-repositories.com/f/awesome-lists/data/html-parsing.md) — Provides tools for extracting structured data from HTML using selectors to isolate repeating data blocks. ([source](https://nanmicoder.github.io/CrawlerTutorial/%E7%88%AC%E8%99%AB%E5%85%A5%E9%97%A8/08_%E7%88%AC%E8%99%AB%E5%85%A5%E9%97%A8%E5%AE%9E%E6%88%981_%E9%9D%99%E6%80%81%E7%BD%91%E9%A1%B5%E6%95%B0%E6%8D%AE%E6%8F%90%E5%8F%96.html))
- [Web Parsing](https://awesome-repositories.com/f/awesome-lists/devtools/web-parsing.md) — Retrieves information from web pages using CSS selectors and XPath expressions. ([source](https://nanmicoder.github.io/CrawlerTutorial/%E7%88%AC%E8%99%AB%E5%85%A5%E9%97%A8/04_%E7%88%AC%E8%99%AB%E7%9A%84%E5%9F%BA%E6%9C%AC%E5%B7%A5%E4%BD%9C%E5%8E%9F%E7%90%86.html))

### Development Tools & Productivity

- [Browser Automation Frameworks](https://awesome-repositories.com/f/development-tools-productivity/browser-automation-frameworks.md) — Implements a framework for simulating user interactions to extract content from JavaScript-heavy, dynamic webpages.
- [Headless Browser Automation](https://awesome-repositories.com/f/development-tools-productivity/headless-browser-automation.md) — Automates headless browser engines to render JavaScript and interact with dynamic web content.
- [Browser Automation](https://awesome-repositories.com/f/development-tools-productivity/natural-language-interfaces/browser-interactions/browser-automation.md) — Provides programmatic control of browsers to simulate user interactions and manage isolated execution contexts.

### Networking & Communication

- [HTTP Request Execution](https://awesome-repositories.com/f/networking-communication/http-request-execution.md) — Executes synchronous and asynchronous HTTP requests to retrieve data from web servers. ([source](https://nanmicoder.github.io/CrawlerTutorial/%E7%88%AC%E8%99%AB%E5%85%A5%E9%97%A8/06_%E4%B8%BA%E4%BB%80%E4%B9%88%E8%AF%B4%E7%94%A8Python%E5%86%99%E7%88%AC%E8%99%AB%E6%9C%89%E5%A4%A9%E7%94%9F%E4%BC%98%E5%8A%BF.html))
- [Outbound IP Rotation](https://awesome-repositories.com/f/networking-communication/network-reliability-diagnostics/network-filtering/ip-address-filters/network-traffic-proxying/outbound-ip-rotation.md) — Distributes requests across a pool of rotating proxy addresses to bypass server-side IP rate limits.
- [QR Code & Phone Verification Logins](https://awesome-repositories.com/f/networking-communication/communication-platforms-services/messaging-notification-systems/messaging-automation/account-authentication-gateways/account-authentication/qr-code-phone-verification-logins.md) — Automates the login process by simulating QR code scanning and polling for authentication status. ([source](https://nanmicoder.github.io/CrawlerTutorial/%E7%88%AC%E8%99%AB%E8%BF%9B%E4%BB%B7/07_%E7%99%BB%E5%BD%95%E8%AE%A4%E8%AF%81_%E6%89%AB%E7%A0%81%E4%B8%8E%E7%9F%AD%E4%BF%A1%E7%99%BB%E5%BD%95%E5%AE%9E%E7%8E%B0.html))
- [Phone Verification Code Logins](https://awesome-repositories.com/f/networking-communication/communication-platforms-services/messaging-notification-systems/messaging-automation/account-authentication-gateways/account-authentication/qr-code-phone-verification-logins/phone-verification-code-logins.md) — Handles SMS-based authentication by automating the entry of phone numbers and verification codes. ([source](https://nanmicoder.github.io/CrawlerTutorial/%E7%88%AC%E8%99%AB%E8%BF%9B%E4%BB%B7/07_%E7%99%BB%E5%BD%95%E8%AE%A4%E8%AF%81_%E6%89%AB%E7%A0%81%E4%B8%8E%E7%9F%AD%E4%BF%A1%E7%99%BB%E5%BD%95%E5%AE%9E%E7%8E%B0.html))
- [HTTP Request Orchestrators](https://awesome-repositories.com/f/networking-communication/http-request-orchestrators.md) — Ships a system for orchestrating concurrent network requests with integrated proxy rotation and session persistence.
- [HTTP Traffic Inspection](https://awesome-repositories.com/f/networking-communication/http-traffic-inspection.md) — Identifies hidden API endpoints by capturing and analyzing HTTP request and response traffic. ([source](https://nanmicoder.github.io/CrawlerTutorial/%E7%88%AC%E8%99%AB%E5%85%A5%E9%97%A8/05_%E5%B8%B8%E7%94%A8%E7%9A%84%E6%8A%93%E5%8C%85%E5%B7%A5%E5%85%B7%E6%9C%89%E9%82%A3%E4%BA%9B.html))
- [Bot Detection Bypass](https://awesome-repositories.com/f/networking-communication/request-header-configuration/request-header-overrides/bot-detection-bypass.md) — Avoids IP bans by applying random request delays and capping crawl volume to mimic human behavior. ([source](https://nanmicoder.github.io/CrawlerTutorial/%E7%88%AC%E8%99%AB%E8%BF%9B%E4%BB%B7/11_%E8%BF%9B%E9%98%B6%E7%BB%BC%E5%90%88%E5%AE%9E%E6%88%98%E9%A1%B9%E7%9B%AE.html))

### Security & Cryptography

- [Anti-Bot Evasion](https://awesome-repositories.com/f/security-cryptography/bot-detection/anti-bot-evasion.md) — Implements browser fingerprint masquerading and proxy rotation to evade sophisticated bot detection systems.
- [Browser Fingerprint Spoofing](https://awesome-repositories.com/f/security-cryptography/browser-fingerprinting-services/browser-fingerprint-generators/spoofing-tools/browser-fingerprint-spoofing.md) — Provides tools to spoof browser fingerprints and TLS parameters to evade automated bot detection systems.
- [API Endpoint Discovery](https://awesome-repositories.com/f/security-cryptography/api-endpoint-discovery.md) — Provides techniques for discovering hidden API endpoints by analyzing network traffic and request signatures. ([source](https://nanmicoder.github.io/CrawlerTutorial/%E7%88%AC%E8%99%AB%E5%85%A5%E9%97%A8/09_%E7%88%AC%E8%99%AB%E5%85%A5%E9%97%A8%E5%AE%9E%E6%88%982_%E5%8A%A8%E6%80%81%E6%95%B0%E6%8D%AE%E6%8F%90%E5%8F%96.html))
- [Session State Management](https://awesome-repositories.com/f/security-cryptography/authentication-clients/client-to-server-authentication/session-state-management.md) — Manages local session state by tracking authentication tokens and cookies throughout the extraction process. ([source](https://nanmicoder.github.io/CrawlerTutorial/%E7%88%AC%E8%99%AB%E8%BF%9B%E4%BB%B7/06_%E7%99%BB%E5%BD%95%E8%AE%A4%E8%AF%81_Cookie%E4%B8%8ESession%E7%AE%A1%E7%90%86.html))
- [Anti-Captcha Strategies](https://awesome-repositories.com/f/security-cryptography/captcha-services/automated-captcha-solvers/captcha-interaction-simulations/anti-captcha-strategies.md) — Reduces the occurrence of verification walls by managing request frequency and reusing session cookies. ([source](https://nanmicoder.github.io/CrawlerTutorial/%E7%88%AC%E8%99%AB%E8%BF%9B%E4%BB%B7/08_%E9%AA%8C%E8%AF%81%E7%A0%81%E8%AF%86%E5%88%AB%E4%B8%8E%E5%A4%84%E7%90%86.html))
- [Rate Limit Bypasses](https://awesome-repositories.com/f/security-cryptography/remote-access-management/content-access-controllers/bypassing-access-restrictions/rate-limit-bypasses.md) — Circumvents server-side IP blocks through identity randomization and request frequency management. ([source](https://nanmicoder.github.io/CrawlerTutorial/%E7%88%AC%E8%99%AB%E8%BF%9B%E4%BB%B7/03_%E4%BB%A3%E7%90%86IP%E7%9A%84%E4%BD%BF%E7%94%A8%E4%B8%8E%E7%AE%A1%E7%90%86.html))
- [Session Authentication Strategies](https://awesome-repositories.com/f/security-cryptography/session-authentication-strategies.md) — Implements session authentication strategies including QR-based authorization to access restricted data. ([source](https://nanmicoder.github.io/CrawlerTutorial/%E7%88%AC%E8%99%AB%E8%BF%9B%E4%BB%B7/11_%E8%BF%9B%E9%98%B6%E7%BB%BC%E5%90%88%E5%AE%9E%E6%88%98%E9%A1%B9%E7%9B%AE.html))
- [Session-Cookie Persistences](https://awesome-repositories.com/f/security-cryptography/session-cookie-handlers/session-cookie-persistences.md) — Persists and reuses session cookies and tokens to maintain authenticated states across scraping runs.
- [Authentication Token Extraction](https://awesome-repositories.com/f/security-cryptography/token-based-authentication/authentication-token-extraction.md) — Retrieves authentication tokens and session cookies from browser contexts for use in programmatic requests. ([source](https://nanmicoder.github.io/CrawlerTutorial/%E7%88%AC%E8%99%AB%E8%BF%9B%E4%BB%B7/06_%E7%99%BB%E5%BD%95%E8%AE%A4%E8%AF%81_Cookie%E4%B8%8ESession%E7%AE%A1%E7%90%86.html))
- [Protection Bypassers](https://awesome-repositories.com/f/security-cryptography/traffic-protection/protection-bypassers.md) — Implements signature algorithms and header management to circumvent automated API security challenges. ([source](https://nanmicoder.github.io/CrawlerTutorial/%E7%88%AC%E8%99%AB%E8%BF%9B%E4%BB%B7/11_%E8%BF%9B%E9%98%B6%E7%BB%BC%E5%90%88%E5%AE%9E%E6%88%98%E9%A1%B9%E7%9B%AE.html))
- [User Authentications](https://awesome-repositories.com/f/security-cryptography/user-authentications.md) — Handles user identity verification and session persistence through the management of cookies and CAPTCHA resolution. ([source](https://cdn.jsdelivr.net/gh/nanmicoder/crawlertutorial@main/README.md))

### Web Development

- [Browser Automation](https://awesome-repositories.com/f/web-development/browser-automation.md) — Simulates user interactions in headless browsers to extract data from JavaScript-heavy dynamic pages. ([source](https://nanmicoder.github.io/CrawlerTutorial/%E7%88%AC%E8%99%AB%E5%85%A5%E9%97%A8/06_%E4%B8%BA%E4%BB%80%E4%B9%88%E8%AF%B4%E7%94%A8Python%E5%86%99%E7%88%AC%E8%99%AB%E6%9C%89%E5%A4%A9%E7%94%9F%E4%BC%98%E5%8A%BF.html))
- [Concurrent Crawling Engines](https://awesome-repositories.com/f/web-development/concurrent-crawling-engines.md) — Executes multiple network requests simultaneously using asynchronous tasks to accelerate data extraction. ([source](https://nanmicoder.github.io/CrawlerTutorial/%E7%88%AC%E8%99%AB%E5%85%A5%E9%97%A8/11_%E7%88%AC%E8%99%AB%E5%85%A5%E9%97%A8%E5%AE%9E%E6%88%984_%E9%AB%98%E6%95%88%E7%8E%87%E7%9A%84%E7%88%AC%E8%99%AB%E5%AE%9E%E7%8E%B0.html))
- [JavaScript-Rendered Content Extractors](https://awesome-repositories.com/f/web-development/data-extraction/javascript-rendered-content-extractors.md) — Includes capabilities to wait for JavaScript execution and ensure dynamic content is fully rendered before extraction. ([source](https://nanmicoder.github.io/CrawlerTutorial/%E7%88%AC%E8%99%AB%E8%BF%9B%E4%BB%B7/04_Playwright%E6%B5%8F%E8%A7%88%E5%99%A8%E8%87%AA%E5%8A%A8%E5%8C%96%E5%85%A5%E9%97%A8.html))
- [Network Request Interception](https://awesome-repositories.com/f/web-development/network-request-interception.md) — Captures API responses and JSON data directly from network traffic to avoid complex DOM parsing.
- [HTTP Header and Cookie Management](https://awesome-repositories.com/f/web-development/parameter-encoding-schemes/http-header-and-parameter-management/http-header-and-cookie-management.md) — Manages HTTP headers and cookies to mimic real browser behavior and maintain request identity. ([source](https://nanmicoder.github.io/CrawlerTutorial/%E7%88%AC%E8%99%AB%E5%85%A5%E9%97%A8/07_Python%E5%B8%B8%E8%A7%81%E7%9A%84%E7%BD%91%E7%BB%9C%E8%AF%B7%E6%B1%82%E5%BA%93.html))
- [Anti-Detection Automations](https://awesome-repositories.com/f/web-development/web-automation-frameworks/anti-detection-automations.md) — Utilizes anti-detection automations to modify browser properties and bypass bot detection mechanisms. ([source](https://nanmicoder.github.io/CrawlerTutorial/%E7%88%AC%E8%99%AB%E8%BF%9B%E4%BB%B7/05_Playwright%E8%BF%9B%E9%98%B6_%E5%8F%8D%E6%A3%80%E6%B5%8B%E4%B8%8E%E6%80%A7%E8%83%BD%E4%BC%98%E5%8C%96.html))
- [Browser Automation](https://awesome-repositories.com/f/web-development/web-automation-scraping/web-scraping-automation/browser-automation.md) — Controls browser processes and isolated contexts to perform complex web actions and manage sessions. ([source](https://nanmicoder.github.io/CrawlerTutorial/))
- [Large-Scale Domain Crawlers](https://awesome-repositories.com/f/web-development/web-automation-scraping/web-scraping-automation/web-crawling/large-scale-domain-crawlers.md) — Provides a unified framework to manage the full lifecycle of large-scale web scraping applications. ([source](https://nanmicoder.github.io/CrawlerTutorial/%E7%88%AC%E8%99%AB%E5%85%A5%E9%97%A8/06_%E4%B8%BA%E4%BB%80%E4%B9%88%E8%AF%B4%E7%94%A8Python%E5%86%99%E7%88%AC%E8%99%AB%E6%9C%89%E5%A4%A9%E7%94%9F%E4%BC%98%E5%8A%BF.html))
- [Web Scraping Evasion Tools](https://awesome-repositories.com/f/web-development/web-automation-scraping/web-scraping-automation/web-scraping-evasion-tools.md) — Provides specialized tools and techniques to mask automation traces and evade bot detection systems. ([source](https://cdn.jsdelivr.net/gh/nanmicoder/crawlertutorial@main/README.md))
- [Web Page Retrievers](https://awesome-repositories.com/f/web-development/web-page-retrievers.md) — Downloads HTML content from websites while managing request frequency and concurrency rules. ([source](https://nanmicoder.github.io/CrawlerTutorial/%E7%88%AC%E8%99%AB%E5%85%A5%E9%97%A8/03_%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB%E5%88%B0%E5%BA%95%E6%98%AF%E4%BB%80%E4%B9%88.html))
- [Web Scraping and Extraction](https://awesome-repositories.com/f/web-development/web-scraping-and-extraction.md) — Provides tools for parsing HTML documents to extract structured text and links from static websites. ([source](https://nanmicoder.github.io/CrawlerTutorial/%E7%88%AC%E8%99%AB%E5%85%A5%E9%97%A8/08_%E7%88%AC%E8%99%AB%E5%85%A5%E9%97%A8%E5%AE%9E%E6%88%981_%E9%9D%99%E6%80%81%E7%BD%91%E9%A1%B5%E6%95%B0%E6%8D%AE%E6%8F%90%E5%8F%96.html))
- [API Reverse Engineering](https://awesome-repositories.com/f/web-development/web-scraping-engines/api-reverse-engineering.md) — Offers methods for analyzing network traffic to identify hidden API endpoints and bypass request signatures. ([source](https://nanmicoder.github.io/CrawlerTutorial/%E7%88%AC%E8%99%AB%E5%85%A5%E9%97%A8/09_%E7%88%AC%E8%99%AB%E5%85%A5%E9%97%A8%E5%AE%9E%E6%88%982_%E5%8A%A8%E6%80%81%E6%95%B0%E6%8D%AE%E6%8F%90%E5%8F%96.html))
- [Crawl Request Deduplications](https://awesome-repositories.com/f/web-development/api-request-deduplication/crawl-request-deduplications.md) — Prevents redundant crawling by filtering and deduplicating extracted URLs using a tracking system. ([source](https://nanmicoder.github.io/CrawlerTutorial/%E7%88%AC%E8%99%AB%E5%85%A5%E9%97%A8/03_%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB%E5%88%B0%E5%BA%95%E6%98%AF%E4%BB%80%E4%B9%88.html))
- [Browser Session Managers](https://awesome-repositories.com/f/web-development/browser-session-managers.md) — Reuses existing browser sessions to minimize resource overhead and improve execution efficiency. ([source](https://nanmicoder.github.io/CrawlerTutorial/%E7%88%AC%E8%99%AB%E8%BF%9B%E4%BB%B7/05_Playwright%E8%BF%9B%E9%98%B6_%E5%8F%8D%E6%A3%80%E6%B5%8B%E4%B8%8E%E6%80%A7%E8%83%BD%E4%BC%98%E5%8C%96.html))
- [Resource Blocking](https://awesome-repositories.com/f/web-development/page-speed-optimizations/resource-blocking.md) — Blocks non-essential resources like images and fonts to reduce bandwidth and increase extraction speed. ([source](https://nanmicoder.github.io/CrawlerTutorial/%E7%88%AC%E8%99%AB%E8%BF%9B%E4%BB%B7/05_Playwright%E8%BF%9B%E9%98%B6_%E5%8F%8D%E6%A3%80%E6%B5%8B%E4%B8%8E%E6%80%A7%E8%83%BD%E4%BC%98%E5%8C%96.html))
- [Crawler Identity Masking](https://awesome-repositories.com/f/web-development/web-automation-scraping/web-scraping-automation/web-scraping/web-crawlers/crawler-configuration-managers/crawler-identity-masking.md) — Configures HTTP headers and TLS fingerprints to simulate real browser behavior and evade detection. ([source](https://nanmicoder.github.io/CrawlerTutorial/%E7%88%AC%E8%99%AB%E8%BF%9B%E4%BB%B7/02_%E5%8F%8D%E7%88%AC%E8%99%AB%E5%AF%B9%E6%8A%97%E5%9F%BA%E7%A1%80_%E8%AF%B7%E6%B1%82%E4%BC%AA%E8%A3%85.html))
- [Request Proxying](https://awesome-repositories.com/f/web-development/web-automation-scraping/web-scraping-automation/web-scraping/web-crawlers/crawler-configuration-managers/crawler-identity-masking/request-proxying.md) — Hides the origin IP address by routing web requests through a pool of intermediate proxy servers. ([source](https://nanmicoder.github.io/CrawlerTutorial/%E7%88%AC%E8%99%AB%E8%BF%9B%E4%BB%B7/03_%E4%BB%A3%E7%90%86IP%E7%9A%84%E4%BD%BF%E7%94%A8%E4%B8%8E%E7%AE%A1%E7%90%86.html))

### Artificial Intelligence & ML

- [Regex Data Extraction](https://awesome-repositories.com/f/artificial-intelligence-ml/data-indexing/schema-less/value-extraction/regex-data-extraction.md) — Uses regular expressions to isolate specific values from unstructured text strings. ([source](https://nanmicoder.github.io/CrawlerTutorial/%E7%88%AC%E8%99%AB%E8%BF%9B%E4%BB%B7/09_%E6%95%B0%E6%8D%AE%E6%B8%85%E6%B4%97%E4%B8%8E%E9%A2%84%E5%A4%84%E7%90%86.html))

### DevOps & Infrastructure

- [Token Bucket Implementations](https://awesome-repositories.com/f/devops-infrastructure/rate-limiters/rate-limiting-algorithms/token-bucket-implementations.md) — Implements a token bucket algorithm to control request frequency and prevent server overloading.

### Software Engineering & Architecture

- [Concurrent Task Execution](https://awesome-repositories.com/f/software-engineering-architecture/concurrent-task-execution.md) — Implements concurrent execution of network requests to improve the throughput of data extraction.
- [Retry Policies](https://awesome-repositories.com/f/software-engineering-architecture/error-handling/retry-policies.md) — Handles network instability and transient errors by automatically retrying failed scraping operations. ([source](https://nanmicoder.github.io/CrawlerTutorial/%E7%88%AC%E8%99%AB%E8%BF%9B%E4%BB%B7/01_%E5%B7%A5%E7%A8%8B%E5%8C%96%E7%88%AC%E8%99%AB%E5%BC%80%E5%8F%91%E8%A7%84%E8%8C%83.html))
- [Request Rate Limiting](https://awesome-repositories.com/f/software-engineering-architecture/traffic-management/request-rate-limiting.md) — Controls outgoing request frequency using a token bucket mechanism to avoid server overloading. ([source](https://nanmicoder.github.io/CrawlerTutorial/%E7%88%AC%E8%99%AB%E8%BF%9B%E4%BB%B7/02_%E5%8F%8D%E7%88%AC%E8%99%AB%E5%AF%B9%E6%8A%97%E5%9F%BA%E7%A1%80_%E8%AF%B7%E6%B1%82%E4%BC%AA%E8%A3%85.html))