# dataabc/weibospider

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/dataabc-weibospider).**

9,630 stars · 2,071 forks · Python

## Links

- GitHub: https://github.com/dataabc/weiboSpider
- awesome-repositories: https://awesome-repositories.com/repository/dataabc-weibospider.md

## Topics

`help-wanted` `python` `python3` `weibo` `weibospider`

## Description

weiboSpider is a Python web scraper and social media crawler designed to extract user profiles, posts, and engagement metrics from Sina Weibo. It functions as an automated data pipeline for academic research and trend analysis, collecting long-form text and multimedia content.

The tool distinguishes itself through the use of browser session cookies to authenticate requests and access protected profiles. It implements randomized request pacing and global pauses to manage traffic and avoid platform rate limits, while supporting incremental crawling to capture only new content based on timestamps.

Capabilities include keyword-based post searches within defined time windows, the harvesting of original images and videos, and social network mapping via the extraction of follower lists. Extracted data can be filtered by date or originality and is persisted to flat files, relational databases, document databases, or transmitted as token-authenticated JSON payloads to remote API endpoints.

## Tags

### Data & Databases

- [Social Platform Data Extraction](https://awesome-repositories.com/f/data-databases/data-extraction-tools/social-platform-data-extraction.md) — Gathers detailed user profiles and published posts from Sina Weibo for academic research and trend analysis. ([source](https://github.com/dataabc/weiboSpider/blob/master/docs/academic.md))
- [Data Extraction Pipelines](https://awesome-repositories.com/f/data-databases/data-extraction-pipelines.md) — Automates the workflow of parsing and synchronizing social media data into structured JSON payloads.
- [Temporal Range Filtering](https://awesome-repositories.com/f/data-databases/custom-time-range-queries/temporal-range-filtering.md) — Allows users to define specific start and end dates to collect posts within a designated time window. ([source](https://github.com/dataabc/weiboSpider/blob/master/docs/settings.md))
- [Keyword-Based Content Collectors](https://awesome-repositories.com/f/data-databases/keyword-based-content-collectors.md) — Retrieves social media posts containing specific keywords within defined time windows.
- [Keyword-Based Trend Monitoring](https://awesome-repositories.com/f/data-databases/keyword-based-trend-monitoring.md) — Searches for and extracts posts containing specific terms within defined time windows for trend analysis.
- [Multi-Destination Data Routing](https://awesome-repositories.com/f/data-databases/multi-destination-data-routing.md) — Routes extracted social media data to multiple destinations including flat files, relational databases, and document databases.
- [Multi-Format Data Persistence](https://awesome-repositories.com/f/data-databases/multi-format-data-persistence.md) — Persists extracted information across various storage types, including flat files and relational or document databases. ([source](https://github.com/dataabc/weibospider#readme))
- [Date-Based Filters](https://awesome-repositories.com/f/data-databases/search-indexing-technologies/search-indexing/search-and-indexing/content-search-filters/date-based-filters.md) — Limits the extraction of posts to a specific time window using configurable date parameters. ([source](https://github.com/dataabc/weiboSpider/blob/master/docs/example.md))

### Web Development

- [Data Extraction](https://awesome-repositories.com/f/web-development/data-extraction.md) — Implements a specialized system for extracting posts, profiles, and engagement metrics from Sina Weibo.
- [Web Scrapers](https://awesome-repositories.com/f/web-development/web-scrapers.md) — Functions as a Python-based tool for extracting structured data from the Sina Weibo platform.

### Part of an Awesome List

- [Social Media Post Retrievers](https://awesome-repositories.com/f/awesome-lists/media/media-and-content/social-media-post-retrievers.md) — Collects posts and engagement metrics from social networks for academic research and analysis.
- [User Identifier Harvesting](https://awesome-repositories.com/f/awesome-lists/ai/reasoning-frameworks/cognitive-reasoning-patterns/memory-pattern-extraction/user-profile-extraction/user-identifier-harvesting.md) — Extracts specific user IDs from profile pages or discovers new IDs by crawling following lists. ([source](https://github.com/dataabc/weiboSpider/blob/master/docs/userid.md))

### Content Management & Publishing

- [Social Media Archiving Tools](https://awesome-repositories.com/f/content-management-publishing/social-media-archiving-tools.md) — Schedules regular crawls to permanently save and organize content from social media platforms.
- [Full-Text Content Extraction](https://awesome-repositories.com/f/content-management-publishing/full-text-content-extraction.md) — Extracts full-text versions of long-form posts by visiting the specific detail pages of the social platform. ([source](https://github.com/dataabc/weiboSpider/blob/master/docs/FAQ.md))

### DevOps & Infrastructure

- [Incremental Data Collection](https://awesome-repositories.com/f/devops-infrastructure/crawl-initiators/incremental-data-collection.md) — Resumes extraction from the last successful crawl timestamp to capture only new posts. ([source](https://github.com/dataabc/weiboSpider/blob/master/docs/automation.md))

### Networking & Communication

- [Social Profile Extractors](https://awesome-repositories.com/f/networking-communication/http-data-extractors/social-profile-extractors.md) — Gathers long-form text, images, and videos from Sina Weibo profiles using session cookies.

### Security & Cryptography

- [Session & Cookie Handlers](https://awesome-repositories.com/f/security-cryptography/session-cookie-handlers.md) — Retrieves session cookies from browser developer tools to maintain authenticated sessions for data extraction. ([source](https://github.com/dataabc/weiboSpider/blob/master/docs/cookie.md))
- [Cookie-Based Authentication Bridges](https://awesome-repositories.com/f/security-cryptography/session-cookie-handlers/cookie-based-authentication-bridges.md) — Utilizes browser session cookies to authenticate requests and access protected user profiles on Sina Weibo.

### Software Engineering & Architecture

- [Incremental Crawling](https://awesome-repositories.com/f/software-engineering-architecture/request-throttling/crawling-request-throttlers/incremental-crawling.md) — Tracks the most recent post date per user to capture only new content since the last execution.
- [Client-Side Request Pacing](https://awesome-repositories.com/f/software-engineering-architecture/traffic-management/request-rate-limiting/client-side-request-pacing.md) — Implements randomized wait intervals between requests to mimic human behavior and avoid platform rate limits.
- [Community Content Details](https://awesome-repositories.com/f/software-engineering-architecture/component-lifecycle-management/component-detail-retrievers/content-detail-retrievers/community-content-details.md) — Navigates to detailed post pages to retrieve complete long-form text when summary views are truncated.

### Artificial Intelligence & ML

- [Keyword Search Crawlers](https://awesome-repositories.com/f/artificial-intelligence-ml/bot-platforms/platform-normalization-adapters/platform-search-adapters/keyword-search-crawlers.md) — Queries the platform search API by keyword to collect matching posts within specific time ranges. ([source](https://github.com/dataabc/weiboSpider/blob/master/docs/FAQ.md))

### Business & Productivity Software

- [Public Account Following Retrievers](https://awesome-repositories.com/f/business-productivity-software/polls-and-events/audience-engagement-tools/follower-notification-systems/follower-relationship-managers/public-account-following-retrievers.md) — Retrieves complete lists of accounts that a public social media user follows. ([source](https://github.com/dataabc/weiboSpider/blob/master/docs/FAQ.md))
- [Automated Extraction Schedulers](https://awesome-repositories.com/f/business-productivity-software/scheduling-automation/automated-extraction-schedulers.md) — Runs automated extraction processes at regular intervals to capture new social media content. ([source](https://github.com/dataabc/weibospider#readme))

### Development Tools & Productivity

- [Social Network Discovery](https://awesome-repositories.com/f/development-tools-productivity/search-discovery-tools/recursive-discovery-engines/social-network-discovery.md) — Expands the target user list by recursively extracting identifiers from the following lists of crawled accounts.

### Graphics & Multimedia

- [Social Media Content Extractors](https://awesome-repositories.com/f/graphics-multimedia/multimedia-asset-extractors/social-media-content-extractors.md) — Provides automated downloading of original images, videos, and live photos from social media posts.
- [Social Relationship Maps](https://awesome-repositories.com/f/graphics-multimedia/relationship-network-maps/social-relationship-maps.md) — Extracts follower lists and user identifiers to visualize and analyze relationships between accounts.
- [Social Media Asset Downloaders](https://awesome-repositories.com/f/graphics-multimedia/social-media-asset-downloaders.md) — Saves original images, videos, and live photo formats from posts and retweets to local directories. ([source](https://github.com/dataabc/weibospider#readme))
