MediaCrawler

MediaCrawler is an automated web scraping framework designed to extract public posts, comments, and creator metadata from various social media platforms. It functions as a headless browser automator, utilizing real browser instances to render dynamic content and execute the client-side scripts necessary for interacting with modern web interfaces.

The system distinguishes itself through a focus on session persistence and network flexibility. It supports remote debugging to reuse active browser sessions and cookies, which helps minimize the risk of triggering platform security challenges. To maintain stable data collection at scale, the tool integrates proxy-based request routing, allowing users to distribute traffic across external IP services to bypass rate limits and geographic restrictions.

The architecture is built for extensibility and modularity, employing a provider pattern that allows developers to integrate new platforms or custom storage backends through standardized interfaces. Users can manage complex scraping workflows via command-line configuration, enabling the definition of specific targets and storage formats—such as JSON, CSV, or various database systems—without modifying the core logic. The project also includes utilities for data visualization, such as generating word clouds from collected comments.

Installation requires setting up the necessary runtime environments, including a JavaScript engine for handling complex client-side rendering and the appropriate browser automation drivers.

Features

Web Scrapers - Collects posts, comments, and creator details from social platforms using a unified interface.
Web Scraping Frameworks - Implements automated pipelines for navigating websites and collecting data at scale.
Browser Automation - Controls real web browsers to render dynamic content and execute client-side scripts.
Headless Browser Controllers - Manages headless browser instances to navigate dynamic content and bypass security challenges.
Social Media Scrapers - Automates browser interaction to extract posts, comments, and creator metadata from social platforms.
Browser Session Persistence - Maintains persistent login states to minimize detection and avoid repetitive security challenges.
Media Crawlers - Automates media retrieval across various social platforms using command-line configuration.
Task Execution Engines - Retrieves specific content or user contributions by running crawling tasks across supported platforms.
Automation Scripts - Automates scraping tasks via command-line arguments and configuration files.
Proxy Management - Distributes network traffic across external IP services to circumvent rate limits and access restricted content.
Proxy Management Services - Routes network requests through external proxies to bypass rate limits and geo-blocking.
Data Exporters - Saves collected information into formats like CSV, JSON, SQLite, MySQL, or MongoDB.
Data Storage Adapters - Persists data through interchangeable drivers that abstract the underlying database implementation.
Proxy-Aware Clients - Routes traffic through external services to manage rate limits and access geo-restricted content.
Social Media Extraction Tools - Extracts public posts, comments, and creator metadata from various social platforms.
Data Aggregation Pipelines - Standardizes data retrieval from multiple services into a unified format for consistent processing.
Configuration Management - Decouples task logic from the runtime environment using external configuration files and command-line arguments.
Proxy Configuration Tools - Configures network requests through external proxy services to bypass rate limits.
Remote Debugging Tools - Connects to existing browser instances to reuse cookies and login sessions.
Session Management - Attaches to existing browser instances to reuse active cookies and login sessions.

Star history

NanmiCoderMediaCrawler

Name: nanmicoder/mediacrawler
Author: NanmiCoder

View on GitHub

51,294 stars10,747 forksPython32 viewsnanmicoder.github.io/MediaCrawler

MediaCrawler

Installation requires setting up the necessary runtime environments, including a JavaScript engine for handling complex client-side rendering and the appropriate browser automation drivers.

Features

Web Scrapers - Collects posts, comments, and creator details from social platforms using a unified interface.
Web Scraping Frameworks - Implements automated pipelines for navigating websites and collecting data at scale.
Browser Automation - Controls real web browsers to render dynamic content and execute client-side scripts.
Headless Browser Controllers - Manages headless browser instances to navigate dynamic content and bypass security challenges.
Social Media Scrapers - Automates browser interaction to extract posts, comments, and creator metadata from social platforms.
Browser Session Persistence - Maintains persistent login states to minimize detection and avoid repetitive security challenges.
Media Crawlers - Automates media retrieval across various social platforms using command-line configuration.
Task Execution Engines - Retrieves specific content or user contributions by running crawling tasks across supported platforms.
Automation Scripts - Automates scraping tasks via command-line arguments and configuration files.
Proxy Management - Distributes network traffic across external IP services to circumvent rate limits and access restricted content.
Proxy Management Services - Routes network requests through external proxies to bypass rate limits and geo-blocking.
Data Exporters - Saves collected information into formats like CSV, JSON, SQLite, MySQL, or MongoDB.
Data Storage Adapters - Persists data through interchangeable drivers that abstract the underlying database implementation.
Proxy-Aware Clients - Routes traffic through external services to manage rate limits and access geo-restricted content.
Social Media Extraction Tools - Extracts public posts, comments, and creator metadata from various social platforms.
Data Aggregation Pipelines - Standardizes data retrieval from multiple services into a unified format for consistent processing.
Configuration Management - Decouples task logic from the runtime environment using external configuration files and command-line arguments.
Proxy Configuration Tools - Configures network requests through external proxy services to bypass rate limits.
Remote Debugging Tools - Connects to existing browser instances to reuse cookies and login sessions.
Session Management - Attaches to existing browser instances to reuse active cookies and login sessions.

Open-source alternatives to MediaCrawler

Similar open-source projects, ranked by how many features they share with MediaCrawler.

apify/crawlee
apify/crawlee
24,002View on GitHub
Crawlee is a web scraping framework designed for building scalable, reliable, and distributed data extraction pipelines. It provides a unified interface for managing headless browser automation and lightweight HTTP requests, allowing developers to handle complex web navigation, dynamic content rendering, and large-scale data collection within a single, modular architecture. The project distinguishes itself through its resource-aware concurrency controller, which dynamically scales task execution based on real-time CPU and memory usage to prevent host machine exhaustion. It also features a rob
TypeScriptapifyautomationcrawler
View on GitHub24,002
kovidgoyal/kitty
kovidgoyal/kitty
33,462View on GitHub
Kitty is a high-performance, GPU-accelerated terminal emulator designed to provide a consistent and extensible workspace across different operating systems. It leverages graphics hardware to render text, images, and complex layouts with low latency, while providing a robust environment for demanding command-line workflows. The project distinguishes itself through its integrated workspace management and programmable interface. It functions as a tiling window manager that organizes terminal windows, tabs, and layouts into persistent, keyboard-driven sessions. Users can automate complex workflow
Pythoncgogolang
View on GitHub33,462
ultrafunkamsterdam/nodriver
ultrafunkamsterdam/nodriver
3,578View on GitHub
nodriver is an asynchronous Chromium browser automation framework that provides headless control and web scraping capabilities. It functions as a Chrome DevTools Protocol client, allowing for granular engine control by attaching directly to the browser's debug port without the need for external driver binaries. The framework is specifically designed as an anti-bot detection bypass tool. It modifies browser fingerprints and protocol headers to evade automated security systems, handle security warnings, and bypass common obstacles like insecure connection alerts. The system covers a broad rang
Python
View on GitHub3,578

Frequently asked questions

What does nanmicoder/mediacrawler do?

What are the main features of nanmicoder/mediacrawler?

The main features of nanmicoder/mediacrawler are: Web Scrapers, Web Scraping Frameworks, Browser Automation, Headless Browser Controllers, Social Media Scrapers, Browser Session Persistence, Media Crawlers, Task Execution Engines.

What are some open-source alternatives to nanmicoder/mediacrawler?

Open-source alternatives to nanmicoder/mediacrawler include: apify/crawlee — Crawlee is a web scraping framework designed for building scalable, reliable, and distributed data extraction… kovidgoyal/kitty — Kitty is a high-performance, GPU-accelerated terminal emulator designed to provide a consistent and extensible… ultrafunkamsterdam/nodriver — nodriver is an asynchronous Chromium browser automation framework that provides headless control and web scraping… sawyerhood/dev-browser — Dev-browser is a browser automation framework and headless browser controller that provides a sandboxed script runner… apify/crawlee-python — Crawlee-python is a web crawling framework for building scalable scrapers using Python. It serves as a comprehensive… twintproject/twint — Twint is an open-source intelligence and data extraction framework designed to gather public social media information.…

MediaCrawler

Features

Star history

MediaCrawler

Features

Open-source alternatives to MediaCrawler

apify/crawlee

kovidgoyal/kitty

ultrafunkamsterdam/nodriver

Frequently asked questions

Star history

Frequently asked questions

Open-source alternatives to MediaCrawler

apify/crawlee

kovidgoyal/kitty

ultrafunkamsterdam/nodriver

SawyerHood/dev-browser