30 open-source projects similar to nanmicoder/mediacrawler, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best MediaCrawler alternative.
Crawlee is a web scraping framework designed for building scalable, reliable, and distributed data extraction pipelines. It provides a unified interface for managing headless browser automation and lightweight HTTP requests, allowing developers to handle complex web navigation, dynamic content rendering, and large-scale data collection within a single, modular architecture. The project distinguishes itself through its resource-aware concurrency controller, which dynamically scales task execution based on real-time CPU and memory usage to prevent host machine exhaustion. It also features a rob
Kitty is a high-performance, GPU-accelerated terminal emulator designed to provide a consistent and extensible workspace across different operating systems. It leverages graphics hardware to render text, images, and complex layouts with low latency, while providing a robust environment for demanding command-line workflows. The project distinguishes itself through its integrated workspace management and programmable interface. It functions as a tiling window manager that organizes terminal windows, tabs, and layouts into persistent, keyboard-driven sessions. Users can automate complex workflow
nodriver is an asynchronous Chromium browser automation framework that provides headless control and web scraping capabilities. It functions as a Chrome DevTools Protocol client, allowing for granular engine control by attaching directly to the browser's debug port without the need for external driver binaries. The framework is specifically designed as an anti-bot detection bypass tool. It modifies browser fingerprints and protocol headers to evade automated security systems, handle security warnings, and bypass common obstacles like insecure connection alerts. The system covers a broad rang
Dev-browser is a browser automation framework and headless browser controller that provides a sandboxed script runner for executing web tasks. It functions as a vision-based web automator and a specialized interface for large language models, enabling the navigation and interaction of web pages within isolated execution environments. The project distinguishes itself by converting complex web pages into simplified representations and coordinate-based maps, allowing AI agents to analyze layouts and perform actions based on pixel locations. It employs a mapping system that assigns unique identif
Crawlee-python is a web crawling framework for building scalable scrapers using Python. It serves as a comprehensive tool for web scraping automation, providing a system to extract structured data from websites using both lightweight HTTP requests and headless browser automation. The framework is distinguished by its anti-bot evasion capabilities, which include browser fingerprint impersonation and tiered proxy rotation to bypass detection systems and solve challenges such as Cloudflare. It also incorporates artificial intelligence for autonomous website navigation and schema-based data extra
Twint is an open-source intelligence and data extraction framework designed to gather public social media information. It functions as a command-line utility that retrieves posts, user profiles, and follower lists directly from web interfaces, bypassing the need for official platform developer credentials or authentication keys. The tool distinguishes itself by enabling automated, large-scale data collection through terminal-based orchestration. It supports granular filtering by keywords, geographic locations, time ranges, and account status, allowing researchers to build targeted datasets fo
Automa is a browser-based automation platform that enables users to build, schedule, and execute repetitive web tasks through a visual, no-code interface. By operating as a browser extension, it provides a canvas-based environment where users construct workflows by connecting functional blocks to interact with web elements, manage browser state, and process data. The platform distinguishes itself through its deep integration with the browser environment, allowing for complex orchestration such as event-driven triggers, cross-origin request handling, and the ability to package workflows as sta
Aria2 is a multi-protocol command-line download manager designed to maximize bandwidth utilization by retrieving files from multiple sources and protocols simultaneously. It functions as an asynchronous, event-driven engine that handles complex download lifecycles, including peer-to-peer transfers via BitTorrent, while ensuring data integrity through continuous chunk-based verification. The utility distinguishes itself through its ability to act as a background process that can be controlled programmatically via a remote procedure call interface. This allows external applications to manage, m
This project is a command-line interface that bridges local development workflows with remote platform services. It functions as a terminal-based platform client, enabling users to manage repositories, issues, and pull requests directly from their command line through authenticated API interactions. The tool provides a modular environment that supports custom binary extensions and command aliases, allowing developers to tailor their terminal experience to specific project needs. Beyond standard repository management, the tool serves as a remote development manager, offering capabilities to pr
Nightmare is an Electron-based browser automation library and headless browser controller. It provides the infrastructure to programmatically navigate web pages, interact with DOM elements, and execute JavaScript within a background browser instance. The project distinguishes itself by integrating a full Chromium instance within an Electron shell, allowing for the management of browser sessions, network proxy settings, and persistent storage partitions. It enables the capture of page states as PNG screenshots, PDF documents, or HTML files. The tool covers a broad range of capabilities includ
Botasaurus is a Python web scraping framework and headless browser automation system used to build scalable data extraction tools. It functions as a web data extraction tool and OCR document parser, converting website content, images, and PDF files into structured formats such as JSON, CSV, and Excel. The framework distinguishes itself by providing a scraper management interface that allows Python functions to be wrapped in a web-based UI or deployed as standalone desktop applications. This enables non-technical users to trigger extraction jobs and manage tasks via a graphical interface or RE
pydoll is a Chrome DevTools Protocol automation library and headless browser controller used for web data extraction and parallel browser automation. It controls Chromium-based browsers via direct WebSocket connections, allowing it to manage isolated browser contexts and tabs while bypassing the overhead and detection associated with WebDriver. The project features an anti-bot evasion framework that mimics natural human behavior, including mouse movements generated via Bezier curves and variable typing patterns. It provides specialized stealth capabilities to bypass behavioral analysis and au
This project is a graphical user interface for controlling, configuring, and monitoring AI agents that automate web browser interactions. It provides a visual dashboard to execute autonomous web tasks and manage the behavior of browser-based agents without requiring raw code for every operation. The system includes a browser profile manager to link agents to local executables and user data directories, which allows for persistent authenticated sessions. To support remote observation, it features a VNC streamer that provides a real-time visual feed of headless browser agents operating within a
Social-analyzer is an open-source intelligence framework designed for the automated discovery, correlation, and verification of digital identities across online platforms. It functions as a comprehensive engine for gathering social media intelligence, utilizing distributed browser automation to extract metadata and profile information from hundreds of websites simultaneously. The platform distinguishes itself through its ability to perform cross-platform identity correlation using heuristic-based pattern matching and name permutation generation. It processes these findings through a confidenc
This project is a comprehensive resource directory for web data extraction, providing a curated collection of tools and libraries for parsing data, automating browsers, and managing network operations. It serves as a guide for extracting structured information from HTML, XML, JSON, and PDF formats. The toolkit focuses on advanced data collection strategies, including headless browser automation to interact with JavaScript and a suite of network utilities for DNS resolution and WebSocket connections. It specifically covers methods for bypassing bot protections through proxy pool management, us
Instaloader is a Python library and command-line utility designed for the automated retrieval, archiving, and analysis of Instagram content. It provides a programmatic interface to fetch media, captions, and metadata from public or private profiles, hashtags, and stories, while maintaining persistent user sessions for authorized access. The tool distinguishes itself through robust archive management and traffic control mechanisms. It supports incremental synchronization, allowing users to resume interrupted downloads and update local collections without redundant requests. To ensure reliable
LaVague is an LLM web agent framework and large action model designed to translate natural language instructions into executable browser automation scripts. It functions as a multi-modal orchestrator that reasons over web page states and HTML content to automate multi-step tasks via a Selenium-based automation engine. The framework features a modular model provider layer, allowing users to swap between different language and vision models from providers such as Anthropic, Gemini, and Azure OpenAI. It employs a multi-modal world model to process screenshots and HTML structures, utilizing retri
Agent-Reach is an AI agent web gateway and search tool that provides language models with the ability to search and read content from the open web, social media, and community forums without using official APIs. It functions as a routing layer that connects large language models to various internet backends while managing content parsing and connection health. The system enables API-free information retrieval by using open-source backends to extract text and metadata from platforms such as Twitter, Reddit, and YouTube. It converts unstructured website content, RSS feeds, and video transcripts
Osintgram is a command-line utility designed for open-source intelligence gathering and the extraction of public data from social media profiles. It functions as a framework for collecting and processing user information to assist in digital investigations and the mapping of digital footprints. The tool distinguishes itself through a modular architecture that organizes intelligence-gathering tasks into independent scripts, all sharing a unified session state and data processing pipeline. It utilizes headless browser automation and session-based interactions to mimic legitimate user behavior,
Streamlink is a command line video stream extractor that retrieves direct stream URLs from online services for use in external media players. It functions as a local media stream pipe, redirecting raw video data from web services into local files or players via standard input or HTTP. The project includes a headless browser stream scraper to intercept network requests and extract media data from script-heavy websites, alongside a dedicated processor for HLS and DASH segmented media streams. The tool utilizes a modular video plugin framework, allowing support for new streaming platforms to be
This project is a RESTful media extraction service that provides a programmatic interface for downloading video and image content from social media platforms. It functions as a scraper that parses shared URLs and user profile identifiers to isolate direct media streams and associated metadata from platform-specific data structures. The service distinguishes itself through its ability to emulate cryptographic signatures and security tokens required to authenticate requests against protected backend services. By simulating headless browser behavior and managing cookies and headers, the system b
DrissionPage is a Python library designed for web automation, data scraping, and testing. It functions as a browser automation framework that communicates directly with the browser engine via the Chrome DevTools Protocol, allowing for precise control over browser instances and page states. The library distinguishes itself by providing a unified interface that combines full browser automation with raw HTTP request capabilities. This hybrid approach allows users to switch between lightweight network requests and heavy browser-based interactions within a single workflow. By wrapping asynchronous
This project is a comprehensive framework for building and managing autonomous agent systems. It provides a unified architecture for orchestrating multi-agent societies, where specialized agents collaborate through roleplay to decompose and solve complex tasks. The system integrates language models with external environments, enabling agents to perform real-world actions through a standardized tool-calling abstraction layer. The framework distinguishes itself through its focus on iterative reasoning and data reliability. It employs automated feedback loops to refine agent outputs and self-eva
Clashfree is a network traffic routing platform designed to facilitate access to restricted online resources and digital services. It functions as a proxy configuration management tool that enables users to route internet traffic through encrypted tunnels, effectively bypassing regional access restrictions. The system provides a centralized way to manage network proxy connections and organize multiple routing profiles across various environments. The project distinguishes itself by providing automated subscription services that distribute daily updated proxy node lists and configuration files
workerd is a serverless edge runtime designed for executing lightweight, distributed functions at the network edge. It utilizes a V8-based JavaScript engine to provide fast startup and low memory overhead, while maintaining a WebAssembly-compatible execution environment that allows modules to run alongside JavaScript for high-performance computational tasks. The runtime supports isolate-based multi-tenancy to run multiple independent execution contexts within a single process. It implements an event-driven execution model that triggers code based on network requests or scheduled events and in
RSSHub is a headless, server-side engine designed to generate standardized RSS and Atom feeds from websites that do not natively provide them. By acting as an extensible data aggregator, it enables the automated collection of web content, allowing users to monitor updates from disparate sources through centralized feed readers or workflow automation tools. The platform distinguishes itself through a route-based data extraction framework that maps specific URL patterns to custom scraping logic. This modular architecture is supported by a middleware-driven request pipeline and declarative route
gstack is an AI agent framework and development workflow system designed to automate the software development lifecycle. It coordinates specialized AI personas to manage tasks across product design, engineering management, and quality assurance, transforming product intent into technical specifications and final releases. The project is distinguished by its deep integration of headless browser automation and semantic code memory. It utilizes a persistent Chromium daemon for web scraping and visual auditing, and implements a searchable knowledge base that logs architectural decisions and repos