30 open-source projects similar to hect0x7/jmcomic-crawler-python, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best JMComic Crawler Python alternative.
Hakuneko is a cross-platform manga downloader and multi-platform media scraper designed to save manga and anime images and videos from various websites. It functions as a tool for offline media consumption, allowing users to extract visual content from web sources and save it to local storage. The application enables cross-platform media archiving on Windows, Linux, and MacOS. It focuses on web content scraping to create local archives of images and videos, ensuring content remains accessible without an internet connection. The system manages these tasks through a connector architecture and
Copymanga is a mobile manga reader application designed for browsing, reading, and downloading manga content from remote web sources. It functions as a manga library manager that allows users to organize favorite series and track their reading progress. The application includes a cross-device synchronization client that maintains user identity, preferences, and reading history across multiple devices via cloud storage. It also operates as an offline downloader, saving manga chapters as compressed image files to local storage for consumption without an internet connection. The platform covers
This project is a web-based manga and novel downloader and multi-site web scraper designed to extract images and text from diverse media platforms. It functions as a digital media archiver and EPUB e-book generator, using a plugin-based crawler architecture with site-specific scripts to define how content is extracted from various international websites. The system distinguishes itself through authenticated web crawling, using browser cookie simulation to access restricted or member-only content. It includes specialized capabilities for digital comic archiving, which organizes image sequences
This project is a Telegram API client and media archiving system designed to programmatically retrieve chat histories and export media. It functions as a download manager and message forwarder, allowing users to back up photos, videos, and documents from Telegram chats into structured local archives. The system distinguishes itself through advanced content filtering and forwarding capabilities. It can monitor chats for new messages, apply custom regular expressions to filter media by size or date, and automatically forward content between chats. This includes the ability to export protected c
X-Ray is a web scraping framework and asynchronous web crawler designed to extract structured data from websites. It functions as an HTML data extractor that transforms raw page content into a defined schema using CSS-style selectors. The project implements a headless browser crawler capable of executing JavaScript to render dynamic content. It handles website content discovery through a breadth-first crawling strategy and automatic pagination discovery to traverse multi-page result sets. The framework manages web data pipelines using a concurrency-limited request queue and request rate cont
udemy-dl is a Python command-line tool and web content scraper designed to download Udemy course videos, subtitles, and supplementary materials for offline personal use. It functions as a course media archiver that authenticates via user credentials or cookies to retrieve restricted media and metadata. The utility distinguishes itself through batch media retrieval, allowing the sequential download of multiple courses from a list of URLs. It provides granular control over the archive process, including the ability to filter specific chapters or lectures and export direct download links to a fi
PythonSpiderNotes is a comprehensive instructional resource and framework for building web crawlers and extracting data using the Python programming language. It provides a set of methods for parsing unstructured HTML and JSON data into structured formats for persistent storage. The project includes detailed guides and tutorials on browser automation for retrieving dynamic content, as well as a framework for data extraction. It specifically covers anti-bot bypass techniques, such as rotating proxies and spoofing headers, to avoid IP blocks and detection systems. The capability surface extend
Ladder is a web proxy server and HTTP response modifier designed to circumvent bot protections, CORS restrictions, and paywalls. It functions by intercepting traffic to modify HTML, CSS, and JavaScript via regular expressions and altering HTTP headers to reveal restricted content. The project distinguishes itself through its ability to bypass anti-scraping mechanisms and specialized bot detection, such as Cloudflare, by integrating with external challenge-solving services. It also enables client identity emulation by spoofing user agents and network identifiers to masquerade as different brow
This project is a manga source extension repository and content aggregator. It functions as an HTTP content scraping engine that retrieves images and metadata from external provider websites by parsing HTML and making network requests to display digital manga within a unified reader. The system utilizes a JSON extension repository to allow reader applications to discover and install third-party content providers. It employs an interface-based plugin framework that defines a common set of methods to ensure external sources remain compatible with a standardized internal format. The project cov
TopList is a trending news aggregator and headline aggregation dashboard. Built as a Go web crawler, it functions as a centralized tool for collecting and displaying popular current topics from diverse online news and social sources in a single view. The system focuses on multi-source content scraping and web headline monitoring. It retrieves information from various web pages to consolidate trending headlines into a unified feed.
This project is a vulnerability search engine and security knowledge base designed to collect and index public security disclosures. It functions as a vulnerability database crawler that extracts technical reports and security flaws from websites to create a searchable local archive. The system utilizes a security knowledge indexer and a full-text inverted index to convert unstructured crawled data into a structured format. This allows for keyword-based information retrieval, enabling the location of specific security flaws and technical details through a dedicated search interface. The plat
Karakeep is a self-hosted, open-source platform designed for personal knowledge management and web content archiving. It functions as a centralized repository where users can capture, organize, and preserve bookmarks, notes, and media files, ensuring long-term access to digital information even if original sources are removed or modified. The system distinguishes itself through its automated content processing and security-focused architecture. It utilizes headless browser crawling and optical character recognition to ingest and index web content, while a modular artificial intelligence pipel
Nominatim is a self-hosted geospatial search engine and geocoding server that utilizes OpenStreetMap data. It provides a complete infrastructure for forward geocoding, converting addresses or place names into geographic coordinates, and reverse geocoding, translating coordinates into human-readable physical addresses. The project features a dedicated data importer that parses raw map data into a PostgreSQL geospatial database. It distinguishes itself through a configurable import pipeline that uses style files to filter map features and an importance-based ranking system to prioritize search
JHenTai is a specialized gallery client and media browser for accessing adult image galleries and manga. It functions as a multi-source media downloader and local image gallery manager, allowing users to find, view, and save content from remote sources. The application features biometric-based access control, using fingerprint scanning to secure private image libraries and account profiles. It also includes a fallback-based request routing system that redirects content requests to secondary sources when primary data providers fail to respond. The platform covers broad capabilities for conten
This project is a Python-based web scraping tool and command line image downloader designed to automate the retrieval of images from Google Images. It functions as an image dataset collector, allowing users to gather large sets of images for data analysis or research through a terminal interface or programmatic scripts. The tool features advanced search filtering to restrict results by file format, color, size, aspect ratio, and usage rights. It also supports reverse image search to find visually similar media based on a provided URL and offers search scope expansion to increase result volume
This project is a cross-platform messaging SDK and client development library used to build custom Telegram applications. It functions as a comprehensive framework that manages network encryption, local data storage, and API communication, providing a C-compatible JSON interface that allows integration with any programming language. The library distinguishes itself by providing a full database manager for encrypted local caching and synchronized state, alongside a dedicated bot framework for creating interactive bots with business account integration. It enables the implementation of speciali
picacg-qt is a cross-platform desktop client for browsing and reading digital manga and comics from remote services. It provides a dedicated interface for navigating remote libraries, searching for specific titles, and viewing media content on Windows, Linux, and MacOS. The application includes an integrated artificial intelligence tool to upscale the resolution and visual quality of comic images. It also functions as an offline downloader, allowing users to archive comic sets and images from remote providers to local storage. The system handles asynchronous media downloading and local image
This project is a community-curated directory of open-source software designed for deployment in private server environments and home labs. It serves as a comprehensive resource for discovering independent, self-hosted alternatives to mainstream cloud services, enabling users to maintain full data ownership and control over their digital infrastructure. The directory is structured through a hierarchical taxonomy that organizes a vast collection of applications into logical categories, ranging from media management and data analytics to private communication and team productivity tools. It dis
UltimaScraper is an automated content downloader and media scraper designed to capture images and videos from OnlyFans accounts and save them to local storage. It functions as a media archive manager that organizes large volumes of downloaded content into structured folder hierarchies. The system includes a webhook notifier that sends automated alerts to external URLs upon the completion of download tasks. It utilizes custom naming patterns and directory structuring to sort files by username and media type. The tool provides resource control through concurrent download limiting to manage net
XHS-Downloader is a media downloader and content scraper for Xiaohongshu designed to extract and save images, videos, and metadata from profiles, search results, and shared links. It functions as a background service that can automatically detect and download media when platform URLs are copied to the system clipboard. The project provides a server with an HTTP API endpoint for programmatically triggering media downloads and extracting work details via external scripts. It includes a media asset manager that sorts downloaded content into custom folders using filename patterns based on author
CrawlerTutorial is a comprehensive Python web scraping tutorial and framework designed for extracting data from static and dynamic websites. It functions as a web data extraction pipeline and an HTTP request orchestrator, covering the full lifecycle of scraping applications from initial fetching to final data storage. The project provides specialized guidance on anti-bot bypass techniques and web API reverse engineering. It includes methods for evading browser detection through identity masking and proxy rotation, as well as techniques for identifying hidden API endpoints by analyzing network
so-novel is a web novel downloader and scraping engine designed to extract structured text from websites and convert it into electronic book formats. It functions as a multi-interface content extractor, providing a shared backend accessible via a web-based management dashboard, a terminal user interface, and a command line interface. The system utilizes a rule-driven approach for data extraction, using CSS selectors and XPath rules defined in external configuration files to map web elements to specific data fields. To maintain access to content, it includes a proxy-routed request pipeline to
Jasmine is a digital comic reader and community content platform designed for browsing, reading, and organizing digital comic collections. It functions as a comic library manager that allows users to track reading progress, save favorite titles, and categorize comic series. The application is an offline-capable web app that employs local-first data and content caching to ensure comic pages remain accessible without an internet connection. It features a responsive page viewer that adjusts comic dimensions based on the device screen size to maintain readability. The platform integrates social
This project is a local media management platform designed for organizing, browsing, and analyzing large collections of AI-generated images and videos. It functions as a specialized browser that extracts and parses embedded generation parameters, allowing users to manage their creative assets through a high-performance interface. The platform distinguishes itself through semantic search and organization capabilities, which use vector indexing to enable natural language queries across local file libraries. It automates the sorting and tagging of media based on prompt similarity and visual cont
ConvertX is a web-based file conversion management platform designed to transform documents, images, and video files between various formats. It utilizes system-level binary orchestration to execute conversion tasks, leveraging background worker threads to handle concurrent, high-volume bulk processing without blocking the user interface. The platform distinguishes itself through a comprehensive security and access control framework, which includes multi-user account management, session-based token authentication, and role-based permissions. Users can secure their output files with passwords
node-crawler is a programmable web crawler for Node.js that manages request queues and automates data extraction. It functions as a rate-limited HTTP client and a headless HTML parser, providing the infrastructure to visit large sets of URLs asynchronously while preventing duplicate processing through task deduplication. The project distinguishes itself through a proxy rotation manager that cycles user agents and proxy servers to bypass access restrictions. It utilizes the HTTP/2 protocol to improve request performance and server compatibility during large-scale scraping operations. The syst
Amundsen is a data catalog and discovery platform that provides a centralized directory for indexing tables and dashboards. It functions as a metadata management system and search engine, allowing users to locate and understand available data assets across diverse distributed sources. The platform includes capabilities for data lineage tracking to map the origin and movement of datasets between systems. It also serves as a data profiling tool, calculating distribution and quality statistics for individual table columns to provide automated insights into the nature of the data. The system man
EverythingPowerToys is a high-performance file and folder search tool for Windows that functions as a system-wide file indexer. It provides near-instant retrieval of files and directories across local storage using a centralized interface and a high-performance indexing engine. The utility specializes in advanced file querying by supporting regular expression patterns to locate files based on complex naming schemes. It also resolves system environment variables within search queries to find files in dynamic directory paths. The project covers a broad range of file management and search capab
This project is a Node.js web scraping framework designed to automate data extraction through a programmatic workflow of requests, parsing, and document interaction. It functions as a headless web crawler, an HTTP request manager, and a DOM parser and extractor. The framework distinguishes itself by combining a JavaScript execution engine to interact with dynamic content and a hybrid selection system that utilizes both CSS and XPath selectors. It includes specialized middleware for proxy rotation and cookie-jar session management to maintain authenticated states and manage automated traffic.
Res-downloader is a network proxy utility designed to intercept, analyze, and extract multimedia assets from web traffic. It functions as a gateway that captures video, audio, and image files directly from data streams for local storage and offline access. The tool employs man-in-the-middle interception to decrypt and inspect network packets, allowing it to identify media resources through pattern matching and content type filtering. It integrates proxy-based routing to manage outgoing requests, enabling the retrieval of content that may be subject to regional restrictions or network-level ac