21 open-source projects similar to cocrawler/cocrawler, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best Cocrawler alternative.
PySpider is a Python web crawling framework designed for automated data extraction. It provides a pipeline for periodically fetching web content, processing HTML, and persisting scraped information into database backends. The system features a web-based management interface for editing scraping scripts, monitoring task progress, and reviewing collected data. It includes a headless browser JavaScript renderer to capture rendered HTML from dynamic web pages and a distributed architecture that uses message queues to scale crawling workloads across multiple nodes. The framework also covers task
Newspaper is a Python library designed for scraping, parsing, and analyzing web-based information. It functions as a framework for automated news aggregation and large-scale web content extraction, providing tools to download, clean, and structure text, metadata, and media from diverse online sources. The project distinguishes itself through a pipeline-oriented architecture that combines heuristic-based content extraction with natural language processing. It automatically identifies and isolates article bodies from web page boilerplate while simultaneously performing language detection, keywo
|Build Status| |Coverage Status| |PyPI Version| |PyPI Downloads| |Wheel Status|
MechanicalSoup is a Python web automation library designed to simulate browser behavior. It functions as a toolkit for web scraping and automation, providing an HTML parsing engine and an HTTP session manager to interact with websites programmatically. The library enables headless web interaction by mimicking a real user session. It manages persistent state through cookie handling and automatic redirect following, allowing for programmatic website navigation and the simulation of complex browser interactions. Its capabilities cover automated form population and submission using CSS selectors
Ruia 🕸️ Async Python 3.6+ web scraping micro-framework based on asyncio. ⚡ Write less, run faster.
This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster.
RoboBrowser: Your friendly neighborhood web scraper
High Speed WebCrawler built on Eventlet. Supports databases engines like Postgre, Mysql, Oracle, Sqlite. Command line tools. Extract data using your favourite tool. XPath or Pyquery (A Jquery-like library for python). Cookie Handlers. Very easy to use (see the example).
The information security department of 360 company has been recruiting for a long time and is interested in contacting the mailbox zhangxin1at360.cn.
Spidy (/spˈɪdi/) is the simple, easy to use command line web crawler. Given a list of web links, it uses the Python requests library to query the webpages. Spidy then uses lxml to extract all links from the page and adds them to its list. Pretty simple!
This project is a distributed web crawling framework that enables the horizontal scaling of scraping tasks. It uses Redis as a centralized request queue manager and state store to coordinate crawl progress and request metadata across multiple server instances. The system distributes crawling workloads by sharing a single request queue and utilizes a distributed duplicate filter to prevent multiple workers from visiting the same page. It persists complex request state and metadata as JSON strings within the shared remote store. The framework also provides capabilities for distributed data pro
Portia is a containerized scraping platform and visual web scraper that enables no-code data extraction. It serves as a Scrapy visual scraping tool and spider generator, allowing users to design and deploy web scrapers through a graphical interface instead of writing manual selector code. The system distinguishes itself by converting visual web page annotations into executable Scrapy spider code and structured JSON specifications. This visual-to-code mapping allows users to define scraping logic and extraction rules through a point-and-click interface, which can then be exported for use in ex
Scrapy is a comprehensive framework designed for automated web data extraction and large-scale crawling. It operates on an asynchronous, event-driven engine that manages non-blocking network requests and data processing tasks, allowing for the efficient retrieval of structured information from web documents using path-based selectors. The system distinguishes itself through a highly modular architecture that supports complex data collection workflows. Users can implement custom middleware and signal handlers to intercept and modify request flows, while a priority-based scheduler manages concu
This project is a command-line utility designed to fetch video, audio, and image content from a wide range of web platforms. It functions by parsing page metadata and utilizing modular, site-specific scripts to extract direct media stream URLs from complex web structures, enabling the local archiving of digital media for offline use. The tool distinguishes itself through its ability to handle authenticated content, allowing users to inject browser-stored session cookies to access restricted or private media. It also supports real-time media streaming by piping remote content directly into ext
A simple web spider frame written by Python, which needs Python3.8+