24 repository-uri
Libraries and frameworks for web scraping and crawling in Python.
Explore 24 awesome GitHub repositories matching part of an awesome list · Python Crawling Frameworks. Refine with filters or upvote what's useful.
Scrapy is a comprehensive framework designed for automated web data extraction and large-scale crawling. It operates on an asynchronous, event-driven engine that manages non-blocking network requests and data processing tasks, allowing for the efficient retrieval of structured information from web documents using path-based selectors. The system distinguishes itself through a highly modular architecture that supports complex data collection workflows. Users can implement custom middleware and signal handlers to intercept and modify request flows, while a priority-based scheduler manages concu
High-level framework for screen scraping and web crawling.
This project is a command-line utility designed to fetch video, audio, and image content from a wide range of web platforms. It functions by parsing page metadata and utilizing modular, site-specific scripts to extract direct media stream URLs from complex web structures, enabling the local archiving of digital media for offline use. The tool distinguishes itself through its ability to handle authenticated content, allowing users to inject browser-stored session cookies to access restricted or private media. It also supports real-time media streaming by piping remote content directly into ext
Command-line tool for downloading web content.
PySpider is a Python web crawling framework designed for automated data extraction. It provides a pipeline for periodically fetching web content, processing HTML, and persisting scraped information into database backends. The system features a web-based management interface for editing scraping scripts, monitoring task progress, and reviewing collected data. It includes a headless browser JavaScript renderer to capture rendered HTML from dynamic web pages and a distributed architecture that uses message queues to scale crawling workloads across multiple nodes. The framework also covers task
Powerful, full-featured spider system.
Newspaper is a Python library designed for scraping, parsing, and analyzing web-based information. It functions as a framework for automated news aggregation and large-scale web content extraction, providing tools to download, clean, and structure text, metadata, and media from diverse online sources. The project distinguishes itself through a pipeline-oriented architecture that combines heuristic-based content extraction with natural language processing. It automatically identifies and isolates article bodies from web page boilerplate while simultaneously performing language detection, keywo
Extraction of news, full-text, and article metadata.
Portia is a containerized scraping platform and visual web scraper that enables no-code data extraction. It serves as a Scrapy visual scraping tool and spider generator, allowing users to design and deploy web scrapers through a graphical interface instead of writing manual selector code. The system distinguishes itself by converting visual web page annotations into executable Scrapy spider code and structured JSON specifications. This visual-to-code mapping allows users to define scraping logic and extraction rules through a point-and-click interface, which can then be exported for use in ex
Visual scraping tool for Scrapy.
This project is a proxy aggregation platform designed to collect and verify free proxy server lists from web platforms, social media, and public repositories. It functions as a crawler framework that gathers proxy data and subscription links, a validation tool for testing server liveness, and a synchronization service for distributing the results. The system uses a plugin-based architecture that allows for the integration of custom Python scripts to handle diverse web source structures. It also includes utilities to transform raw proxy data into standardized configuration formats compatible w
Integrates Python scripts as plugins to implement specialized crawling logic for unique web sources.
QOwnNotes is a desktop note editor that stores each note as a plain-text Markdown file on the local filesystem, avoiding proprietary formats and enabling direct file access. It functions as a Nextcloud Notes client, syncing notes and metadata with Nextcloud or ownCloud servers through a companion API service for versioning and sharing. The application also integrates with AI providers and exposes a local MCP server for external agents to search and fetch notes, and includes a companion browser extension for capturing web content, bookmarks, and screenshots. The editor distinguishes itself thr
Extends functionality by running user-written scripts from an online repository.
Acest proiect este un framework distribuit de web crawling care permite scalarea orizontală a sarcinilor de scraping. Utilizează Redis ca manager centralizat de cozi de cereri și stocare de stare pentru a coordona progresul crawl-ului și metadatele cererilor pe mai multe instanțe de server. Sistemul distribuie sarcinile de crawling prin partajarea unei singure cozi de cereri și utilizează un filtru distribuit de duplicate pentru a preveni vizitarea aceleiași pagini de către mai mulți lucrători. Acesta persistă starea complexă a cererilor și metadatele sub formă de șiruri JSON în cadrul stocării remote partajate. Framework-ul oferă, de asemenea, capabilități pentru procesarea distribuită a datelor prin trimiterea elementelor extrase într-o coadă partajată pentru consumul paralel de către lucrători de procesare separați.
Redis-based components for distributed Scrapy projects.
MechanicalSoup este o bibliotecă Python de automatizare web concepută pentru a simula comportamentul browserului. Funcționează ca un set de instrumente pentru web scraping și automatizare, oferind un motor de parsare HTML și un manager de sesiune HTTP pentru a interacționa cu site-urile web programatic. Biblioteca permite interacțiunea web headless prin imitarea unei sesiuni reale de utilizator. Gestionează starea persistentă prin gestionarea cookie-urilor și urmărirea automată a redirectărilor, permițând navigarea programatică pe site-uri web și simularea interacțiunilor complexe cu browserul. Capabilitățile sale acoperă popularea și trimiterea automată a formularelor folosind selectori CSS, precum și extragerea datelor din răspunsurile HTML. Setul de instrumente include utilitare pentru descărcarea fișierelor linkate, specificarea user-agent-urilor personalizate și căutarea paginilor pe baza unor cuvinte cheie specifice. De asemenea, oferă instrumente de diagnosticare pentru a randa starea curentă a paginii într-un browser pentru verificare vizuală.
Automates website interactions for scraping.
RoboBrowser: Your friendly neighborhood web scraper
Library for browsing the web without a standalone browser.
Scrapely
Pure-python library for HTML screen-scraping.
A simple web spider frame written by Python, which needs Python3.8+
Simple spider framework for Python 3.
Ruia 🕸️ Async Python 3.6+ web scraping micro-framework based on asyncio. ⚡ Write less, run faster.
Asyncio-based micro-framework for web scraping.
A high-level distributed crawling framework.
Distributed framework for web crawling.
This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster.
Distributed scraping cluster using Redis and Kafka.
Minimalist and powerful Web Crawler.
Minimalist and powerful web crawler.
Spidy (/spˈɪdi/) is the simple, easy to use command line web crawler. Given a list of web links, it uses the Python requests library to query the webpages. Spidy then uses lxml to extract all links from the page and adds them to its list. Pretty simple!
Simple command-line web crawler.
The information security department of 360 company has been recruiting for a long time and is interested in contacting the mailbox zhangxin1at360.cn.
Simple spider using gevent and JavaScript rendering.
CoCrawler is a versatile web crawler built using modern tools and concurrency.
Versatile crawler built with modern concurrency tools.
High Speed WebCrawler built on Eventlet. Supports databases engines like Postgre, Mysql, Oracle, Sqlite. Command line tools. Extract data using your favourite tool. XPath or Pyquery (A Jquery-like library for python). Cookie Handlers. Very easy to use (see the example).
Pythonic framework based on non-blocking I/O.