What are the best Awesome Python Crawling Frameworks GitHub Repositories?

Libraries and frameworks for web scraping and crawling in Python. Explore 24 awesome GitHub repositories matching part of an awesome list · Python Crawling Frameworks. Refine with filters or upvote what's useful. Top picks: scrapy/scrapy, soimort/you-get, binux/pyspider, codelucas/newspaper, scrapinghub/portia, wzdnzd/aggregator, pbek/qownnotes, rolando/scrapy-redis, hickford/mechanicalsoup, jmcarp/robobrowser.

Why is scrapy/scrapy a recommended Python Crawling Frameworks GitHub Repositories repository?

High-level framework for screen scraping and web crawling.

Why is soimort/you-get a recommended Python Crawling Frameworks GitHub Repositories repository?

Command-line tool for downloading web content.

Why is binux/pyspider a recommended Python Crawling Frameworks GitHub Repositories repository?

Powerful, full-featured spider system.

Why is codelucas/newspaper a recommended Python Crawling Frameworks GitHub Repositories repository?

Extraction of news, full-text, and article metadata.

Why is scrapinghub/portia a recommended Python Crawling Frameworks GitHub Repositories repository?

Visual scraping tool for Scrapy.

Why is wzdnzd/aggregator a recommended Python Crawling Frameworks GitHub Repositories repository?

Integrates Python scripts as plugins to implement specialized crawling logic for unique web sources.

Why is pbek/qownnotes a recommended Python Crawling Frameworks GitHub Repositories repository?

Extends functionality by running user-written scripts from an online repository.

Why is rolando/scrapy-redis a recommended Python Crawling Frameworks GitHub Repositories repository?

Redis-based components for distributed Scrapy projects.

Why is hickford/mechanicalsoup a recommended Python Crawling Frameworks GitHub Repositories repository?

Automates website interactions for scraping.

Why is jmcarp/robobrowser a recommended Python Crawling Frameworks GitHub Repositories repository?

Library for browsing the web without a standalone browser.

24 repository-uri

Awesome GitHub RepositoriesPython Crawling Frameworks

Libraries and frameworks for web scraping and crawling in Python.

Explore 24 awesome GitHub repositories matching part of an awesome list · Python Crawling Frameworks. Refine with filters or upvote what's useful.

Găsește cele mai bune repo-uri cu AI.Vom căuta cele mai potrivite repository-uri folosind AI.

scrapy/scrapy
scrapy/scrapy
62,274Vezi pe GitHub
Scrapy is a comprehensive framework designed for automated web data extraction and large-scale crawling. It operates on an asynchronous, event-driven engine that manages non-blocking network requests and data processing tasks, allowing for the efficient retrieval of structured information from web documents using path-based selectors. The system distinguishes itself through a highly modular architecture that supports complex data collection workflows. Users can implement custom middleware and signal handlers to intercept and modify request flows, while a priority-based scheduler manages concu
High-level framework for screen scraping and web crawling.
Pythoncrawlercrawlingframework
Vezi pe GitHub62,274
soimort/you-get
soimort/you-get
56,839Vezi pe GitHub
This project is a command-line utility designed to fetch video, audio, and image content from a wide range of web platforms. It functions by parsing page metadata and utilizing modular, site-specific scripts to extract direct media stream URLs from complex web structures, enabling the local archiving of digital media for offline use. The tool distinguishes itself through its ability to handle authenticated content, allowing users to inject browser-stored session cookies to access restricted or private media. It also supports real-time media streaming by piping remote content directly into ext
Command-line tool for downloading web content.
Python
Vezi pe GitHub56,839
binux/pyspider
binux/pyspider
16,809Vezi pe GitHub
PySpider is a Python web crawling framework designed for automated data extraction. It provides a pipeline for periodically fetching web content, processing HTML, and persisting scraped information into database backends. The system features a web-based management interface for editing scraping scripts, monitoring task progress, and reviewing collected data. It includes a headless browser JavaScript renderer to capture rendered HTML from dynamic web pages and a distributed architecture that uses message queues to scale crawling workloads across multiple nodes. The framework also covers task
Powerful, full-featured spider system.
Python
Vezi pe GitHub16,809
codelucas/newspaper
codelucas/newspaper
14,982Vezi pe GitHub
Newspaper is a Python library designed for scraping, parsing, and analyzing web-based information. It functions as a framework for automated news aggregation and large-scale web content extraction, providing tools to download, clean, and structure text, metadata, and media from diverse online sources. The project distinguishes itself through a pipeline-oriented architecture that combines heuristic-based content extraction with natural language processing. It automatically identifies and isolates article bodies from web page boilerplate while simultaneously performing language detection, keywo
Extraction of news, full-text, and article metadata.
HTMLcrawlercrawlingnews
Vezi pe GitHub14,982
scrapinghub/portia
scrapinghub/portia
9,509Vezi pe GitHub
Portia is a containerized scraping platform and visual web scraper that enables no-code data extraction. It serves as a Scrapy visual scraping tool and spider generator, allowing users to design and deploy web scrapers through a graphical interface instead of writing manual selector code. The system distinguishes itself by converting visual web page annotations into executable Scrapy spider code and structured JSON specifications. This visual-to-code mapping allows users to define scraping logic and extraction rules through a point-and-click interface, which can then be exported for use in ex
Visual scraping tool for Scrapy.
Python
Vezi pe GitHub9,509
wzdnzd/aggregator
wzdnzd/aggregator
6,689Vezi pe GitHub
This project is a proxy aggregation platform designed to collect and verify free proxy server lists from web platforms, social media, and public repositories. It functions as a crawler framework that gathers proxy data and subscription links, a validation tool for testing server liveness, and a synchronization service for distributing the results. The system uses a plugin-based architecture that allows for the integration of custom Python scripts to handle diverse web source structures. It also includes utilities to transform raw proxy data into standardized configuration formats compatible w
Integrates Python scripts as plugins to implement specialized crawling logic for unique web sources.
Pythonproxypool
Vezi pe GitHub6,689
pbek/qownnotes
pbek/QOwnNotes
5,792Vezi pe GitHub
QOwnNotes is a desktop note editor that stores each note as a plain-text Markdown file on the local filesystem, avoiding proprietary formats and enabling direct file access. It functions as a Nextcloud Notes client, syncing notes and metadata with Nextcloud or ownCloud servers through a companion API service for versioning and sharing. The application also integrates with AI providers and exposes a local MCP server for external agents to search and fetch notes, and includes a companion browser extension for capturing web content, bookmarks, and screenshots. The editor distinguishes itself thr
Extends functionality by running user-written scripts from an online repository.
C++
Vezi pe GitHub5,792
rolando/scrapy-redis
rolando/scrapy-redis
5,639Vezi pe GitHub
Acest proiect este un framework distribuit de web crawling care permite scalarea orizontală a sarcinilor de scraping. Utilizează Redis ca manager centralizat de cozi de cereri și stocare de stare pentru a coordona progresul crawl-ului și metadatele cererilor pe mai multe instanțe de server. Sistemul distribuie sarcinile de crawling prin partajarea unei singure cozi de cereri și utilizează un filtru distribuit de duplicate pentru a preveni vizitarea aceleiași pagini de către mai mulți lucrători. Acesta persistă starea complexă a cererilor și metadatele sub formă de șiruri JSON în cadrul stocării remote partajate. Framework-ul oferă, de asemenea, capabilități pentru procesarea distribuită a datelor prin trimiterea elementelor extrase într-o coadă partajată pentru consumul paralel de către lucrători de procesare separați.
Redis-based components for distributed Scrapy projects.
Python
Vezi pe GitHub5,639
hickford/mechanicalsoup
hickford/MechanicalSoup
4,868Vezi pe GitHub
MechanicalSoup este o bibliotecă Python de automatizare web concepută pentru a simula comportamentul browserului. Funcționează ca un set de instrumente pentru web scraping și automatizare, oferind un motor de parsare HTML și un manager de sesiune HTTP pentru a interacționa cu site-urile web programatic. Biblioteca permite interacțiunea web headless prin imitarea unei sesiuni reale de utilizator. Gestionează starea persistentă prin gestionarea cookie-urilor și urmărirea automată a redirectărilor, permițând navigarea programatică pe site-uri web și simularea interacțiunilor complexe cu browserul. Capabilitățile sale acoperă popularea și trimiterea automată a formularelor folosind selectori CSS, precum și extragerea datelor din răspunsurile HTML. Setul de instrumente include utilitare pentru descărcarea fișierelor linkate, specificarea user-agent-urilor personalizate și căutarea paginilor pe baza unor cuvinte cheie specifice. De asemenea, oferă instrumente de diagnosticare pentru a randa starea curentă a paginii într-un browser pentru verificare vizuală.
Automates website interactions for scraping.
Python
Vezi pe GitHub4,868
jmcarp/robobrowser
jmcarp/robobrowser
3,696Vezi pe GitHub
RoboBrowser: Your friendly neighborhood web scraper
Library for browsing the web without a standalone browser.
Python
Vezi pe GitHub3,696
scrapy/scrapely
scrapy/scrapely
1,887Vezi pe GitHub
Scrapely
Pure-python library for HTML screen-scraping.
HTML
Vezi pe GitHub1,887
xianhu/pspider
xianhu/PSpider
1,840Vezi pe GitHub
A simple web spider frame written by Python, which needs Python3.8+
Simple spider framework for Python 3.
Python
Vezi pe GitHub1,840
howie6879/aspider
howie6879/aspider
1,742Vezi pe GitHub
Ruia 🕸️ Async Python 3.6+ web scraping micro-framework based on asyncio. ⚡ Write less, run faster.
Asyncio-based micro-framework for web scraping.
Python
Vezi pe GitHub1,742
chineking/cola
chineking/cola
1,501Vezi pe GitHub
A high-level distributed crawling framework.
Distributed framework for web crawling.
Python
Vezi pe GitHub1,501
istresearch/scrapy-cluster
istresearch/scrapy-cluster
1,224Vezi pe GitHub
This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster.
Distributed scraping cluster using Redis and Kafka.
Python
Vezi pe GitHub1,224
iogf/sukhoi
iogf/sukhoi
873Vezi pe GitHub
Minimalist and powerful Web Crawler.
Minimalist and powerful web crawler.
Python
Vezi pe GitHub873
rivermont/spidy
rivermont/spidy
354Vezi pe GitHub
Spidy (/spˈɪdi/) is the simple, easy to use command line web crawler. Given a list of web links, it uses the Python requests library to query the webpages. Spidy then uses lxml to extract all links from the page and adds them to its list. Pretty simple!
Simple command-line web crawler.
Python
Vezi pe GitHub354
manning23/mspider
manning23/MSpider
345Vezi pe GitHub
The information security department of 360 company has been recruiting for a long time and is interested in contacting the mailbox zhangxin1at360.cn.
Simple spider using gevent and JavaScript rendering.
Python
Vezi pe GitHub345
cocrawler/cocrawler
cocrawler/cocrawler
194Vezi pe GitHub
CoCrawler is a versatile web crawler built using modern tools and concurrency.
Versatile crawler built with modern concurrency tools.
Python
Vezi pe GitHub194
jmg/crawley
jmg/crawley
191Vezi pe GitHub
High Speed WebCrawler built on Eventlet. Supports databases engines like Postgre, Mysql, Oracle, Sqlite. Command line tools. Extract data using your favourite tool. XPath or Pyquery (A Jquery-like library for python). Cookie Handlers. Very easy to use (see the example).
Pythonic framework based on non-blocking I/O.
Python
Vezi pe GitHub191