What are the best Awesome Python Crawling Frameworks GitHub Repositories?

Libraries and frameworks for web scraping and crawling in Python. Explore 24 awesome GitHub repositories matching part of an awesome list · Python Crawling Frameworks. Refine with filters or upvote what's useful. Top picks: scrapy/scrapy, soimort/you-get, binux/pyspider, codelucas/newspaper, scrapinghub/portia, wzdnzd/aggregator, pbek/qownnotes, rolando/scrapy-redis, hickford/mechanicalsoup, jmcarp/robobrowser.

Why is scrapy/scrapy a recommended Python Crawling Frameworks GitHub Repositories repository?

High-level framework for screen scraping and web crawling.

Why is soimort/you-get a recommended Python Crawling Frameworks GitHub Repositories repository?

Command-line tool for downloading web content.

Why is binux/pyspider a recommended Python Crawling Frameworks GitHub Repositories repository?

Powerful, full-featured spider system.

Why is codelucas/newspaper a recommended Python Crawling Frameworks GitHub Repositories repository?

Extraction of news, full-text, and article metadata.

Why is scrapinghub/portia a recommended Python Crawling Frameworks GitHub Repositories repository?

Visual scraping tool for Scrapy.

Why is wzdnzd/aggregator a recommended Python Crawling Frameworks GitHub Repositories repository?

Integrates Python scripts as plugins to implement specialized crawling logic for unique web sources.

Why is pbek/qownnotes a recommended Python Crawling Frameworks GitHub Repositories repository?

Extends functionality by running user-written scripts from an online repository.

Why is rolando/scrapy-redis a recommended Python Crawling Frameworks GitHub Repositories repository?

Redis-based components for distributed Scrapy projects.

Why is hickford/mechanicalsoup a recommended Python Crawling Frameworks GitHub Repositories repository?

Automates website interactions for scraping.

Why is jmcarp/robobrowser a recommended Python Crawling Frameworks GitHub Repositories repository?

Library for browsing the web without a standalone browser.

24 repositorios

Awesome GitHub RepositoriesPython Crawling Frameworks

Libraries and frameworks for web scraping and crawling in Python.

Explore 24 awesome GitHub repositories matching part of an awesome list · Python Crawling Frameworks. Refine with filters or upvote what's useful.

Encuentra los mejores repositorios con IA.Buscaremos los repositorios que mejor coincidan usando IA.

scrapy/scrapy
scrapy/scrapy
62,274Ver en GitHub
Scrapy is a comprehensive framework designed for automated web data extraction and large-scale crawling. It operates on an asynchronous, event-driven engine that manages non-blocking network requests and data processing tasks, allowing for the efficient retrieval of structured information from web documents using path-based selectors. The system distinguishes itself through a highly modular architecture that supports complex data collection workflows. Users can implement custom middleware and signal handlers to intercept and modify request flows, while a priority-based scheduler manages concu
High-level framework for screen scraping and web crawling.
Pythoncrawlercrawlingframework
Ver en GitHub62,274
soimort/you-get
soimort/you-get
56,839Ver en GitHub
This project is a command-line utility designed to fetch video, audio, and image content from a wide range of web platforms. It functions by parsing page metadata and utilizing modular, site-specific scripts to extract direct media stream URLs from complex web structures, enabling the local archiving of digital media for offline use. The tool distinguishes itself through its ability to handle authenticated content, allowing users to inject browser-stored session cookies to access restricted or private media. It also supports real-time media streaming by piping remote content directly into ext
Command-line tool for downloading web content.
Python
Ver en GitHub56,839
binux/pyspider
binux/pyspider
16,809Ver en GitHub
PySpider is a Python web crawling framework designed for automated data extraction. It provides a pipeline for periodically fetching web content, processing HTML, and persisting scraped information into database backends. The system features a web-based management interface for editing scraping scripts, monitoring task progress, and reviewing collected data. It includes a headless browser JavaScript renderer to capture rendered HTML from dynamic web pages and a distributed architecture that uses message queues to scale crawling workloads across multiple nodes. The framework also covers task
Powerful, full-featured spider system.
Python
Ver en GitHub16,809
codelucas/newspaper
codelucas/newspaper
14,982Ver en GitHub
Newspaper is a Python library designed for scraping, parsing, and analyzing web-based information. It functions as a framework for automated news aggregation and large-scale web content extraction, providing tools to download, clean, and structure text, metadata, and media from diverse online sources. The project distinguishes itself through a pipeline-oriented architecture that combines heuristic-based content extraction with natural language processing. It automatically identifies and isolates article bodies from web page boilerplate while simultaneously performing language detection, keywo
Extraction of news, full-text, and article metadata.
HTMLcrawlercrawlingnews
Ver en GitHub14,982
scrapinghub/portia
scrapinghub/portia
9,509Ver en GitHub
Portia is a containerized scraping platform and visual web scraper that enables no-code data extraction. It serves as a Scrapy visual scraping tool and spider generator, allowing users to design and deploy web scrapers through a graphical interface instead of writing manual selector code. The system distinguishes itself by converting visual web page annotations into executable Scrapy spider code and structured JSON specifications. This visual-to-code mapping allows users to define scraping logic and extraction rules through a point-and-click interface, which can then be exported for use in ex
Visual scraping tool for Scrapy.
Python
Ver en GitHub9,509
wzdnzd/aggregator
wzdnzd/aggregator
6,689Ver en GitHub
This project is a proxy aggregation platform designed to collect and verify free proxy server lists from web platforms, social media, and public repositories. It functions as a crawler framework that gathers proxy data and subscription links, a validation tool for testing server liveness, and a synchronization service for distributing the results. The system uses a plugin-based architecture that allows for the integration of custom Python scripts to handle diverse web source structures. It also includes utilities to transform raw proxy data into standardized configuration formats compatible w
Integrates Python scripts as plugins to implement specialized crawling logic for unique web sources.
Pythonproxypool
Ver en GitHub6,689
pbek/qownnotes
pbek/QOwnNotes
5,792Ver en GitHub
QOwnNotes is a desktop note editor that stores each note as a plain-text Markdown file on the local filesystem, avoiding proprietary formats and enabling direct file access. It functions as a Nextcloud Notes client, syncing notes and metadata with Nextcloud or ownCloud servers through a companion API service for versioning and sharing. The application also integrates with AI providers and exposes a local MCP server for external agents to search and fetch notes, and includes a companion browser extension for capturing web content, bookmarks, and screenshots. The editor distinguishes itself thr
Extends functionality by running user-written scripts from an online repository.
C++
Ver en GitHub5,792
rolando/scrapy-redis
rolando/scrapy-redis
5,639Ver en GitHub
Este proyecto es un framework de rastreo web distribuido que permite el escalado horizontal de tareas de scraping. Utiliza Redis como gestor de colas de solicitudes centralizado y almacén de estado para coordinar el progreso del rastreo y los metadatos de las solicitudes a través de múltiples instancias de servidor. El sistema distribuye las cargas de trabajo de rastreo compartiendo una única cola de solicitudes y utiliza un filtro de duplicados distribuido para evitar que múltiples trabajadores visiten la misma página. Persiste el estado complejo de la solicitud y los metadatos como cadenas JSON dentro del almacén remoto compartido. El framework también proporciona capacidades para el procesamiento de datos distribuido al enviar elementos scrapeados a una cola compartida para el consumo paralelo por parte de trabajadores de procesamiento separados.
Redis-based components for distributed Scrapy projects.
Python
Ver en GitHub5,639
hickford/mechanicalsoup
hickford/MechanicalSoup
4,868Ver en GitHub
MechanicalSoup es una biblioteca de automatización web de Python diseñada para simular el comportamiento del navegador. Funciona como un kit de herramientas para web scraping y automatización, proporcionando un motor de análisis HTML y un gestor de sesiones HTTP para interactuar con sitios web programáticamente. La biblioteca permite la interacción web headless (sin interfaz gráfica) imitando una sesión de usuario real. Gestiona el estado persistente a través del manejo de cookies y el seguimiento automático de redirecciones, permitiendo la navegación programática por sitios web y la simulación de interacciones complejas del navegador. Sus capacidades cubren el llenado y envío automatizado de formularios utilizando selectores CSS, así como la extracción de datos de respuestas HTML. El conjunto de herramientas incluye utilidades para descargar archivos vinculados, especificar agentes de usuario personalizados y buscar páginas basadas en palabras clave específicas. También proporciona herramientas de diagnóstico para renderizar el estado actual de la página en un navegador para su verificación visual.
Automates website interactions for scraping.
Python
Ver en GitHub4,868
jmcarp/robobrowser
jmcarp/robobrowser
3,696Ver en GitHub
RoboBrowser: Your friendly neighborhood web scraper
Library for browsing the web without a standalone browser.
Python
Ver en GitHub3,696
scrapy/scrapely
scrapy/scrapely
1,887Ver en GitHub
Scrapely
Pure-python library for HTML screen-scraping.
HTML
Ver en GitHub1,887
xianhu/pspider
xianhu/PSpider
1,840Ver en GitHub
A simple web spider frame written by Python, which needs Python3.8+
Simple spider framework for Python 3.
Python
Ver en GitHub1,840
howie6879/aspider
howie6879/aspider
1,742Ver en GitHub
Ruia 🕸️ Async Python 3.6+ web scraping micro-framework based on asyncio. ⚡ Write less, run faster.
Asyncio-based micro-framework for web scraping.
Python
Ver en GitHub1,742
chineking/cola
chineking/cola
1,501Ver en GitHub
A high-level distributed crawling framework.
Distributed framework for web crawling.
Python
Ver en GitHub1,501
istresearch/scrapy-cluster
istresearch/scrapy-cluster
1,224Ver en GitHub
This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster.
Distributed scraping cluster using Redis and Kafka.
Python
Ver en GitHub1,224
iogf/sukhoi
iogf/sukhoi
873Ver en GitHub
Minimalist and powerful Web Crawler.
Minimalist and powerful web crawler.
Python
Ver en GitHub873
rivermont/spidy
rivermont/spidy
354Ver en GitHub
Spidy (/spˈɪdi/) is the simple, easy to use command line web crawler. Given a list of web links, it uses the Python requests library to query the webpages. Spidy then uses lxml to extract all links from the page and adds them to its list. Pretty simple!
Simple command-line web crawler.
Python
Ver en GitHub354
manning23/mspider
manning23/MSpider
345Ver en GitHub
The information security department of 360 company has been recruiting for a long time and is interested in contacting the mailbox zhangxin1at360.cn.
Simple spider using gevent and JavaScript rendering.
Python
Ver en GitHub345
cocrawler/cocrawler
cocrawler/cocrawler
194Ver en GitHub
CoCrawler is a versatile web crawler built using modern tools and concurrency.
Versatile crawler built with modern concurrency tools.
Python
Ver en GitHub194
jmg/crawley
jmg/crawley
191Ver en GitHub
High Speed WebCrawler built on Eventlet. Supports databases engines like Postgre, Mysql, Oracle, Sqlite. Command line tools. Extract data using your favourite tool. XPath or Pyquery (A Jquery-like library for python). Cookie Handlers. Very easy to use (see the example).
Pythonic framework based on non-blocking I/O.
Python
Ver en GitHub191

Awesome Python Crawling Frameworks GitHub Repositories

scrapy/scrapy

soimort/you-get

binux/pyspider

codelucas/newspaper

scrapinghub/portia

wzdnzd/aggregator

pbek/QOwnNotes

rolando/scrapy-redis

hickford/MechanicalSoup

jmcarp/robobrowser

scrapy/scrapely

xianhu/PSpider

howie6879/aspider

chineking/cola

istresearch/scrapy-cluster

iogf/sukhoi

rivermont/spidy

manning23/MSpider

cocrawler/cocrawler

jmg/crawley

Explorar subetiquetas