24 repositorios
Libraries and frameworks for web scraping and crawling in Python.
Explore 24 awesome GitHub repositories matching part of an awesome list · Python Crawling Frameworks. Refine with filters or upvote what's useful.
Scrapy is a comprehensive framework designed for automated web data extraction and large-scale crawling. It operates on an asynchronous, event-driven engine that manages non-blocking network requests and data processing tasks, allowing for the efficient retrieval of structured information from web documents using path-based selectors. The system distinguishes itself through a highly modular architecture that supports complex data collection workflows. Users can implement custom middleware and signal handlers to intercept and modify request flows, while a priority-based scheduler manages concu
High-level framework for screen scraping and web crawling.
This project is a command-line utility designed to fetch video, audio, and image content from a wide range of web platforms. It functions by parsing page metadata and utilizing modular, site-specific scripts to extract direct media stream URLs from complex web structures, enabling the local archiving of digital media for offline use. The tool distinguishes itself through its ability to handle authenticated content, allowing users to inject browser-stored session cookies to access restricted or private media. It also supports real-time media streaming by piping remote content directly into ext
Command-line tool for downloading web content.
PySpider is a Python web crawling framework designed for automated data extraction. It provides a pipeline for periodically fetching web content, processing HTML, and persisting scraped information into database backends. The system features a web-based management interface for editing scraping scripts, monitoring task progress, and reviewing collected data. It includes a headless browser JavaScript renderer to capture rendered HTML from dynamic web pages and a distributed architecture that uses message queues to scale crawling workloads across multiple nodes. The framework also covers task
Powerful, full-featured spider system.
Newspaper is a Python library designed for scraping, parsing, and analyzing web-based information. It functions as a framework for automated news aggregation and large-scale web content extraction, providing tools to download, clean, and structure text, metadata, and media from diverse online sources. The project distinguishes itself through a pipeline-oriented architecture that combines heuristic-based content extraction with natural language processing. It automatically identifies and isolates article bodies from web page boilerplate while simultaneously performing language detection, keywo
Extraction of news, full-text, and article metadata.
Portia is a containerized scraping platform and visual web scraper that enables no-code data extraction. It serves as a Scrapy visual scraping tool and spider generator, allowing users to design and deploy web scrapers through a graphical interface instead of writing manual selector code. The system distinguishes itself by converting visual web page annotations into executable Scrapy spider code and structured JSON specifications. This visual-to-code mapping allows users to define scraping logic and extraction rules through a point-and-click interface, which can then be exported for use in ex
Visual scraping tool for Scrapy.
This project is a proxy aggregation platform designed to collect and verify free proxy server lists from web platforms, social media, and public repositories. It functions as a crawler framework that gathers proxy data and subscription links, a validation tool for testing server liveness, and a synchronization service for distributing the results. The system uses a plugin-based architecture that allows for the integration of custom Python scripts to handle diverse web source structures. It also includes utilities to transform raw proxy data into standardized configuration formats compatible w
Integrates Python scripts as plugins to implement specialized crawling logic for unique web sources.
QOwnNotes is a desktop note editor that stores each note as a plain-text Markdown file on the local filesystem, avoiding proprietary formats and enabling direct file access. It functions as a Nextcloud Notes client, syncing notes and metadata with Nextcloud or ownCloud servers through a companion API service for versioning and sharing. The application also integrates with AI providers and exposes a local MCP server for external agents to search and fetch notes, and includes a companion browser extension for capturing web content, bookmarks, and screenshots. The editor distinguishes itself thr
Extends functionality by running user-written scripts from an online repository.
Este proyecto es un framework de rastreo web distribuido que permite el escalado horizontal de tareas de scraping. Utiliza Redis como gestor de colas de solicitudes centralizado y almacén de estado para coordinar el progreso del rastreo y los metadatos de las solicitudes a través de múltiples instancias de servidor. El sistema distribuye las cargas de trabajo de rastreo compartiendo una única cola de solicitudes y utiliza un filtro de duplicados distribuido para evitar que múltiples trabajadores visiten la misma página. Persiste el estado complejo de la solicitud y los metadatos como cadenas JSON dentro del almacén remoto compartido. El framework también proporciona capacidades para el procesamiento de datos distribuido al enviar elementos scrapeados a una cola compartida para el consumo paralelo por parte de trabajadores de procesamiento separados.
Redis-based components for distributed Scrapy projects.
MechanicalSoup es una biblioteca de automatización web de Python diseñada para simular el comportamiento del navegador. Funciona como un kit de herramientas para web scraping y automatización, proporcionando un motor de análisis HTML y un gestor de sesiones HTTP para interactuar con sitios web programáticamente. La biblioteca permite la interacción web headless (sin interfaz gráfica) imitando una sesión de usuario real. Gestiona el estado persistente a través del manejo de cookies y el seguimiento automático de redirecciones, permitiendo la navegación programática por sitios web y la simulación de interacciones complejas del navegador. Sus capacidades cubren el llenado y envío automatizado de formularios utilizando selectores CSS, así como la extracción de datos de respuestas HTML. El conjunto de herramientas incluye utilidades para descargar archivos vinculados, especificar agentes de usuario personalizados y buscar páginas basadas en palabras clave específicas. También proporciona herramientas de diagnóstico para renderizar el estado actual de la página en un navegador para su verificación visual.
Automates website interactions for scraping.
RoboBrowser: Your friendly neighborhood web scraper
Library for browsing the web without a standalone browser.
Scrapely
Pure-python library for HTML screen-scraping.
A simple web spider frame written by Python, which needs Python3.8+
Simple spider framework for Python 3.
Ruia 🕸️ Async Python 3.6+ web scraping micro-framework based on asyncio. ⚡ Write less, run faster.
Asyncio-based micro-framework for web scraping.
A high-level distributed crawling framework.
Distributed framework for web crawling.
This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster.
Distributed scraping cluster using Redis and Kafka.
Minimalist and powerful Web Crawler.
Minimalist and powerful web crawler.
Spidy (/spˈɪdi/) is the simple, easy to use command line web crawler. Given a list of web links, it uses the Python requests library to query the webpages. Spidy then uses lxml to extract all links from the page and adds them to its list. Pretty simple!
Simple command-line web crawler.
The information security department of 360 company has been recruiting for a long time and is interested in contacting the mailbox zhangxin1at360.cn.
Simple spider using gevent and JavaScript rendering.
CoCrawler is a versatile web crawler built using modern tools and concurrency.
Versatile crawler built with modern concurrency tools.
High Speed WebCrawler built on Eventlet. Supports databases engines like Postgre, Mysql, Oracle, Sqlite. Command line tools. Extract data using your favourite tool. XPath or Pyquery (A Jquery-like library for python). Cookie Handlers. Very easy to use (see the example).
Pythonic framework based on non-blocking I/O.