30 open-source projects similar to scrapy/scrapy, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best Scrapy alternative.
Crawlee is a web scraping framework designed for building scalable, reliable, and distributed data extraction pipelines. It provides a unified interface for managing headless browser automation and lightweight HTTP requests, allowing developers to handle complex web navigation, dynamic content rendering, and large-scale data collection within a single, modular architecture. The project distinguishes itself through its resource-aware concurrency controller, which dynamically scales task execution based on real-time CPU and memory usage to prevent host machine exhaustion. It also features a rob
Crawl4AI is an AI-powered web crawling and data extraction engine designed to transform complex web content into structured formats. It functions as a headless browser orchestrator, enabling the navigation of dynamic websites, the execution of custom scripts, and the capture of visual assets like screenshots and PDFs. By integrating language models directly into the extraction workflow, the system converts raw HTML into clean, structured data or Markdown files optimized for downstream ingestion. The platform distinguishes itself through a distributed, self-hosted infrastructure that manages l
PySpider is a Python web crawling framework designed for automated data extraction. It provides a pipeline for periodically fetching web content, processing HTML, and persisting scraped information into database backends. The system features a web-based management interface for editing scraping scripts, monitoring task progress, and reviewing collected data. It includes a headless browser JavaScript renderer to capture rendered HTML from dynamic web pages and a distributed architecture that uses message queues to scale crawling workloads across multiple nodes. The framework also covers task
Firecrawl is a web data extraction platform designed to convert unstructured web content into clean, LLM-ready formats like markdown or JSON. It functions as an autonomous web crawler and scraper, capable of mapping entire domains, performing recursive navigation, and executing complex data gathering tasks. By leveraging headless browser orchestration, the system handles dynamic, JavaScript-heavy pages to ensure comprehensive data capture. The platform distinguishes itself through its focus on agentic workflows, providing a programmatic interface that allows autonomous agents to perform live
Huginn is an open-source automation platform that functions as an event-driven task automator and webhook integration engine. It enables the creation of agents that monitor web data and automate tasks across various web services, operating as a self-hosted web scraper and JavaScript workflow orchestrator. The system uses a directed graph of event flows to route and transform data between external APIs. It differentiates itself by allowing custom JavaScript execution within workflows to modify data payloads and by integrating human-in-the-loop automation to insert manual judgment or data entry
Crawlab is a distributed web scraping platform designed to centralize the management, deployment, and execution of large-scale data extraction tasks. It functions as a control plane that orchestrates scraping scripts and automated workflows across multiple nodes, providing a unified environment for managing complex data collection operations. The platform distinguishes itself through a distributed architecture that coordinates worker nodes via a central master, utilizing real-time communication to maintain oversight of all active processes. It ensures operational consistency by isolating task
Browser-use is a framework for building autonomous agents that navigate, interact with, and extract data from web interfaces using natural language instructions. By acting as an orchestration layer between large language models and browser automation protocols, it enables the execution of complex, multi-step workflows without relying on brittle selectors. The system functions as a headless browser controller, providing a programmatic interface to manage browser instances and execute granular interactions. The project distinguishes itself through its ability to translate high-level intent into
This project is a comprehensive educational guide and framework for building web scrapers using Python. It provides a course-based approach to data extraction, combining a Python crawler framework with tutorials on web reverse engineering and network traffic analysis. The project distinguishes itself by covering advanced extraction challenges, including the decryption of obfuscated JavaScript and the bypass of anti-scraping measures. It specifically addresses mobile application scraping through the simulation of user interactions and the interception of network traffic. The capability surfac
This project is a comprehensive Python network request framework designed for both synchronous and asynchronous HTTP communication. It provides a high-performance client capable of executing non-blocking requests within event-driven applications, while also supporting standard blocking calls for simpler scripts. The library is built to operate natively across diverse asynchronous runtimes, automatically detecting and utilizing the underlying event loop for concurrency. What distinguishes this library is its modular architecture, which decouples request construction from network execution thro
Hyper is a low-level networking library designed for building high-performance HTTP clients and servers. It provides a foundational toolkit for creating network services that leverage asynchronous execution and memory-safe data handling, supporting both HTTP/1 and HTTP/2 protocols. The library distinguishes itself through a protocol-agnostic architecture that separates transport logic from HTTP semantics. It utilizes a service-trait abstraction to decouple network logic from the underlying transport, enabling developers to inject custom middleware for request interception and response transfo
json-server is a development toolset used to simulate a full REST API from a JSON file. It functions as a customizable mock API server that allows for the simulation of CRUD operations and resource relationships without the need to write backend code. The project enables rapid prototyping by generating a fake backend that persists data changes back to a local JSON file. It distinguishes itself by providing a static asset file server to deliver local documents, images, and stylesheets alongside the mock API endpoints. The server includes capabilities for data querying, such as parameter-based
The AWS SDK for JavaScript is a programmatic interface and API client used to manage, automate, and orchestrate AWS cloud infrastructure and services. It provides a set of tools for controlling compute, storage, and networking resources from Node.js and web browser environments. The project includes a modular asset bundler that allows for the creation of specialized service bundles. This mechanism enables the selection of specific service modules at build time to reduce the final JavaScript payload size for frontend cloud integrations. The SDK covers a broad range of cloud management capabil
RoboBrowser: Your friendly neighborhood web scraper
SymPy is a Python computer algebra system and symbolic mathematics library. It performs algebraic manipulations, calculus, and equation solving using symbolic representations to achieve exact computations rather than numerical approximations. The library includes a LaTeX expression parser that converts mathematical strings into symbolic representations for computation and formula manipulation. It also incorporates a mathematical benchmarking suite to measure execution speed and detect performance regressions across different software versions. The system provides capabilities for automated m
Scrapegraph-ai is a Python framework that uses large language models to automate the extraction of structured data from websites and documents. It functions as an AI-driven data extraction pipeline that converts unstructured web content into structured formats using natural language processing and graph-based logic. The project utilizes graph-based task orchestration to model scraping workflows as interconnected nodes. It features a pluggable model interface for connecting to cloud or local artificial intelligence providers and can generate executable Python code on the fly to handle site-spe
This project is a command-line utility designed to fetch video, audio, and image content from a wide range of web platforms. It functions by parsing page metadata and utilizing modular, site-specific scripts to extract direct media stream URLs from complex web structures, enabling the local archiving of digital media for offline use. The tool distinguishes itself through its ability to handle authenticated content, allowing users to inject browser-stored session cookies to access restricted or private media. It also supports real-time media streaming by piping remote content directly into ext
The official repository for Babel, the Python Internationalization Library
Briefcase is a cross-platform build tool and native app packager that converts Python projects into standalone applications. It functions as a deployment framework designed to package Python code into binaries for desktop, mobile, and web platforms. The project utilizes a plugin-based architecture, allowing users to extend platform support and integrate new operating systems or specialized installation formats into the build process. The system handles cross-platform development through a configuration-driven build process. This covers capabilities including project generation from templates
Portia is a containerized scraping platform and visual web scraper that enables no-code data extraction. It serves as a Scrapy visual scraping tool and spider generator, allowing users to design and deploy web scrapers through a graphical interface instead of writing manual selector code. The system distinguishes itself by converting visual web page annotations into executable Scrapy spider code and structured JSON specifications. This visual-to-code mapping allows users to define scraping logic and extraction rules through a point-and-click interface, which can then be exported for use in ex
Requests is a high-level HTTP client library designed to simplify web communication and API integration. It provides an intuitive, human-readable interface for performing standard network operations, including request execution, connection pooling, and stateful session management. By encapsulating raw network data into structured objects, the library automates the complexities of headers, cookies, and payload transmission. The library distinguishes itself through a modular transport adapter layer that allows for custom protocol handling and extensible authentication hooks. It supports a wide
mypy is a static type checker for Python that analyzes source code to detect type errors and inconsistencies without executing the program. It functions as a static analysis tool and type inference engine, providing a gradual typing system that allows type hints to be added to a codebase incrementally while maintaining compatibility with dynamic typing. The project distinguishes itself through a combination of performance and precision features. It utilizes a daemon-based incremental checking system and multi-process parallel analysis to manage large codebases, supported by binary cache persi
Realtime Web Apps and Dashboards for Python and R
Cookiecutter is a command-line project templating engine and scaffolding tool used to automate the creation of software project structures. It functions as a project automator that generates initialized repositories and directories from predefined templates using variable substitution. The system utilizes the Jinja2 templating language to render files and folders, transforming generic blueprints into customized codebases based on user input. It supports template distribution via Git repositories or zip archives, allowing for the standardization of development environments across teams. The t
FastAPI is a web framework for building APIs with Python. It leverages standard language type hints to provide automatic data validation, request parsing, and interactive API documentation generation. The framework supports asynchronous request handling and manages execution contexts to prevent blocking the main event loop. The project includes a dependency injection system that allows for the resolution and injection of reusable components into request handlers. This system supports request-scoped caching, lifecycle management, and integration with security mechanisms like OAuth2 and JSON We
This project is a browser-based interactive computing environment and data science IDE. It serves as a literate programming tool that allows users to create documents combining live code, mathematical equations, visualizations, and narrative text. As a polyglot notebook interface, it connects to various language kernels to execute code and render output within a single interface. The application distinguishes itself by separating the frontend interface from a remote compute engine through a language-agnostic kernel interface. This allows it to support multiple programming languages while main
A Python library for automating interaction with websites.
pytest is a testing framework for Python that provides a command-line runner for discovering and executing test suites. It is built on a modular architecture that uses standard language assertions to verify code correctness, automatically inspecting expressions to provide detailed failure reports without requiring specialized assertion methods. The framework distinguishes itself through a dependency injection system that manages setup and teardown logic by automatically resolving and injecting resources into test functions. It also features a hook-based plugin architecture that allows for dee