30 open-source projects similar to bjesus/pipet, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best Pipet alternative.
requests-html is a Python HTML parsing library and web scraping framework. It functions as an asynchronous HTTP client and a JavaScript rendering engine designed to fetch and parse web pages for structured data extraction. The project integrates a headless browser to execute JavaScript, allowing it to retrieve dynamically generated content that standard HTML parsers cannot see. It provides tools for automated data extraction using CSS selectors and XPath expressions to isolate specific text or attributes from HTML structures. The framework covers network operations including asynchronous pag
Webmagic is a Java web crawling framework designed for building scalable automated crawlers to download and process large volumes of web pages. It functions as a distributed web crawler and dynamic content crawler, utilizing an XPath HTML parser to locate and extract specific data points from page structures. The framework distinguishes itself through its ability to handle dynamic content by rendering JavaScript and executing asynchronous requests to extract data from non-static pages. It also allows users to define and execute crawler logic via scripting languages, enabling the update of col
PySpider is a Python web crawling framework designed for automated data extraction. It provides a pipeline for periodically fetching web content, processing HTML, and persisting scraped information into database backends. The system features a web-based management interface for editing scraping scripts, monitoring task progress, and reviewing collected data. It includes a headless browser JavaScript renderer to capture rendered HTML from dynamic web pages and a distributed architecture that uses message queues to scale crawling workloads across multiple nodes. The framework also covers task
Crawlee-python is a web crawling framework for building scalable scrapers using Python. It serves as a comprehensive tool for web scraping automation, providing a system to extract structured data from websites using both lightweight HTTP requests and headless browser automation. The framework is distinguished by its anti-bot evasion capabilities, which include browser fingerprint impersonation and tiered proxy rotation to bypass detection systems and solve challenges such as Cloudflare. It also incorporates artificial intelligence for autonomous website navigation and schema-based data extra
Steel is a cloud browser automation platform that provides a REST API for launching and controlling remote Chrome browser sessions. It enables programmatic browsing and web scraping using standard automation tools like Puppeteer, Playwright, and Selenium, connecting to cloud-hosted browser instances via WebSocket and the Chrome DevTools Protocol. The platform supports both headless and headful browser sessions, with language-specific SDKs for TypeScript and Python. The service distinguishes itself through comprehensive anti-detection capabilities, including residential proxy rotation, CAPTCHA
This project is a distributed scraping engine designed to extract business details, customer reviews, and lead information from Google Maps. It functions as a business scraper and data extractor that can be deployed as a permanent system or as on-demand serverless functions. The system utilizes a proxy-routed web crawler to manage request origins via SOCKS5, HTTP, and HTTPS proxies. To locate contact information, it includes an email extraction tool that recursively crawls business websites linked within map listings. The software supports coordinate-based radius searches for efficient data
X-Ray is a web scraping framework and asynchronous web crawler designed to extract structured data from websites. It functions as an HTML data extractor that transforms raw page content into a defined schema using CSS-style selectors. The project implements a headless browser crawler capable of executing JavaScript to render dynamic content. It handles website content discovery through a breadth-first crawling strategy and automatic pagination discovery to traverse multi-page result sets. The framework manages web data pipelines using a concurrency-limited request queue and request rate cont
Skill Seekers is a toolset for generating large language model knowledge bases, featuring a multi-source content scraper and a dedicated RAG data pipeline. It extracts technical data from documentation, code, and video to create structured assets and configuration files for AI-powered IDE extensions. The project distinguishes itself through the ability to transform raw data into polished tutorials and specialized skills for AI plugin marketplaces. It utilizes abstract syntax tree parsing and optical character recognition to analyze GitHub repositories, PDFs, and video frames, converting these
Firecrawl MCP Server is a Model Context Protocol tool server that exposes the full suite of Firecrawl’s web scraping, crawling, and automation capabilities as tools that large language models can invoke directly. It acts as a proxy to the Firecrawl cloud platform, which manages headless browser orchestration, async job queues, and rate limiting behind the scenes. The server distinguishes itself by packaging autonomous web agents — both a research agent that browses and collects structured data from multiple pages, and a general web agent that performs multi-step browsing and extraction tasks
This project serves as a comprehensive educational repository and technical reference collection, documenting a wide range of software engineering practices and modern development technologies. It provides a structured learning path for developers, curating tutorials and practical examples that cover the full lifecycle of application development, from initial project scaffolding to deployment and maintenance. The repository distinguishes itself by offering deep technical insights into complex architectural patterns, including actor-based concurrency models for managing parallel tasks and cont
so-novel is a web novel downloader and scraping engine designed to extract structured text from websites and convert it into electronic book formats. It functions as a multi-interface content extractor, providing a shared backend accessible via a web-based management dashboard, a terminal user interface, and a command line interface. The system utilizes a rule-driven approach for data extraction, using CSS selectors and XPath rules defined in external configuration files to map web elements to specific data fields. To maintain access to content, it includes a proxy-routed request pipeline to
Scrapy is a comprehensive framework designed for automated web data extraction and large-scale crawling. It operates on an asynchronous, event-driven engine that manages non-blocking network requests and data processing tasks, allowing for the efficient retrieval of structured information from web documents using path-based selectors. The system distinguishes itself through a highly modular architecture that supports complex data collection workflows. Users can implement custom middleware and signal handlers to intercept and modify request flows, while a priority-based scheduler manages concu
Spider is a web-based platform designed for automated data extraction, providing a centralized framework to collect, process, and route structured information from websites. It functions as a comprehensive pipeline that manages the entire lifecycle of data gathering, from initial configuration to final storage in external databases or message queues. The platform distinguishes itself through a visual configuration interface that allows users to define extraction rules and manage scraping templates without writing custom code. It supports both static and dynamic content retrieval by integratin
Crawlab is a distributed web scraping platform designed to centralize the management, deployment, and execution of large-scale data extraction tasks. It functions as a control plane that orchestrates scraping scripts and automated workflows across multiple nodes, providing a unified environment for managing complex data collection operations. The platform distinguishes itself through a distributed architecture that coordinates worker nodes via a central master, utilizing real-time communication to maintain oversight of all active processes. It ensures operational consistency by isolating task
Nightmare is an Electron-based browser automation library and headless browser controller. It provides the infrastructure to programmatically navigate web pages, interact with DOM elements, and execute JavaScript within a background browser instance. The project distinguishes itself by integrating a full Chromium instance within an Electron shell, allowing for the management of browser sessions, network proxy settings, and persistent storage partitions. It enables the capture of page states as PNG screenshots, PDF documents, or HTML files. The tool covers a broad range of capabilities includ
Ripme is a batch media downloader and web media scraper designed for extracting images and videos from image-hosting platforms and social media sites. It functions as an image gallery downloader and a network client capable of retrieving full albums and paginated content. The project includes a custom media ripper framework that allows for the definition of new extraction rules to support websites lacking native support. It features a proxy-enabled network layer for routing requests through HTTP or SOCKS servers and supports session-based content retrieval using authentication cookies and cus
pydoll is a Chrome DevTools Protocol automation library and headless browser controller used for web data extraction and parallel browser automation. It controls Chromium-based browsers via direct WebSocket connections, allowing it to manage isolated browser contexts and tabs while bypassing the overhead and detection associated with WebDriver. The project features an anti-bot evasion framework that mimics natural human behavior, including mouse movements generated via Bezier curves and variable typing patterns. It provides specialized stealth capabilities to bypass behavioral analysis and au
PhantomJS is a scriptable, headless browser engine based on WebKit that provides a programmatic interface for automating web page interactions. It operates without a graphical user interface, allowing for the execution of JavaScript to navigate pages, manipulate the document object model, and perform functional testing of web applications. The tool distinguishes itself by providing low-level control over the browser rendering lifecycle and network stack. It enables real-time interception and modification of network traffic, alongside the ability to generate visual snapshots and document expor
jsdom is a Node.js implementation of web standards that functions as a headless browser emulator. It provides a JavaScript execution environment and an HTML and XML parser to simulate a browser environment on the server side, implementing various web APIs and W3C standards. The project distinguishes itself by providing a sandboxed runtime for executing scripts embedded in HTML or external files. It includes specialized polyfills for the Canvas API and manages session state through HTTP cookie management. Its broader capabilities cover network interaction via request interception and resource
Huginn is an open-source automation platform that functions as an event-driven task automator and webhook integration engine. It enables the creation of agents that monitor web data and automate tasks across various web services, operating as a self-hosted web scraper and JavaScript workflow orchestrator. The system uses a directed graph of event flows to route and transform data between external APIs. It differentiates itself by allowing custom JavaScript execution within workflows to modify data payloads and by integrating human-in-the-loop automation to insert manual judgment or data entry
Agent-Reach is an AI agent web gateway and search tool that provides language models with the ability to search and read content from the open web, social media, and community forums without using official APIs. It functions as a routing layer that connects large language models to various internet backends while managing content parsing and connection health. The system enables API-free information retrieval by using open-source backends to extract text and metadata from platforms such as Twitter, Reddit, and YouTube. It converts unstructured website content, RSS feeds, and video transcripts
X-ray is a headless browser web scraper and HTML content crawler designed to extract structured data from websites. It functions as a stream-based data scraper and structured data extractor, using selectors to retrieve text and attributes from HTML as nested objects or arrays. The project includes a request rate controller to manage network traffic through concurrency limits, throttles, and timeouts. It handles dynamic website scraping by rendering JavaScript via a headless browser and performs automated website crawling using breadth-first link following and pagination management. The syste
Venera is a multi-source content reader and aggregator that allows users to browse and download media from various remote websites and local files through a unified interface. It functions as a local-remote media manager, synchronizing online content with local storage to enable offline viewing. The project utilizes a JavaScript-based content parser and aggregator to scrape and parse data from external web sources. This system allows for the definition of custom data extraction rules using JavaScript to fetch and display content from external websites. The platform covers remote media manage
Avbook is a self-hosted, web-based digital library and catalog manager designed for organizing personal collections of adult videos and digital books. It serves as a centralized system for tracking media entries and associated metadata within a local database. The system functions as a media metadata aggregator and web-based scraper, automatically extracting video details and descriptive tags from external sources to populate the library. It also operates as a magnet link organizer, storing and indexing peer-to-peer file identifiers to simplify the discovery of downloadable media. The platfo
Goutte is a PHP web scraper and DOM crawler designed for extracting data from websites. It functions as an HTTP client wrapper that enables the retrieval of web pages and the parsing of HTML content. The project provides a web form automator to programmatically fill and submit HTML forms to remote servers. It also includes a mechanism for automated website crawling by following links to discover and archive web content. The system supports stateful session management to maintain cookies and headers across requests. It further covers HTML data extraction through DOM-based element selection an
jsdom is a Node.js DOM implementation that functions as a headless browser emulator and virtual browser environment. It provides a pure JavaScript implementation of web standards, acting as a web standards polyfill that simulates the window and document objects within a non-browser runtime. The project implements W3C and WHATWG specifications to provide a programmatic environment for parsing HTML and manipulating content. It serves as an HTML parser and serializer, allowing for the transformation of HTML strings into document structures and the export of those structures back into text. The
FlareSolverr is a proxy server designed to provide programmatic access to websites protected by automated security challenges and firewall restrictions. It functions by orchestrating headless browser instances to render web pages, execute JavaScript, and retrieve the necessary cookies and content required to bypass common security hurdles. The service distinguishes itself by maintaining persistent browser sessions in memory, which allows for the reuse of authenticated states across multiple requests. It integrates with external captcha resolution services to handle interactive security challe
This project serves as an agentic browser controller, providing a programmatic bridge that enables autonomous software agents to navigate web pages and interact with document elements. It functions as a browser automation protocol, facilitating headless browser operations and automated web interactions to perform repetitive tasks and end-to-end testing without manual human input. The system distinguishes itself by utilizing the Chrome DevTools Protocol to establish a bidirectional communication channel with the browser engine. This allows for protocol-based remote control, where external appl
Kazumi is a cross-platform media player and streaming platform that centralizes video content from diverse third-party web sources. It functions as an automated scraping tool, utilizing configurable path patterns and selectors to extract and aggregate media streams into a unified interface. The platform distinguishes itself through its focus on synchronized group viewing and real-time state management. Users can participate in shared virtual rooms where playback progress and controls are aligned across multiple devices. Additionally, the application includes integrated image processing capabi