5 مستودعات
Processes multiple URLs concurrently using async concurrency controls to speed up batch browser automation tasks.
Distinct from Parallel Batch Processing: Distinct from Parallel Batch Processing: specifically targets URL processing in browser automation contexts rather than general data processing.
Explore 5 awesome GitHub repositories matching data & databases · URL Batch Processors. Refine with filters or upvote what's useful.
Firecrawl is a headless browser automation tool and web crawling engine designed to extract structured data from the web. It functions as an API that transforms raw website content and documents into clean markdown and JSON formats to serve as context for large language models. The project distinguishes itself by using natural language prompts to translate human instructions into targeted data extraction tasks and browser actions. It can execute interactive page navigation, such as clicking and scrolling, and perform automated web research to retrieve structured data without manual interventi
Processes thousands of URLs concurrently using asynchronous queue-based controls to ensure scalable data retrieval.
Steel is a cloud browser automation platform that provides a REST API for launching and controlling remote Chrome browser sessions. It enables programmatic browsing and web scraping using standard automation tools like Puppeteer, Playwright, and Selenium, connecting to cloud-hosted browser instances via WebSocket and the Chrome DevTools Protocol. The platform supports both headless and headful browser sessions, with language-specific SDKs for TypeScript and Python. The service distinguishes itself through comprehensive anti-detection capabilities, including residential proxy rotation, CAPTCHA
Processes multiple URLs concurrently using async concurrency controls to speed up batch browser automation tasks.
Firecrawl MCP Server is a Model Context Protocol tool server that exposes the full suite of Firecrawl’s web scraping, crawling, and automation capabilities as tools that large language models can invoke directly. It acts as a proxy to the Firecrawl cloud platform, which manages headless browser orchestration, async job queues, and rate limiting behind the scenes. The server distinguishes itself by packaging autonomous web agents — both a research agent that browses and collects structured data from multiple pages, and a general web agent that performs multi-step browsing and extraction tasks
Scrapes multiple URLs in parallel with rate limiting and returns operation status for later retrieval.
Trafilatura is a Python library and command-line tool for extracting clean, structured text and metadata from web pages. It downloads HTML content, identifies the main body of text, and strips away navigation, ads, and other boilerplate, returning the core article content along with fields like title, author, date, and URL. The tool can also extract user comments and test whether a page contains extractable text, making it a general-purpose web text extraction library. What distinguishes Trafilatura from simpler extractors is its configurable extraction pipeline, which offers high-speed, high
Fetches multiple URLs concurrently with deduplication and archive fallback.
Yattee is a privacy-focused video player and multi-backend video aggregator designed for streaming online content without tracking, ads, or account requirements. It functions as a cross-platform application that collects video content from self-hosted servers, third-party APIs, and decentralized platforms into a single interface. The project features SponsorBlock integration to automatically skip sponsored or promotional segments using a community-sourced timestamp database. It also includes an Invidious-compatible API server that can replace standard endpoints to facilitate private playback.
Processes multiple URLs in parallel to extract video information efficiently in batches.