# any4ai/anycrawl

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/any4ai-anycrawl).**

2,742 stars · 289 forks · TypeScript · mit

## Links

- GitHub: https://github.com/any4ai/AnyCrawl
- Homepage: https://anycrawl.dev
- awesome-repositories: https://awesome-repositories.com/repository/any4ai-anycrawl.md

## Topics

`ai-scraping` `aitools` `crawl` `data` `html-to-markdown` `rag` `scrape` `scraping` `serp` `webscraper`

## Description

AnyCrawl is an AI-powered data extractor, automated web crawler, and headless browser orchestrator. It serves as a web content extraction API and a gateway that connects crawling and scraping tools to language models using a standardized API protocol.

The project specializes in converting unstructured website content into structured JSON or markdown optimized for AI assistants. It utilizes language models and JSON schemas to pull specific information into validated formats and provides capabilities for AI page summarization and LLM-optimized content extraction.

The system manages comprehensive web scraping infrastructure, including proxy rotation, stealth rendering, and asynchronous job queuing. It supports automated site traversal through recursive crawling and sitemap discovery, as well as scheduled data collection using cron-based timing and webhook notifications. Additional capabilities include search engine integration for URL discovery and the execution of custom JavaScript logic within a sandbox for result transformation.

The toolkit is available for containerized deployment.

## Tags

### Artificial Intelligence & ML

- [Web Content Extractions](https://awesome-repositories.com/f/artificial-intelligence-ml/web-content-extractions.md) — Extracts structured data from websites using a multi-threaded engine specifically optimized for language models. ([source](https://docs.anycrawl.dev/))
- [AI-Powered Data Extractors](https://awesome-repositories.com/f/artificial-intelligence-ml/ai-powered-data-extractors.md) — Uses language models and JSON schemas to pull specific information from web pages into validated formats.
- [Web URL Discovery](https://awesome-repositories.com/f/artificial-intelligence-ml/web-url-discovery.md) — Automatically discovers site links by analyzing sitemaps and HTML page structures for content ingestion. ([source](https://docs.anycrawl.dev/en/general/map))

### Web Development

- [LLM Integration Gateways](https://awesome-repositories.com/f/web-development/web-automation-scraping/web-scraping-automation/web-scraping/web-scraping-apis/llm-integration-gateways.md) — Serves as a standardized API gateway connecting web crawling and extraction tools directly to language models.
- [AI-Powered Web Summarization](https://awesome-repositories.com/f/web-development/custom-page-frameworks/content-summarization/ai-powered-web-summarization.md) — Uses artificial intelligence to generate a concise abstract of a webpage's main information. ([source](https://docs.anycrawl.dev/en/general/scrape))
- [Headless Browser Orchestrators](https://awesome-repositories.com/f/web-development/web-automation-scraping/browser-orchestration-systems/headless-browser-orchestrators.md) — Provides a system to manage concurrent crawling jobs with proxy rotation and stealth rendering using headless browsers.
- [Scraping Infrastructure Management](https://awesome-repositories.com/f/web-development/web-automation-scraping/web-scraping-automation/scraping-infrastructure-management.md) — Manages proxy rotation, stealth rendering, and asynchronous job queuing to bypass website restrictions.
- [Web Crawling](https://awesome-repositories.com/f/web-development/web-automation-scraping/web-scraping-automation/web-crawling.md) — Systematically discovers and archives linked pages across entire websites or search result sets at scale.
- [Web Crawlers](https://awesome-repositories.com/f/web-development/web-automation-scraping/web-scraping-automation/web-scraping/web-crawlers.md) — Automates the traversal of domains and discovery of URLs via sitemaps to archive website content.
- [Standardized Scraping Configurations](https://awesome-repositories.com/f/web-development/web-automation-scraping/web-scraping-automation/web-scraping/web-scraping-apis/standardized-scraping-configurations.md) — Defines reusable configurations for scraping, crawling, or searching to ensure consistent behavior. ([source](https://docs.anycrawl.dev/en/general/template))

### Part of an Awesome List

- [Tool-Use Integrations](https://awesome-repositories.com/f/awesome-lists/ai/ai-model-and-api-integration/tool-use-integrations.md) — Provides a standardized protocol to connect web scraping and crawling capabilities as tools for AI assistants and language models. ([source](https://docs.anycrawl.dev/en/general/mcp))

### Content Management & Publishing

- [Content Extraction Engines](https://awesome-repositories.com/f/content-management-publishing/content-processing-transformation/content-extraction-engines.md) — Extracts content from a single URL using multiple rendering engines to handle static HTML and JavaScript. ([source](https://cdn.jsdelivr.net/gh/any4ai/anycrawl@main/README.md))
- [Web Content Scraping](https://awesome-repositories.com/f/content-management-publishing/web-content-scraping.md) — Extracts content from specific URLs into formats like markdown or JSON, including OCR options. ([source](https://docs.anycrawl.dev/en/general/mcp))
- [Pattern-Based Extraction](https://awesome-repositories.com/f/content-management-publishing/content-processing-transformation/content-extraction-engines/pattern-based-extraction.md) — Visits pages to discover links while restricting full content extraction to specific URL patterns. ([source](https://docs.anycrawl.dev/en/general/crawl))
- [Search Result Extractors](https://awesome-repositories.com/f/content-management-publishing/search-result-extractors.md) — Scrapes and collects structured data and links specifically from search engine result pages. ([source](https://docs.anycrawl.dev/en/general/search))
- [Static Site Archiving](https://awesome-repositories.com/f/content-management-publishing/static-site-archiving.md) — Provides capabilities to create local snapshots of live websites by traversing them up to a specified depth. ([source](https://cdn.jsdelivr.net/gh/any4ai/anycrawl@main/README.md))

### Data & Databases

- [Web Content Scrapers](https://awesome-repositories.com/f/data-databases/data-engineering-infrastructure/data-extraction-ingestion/web-extraction-engines/web-content-scrapers.md) — Extracts web content and converts it into structured formats like markdown and JSON optimized for AI assistants.
- [Schema-Driven Extraction](https://awesome-repositories.com/f/data-databases/data-processing-pipelines/data-processing/document-unstructured-extraction/schema-driven-extraction.md) — Uses language models and JSON schemas to pull specific information from web pages into validated formats. ([source](https://cdn.jsdelivr.net/gh/any4ai/anycrawl@main/README.md))
- [Multi-Page Crawling](https://awesome-repositories.com/f/data-databases/multi-page-crawling.md) — Implements systems for navigating across multiple pages from a seed URL using asynchronous queues and path filters. ([source](https://docs.anycrawl.dev/en/general/crawl))
- [Structured Data Extraction](https://awesome-repositories.com/f/data-databases/structured-data-extraction.md) — Converts unstructured website content into structured text using static parsing or browser rendering. ([source](https://docs.anycrawl.dev/en/general/scrape))
- [Domain-Restricted Crawling](https://awesome-repositories.com/f/data-databases/url-crawl-queues/url-extraction/domain-restricted-crawling.md) — Collects data from multiple related pages starting from a base URL using domain-restricted crawling strategies. ([source](https://docs.anycrawl.dev/en/general/mcp))
- [Content Caching Controls](https://awesome-repositories.com/f/data-databases/content-caching-controls.md) — Manages the storage and expiration of retrieved web content to optimize data freshness and retrieval speed. ([source](https://docs.anycrawl.dev/en/general/scrape))
- [Web Scraping Result Caches](https://awesome-repositories.com/f/data-databases/data-engineering-infrastructure/caching-performance/caching-strategies/query-result-caching/method-result-caches/web-scraping-result-caches.md) — Stores page content and discovery maps to avoid redundant network requests and increase processing speed. ([source](https://docs.anycrawl.dev/en/general/cache))
- [Batch Search Crawling](https://awesome-repositories.com/f/data-databases/search-indexing-technologies/search-indexing/search-and-indexing/batch-search-crawling.md) — Retrieves search result pages from multiple engines in batch mode to gather external links and data. ([source](https://cdn.jsdelivr.net/gh/any4ai/anycrawl@main/README.md))
- [Structured Search Retrieval](https://awesome-repositories.com/f/data-databases/search-indexing-technologies/search-indexing/search-information-retrieval/structured-search-retrieval.md) — Retrieves structured data from multiple search engines synchronously, supporting multi-page results and various regional locales. ([source](https://docs.anycrawl.dev/en/general/search))

### Networking & Communication

- [Traffic Routing Proxies](https://awesome-repositories.com/f/networking-communication/network-infrastructure-routing/network-infrastructure-configuration/network-infrastructure/traffic-routing-proxies.md) — Directs network traffic through specific proxies based on domain patterns to bypass site restrictions. ([source](https://docs.anycrawl.dev/en/general/proxy-rule))
- [Proxy Management](https://awesome-repositories.com/f/networking-communication/proxy-management.md) — Manages the routing of network traffic through custom proxy servers to avoid rate limits. ([source](https://cdn.jsdelivr.net/gh/any4ai/anycrawl@main/README.md))
- [Proxy Rotation Services](https://awesome-repositories.com/f/networking-communication/proxy-rotation-services.md) — Features a proxy-rotation layer with failover logic to bypass rate limits and domain restrictions.
- [Failover Proxy Routers](https://awesome-repositories.com/f/networking-communication/proxy-servers/failover-proxy-routers.md) — Automatically rotates through a tiered list of backup proxies when the primary connection for a domain fails. ([source](https://docs.anycrawl.dev/en/general/proxy-rule))
- [Per-Request Proxy Assignments](https://awesome-repositories.com/f/networking-communication/request-proxies/per-request-proxy-assignments.md) — Toggles between base, stealth, and custom proxy settings on a per-request basis to optimize speed and access. ([source](https://docs.anycrawl.dev/en/general/proxy-rule))

### Software Engineering & Architecture

- [AI-Driven Schema Extractions](https://awesome-repositories.com/f/software-engineering-architecture/content-schemas/ai-driven-schema-extractions.md) — Uses language models and JSON schemas to transform unstructured web content into validated structured data.
- [Crawler Behavior Configurations](https://awesome-repositories.com/f/software-engineering-architecture/crawler-behavior-configurations.md) — Configures proxy servers, stealth modes, and headless browser settings to control how web pages are accessed. ([source](https://docs.anycrawl.dev/en/general/docker))
- [Job Queues](https://awesome-repositories.com/f/software-engineering-architecture/execution-control/asynchronous-task-queueing/job-queues.md) — Implements an asynchronous job queue to manage and monitor concurrent crawling tasks independently.
- [Webhook Event Notifications](https://awesome-repositories.com/f/software-engineering-architecture/integration-extensibility/programmatic-interfaces/webhook-event-notifications.md) — Sends real-time HTTP POST notifications to external endpoints when asynchronous crawling jobs change state or complete.

### Development Tools & Productivity

- [Cron Scheduling](https://awesome-repositories.com/f/development-tools-productivity/cron-scheduling.md) — Automates recurring web scraping and search tasks using cron-based timing and webhook notifications.
- [Document Extraction Post-Processors](https://awesome-repositories.com/f/development-tools-productivity/post-processing-hooks/specification-post-processors/document-extraction-post-processors.md) — Applies custom JavaScript logic to clean, analyze, or transform extracted web content before final delivery. ([source](https://docs.anycrawl.dev/en/general/template))
- [Crawl Depth Limiters](https://awesome-repositories.com/f/development-tools-productivity/search-paging-limits/crawl-depth-limiters.md) — Supports recursive website traversal with depth constraints and domain filters to discover and archive pages.
- [Web Search Integrations](https://awesome-repositories.com/f/development-tools-productivity/web-search-integrations.md) — Integrates with external search engines to discover target URLs before initiating the scraping process. ([source](https://docs.anycrawl.dev/en/general/mcp))

### DevOps & Infrastructure

- [Crawl Job Monitoring](https://awesome-repositories.com/f/devops-infrastructure/crawl-job-monitoring.md) — Monitors asynchronous crawls, retrieves paginated results, and enables the cancellation of pending jobs. ([source](https://docs.anycrawl.dev/en/general/mcp))
- [Concurrent Job Schedulers](https://awesome-repositories.com/f/devops-infrastructure/job-scheduling/concurrent-job-schedulers.md) — Manages the simultaneous execution of multiple crawling tasks through a concurrent job queue to process data at scale. ([source](https://docs.anycrawl.dev/en/general/crawl))
- [Job Templates](https://awesome-repositories.com/f/devops-infrastructure/job-templates.md) — Executes predefined scraping and search templates using dynamic input variables and specific payloads. ([source](https://docs.anycrawl.dev/en/general/scheduled-tasks))
- [Recurring Job Scheduling](https://awesome-repositories.com/f/devops-infrastructure/recurring-job-scheduling.md) — Automates scraping, crawling, and search operations using cron-based timing for regular interval execution. ([source](https://docs.anycrawl.dev/en/general/scheduled-tasks))

### Programming Languages & Runtimes

- [Sandboxed JavaScript Execution](https://awesome-repositories.com/f/programming-languages-runtimes/runtime-execution-environments/javascript-runtimes/sandboxed-javascript-execution.md) — Executes custom JavaScript logic within an isolated sandbox to process and clean scraped output.
