# mishushakov/llm-scraper

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/mishushakov-llm-scraper).**

6,190 stars · 369 forks · TypeScript · mit

## Links

- GitHub: https://github.com/mishushakov/llm-scraper
- awesome-repositories: https://awesome-repositories.com/repository/mishushakov-llm-scraper.md

## Topics

`ai` `artificial-intelligence` `browser` `browser-automation` `gpt` `gpt-4` `langchain` `llama` `llm` `openai` `playwright` `puppeteer` `scraper`

## Tags

### Artificial Intelligence & ML

- [LLM-Powered Scrapers](https://awesome-repositories.com/f/artificial-intelligence-ml/web-scrapers/llm-powered-scrapers.md) — A tool that uses a language model to extract structured data from web pages based on a user-defined schema.
- [Incremental Result Streaming](https://awesome-repositories.com/f/artificial-intelligence-ml/tool-calling-integration-frameworks/tool-output-processors/incremental-result-streaming.md) — Yields extracted objects incrementally as the language model generates them for early consumption.
- [Extraction Streams](https://awesome-repositories.com/f/artificial-intelligence-ml/tool-calling-integration-frameworks/tool-output-processors/incremental-result-streaming/extraction-streams.md) — Receives partial structured data incrementally as the language model processes a webpage.

### Data & Databases

- [Schema-Driven Extraction](https://awesome-repositories.com/f/data-databases/data-processing-pipelines/data-processing/document-unstructured-extraction/schema-driven-extraction.md) — Defines the shape of data to extract from webpages with Zod or JSON schemas.
- [Webpage Parsers](https://awesome-repositories.com/f/data-databases/json-schema-modeling/llm-driven-schema-generation/webpage-parsers.md) — Uses a language model to parse raw webpage content and return structured data matching a predefined schema.
- [Structured Data Extraction](https://awesome-repositories.com/f/data-databases/structured-data-extraction.md) — Parse webpage content with a language model to return fields defined by a Zod or JSON schema. ([source](https://github.com/mishushakov/llm-scraper#readme))
- [LLM-Powered Webpage Parsers](https://awesome-repositories.com/f/data-databases/structured-data-extraction/llm-powered-webpage-parsers.md) — Uses an LLM to parse a page's content and return typed fields defined by a Zod or JSON Schema. ([source](https://github.com/mishushakov/llm-scraper#readme))
- [Zod Schemas](https://awesome-repositories.com/f/data-databases/data-collection-schemas/standard-schema-validators/zod-schemas.md) — Accepts Zod schemas directly to define the structure and validation rules for extracted data fields.
- [Schema Definitions](https://awesome-repositories.com/f/data-databases/data-processing-pipelines/data-serialization/json-schema/schema-definitions.md) — Supports JSON Schema as an alternative to Zod for defining extraction schemas.
- [Incremental Result Streams](https://awesome-repositories.com/f/data-databases/structured-data-extraction/incremental-result-streams.md) — Yield partial structured results incrementally as the language model produces them for early consumption. ([source](https://github.com/mishushakov/llm-scraper#readme))
- [Incremental Streams](https://awesome-repositories.com/f/data-databases/structured-data-extraction/incremental-streams.md) — Yields partial structured results incrementally as the language model produces them during extraction.

### Development Tools & Productivity

- [Headless Browser Automation](https://awesome-repositories.com/f/development-tools-productivity/headless-browser-automation.md) — Leverages Playwright to programmatically control a browser for automated web content extraction.
- [LLM-Powered Scrapers](https://awesome-repositories.com/f/development-tools-productivity/web-scraping/llm-powered-scrapers.md) — Extracts structured data from web pages using a language model and a user-defined schema.
- [Playwright Scripts](https://awesome-repositories.com/f/development-tools-productivity/shell-command-execution/script-generators/playwright-scripts.md) — Produces a standalone Playwright script that replicates the extraction logic without requiring an LLM at runtime.

### Part of an Awesome List

- [Provider-Agnostic Interfaces](https://awesome-repositories.com/f/awesome-lists/ai/language-models/provider-agnostic-interfaces.md) — Designed to work with various language models, allowing users to choose the provider or model.

### Web Development

- [Playwright Script Generators](https://awesome-repositories.com/f/web-development/data-extraction/playwright-script-generators.md) — Produces a Playwright script that extracts the same schema-defined data without an LLM call. ([source](https://github.com/mishushakov/llm-scraper#readme))
- [Playwright Scripts](https://awesome-repositories.com/f/web-development/web-automation-scraping/web-scraping-automation/generative-scraping-scripts/playwright-scripts.md) — Creates reusable Playwright scripts that extract data without further language model calls.
