5 repos

Awesome GitHub RepositoriesDocument and LLM Preparation

Targeted pipelines for converting unstructured files into machine-readable formats specifically optimized for AI and search indexing applications.

Explore 5 awesome GitHub repositories matching data & databases · Document and LLM Preparation. Refine with filters or upvote what's useful.

We'll search the best matching repositories with AI.

firecrawl/firecrawl
firecrawl/firecrawl
84,034GitHubView on GitHub
Firecrawl is a web data extraction platform designed to convert unstructured web content into clean, LLM-ready formats like markdown or JSON. It functions as an autonomous web crawler and scraper, capable of mapping entire domains, performing recursive navigation, and executing complex data gathering tasks. By leveragi
TypeScriptaiai-agentsai-crawler
unclecode/crawl4ai
unclecode/crawl4ai
60,452GitHubView on GitHub
Crawl4AI is an AI-powered web crawling and data extraction engine designed to transform complex web content into structured formats. It functions as a headless browser orchestrator, enabling the navigation of dynamic websites, the execution of custom scripts, and the capture of visual assets like screenshots and PDFs.
Python
zylon-ai/private-gpt
zylon-ai/private-gpt
57,116GitHubView on GitHub
This project is a privacy-first backend service designed to facilitate retrieval-augmented generation by processing local documents into searchable vector representations. It provides a modular architecture that allows users to ingest diverse file formats, manage document metadata, and perform semantic searches to prov
Python
opendatalab/MinerU
opendatalab/MinerU
54,523GitHubView on GitHub
MinerU is a document parsing pipeline designed to transform unstructured files into machine-readable, structured data. It utilizes deep learning models to perform layout analysis, identifying document regions and extracting complex content such as mathematical expressions. By combining these neural network inferences w
Pythonai4sciencedocument-analysisextract-data
docling-project/docling
docling-project/docling
53,584GitHubView on GitHub
Docling is a modular framework designed for document parsing, layout analysis, and structured data extraction. It transforms unstructured files and web content into a unified, hierarchical data model that preserves the spatial and semantic relationships between text, tables, images, and layout elements. By normalizing
Pythonaiconvertdocument-parser

Explore sub-tags

5 repos

Awesome GitHub RepositoriesDocument and LLM Preparation

Targeted pipelines for converting unstructured files into machine-readable formats specifically optimized for AI and search indexing applications.

Explore 5 awesome GitHub repositories matching data & databases · Document and LLM Preparation. Refine with filters or upvote what's useful.

We'll search the best matching repositories with AI.

firecrawl/firecrawl
firecrawl/firecrawl
84,034GitHubView on GitHub
Firecrawl is a web data extraction platform designed to convert unstructured web content into clean, LLM-ready formats like markdown or JSON. It functions as an autonomous web crawler and scraper, capable of mapping entire domains, performing recursive navigation, and executing complex data gathering tasks. By leveragi
TypeScriptaiai-agentsai-crawler
unclecode/crawl4ai
unclecode/crawl4ai
60,452GitHubView on GitHub
Crawl4AI is an AI-powered web crawling and data extraction engine designed to transform complex web content into structured formats. It functions as a headless browser orchestrator, enabling the navigation of dynamic websites, the execution of custom scripts, and the capture of visual assets like screenshots and PDFs.
Python
zylon-ai/private-gpt
zylon-ai/private-gpt
57,116GitHubView on GitHub
This project is a privacy-first backend service designed to facilitate retrieval-augmented generation by processing local documents into searchable vector representations. It provides a modular architecture that allows users to ingest diverse file formats, manage document metadata, and perform semantic searches to prov
Python
opendatalab/MinerU
opendatalab/MinerU
54,523GitHubView on GitHub
MinerU is a document parsing pipeline designed to transform unstructured files into machine-readable, structured data. It utilizes deep learning models to perform layout analysis, identifying document regions and extracting complex content such as mathematical expressions. By combining these neural network inferences w
Pythonai4sciencedocument-analysisextract-data
docling-project/docling
docling-project/docling
53,584GitHubView on GitHub
Docling is a modular framework designed for document parsing, layout analysis, and structured data extraction. It transforms unstructured files and web content into a unified, hierarchical data model that preserves the spatial and semantic relationships between text, tables, images, and layout elements. By normalizing
Pythonaiconvertdocument-parser

Awesome Document and LLM Preparation GitHub Repositories

firecrawl/firecrawl

unclecode/crawl4ai

zylon-ai/private-gpt

opendatalab/MinerU

docling-project/docling

Explore sub-tags

Awesome Document and LLM Preparation GitHub Repositories

firecrawl/firecrawl

unclecode/crawl4ai

zylon-ai/private-gpt

opendatalab/MinerU

docling-project/docling

Explore sub-tags