awesome-repositories.com
© 2026 Bringes Technology SRL·VAT RO45896025·hello@bringes.io
MCPSitemapPrivacyTerms
Document and LLM Preparation · Awesome GitHub Repositories

5 repos

Awesome GitHub RepositoriesDocument and LLM Preparation

Targeted pipelines for converting unstructured files into machine-readable formats specifically optimized for AI and search indexing applications.

Explore 5 awesome GitHub repositories matching data & databases · Document and LLM Preparation. Refine with filters or upvote what's useful.

  1. Home
  2. Data & Databases
  3. Data Processing Pipelines
  4. Document and LLM Preparation

Awesome Document and LLM Preparation GitHub Repositories

Describe the repository you're looking for…
We'll search the best matching repositories with AI.
  • firecrawl/firecrawl

    firecrawl/firecrawl

    84,034GitHubView on GitHub↗

    Firecrawl is a web data extraction platform designed to convert unstructured web content into clean, LLM-ready formats like markdown or JSON. It functions as an autonomous web crawler and scraper, capable of mapping entire domains, performing recursive navigation, and executing complex data gathering tasks. By leveragi

    TypeScriptaiai-agentsai-crawler
  • unclecode/crawl4ai

    unclecode/crawl4ai

    60,452GitHubView on GitHub↗

    Crawl4AI is an AI-powered web crawling and data extraction engine designed to transform complex web content into structured formats. It functions as a headless browser orchestrator, enabling the navigation of dynamic websites, the execution of custom scripts, and the capture of visual assets like screenshots and PDFs.

    Python
  • zylon-ai/private-gpt

    zylon-ai/private-gpt

    57,116GitHubView on GitHub↗

    This project is a privacy-first backend service designed to facilitate retrieval-augmented generation by processing local documents into searchable vector representations. It provides a modular architecture that allows users to ingest diverse file formats, manage document metadata, and perform semantic searches to prov

    Python
  • opendatalab/MinerU

    opendatalab/MinerU

    54,523GitHubView on GitHub↗

    MinerU is a document parsing pipeline designed to transform unstructured files into machine-readable, structured data. It utilizes deep learning models to perform layout analysis, identifying document regions and extracting complex content such as mathematical expressions. By combining these neural network inferences w

    Pythonai4sciencedocument-analysisextract-data
  • docling-project/docling

    docling-project/docling

    53,584GitHubView on GitHub↗

    Docling is a modular framework designed for document parsing, layout analysis, and structured data extraction. It transforms unstructured files and web content into a unified, hierarchical data model that preserves the spatial and semantic relationships between text, tables, images, and layout elements. By normalizing

    Pythonaiconvertdocument-parser

Explore sub-tags

  • Document Processing PipelinesWorkflows that ingest, parse, and normalize diverse file formats into standardized content for downstream integration.
  • LLM Data Preparation ToolsTools that convert raw web and unstructured content into clean, structured formats suitable for large language model ingestion.
  • Multi-Stage Pipeline ProcessingFrameworks that orchestrate complex data transformations by chaining multiple sequential processing steps together.