awesome-repositories.com
© 2026 Bringes Technology SRL·VAT RO45896025·hello@bringes.io
MCPSitemapPrivacyTerms
Document and Unstructured Extraction · Awesome GitHub Repositories

2 repos

Awesome GitHub RepositoriesDocument and Unstructured Extraction

Automated processes for parsing unstructured text, documents, or web content into structured, machine-readable formats.

Explore 2 awesome GitHub repositories matching data & databases · Document and Unstructured Extraction. Refine with filters or upvote what's useful.

  1. Home
  2. Data & Databases
  3. Data Processing Pipelines
  4. Data Processing
  5. Document and Unstructured Extraction

Awesome Document and Unstructured Extraction GitHub Repositories

Describe the repository you're looking for…
We'll search the best matching repositories with AI.
  • unclecode/crawl4ai

    unclecode/crawl4ai

    60,452GitHubView on GitHub↗

    Crawl4AI is an AI-powered web crawling and data extraction engine designed to transform complex web content into structured formats. It functions as a headless browser orchestrator, enabling the navigation of dynamic websites, the execution of custom scripts, and the capture of visual assets like screenshots and PDFs.

    Python
  • docling-project/docling

    docling-project/docling

    53,584GitHubView on GitHub↗

    Docling is a modular framework designed for document parsing, layout analysis, and structured data extraction. It transforms unstructured files and web content into a unified, hierarchical data model that preserves the spatial and semantic relationships between text, tables, images, and layout elements. By normalizing

    Pythonaiconvertdocument-parser

Explore sub-tags

  • DOM-to-Markdown TransformationsUtilities that parse raw HTML structures into clean, structured text formats for downstream consumption.
  • Extraction ConfigurationsConfiguration tools that define input types and file formats to guide document extraction processes.
  • Schema-Driven ExtractionTools that map unstructured web content into predefined data structures using automated path selection.