2 repos
Automated processes for parsing unstructured text, documents, or web content into structured, machine-readable formats.
Explore 2 awesome GitHub repositories matching data & databases · Document and Unstructured Extraction. Refine with filters or upvote what's useful.
Crawl4AI is an AI-powered web crawling and data extraction engine designed to transform complex web content into structured formats. It functions as a headless browser orchestrator, enabling the navigation of dynamic websites, the execution of custom scripts, and the capture of visual assets like screenshots and PDFs.
Docling is a modular framework designed for document parsing, layout analysis, and structured data extraction. It transforms unstructured files and web content into a unified, hierarchical data model that preserves the spatial and semantic relationships between text, tables, images, and layout elements. By normalizing