4 repos
Tools that convert unstructured web or document content into clean, typed, and organized data formats.
Explore 4 awesome GitHub repositories matching data & databases · Structured. Refine with filters or upvote what's useful.
Browser-use is a framework for building autonomous agents that navigate, interact with, and extract data from web interfaces using natural language instructions. By acting as an orchestration layer between large language models and browser automation protocols, it enables the execution of complex, multi-step workflows
Converts unstructured web content into clean, typed, and organized data formats through automated extraction routines.
Crawl4AI is an AI-powered web crawling and data extraction engine designed to transform complex web content into structured formats. It functions as a headless browser orchestrator, enabling the navigation of dynamic websites, the execution of custom scripts, and the capture of visual assets like screenshots and PDFs.
Converts unstructured web content into clean, organized schemas using path selectors and language model interpretation.
Scrapy is a comprehensive framework designed for automated web data extraction and large-scale crawling. It operates on an asynchronous, event-driven engine that manages non-blocking network requests and data processing tasks, allowing for the efficient retrieval of structured information from web documents using path-
Converts unstructured web content into clean, typed, and organized data formats using defined extraction logic.
Docling is a modular framework designed for document parsing, layout analysis, and structured data extraction. It transforms unstructured files and web content into a unified, hierarchical data model that preserves the spatial and semantic relationships between text, tables, images, and layout elements. By normalizing
Extracts information from unstructured sources by applying schemas to identify and organize content into clean, typed data formats.