5 repos

Awesome GitHub RepositoriesData Extraction

Tools and techniques for isolating and retrieving specific data points from larger, often unstructured, source datasets.

Explore 5 awesome GitHub repositories matching data & databases · Data Extraction. Refine with filters or upvote what's useful.

We'll search the best matching repositories with AI.

browser-use/browser-use
browser-use/browser-use
78,576GitHubView on GitHub
Browser-use is a framework for building autonomous agents that navigate, interact with, and extract data from web interfaces using natural language instructions. By acting as an orchestration layer between large language models and browser automation protocols, it enables the execution of complex, multi-step workflows
Pythonai-agentsai-toolsbrowser-automation
unclecode/crawl4ai
unclecode/crawl4ai
60,452GitHubView on GitHub
Crawl4AI is an AI-powered web crawling and data extraction engine designed to transform complex web content into structured formats. It functions as a headless browser orchestrator, enabling the navigation of dynamic websites, the execution of custom scripts, and the capture of visual assets like screenshots and PDFs.
Python
scrapy/scrapy
scrapy/scrapy
59,824GitHubView on GitHub
Scrapy is a comprehensive framework designed for automated web data extraction and large-scale crawling. It operates on an asynchronous, event-driven engine that manages non-blocking network requests and data processing tasks, allowing for the efficient retrieval of structured information from web documents using path-
Pythoncrawlercrawlingframework
soimort/you-get
soimort/you-get
56,737GitHubView on GitHub
This project is a command-line utility designed to fetch video, audio, and image content from a wide range of web platforms. It functions by parsing page metadata and utilizing modular, site-specific scripts to extract direct media stream URLs from complex web structures, enabling the local archiving of digital media f
Python
docling-project/docling
docling-project/docling
53,584GitHubView on GitHub
Docling is a modular framework designed for document parsing, layout analysis, and structured data extraction. It transforms unstructured files and web content into a unified, hierarchical data model that preserves the spatial and semantic relationships between text, tables, images, and layout elements. By normalizing
Pythonaiconvertdocument-parser

Explore sub-tags

5 repos

Awesome GitHub RepositoriesData Extraction

Tools and techniques for isolating and retrieving specific data points from larger, often unstructured, source datasets.

Explore 5 awesome GitHub repositories matching data & databases · Data Extraction. Refine with filters or upvote what's useful.

We'll search the best matching repositories with AI.

browser-use/browser-use
browser-use/browser-use
78,576GitHubView on GitHub
Browser-use is a framework for building autonomous agents that navigate, interact with, and extract data from web interfaces using natural language instructions. By acting as an orchestration layer between large language models and browser automation protocols, it enables the execution of complex, multi-step workflows
Pythonai-agentsai-toolsbrowser-automation
unclecode/crawl4ai
unclecode/crawl4ai
60,452GitHubView on GitHub
Crawl4AI is an AI-powered web crawling and data extraction engine designed to transform complex web content into structured formats. It functions as a headless browser orchestrator, enabling the navigation of dynamic websites, the execution of custom scripts, and the capture of visual assets like screenshots and PDFs.
Python
scrapy/scrapy
scrapy/scrapy
59,824GitHubView on GitHub
Scrapy is a comprehensive framework designed for automated web data extraction and large-scale crawling. It operates on an asynchronous, event-driven engine that manages non-blocking network requests and data processing tasks, allowing for the efficient retrieval of structured information from web documents using path-
Pythoncrawlercrawlingframework
soimort/you-get
soimort/you-get
56,737GitHubView on GitHub
This project is a command-line utility designed to fetch video, audio, and image content from a wide range of web platforms. It functions by parsing page metadata and utilizing modular, site-specific scripts to extract direct media stream URLs from complex web structures, enabling the local archiving of digital media f
Python
docling-project/docling
docling-project/docling
53,584GitHubView on GitHub
Docling is a modular framework designed for document parsing, layout analysis, and structured data extraction. It transforms unstructured files and web content into a unified, hierarchical data model that preserves the spatial and semantic relationships between text, tables, images, and layout elements. By normalizing
Pythonaiconvertdocument-parser

Awesome Data Extraction GitHub Repositories

browser-use/browser-use

unclecode/crawl4ai

scrapy/scrapy

soimort/you-get

docling-project/docling

Explore sub-tags

Awesome Data Extraction GitHub Repositories

browser-use/browser-use

unclecode/crawl4ai

scrapy/scrapy

soimort/you-get

docling-project/docling

Explore sub-tags