8 repos

Data Extraction & Ingestion — Data & Databases

We curate 8 GitHub repositories matching data & databases · Data Extraction & Ingestion. Refine with filters or upvote what's useful.

We'll search the best matching repositories with AI.

jackfrued/Python-100-Days
jackfrued/Python-100-Days
178,734GitHubView on GitHub
This project is a comprehensive, day-by-day curriculum designed to guide learners through the Python programming language and its professional applications. The content spans from fundamental syntax and object-oriented design to advanced topics including database management, web development, data analysis, and machine
Jupyter Notebook
papers-we-love/papers-we-love
papers-we-love/papers-we-love
103,417GitHubView on GitHub
Papers We Love is a community-driven repository and learning network dedicated to the study and discussion of foundational computer science literature. It functions as a centralized educational archive, providing a structured environment where software professionals can engage with academic research to bridge the gap b
Shellawesomecomputer-sciencemeetup
microsoft/markitdown
microsoft/markitdown
87,305GitHubView on GitHub
This project is an AI-powered document processing engine designed to transform diverse file formats into structured Markdown. By leveraging multimodal language models, it performs complex layout analysis and semantic text extraction, allowing for the conversion of both unstructured files and scanned images into machine
Pythonautogenautogen-extensionlangchain
firecrawl/firecrawl
firecrawl/firecrawl
84,034GitHubView on GitHub
Firecrawl is a web data extraction platform designed to convert unstructured web content into clean, LLM-ready formats like markdown or JSON. It functions as an autonomous web crawler and scraper, capable of mapping entire domains, performing recursive navigation, and executing complex data gathering tasks. By leveragi
TypeScriptaiai-agentsai-crawler
browser-use/browser-use
browser-use/browser-use
78,576GitHubView on GitHub
Browser-use is a framework for building autonomous agents that navigate, interact with, and extract data from web interfaces using natural language instructions. By acting as an orchestration layer between large language models and browser automation protocols, it enables the execution of complex, multi-step workflows
Pythonai-agentsai-toolsbrowser-automation
netdata/netdata
netdata/netdata
77,812GitHubView on GitHub
Netdata is a distributed observability platform designed for real-time infrastructure monitoring and performance tracking. It functions as a high-frequency agent that collects system, container, and application metrics with per-second precision, providing both local visualization and centralized aggregation across comp
Caialertingcncf
infiniflow/ragflow
infiniflow/ragflow
73,425GitHubView on GitHub
This project is a comprehensive retrieval-augmented generation platform designed for building, managing, and deploying knowledge-based AI applications. It provides a unified environment for organizing datasets, configuring conversational chat assistants, and developing autonomous agents that execute multi-step reasonin
Pythonagentagenticagentic-ai
tesseract-ocr/tesseract
tesseract-ocr/tesseract
72,460GitHubView on GitHub
Tesseract is a neural network-based optical character recognition engine designed to convert scanned images and digital documents into machine-readable, searchable text. It functions as both a command-line utility for automating large-scale digitization workflows and a cross-platform library that can be embedded into d
C++hacktoberfestlstmmachine-learning

8 repos

Data Extraction & Ingestion — Data & Databases

We curate 8 GitHub repositories matching data & databases · Data Extraction & Ingestion. Refine with filters or upvote what's useful.

We'll search the best matching repositories with AI.

jackfrued/Python-100-Days
jackfrued/Python-100-Days
178,734GitHubView on GitHub
This project is a comprehensive, day-by-day curriculum designed to guide learners through the Python programming language and its professional applications. The content spans from fundamental syntax and object-oriented design to advanced topics including database management, web development, data analysis, and machine
Jupyter Notebook
papers-we-love/papers-we-love
papers-we-love/papers-we-love
103,417GitHubView on GitHub
Papers We Love is a community-driven repository and learning network dedicated to the study and discussion of foundational computer science literature. It functions as a centralized educational archive, providing a structured environment where software professionals can engage with academic research to bridge the gap b
Shellawesomecomputer-sciencemeetup
microsoft/markitdown
microsoft/markitdown
87,305GitHubView on GitHub
This project is an AI-powered document processing engine designed to transform diverse file formats into structured Markdown. By leveraging multimodal language models, it performs complex layout analysis and semantic text extraction, allowing for the conversion of both unstructured files and scanned images into machine
Pythonautogenautogen-extensionlangchain
firecrawl/firecrawl
firecrawl/firecrawl
84,034GitHubView on GitHub
Firecrawl is a web data extraction platform designed to convert unstructured web content into clean, LLM-ready formats like markdown or JSON. It functions as an autonomous web crawler and scraper, capable of mapping entire domains, performing recursive navigation, and executing complex data gathering tasks. By leveragi
TypeScriptaiai-agentsai-crawler
browser-use/browser-use
browser-use/browser-use
78,576GitHubView on GitHub
Browser-use is a framework for building autonomous agents that navigate, interact with, and extract data from web interfaces using natural language instructions. By acting as an orchestration layer between large language models and browser automation protocols, it enables the execution of complex, multi-step workflows
Pythonai-agentsai-toolsbrowser-automation
netdata/netdata
netdata/netdata
77,812GitHubView on GitHub
Netdata is a distributed observability platform designed for real-time infrastructure monitoring and performance tracking. It functions as a high-frequency agent that collects system, container, and application metrics with per-second precision, providing both local visualization and centralized aggregation across comp
Caialertingcncf
infiniflow/ragflow
infiniflow/ragflow
73,425GitHubView on GitHub
This project is a comprehensive retrieval-augmented generation platform designed for building, managing, and deploying knowledge-based AI applications. It provides a unified environment for organizing datasets, configuring conversational chat assistants, and developing autonomous agents that execute multi-step reasonin
Pythonagentagenticagentic-ai
tesseract-ocr/tesseract
tesseract-ocr/tesseract
72,460GitHubView on GitHub
Tesseract is a neural network-based optical character recognition engine designed to convert scanned images and digital documents into machine-readable, searchable text. It functions as both a command-line utility for automating large-scale digitization workflows and a cross-platform library that can be embedded into d
C++hacktoberfestlstmmachine-learning

Data Extraction & Ingestion — Data & Databases

jackfrued/Python-100-Days

papers-we-love/papers-we-love

microsoft/markitdown

firecrawl/firecrawl

browser-use/browser-use

netdata/netdata

infiniflow/ragflow

tesseract-ocr/tesseract

Data Extraction & Ingestion — Data & Databases

jackfrued/Python-100-Days

papers-we-love/papers-we-love

microsoft/markitdown

firecrawl/firecrawl

browser-use/browser-use

netdata/netdata

infiniflow/ragflow

tesseract-ocr/tesseract