20 repos

Awesome GitHub RepositoriesData Extraction & Ingestion

Tools and processes for gathering, parsing, and importing raw data from various external sources into storage systems.

Explore 20 awesome GitHub repositories matching data & databases · Data Extraction & Ingestion. Refine with filters or upvote what's useful.

We'll search the best matching repositories with AI.

Significant-Gravitas/AutoGPT
Significant-Gravitas/AutoGPT
181,891GitHubView on GitHub
AutoGPT is an orchestration platform designed for building, managing, and deploying autonomous agents. It provides a visual canvas-based environment where users can assemble agents by connecting modular blocks that represent actions, data flows, and conditional logic. The platform supports the entire agent lifecycle, i
Pythonaiartificial-intelligenceautonomous-agents
jackfrued/Python-100-Days
jackfrued/Python-100-Days
178,734GitHubView on GitHub
This project is a comprehensive, day-by-day curriculum designed to guide learners through the Python programming language and its professional applications. The content spans from fundamental syntax and object-oriented design to advanced topics including database management, web development, data analysis, and machine
Jupyter Notebook
papers-we-love/papers-we-love
papers-we-love/papers-we-love
103,417GitHubView on GitHub
Papers We Love is a community-driven repository and learning network dedicated to the study and discussion of foundational computer science literature. It functions as a centralized educational archive, providing a structured environment where software professionals can engage with academic research to bridge the gap b
Shellawesomecomputer-sciencemeetup
microsoft/markitdown
microsoft/markitdown
87,305GitHubView on GitHub
This project is an AI-powered document processing engine designed to transform diverse file formats into structured Markdown. By leveraging multimodal language models, it performs complex layout analysis and semantic text extraction, allowing for the conversion of both unstructured files and scanned images into machine
Pythonautogenautogen-extensionlangchain
firecrawl/firecrawl
firecrawl/firecrawl
84,034GitHubView on GitHub
Firecrawl is a web data extraction platform designed to convert unstructured web content into clean, LLM-ready formats like markdown or JSON. It functions as an autonomous web crawler and scraper, capable of mapping entire domains, performing recursive navigation, and executing complex data gathering tasks. By leveragi
TypeScriptaiai-agentsai-crawler
browser-use/browser-use
browser-use/browser-use
78,576GitHubView on GitHub
Browser-use is a framework for building autonomous agents that navigate, interact with, and extract data from web interfaces using natural language instructions. By acting as an orchestration layer between large language models and browser automation protocols, it enables the execution of complex, multi-step workflows
Pythonai-agentsai-toolsbrowser-automation
netdata/netdata
netdata/netdata
77,812GitHubView on GitHub
Netdata is a distributed observability platform designed for real-time infrastructure monitoring and performance tracking. It functions as a high-frequency agent that collects system, container, and application metrics with per-second precision, providing both local visualization and centralized aggregation across comp
Caialertingcncf
infiniflow/ragflow
infiniflow/ragflow
73,425GitHubView on GitHub
This project is a comprehensive retrieval-augmented generation platform designed for building, managing, and deploying knowledge-based AI applications. It provides a unified environment for organizing datasets, configuring conversational chat assistants, and developing autonomous agents that execute multi-step reasonin
Pythonagentagenticagentic-ai
tesseract-ocr/tesseract
tesseract-ocr/tesseract
72,460GitHubView on GitHub
Tesseract is a neural network-based optical character recognition engine designed to convert scanned images and digital documents into machine-readable, searchable text. It functions as both a command-line utility for automating large-scale digitization workflows and a cross-platform library that can be embedded into d
C++hacktoberfestlstmmachine-learning
apache/superset
apache/superset
70,587GitHubView on GitHub
Superset is a web-based business intelligence platform designed for data exploration, visualization, and interactive dashboarding. It functions as a query-driven analytics engine that connects to various SQL databases, allowing users to perform ad-hoc analysis, define virtual metrics, and build complex data visualizati
TypeScriptanalyticsapacheapache-superset
nocodb/nocodb
nocodb/nocodb
62,131GitHubView on GitHub
NocoDB is a visual platform that transforms relational databases into collaborative, spreadsheet-style workspaces. By acting as a headless database backend, it provides a unified environment for designing database structures, managing record relationships, and interacting with data without requiring manual SQL queries.
TypeScriptairtableairtable-alternativeautomatic-api
unclecode/crawl4ai
unclecode/crawl4ai
60,452GitHubView on GitHub
Crawl4AI is an AI-powered web crawling and data extraction engine designed to transform complex web content into structured formats. It functions as a headless browser orchestrator, enabling the navigation of dynamic websites, the execution of custom scripts, and the capture of visual assets like screenshots and PDFs.
Python
scrapy/scrapy
scrapy/scrapy
59,824GitHubView on GitHub
Scrapy is a comprehensive framework designed for automated web data extraction and large-scale crawling. It operates on an asynchronous, event-driven engine that manages non-blocking network requests and data processing tasks, allowing for the efficient retrieval of structured information from web documents using path-
Pythoncrawlercrawlingframework
zylon-ai/private-gpt
zylon-ai/private-gpt
57,116GitHubView on GitHub
This project is a privacy-first backend service designed to facilitate retrieval-augmented generation by processing local documents into searchable vector representations. It provides a modular architecture that allows users to ingest diverse file formats, manage document metadata, and perform semantic searches to prov
Python
soimort/you-get
soimort/you-get
56,737GitHubView on GitHub
This project is a command-line utility designed to fetch video, audio, and image content from a wide range of web platforms. It functions by parsing page metadata and utilizing modular, site-specific scripts to extract direct media stream URLs from complex web structures, enabling the local archiving of digital media f
Python
Z4nzu/hackingtool
Z4nzu/hackingtool
55,016GitHubView on GitHub
This project is a comprehensive cybersecurity tool collection designed to support security research, penetration testing, and vulnerability assessment. It functions as a unified penetration testing suite, providing a centralized environment where professionals can access a wide range of offensive security utilities to
Pythonallinonehackingtoolbesthackingtoolctf-tools
deepfakes/faceswap
deepfakes/faceswap
54,974GitHubView on GitHub
Faceswap is a comprehensive framework for automated media manipulation and neural face synthesis. It provides a modular pipeline that manages the entire lifecycle of facial feature extraction, deep learning model training, and image conversion. By coordinating complex computer vision workflows, the system enables users
Pythondeep-face-swapdeep-learningdeep-neural-networks
Mintplex-Labs/anything-llm
Mintplex-Labs/anything-llm
54,751GitHubView on GitHub
This platform serves as a comprehensive environment for managing private language models, document knowledge bases, and automated agent workflows within secure local infrastructure. It functions as a document-aware workspace that enables users to ingest diverse file formats into searchable repositories, ensuring that a
JavaScriptai-agentscustom-ai-agentsdeepseek
docling-project/docling
docling-project/docling
53,584GitHubView on GitHub
Docling is a modular framework designed for document parsing, layout analysis, and structured data extraction. It transforms unstructured files and web content into a unified, hierarchical data model that preserves the spatial and semantic relationships between text, tables, images, and layout elements. By normalizing
Pythonaiconvertdocument-parser
WerWolv/ImHex
WerWolv/ImHex
52,656GitHubView on GitHub
ImHex is a professional-grade hex editor and binary data analysis platform designed for inspecting, modifying, and reverse engineering raw file contents. It functions as a schema-driven engine that interprets complex binary structures by applying custom definitions to map and visualize byte-level data. The platform di
C++analyzerbinary-analysisc-plus-plus

Explore sub-tags

20 repos

Awesome GitHub RepositoriesData Extraction & Ingestion

Tools and processes for gathering, parsing, and importing raw data from various external sources into storage systems.

Explore 20 awesome GitHub repositories matching data & databases · Data Extraction & Ingestion. Refine with filters or upvote what's useful.

We'll search the best matching repositories with AI.

Significant-Gravitas/AutoGPT
Significant-Gravitas/AutoGPT
181,891GitHubView on GitHub
AutoGPT is an orchestration platform designed for building, managing, and deploying autonomous agents. It provides a visual canvas-based environment where users can assemble agents by connecting modular blocks that represent actions, data flows, and conditional logic. The platform supports the entire agent lifecycle, i
Pythonaiartificial-intelligenceautonomous-agents
jackfrued/Python-100-Days
jackfrued/Python-100-Days
178,734GitHubView on GitHub
This project is a comprehensive, day-by-day curriculum designed to guide learners through the Python programming language and its professional applications. The content spans from fundamental syntax and object-oriented design to advanced topics including database management, web development, data analysis, and machine
Jupyter Notebook
papers-we-love/papers-we-love
papers-we-love/papers-we-love
103,417GitHubView on GitHub
Papers We Love is a community-driven repository and learning network dedicated to the study and discussion of foundational computer science literature. It functions as a centralized educational archive, providing a structured environment where software professionals can engage with academic research to bridge the gap b
Shellawesomecomputer-sciencemeetup
microsoft/markitdown
microsoft/markitdown
87,305GitHubView on GitHub
This project is an AI-powered document processing engine designed to transform diverse file formats into structured Markdown. By leveraging multimodal language models, it performs complex layout analysis and semantic text extraction, allowing for the conversion of both unstructured files and scanned images into machine
Pythonautogenautogen-extensionlangchain
firecrawl/firecrawl
firecrawl/firecrawl
84,034GitHubView on GitHub
Firecrawl is a web data extraction platform designed to convert unstructured web content into clean, LLM-ready formats like markdown or JSON. It functions as an autonomous web crawler and scraper, capable of mapping entire domains, performing recursive navigation, and executing complex data gathering tasks. By leveragi
TypeScriptaiai-agentsai-crawler
browser-use/browser-use
browser-use/browser-use
78,576GitHubView on GitHub
Browser-use is a framework for building autonomous agents that navigate, interact with, and extract data from web interfaces using natural language instructions. By acting as an orchestration layer between large language models and browser automation protocols, it enables the execution of complex, multi-step workflows
Pythonai-agentsai-toolsbrowser-automation
netdata/netdata
netdata/netdata
77,812GitHubView on GitHub
Netdata is a distributed observability platform designed for real-time infrastructure monitoring and performance tracking. It functions as a high-frequency agent that collects system, container, and application metrics with per-second precision, providing both local visualization and centralized aggregation across comp
Caialertingcncf
infiniflow/ragflow
infiniflow/ragflow
73,425GitHubView on GitHub
This project is a comprehensive retrieval-augmented generation platform designed for building, managing, and deploying knowledge-based AI applications. It provides a unified environment for organizing datasets, configuring conversational chat assistants, and developing autonomous agents that execute multi-step reasonin
Pythonagentagenticagentic-ai
tesseract-ocr/tesseract
tesseract-ocr/tesseract
72,460GitHubView on GitHub
Tesseract is a neural network-based optical character recognition engine designed to convert scanned images and digital documents into machine-readable, searchable text. It functions as both a command-line utility for automating large-scale digitization workflows and a cross-platform library that can be embedded into d
C++hacktoberfestlstmmachine-learning
apache/superset
apache/superset
70,587GitHubView on GitHub
Superset is a web-based business intelligence platform designed for data exploration, visualization, and interactive dashboarding. It functions as a query-driven analytics engine that connects to various SQL databases, allowing users to perform ad-hoc analysis, define virtual metrics, and build complex data visualizati
TypeScriptanalyticsapacheapache-superset
nocodb/nocodb
nocodb/nocodb
62,131GitHubView on GitHub
NocoDB is a visual platform that transforms relational databases into collaborative, spreadsheet-style workspaces. By acting as a headless database backend, it provides a unified environment for designing database structures, managing record relationships, and interacting with data without requiring manual SQL queries.
TypeScriptairtableairtable-alternativeautomatic-api
unclecode/crawl4ai
unclecode/crawl4ai
60,452GitHubView on GitHub
Crawl4AI is an AI-powered web crawling and data extraction engine designed to transform complex web content into structured formats. It functions as a headless browser orchestrator, enabling the navigation of dynamic websites, the execution of custom scripts, and the capture of visual assets like screenshots and PDFs.
Python
scrapy/scrapy
scrapy/scrapy
59,824GitHubView on GitHub
Scrapy is a comprehensive framework designed for automated web data extraction and large-scale crawling. It operates on an asynchronous, event-driven engine that manages non-blocking network requests and data processing tasks, allowing for the efficient retrieval of structured information from web documents using path-
Pythoncrawlercrawlingframework
zylon-ai/private-gpt
zylon-ai/private-gpt
57,116GitHubView on GitHub
This project is a privacy-first backend service designed to facilitate retrieval-augmented generation by processing local documents into searchable vector representations. It provides a modular architecture that allows users to ingest diverse file formats, manage document metadata, and perform semantic searches to prov
Python
soimort/you-get
soimort/you-get
56,737GitHubView on GitHub
This project is a command-line utility designed to fetch video, audio, and image content from a wide range of web platforms. It functions by parsing page metadata and utilizing modular, site-specific scripts to extract direct media stream URLs from complex web structures, enabling the local archiving of digital media f
Python
Z4nzu/hackingtool
Z4nzu/hackingtool
55,016GitHubView on GitHub
This project is a comprehensive cybersecurity tool collection designed to support security research, penetration testing, and vulnerability assessment. It functions as a unified penetration testing suite, providing a centralized environment where professionals can access a wide range of offensive security utilities to
Pythonallinonehackingtoolbesthackingtoolctf-tools
deepfakes/faceswap
deepfakes/faceswap
54,974GitHubView on GitHub
Faceswap is a comprehensive framework for automated media manipulation and neural face synthesis. It provides a modular pipeline that manages the entire lifecycle of facial feature extraction, deep learning model training, and image conversion. By coordinating complex computer vision workflows, the system enables users
Pythondeep-face-swapdeep-learningdeep-neural-networks
Mintplex-Labs/anything-llm
Mintplex-Labs/anything-llm
54,751GitHubView on GitHub
This platform serves as a comprehensive environment for managing private language models, document knowledge bases, and automated agent workflows within secure local infrastructure. It functions as a document-aware workspace that enables users to ingest diverse file formats into searchable repositories, ensuring that a
JavaScriptai-agentscustom-ai-agentsdeepseek
docling-project/docling
docling-project/docling
53,584GitHubView on GitHub
Docling is a modular framework designed for document parsing, layout analysis, and structured data extraction. It transforms unstructured files and web content into a unified, hierarchical data model that preserves the spatial and semantic relationships between text, tables, images, and layout elements. By normalizing
Pythonaiconvertdocument-parser
WerWolv/ImHex
WerWolv/ImHex
52,656GitHubView on GitHub
ImHex is a professional-grade hex editor and binary data analysis platform designed for inspecting, modifying, and reverse engineering raw file contents. It functions as a schema-driven engine that interprets complex binary structures by applying custom definitions to map and visualize byte-level data. The platform di
C++analyzerbinary-analysisc-plus-plus

Awesome Data Extraction & Ingestion GitHub Repositories

Significant-Gravitas/AutoGPT

jackfrued/Python-100-Days

papers-we-love/papers-we-love

microsoft/markitdown

firecrawl/firecrawl

browser-use/browser-use

netdata/netdata

infiniflow/ragflow

tesseract-ocr/tesseract

apache/superset

nocodb/nocodb

unclecode/crawl4ai

scrapy/scrapy

zylon-ai/private-gpt

soimort/you-get

Z4nzu/hackingtool

deepfakes/faceswap

Mintplex-Labs/anything-llm

docling-project/docling

WerWolv/ImHex

Explore sub-tags

Awesome Data Extraction & Ingestion GitHub Repositories

Significant-Gravitas/AutoGPT

jackfrued/Python-100-Days

papers-we-love/papers-we-love

microsoft/markitdown

firecrawl/firecrawl

browser-use/browser-use

netdata/netdata

infiniflow/ragflow

tesseract-ocr/tesseract

apache/superset

nocodb/nocodb

unclecode/crawl4ai

scrapy/scrapy

zylon-ai/private-gpt

soimort/you-get

Z4nzu/hackingtool

deepfakes/faceswap

Mintplex-Labs/anything-llm

docling-project/docling

WerWolv/ImHex

Explore sub-tags