awesome-repositories.com
© 2026 Bringes Technology SRL·VAT RO45896025·hello@bringes.io
MCPSitemapPrivacyTerms
Data Extraction & Ingestion · Awesome GitHub Repositories

20 repos

Awesome GitHub RepositoriesData Extraction & Ingestion

Tools and processes for gathering, parsing, and importing raw data from various external sources into storage systems.

Explore 20 awesome GitHub repositories matching data & databases · Data Extraction & Ingestion. Refine with filters or upvote what's useful.

  1. Home
  2. Data & Databases
  3. Data Engineering and Infrastructure
  4. Data Extraction & Ingestion

Awesome Data Extraction & Ingestion GitHub Repositories

Describe the repository you're looking for…
We'll search the best matching repositories with AI.
  • Significant-Gravitas/AutoGPT

    Significant-Gravitas/AutoGPT

    181,891GitHubView on GitHub↗

    AutoGPT is an orchestration platform designed for building, managing, and deploying autonomous agents. It provides a visual canvas-based environment where users can assemble agents by connecting modular blocks that represent actions, data flows, and conditional logic. The platform supports the entire agent lifecycle, i

    Pythonaiartificial-intelligenceautonomous-agents
  • jackfrued/Python-100-Days

    jackfrued/Python-100-Days

    178,734GitHubView on GitHub↗

    This project is a comprehensive, day-by-day curriculum designed to guide learners through the Python programming language and its professional applications. The content spans from fundamental syntax and object-oriented design to advanced topics including database management, web development, data analysis, and machine

    Jupyter Notebook
  • papers-we-love/papers-we-love

    papers-we-love/papers-we-love

    103,417GitHubView on GitHub↗

    Papers We Love is a community-driven repository and learning network dedicated to the study and discussion of foundational computer science literature. It functions as a centralized educational archive, providing a structured environment where software professionals can engage with academic research to bridge the gap b

    Shellawesomecomputer-sciencemeetup
  • microsoft/markitdown

    microsoft/markitdown

    87,305GitHubView on GitHub↗

    This project is an AI-powered document processing engine designed to transform diverse file formats into structured Markdown. By leveraging multimodal language models, it performs complex layout analysis and semantic text extraction, allowing for the conversion of both unstructured files and scanned images into machine

    Pythonautogenautogen-extensionlangchain
  • firecrawl/firecrawl

    firecrawl/firecrawl

    84,034GitHubView on GitHub↗

    Firecrawl is a web data extraction platform designed to convert unstructured web content into clean, LLM-ready formats like markdown or JSON. It functions as an autonomous web crawler and scraper, capable of mapping entire domains, performing recursive navigation, and executing complex data gathering tasks. By leveragi

    TypeScriptaiai-agentsai-crawler
  • browser-use/browser-use

    browser-use/browser-use

    78,576GitHubView on GitHub↗

    Browser-use is a framework for building autonomous agents that navigate, interact with, and extract data from web interfaces using natural language instructions. By acting as an orchestration layer between large language models and browser automation protocols, it enables the execution of complex, multi-step workflows

    Pythonai-agentsai-toolsbrowser-automation
  • netdata/netdata

    netdata/netdata

    77,812GitHubView on GitHub↗

    Netdata is a distributed observability platform designed for real-time infrastructure monitoring and performance tracking. It functions as a high-frequency agent that collects system, container, and application metrics with per-second precision, providing both local visualization and centralized aggregation across comp

    Caialertingcncf
  • infiniflow/ragflow

    infiniflow/ragflow

    73,425GitHubView on GitHub↗

    This project is a comprehensive retrieval-augmented generation platform designed for building, managing, and deploying knowledge-based AI applications. It provides a unified environment for organizing datasets, configuring conversational chat assistants, and developing autonomous agents that execute multi-step reasonin

    Pythonagentagenticagentic-ai
  • tesseract-ocr/tesseract

    tesseract-ocr/tesseract

    72,460GitHubView on GitHub↗

    Tesseract is a neural network-based optical character recognition engine designed to convert scanned images and digital documents into machine-readable, searchable text. It functions as both a command-line utility for automating large-scale digitization workflows and a cross-platform library that can be embedded into d

    C++hacktoberfestlstmmachine-learning
  • apache/superset

    apache/superset

    70,587GitHubView on GitHub↗

    Superset is a web-based business intelligence platform designed for data exploration, visualization, and interactive dashboarding. It functions as a query-driven analytics engine that connects to various SQL databases, allowing users to perform ad-hoc analysis, define virtual metrics, and build complex data visualizati

    TypeScriptanalyticsapacheapache-superset
  • nocodb/nocodb

    nocodb/nocodb

    62,131GitHubView on GitHub↗

    NocoDB is a visual platform that transforms relational databases into collaborative, spreadsheet-style workspaces. By acting as a headless database backend, it provides a unified environment for designing database structures, managing record relationships, and interacting with data without requiring manual SQL queries.

    TypeScriptairtableairtable-alternativeautomatic-api
  • unclecode/crawl4ai

    unclecode/crawl4ai

    60,452GitHubView on GitHub↗

    Crawl4AI is an AI-powered web crawling and data extraction engine designed to transform complex web content into structured formats. It functions as a headless browser orchestrator, enabling the navigation of dynamic websites, the execution of custom scripts, and the capture of visual assets like screenshots and PDFs.

    Python
  • scrapy/scrapy

    scrapy/scrapy

    59,824GitHubView on GitHub↗

    Scrapy is a comprehensive framework designed for automated web data extraction and large-scale crawling. It operates on an asynchronous, event-driven engine that manages non-blocking network requests and data processing tasks, allowing for the efficient retrieval of structured information from web documents using path-

    Pythoncrawlercrawlingframework
  • zylon-ai/private-gpt

    zylon-ai/private-gpt

    57,116GitHubView on GitHub↗

    This project is a privacy-first backend service designed to facilitate retrieval-augmented generation by processing local documents into searchable vector representations. It provides a modular architecture that allows users to ingest diverse file formats, manage document metadata, and perform semantic searches to prov

    Python
  • soimort/you-get

    soimort/you-get

    56,737GitHubView on GitHub↗

    This project is a command-line utility designed to fetch video, audio, and image content from a wide range of web platforms. It functions by parsing page metadata and utilizing modular, site-specific scripts to extract direct media stream URLs from complex web structures, enabling the local archiving of digital media f

    Python
  • Z4nzu/hackingtool

    Z4nzu/hackingtool

    55,016GitHubView on GitHub↗

    This project is a comprehensive cybersecurity tool collection designed to support security research, penetration testing, and vulnerability assessment. It functions as a unified penetration testing suite, providing a centralized environment where professionals can access a wide range of offensive security utilities to

    Pythonallinonehackingtoolbesthackingtoolctf-tools
  • deepfakes/faceswap

    deepfakes/faceswap

    54,974GitHubView on GitHub↗

    Faceswap is a comprehensive framework for automated media manipulation and neural face synthesis. It provides a modular pipeline that manages the entire lifecycle of facial feature extraction, deep learning model training, and image conversion. By coordinating complex computer vision workflows, the system enables users

    Pythondeep-face-swapdeep-learningdeep-neural-networks
  • Mintplex-Labs/anything-llm

    Mintplex-Labs/anything-llm

    54,751GitHubView on GitHub↗

    This platform serves as a comprehensive environment for managing private language models, document knowledge bases, and automated agent workflows within secure local infrastructure. It functions as a document-aware workspace that enables users to ingest diverse file formats into searchable repositories, ensuring that a

    JavaScriptai-agentscustom-ai-agentsdeepseek
  • docling-project/docling

    docling-project/docling

    53,584GitHubView on GitHub↗

    Docling is a modular framework designed for document parsing, layout analysis, and structured data extraction. It transforms unstructured files and web content into a unified, hierarchical data model that preserves the spatial and semantic relationships between text, tables, images, and layout elements. By normalizing

    Pythonaiconvertdocument-parser
  • WerWolv/ImHex

    WerWolv/ImHex

    52,656GitHubView on GitHub↗

    ImHex is a professional-grade hex editor and binary data analysis platform designed for inspecting, modifying, and reverse engineering raw file contents. It functions as a schema-driven engine that interprets complex binary structures by applying custom definitions to map and visualize byte-level data. The platform di

    C++analyzerbinary-analysisc-plus-plus

Explore sub-tags

  • Application Metrics CollectionCollection of telemetry from application-level processes via modular, language-agnostic interfaces.
  • Data Collection Tools2 sub-tagsUtilities designed to gather raw information from external sources, web pages, or user input interfaces.
  • Data Extraction4 sub-tagsTools and techniques for isolating and retrieving specific data points from larger, often unstructured, source datasets.
  • Data Import and Export
2 sub-tags
Functionality for moving data between different systems by converting it into compatible formats for transfer.
  • Data Ingestion6 sub-tagsProcesses and services that receive, clean, and prepare raw data for entry into a storage system.
  • Data Parsing2 sub-tagsTools that analyze and translate raw data streams or files into structured, machine-readable formats.
  • Document Processing Tools3 sub-tagsFocuses on the parsing, conversion, and structural extraction of static files and documents rather than live web or telemetry streams.
  • File Upload ConfigurationsSettings and parameters for managing file upload constraints and batch processing for data ingestion.
  • Modular Data CollectorsIsolated processes for collecting metrics from heterogeneous sources.
  • Table Extraction UtilitiesTools for identifying and converting grid-based document structures into structured data formats.
  • Web Extraction Engines5 sub-tagsSpecializes in retrieving and transforming unstructured web content into structured or machine-readable formats, distinct from general file ingestion.