awesome-repositories.com
© 2026 Bringes Technology SRL·VAT RO45896025·hello@bringes.io
MCPSitemapPrivacyTerms
Data Engineering and Infrastructure · Awesome GitHub Repositories

61 repos

Awesome GitHub RepositoriesData Engineering and Infrastructure

Foundational tools for large-scale data collection, ingestion, storage management, and reliability.

Explore 61 awesome GitHub repositories matching data & databases · Data Engineering and Infrastructure. Refine with filters or upvote what's useful.

  1. Home
  2. Data & Databases
  3. Data Engineering and Infrastructure

Awesome Data Engineering and Infrastructure GitHub Repositories

Describe the repository you're looking for…
We'll search the best matching repositories with AI.
  • minio/minio

    minio/minio

    60,346GitHubView on GitHub↗

    MinIO is a software-defined, cloud-native object storage server designed to manage large volumes of unstructured data. It functions as a distributed storage cluster that aggregates multiple independent nodes into a unified, scalable pool, providing a high-performance infrastructure compatible with standard cloud storag

    Goamazon-s3cloudcloudnative
  • scrapy/scrapy

    scrapy/scrapy

    59,824GitHubView on GitHub↗

    Scrapy is a comprehensive framework designed for automated web data extraction and large-scale crawling. It operates on an asynchronous, event-driven engine that manages non-blocking network requests and data processing tasks, allowing for the efficient retrieval of structured information from web documents using path-

    Pythoncrawlercrawlingframework
  • pathwaycom/pathway

    pathwaycom/pathway

    59,684GitHubView on GitHub↗

    Pathway is a high-performance data processing framework designed for building unified batch and streaming pipelines. It functions as an orchestrator for complex data transformations, utilizing a differential dataflow engine to process updates incrementally. By treating static datasets and continuous event streams with

    Pythonbatch-processingdata-analyticsdata-pipelines
  • git/git

    git/git

    59,192GitHubView on GitHub↗

    Git is a distributed version control system and command-line tool designed for tracking changes in source code and coordinating collaborative software development. It functions as a content-addressable storage platform where project data is maintained as immutable objects indexed by cryptographic hashes, ensuring data

    Cchacktoberfestshell
  • Solido/awesome-flutter

    Solido/awesome-flutter

    59,015GitHubView on GitHub↗

    This project is a community-curated directory of resources, libraries, and tools designed to support developers working with the Flutter framework. It functions as a centralized knowledge base, organizing high-quality external references into a structured, human-readable format to assist in the discovery of technical m

    Dartandroidawesomeawesome-list
  • zylon-ai/private-gpt

    zylon-ai/private-gpt

    57,116GitHubView on GitHub↗

    This project is a privacy-first backend service designed to facilitate retrieval-augmented generation by processing local documents into searchable vector representations. It provides a modular architecture that allows users to ingest diverse file formats, manage document metadata, and perform semantic searches to prov

    Python
  • pmndrs/zustand

    pmndrs/zustand

    57,057GitHubView on GitHub↗

    Zustand is a state management library that provides a centralized store for managing shared application data. It functions as a reactive container that connects application state to components, allowing them to subscribe to specific slices of data and trigger updates automatically. By utilizing selector-based data acce

    TypeScripthacktoberfesthooksreact
  • soimort/you-get

    soimort/you-get

    56,737GitHubView on GitHub↗

    This project is a command-line utility designed to fetch video, audio, and image content from a wide range of web platforms. It functions by parsing page metadata and utilizing modular, site-specific scripts to extract direct media stream URLs from complex web structures, enabling the local archiving of digital media f

    Python
  • pathwaycom/llm-app

    pathwaycom/llm-app

    56,311GitHubView on GitHub↗

    This project is a data processing engine and AI application platform designed for building production-grade machine learning workflows. It provides a unified programming model that handles both historical batch data and live stream ingestion, enabling the development of real-time ETL pipelines and scalable data transfo

    Jupyter Notebookchatbothugging-facellm
  • meilisearch/meilisearch

    meilisearch/meilisearch

    55,992GitHubView on GitHub↗

    Meilisearch is a Rust-based search engine providing typo-tolerant full-text and vector-based semantic search with real-time conversational capabilities.

    Rustaiapiapp-search
  • Z4nzu/hackingtool

    Z4nzu/hackingtool

    55,016GitHubView on GitHub↗

    This project is a comprehensive cybersecurity tool collection designed to support security research, penetration testing, and vulnerability assessment. It functions as a unified penetration testing suite, providing a centralized environment where professionals can access a wide range of offensive security utilities to

    Pythonallinonehackingtoolbesthackingtoolctf-tools
  • deepfakes/faceswap

    deepfakes/faceswap

    54,974GitHubView on GitHub↗

    Faceswap is a comprehensive framework for automated media manipulation and neural face synthesis. It provides a modular pipeline that manages the entire lifecycle of facial feature extraction, deep learning model training, and image conversion. By coordinating complex computer vision workflows, the system enables users

    Pythondeep-face-swapdeep-learningdeep-neural-networks
  • Mintplex-Labs/anything-llm

    Mintplex-Labs/anything-llm

    54,751GitHubView on GitHub↗

    This platform serves as a comprehensive environment for managing private language models, document knowledge bases, and automated agent workflows within secure local infrastructure. It functions as a document-aware workspace that enables users to ingest diverse file formats into searchable repositories, ensuring that a

    JavaScriptai-agentscustom-ai-agentsdeepseek
  • go-gitea/gitea

    go-gitea/gitea

    53,820GitHubView on GitHub↗

    Gitea is a self-hosted service designed for managing version control repositories, project issue tracking, and software artifact distribution. It provides a collaborative platform that enables teams to host their own source code, manage development tasks through integrated project boards, and store container images or

    Gobitbucketcicddevops
  • docling-project/docling

    docling-project/docling

    53,584GitHubView on GitHub↗

    Docling is a modular framework designed for document parsing, layout analysis, and structured data extraction. It transforms unstructured files and web content into a unified, hierarchical data model that preserves the spatial and semantic relationships between text, tables, images, and layout elements. By normalizing

    Pythonaiconvertdocument-parser
  • laurent22/joplin

    laurent22/joplin

    53,497GitHubView on GitHub↗

    Joplin is an open-source, cross-platform note-taking application designed for secure, private knowledge management. It functions as a local-first productivity platform, maintaining a complete relational database on the user's device to ensure offline availability and high-performance data retrieval. The application pri

    TypeScriptandroiddropboxelectron
  • ultralytics/ultralytics

    ultralytics/ultralytics

    53,426GitHubView on GitHub↗

    Ultralytics is a comprehensive computer vision framework designed for training, validating, and deploying deep learning models across a wide range of visual recognition tasks. It provides a unified interface for core operations including object detection, instance segmentation, pose estimation, and image classification

    Pythonclicomputer-visiondeep-learning
  • WerWolv/ImHex

    WerWolv/ImHex

    52,656GitHubView on GitHub↗

    ImHex is a professional-grade hex editor and binary data analysis platform designed for inspecting, modifying, and reverse engineering raw file contents. It functions as a schema-driven engine that interprets complex binary structures by applying custom definitions to map and visualize byte-level data. The platform di

    C++analyzerbinary-analysisc-plus-plus
  • unslothai/unsloth

    unslothai/unsloth

    52,461GitHubView on GitHub↗

    Unsloth is a high-performance training and inference platform designed to optimize the lifecycle of large language and multimodal models. It provides a comprehensive engine for fine-tuning, executing, and managing models locally, with a focus on reducing memory consumption and increasing compute speed on consumer-grade

    Pythonagentdeepseekdeepseek-r1
  • TryGhost/Ghost

    TryGhost/Ghost

    51,857GitHubView on GitHub↗

    Ghost is an open-source publishing platform and headless content management system designed for professional publishers. It provides a decoupled architecture that separates the content management backend from the front-end delivery layer, allowing users to manage editorial workflows and site data through structured web

    JavaScriptbloggingcmsghost
Prev1234Next

Explore sub-tags

  • Backup and Recovery Utilities4 sub-tagsUtilities for automating database dumps, file storage backups, and managing retention policies or recovery operations.
  • Caching and Performance2 sub-tagsTechniques and implementations focused on reducing latency and improving system throughput by storing frequently accessed data.
  • Data Engineering7 sub-tagsInfrastructure and frameworks used to build, manage, and scale complex systems for processing and analyzing large datasets.
  • Data Extraction & Ingestion11 sub-tags
Tools and processes for gathering, parsing, and importing raw data from various external sources into storage systems.
  • Data Persistence and Storage10 sub-tagsTechnologies and architectures dedicated to the durable storage and long-term management of digital information.