awesome-repositories.com
© 2026 Bringes Technology SRL·VAT RO45896025·hello@bringes.io
MCPSitemapPrivacyTerms
Data Engineering Pipelines · Awesome GitHub Repositories

7 repos

Awesome GitHub RepositoriesData Engineering Pipelines

Systems and workflows designed for the automated collection, transformation, and loading of data across diverse sources.

Explore 7 awesome GitHub repositories matching data & databases · Data Engineering Pipelines. Refine with filters or upvote what's useful.

  1. Home
  2. Data & Databases
  3. Data Engineering and Infrastructure
  4. Data Engineering
  5. Data Pipeline Orchestration
  6. Data Engineering Pipelines

Awesome Data Engineering Pipelines GitHub Repositories

Describe the repository you're looking for…
We'll search the best matching repositories with AI.
  • sindresorhus/awesome

    sindresorhus/awesome

    438,690GitHubView on GitHub↗

    This project is a community-curated knowledge base that organizes vast technical ecosystems into a hierarchical, human-readable directory. It serves as a comprehensive index of libraries, frameworks, and methodologies, designed to facilitate discovery and professional development across the entire spectrum of software

    awesomeawesome-listlists
  • vinta/awesome-python

    vinta/awesome-python

    283,687GitHubView on GitHub↗

    This project is a comprehensive, community-curated directory that organizes a vast landscape of Python software libraries, frameworks, and tools. It serves as a centralized knowledge base designed to facilitate ecosystem navigation and accelerate developer discovery across the entire software development lifecycle. Th

    Pythonawesomecollectionspython
  • tensorflow/tensorflow

    tensorflow/tensorflow

    193,864GitHubView on GitHub↗

    TensorFlow is a comprehensive machine learning framework designed for the construction, training, and deployment of complex mathematical models. It utilizes a graph-based execution model that represents operations as directed acyclic graphs, enabling automatic differentiation and efficient parallel processing. The syst

    C++deep-learningdeep-neural-networksdistributed
  • pytorch/pytorch

    pytorch/pytorch

    97,601GitHubView on GitHub↗

    PyTorch is a machine learning framework centered on a GPU-ready tensor library that supports multi-dimensional array operations across both CPU and accelerator hardware. It provides a foundational infrastructure for mathematical computation and dynamic neural network construction, utilizing a tape-based automatic diffe

    Pythonautograddeep-learninggpu
  • microsoft/markitdown

    microsoft/markitdown

    87,305GitHubView on GitHub↗

    This project is an AI-powered document processing engine designed to transform diverse file formats into structured Markdown. By leveraging multimodal language models, it performs complex layout analysis and semantic text extraction, allowing for the conversion of both unstructured files and scanned images into machine

    Pythonautogenautogen-extensionlangchain
  • elastic/elasticsearch

    elastic/elasticsearch

    76,163GitHubView on GitHub↗

    Elasticsearch is a distributed search engine and document store designed for the high-performance indexing and retrieval of massive volumes of unstructured data. It functions as a centralized analytics platform, providing a schema-flexible architecture that organizes information into searchable indices while maintainin

    Javaelasticsearchjavasearch-engine
  • unslothai/unsloth

    unslothai/unsloth

    52,461GitHubView on GitHub↗

    Unsloth is a high-performance training and inference platform designed to optimize the lifecycle of large language and multimodal models. It provides a comprehensive engine for fine-tuning, executing, and managing models locally, with a focus on reducing memory consumption and increasing compute speed on consumer-grade

    Pythonagentdeepseekdeepseek-r1

Explore sub-tags

  • Batched Data LoadingAutomatic collation of data samples into batches.
  • Data Ingestion ToolsUtilities and APIs for importing external data into the system.
  • Data SamplersMechanisms for controlling the sequence, shuffling, and batching of data indices.
  • Dataset Abstractions
Interfaces for defining data sources, supporting both index-based and streaming access patterns.
  • LLM-Integrated Extraction PipelinesOrchestration workflows that chain file ingestion, layout analysis, and model-based generation.
  • Lazy Data Ingestion PipelinesStreams and transforms training data through asynchronous, multi-threaded buffers.
  • Parallel Data LoadersUtilities for fetching and preprocessing data using multi-process parallelism to maximize throughput.