What are the best open-source GitHub repositories for data engineering?

apache/incubator-airflow is the closest match — Airflow is a mature Python-based workflow orchestrator that lets you author, schedule, and monitor complex data pipelines as DAGs, with built-in scheduling, monitoring, retries, and extensible operators for transformation and connectors — directly matching the requirement for a data pipeline / ETL framework.. Other strong matches: apache/airflow, pathwaycom/pathway, dagster-io/dagster, unstructured-io/unstructured.

Why does apache/incubator-airflow match “data engineering”?

Airflow is a mature Python-based workflow orchestrator that lets you author, schedule, and monitor complex data pipelines as DAGs, with built-in scheduling, monitoring, retries, and extensible operators for transformation and connectors — directly matching the requirement for a data pipeline / ETL…

Why does apache/airflow match “data engineering”?

Apache Airflow is a mature workflow engine for authoring, scheduling, and monitoring data pipelines as DAGs, with native Python-based transformations, extensive connector libraries, error handling and retries — precisely the orchestration core this search targets.

Why does pathwaycom/pathway match “data engineering”?

Pathway is a high-performance data processing framework that unifies batch and streaming pipelines with differential dataflow for incremental processing, orchestrates complex transformations, and supports diverse source connectors and exactly-once semantics — squarely covering the core ETL and orch…

Why does dagster-io/dagster match “data engineering”?

Dagster is a data orchestration platform that treats data assets as first-class primitives, supporting definition, scheduling, monitoring, and error handling of pipelines with Python-based transformations, incremental processing, and extensive connector support—exactly what a data pipeline / ETL fr…

Why does unstructured-io/unstructured match “data engineering”?

Unstructured is a data orchestration engine specialized for transforming unstructured documents into structured formats, with scheduling, monitoring, and error handling features, making it a valid if narrow choice for document-centric data pipelines.

Data Engineering

Entdecke Open-Source-Frameworks und Tools zum Aufbau von Datenpipelines, zur Verarbeitung großer Datensätze und zur Infrastrukturverwaltung.

Finde die besten Repos mit KI.Wir suchen mit KI nach den am besten passenden Repositories.

apache/incubator-airflow
apache/incubator-airflow
45,840Auf GitHub ansehen
This project is a Python workflow orchestration platform and programmatic data pipeline engine used to author, schedule, and monitor complex data pipelines. It functions as a directed acyclic graph manager and scheduler, allowing users to define data movement and transformation tasks as code to ensure precise execution order and maintainability. The platform distinguishes itself by treating workflows as code, enabling pipelines to be versioned and tested through a standard programming language. It utilizes a system of extensible operators to encapsulate integration logic and employs a templat
Airflow is a mature Python-based workflow orchestrator that lets you author, schedule, and monitor complex data pipelines as DAGs, with built-in scheduling, monitoring, retries, and extensible operators for transformation and connectors — directly matching the requirement for a data pipeline / ETL framework.
PythonDirected Acyclic Graph Engines
Auf GitHub ansehen45,840
apache/airflow
apache/airflow
45,902Auf GitHub ansehen
Airflow is a platform for programmatically authoring, scheduling, and monitoring complex data pipelines. It functions as a workflow automation engine that manages the lifecycle of recurring business processes by executing code-defined task dependencies. By representing workflows as directed acyclic graphs, the system ensures that task execution order and data flow are explicitly defined and reliably maintained across distributed computing environments. The platform distinguishes itself through a highly modular, provider-based architecture that decouples core orchestration logic from external
Apache Airflow is a mature workflow engine for authoring, scheduling, and monitoring data pipelines as DAGs, with native Python-based transformations, extensive connector libraries, error handling and retries — precisely the orchestration core this search targets.
PythonAlerting Systems
Auf GitHub ansehen45,902
pathwaycom/pathway
pathwaycom/pathway
62,959Auf GitHub ansehen
Pathway is a high-performance data processing framework designed for building unified batch and streaming pipelines. It functions as an orchestrator for complex data transformations, utilizing a differential dataflow engine to process updates incrementally. By treating static datasets and continuous event streams with identical logic, the platform ensures exactly-once processing semantics and consistent results across diverse data sources. The framework distinguishes itself through its specialized support for real-time artificial intelligence and retrieval-augmented generation. It features in
Pathway is a high-performance data processing framework that unifies batch and streaming pipelines with differential dataflow for incremental processing, orchestrates complex transformations, and supports diverse source connectors and exactly-once semantics — squarely covering the core ETL and orchestration needs you listed.
PythonData Processing FrameworksData Stream ProcessorsDeclarative Pipeline Construction
Auf GitHub ansehen62,959
dagster-io/dagster
dagster-io/dagster
14,974Auf GitHub ansehen
Dagster is a data orchestration platform designed to manage the entire lifecycle of data assets through declarative modeling and version-controlled code. It functions as a workflow engine that treats data assets as first-class primitives, allowing teams to define, schedule, and monitor complex pipelines while maintaining clear visibility into lineage, dependencies, and data quality. The platform distinguishes itself by using a code-as-configuration framework that enables standard software engineering practices, such as unit testing and local mocking, to be applied directly to data workflows.
Dagster is a data orchestration platform that treats data assets as first-class primitives, supporting definition, scheduling, monitoring, and error handling of pipelines with Python-based transformations, incremental processing, and extensive connector support—exactly what a data pipeline / ETL framework search is after.
PythonData Pipeline OrchestrationDeclarative OrchestrationWorkflow Orchestration Engines
Auf GitHub ansehen14,974
unstructured-io/unstructured
Unstructured-IO/unstructured
14,019Auf GitHub ansehen
Unstructured is an enterprise-grade data orchestration engine designed to transform raw, unstructured files into structured, machine-readable formats. It functions as a comprehensive platform for document ingestion, partitioning, and enrichment, specifically engineered to prepare complex data for retrieval-augmented generation and agentic AI workflows. The platform distinguishes itself through its sophisticated document processing strategies, which combine rule-based extraction with vision-language models to handle diverse file layouts, tables, and images. It provides a modular architecture t
Unstructured is a data orchestration engine specialized for transforming unstructured documents into structured formats, with scheduling, monitoring, and error handling features, making it a valid if narrow choice for document-centric data pipelines.
HTMLData Source ConnectionsTask Status MonitorsDirected Acyclic Graph Engines
Auf GitHub ansehen14,019
airbytehq/airbyte
airbytehq/airbyte
21,472Auf GitHub ansehen
Airbyte is a data integration platform designed to synchronize information between diverse applications, databases, and data warehouses. It functions as an extract, transform, and load orchestrator that manages automated data movement workflows across cloud, on-premise, and hybrid environments. The platform provides a standardized interface for connectors, enabling the movement of structured and unstructured data while maintaining stateful checkpoints for reliable incremental syncing. The platform distinguishes itself through a containerized architecture that isolates connectors to prevent de
Airbyte is a data integration platform that orchestrates ETL/ELT pipelines with extensive connectors, incremental syncing, monitoring, and error handling, fitting the data pipeline/ETL framework category for building and managing data workflows.
PythonData Transformation
Auf GitHub ansehen21,472
apache/beam
apache/beam
8,612Auf GitHub ansehen
Apache Beam is a distributed data pipeline framework and unified data processing model designed to handle both bounded batch data and unbounded real-time streams. It provides a system for building scalable, data-parallel workflows that operate across compute clusters using a single programming model. The framework utilizes a cross-runner pipeline abstraction that decouples the data processing logic from the underlying execution backend, allowing the same pipeline to run on different distributed compute engines. It supports multi-language pipeline development by translating high-level code fro
Apache Beam is a unified batch and stream processing framework that covers data transformation, connectors, and incremental processing well, but it lacks built-in scheduling and monitoring, which you would typically add with an orchestrator like Airflow.
JavaDead Letter QueuesDirected Acyclic Graph Engines
Auf GitHub ansehen8,612
prefecthq/prefect
PrefectHQ/prefect
21,640Auf GitHub ansehen
Prefect is a workflow orchestration platform designed to define, schedule, and monitor complex data pipelines as Python code. It functions as a container-native engine that wraps individual tasks in isolated environments, ensuring consistent dependencies and resource allocation across diverse infrastructure. By utilizing a state-machine-based orchestration model, the system tracks execution progress through discrete transitions and persistent event logs to maintain reliable and observable task processing. The platform distinguishes itself through a decoupled worker-API architecture, which sep
Prefect is a mature workflow orchestration platform that lets you build, schedule, and monitor data pipelines entirely in Python, with built-in observability, error handling, and retries—exactly the kind of framework this search is after.
PythonSnowflake Connectors
Auf GitHub ansehen21,640
apache/seatunnel
apache/seatunnel
9,427Auf GitHub ansehen
SeaTunnel is a distributed data integration engine designed to synchronize structured and unstructured data across diverse sources and sinks. It functions as a multi-engine execution framework that can run data integration tasks across different distributed computing backends to optimize workload performance. The project is distinguished by a visual data pipeline designer for configuring workflows without manual code and a specialized change data capture tool for streaming incremental database updates. It also includes an enrichment pipeline that integrates large language models and embedding
SeaTunnel is a distributed data integration engine for building and orchestrating data pipelines, supporting batch and streaming, CDC, a visual designer, and multiple execution backends—fitting the requirement for an ETL framework with connectors, incremental processing, and orchestration capabilities.
JavaBackend-Agnostic Execution LayersDistributed Data EnginesCDC Synchronization
Auf GitHub ansehen9,427
argoproj/argo-workflows
argoproj/argo-workflows
16,466Auf GitHub ansehen
Argo Workflows is a container-native workflow engine that functions as a Kubernetes custom resource controller. It orchestrates complex sequences of containerized tasks by executing them as directed acyclic graphs, allowing for dependency management and parallel processing within a cluster. The system extends the native Kubernetes control plane to manage the full lifecycle of automated processes, from initial triggering to final resource cleanup. The platform distinguishes itself through its controller-pattern reconciliation, which continuously monitors workflow states to align them with desi
Argo Workflows is a Kubernetes-native workflow engine that orchestrates containerized tasks as directed acyclic graphs, making it a solid fit for building and managing data pipelines and ETL workflows; it handles scheduling, retries, and monitoring, but delegates data transformation (SQL/Python) and source connectors to custom containers rather than providing them built-in.
GoCI/CD Orchestration ToolsDistributed Task OrchestratorsWorkflow Orchestrators
Auf GitHub ansehen16,466
pentaho/pentaho-kettle
pentaho/pentaho-kettle
8,353Auf GitHub ansehen
Pentaho Kettle is an enterprise ETL data integration platform designed to extract, transform, and load data between disparate sources and target databases. It functions as a metadata-driven orchestrator that utilizes a visual workflow designer to create and manage complex sequences of data tasks and transformation pipelines. The system is distinguished by its distributed data processing engine, which executes workloads across clusters of server nodes to increase throughput. It employs a plugin-based architecture, allowing the platform to be extended via external JAR files to provide connectiv
Pentaho Kettle is a full-featured, open-source ETL platform that uses a visual workflow designer and metadata-driven orchestration to build and manage complex data pipelines, with built-in monitoring, connector plugins, and distributed processing—exactly what a data pipeline/ETL framework should provide.
JavaData IntegrationETL WorkflowsCross-Source Data Integration
Auf GitHub ansehen8,353
apache/nifi
apache/nifi
5,976Auf GitHub ansehen
Apache NiFi is a flow-based programming platform that enables the visual design, monitoring, and management of data pipelines. At its core, it provides a web-based visual dataflow designer where users build directed graphs of processors to route, transform, and mediate data movement between any source and destination without writing custom code. The system records fine-grained data provenance for every data item from ingestion to delivery, supporting audit, debugging, and replay of data lineage. The platform distinguishes itself through a zero-master cluster architecture that distributes proc
Apache NiFi is a flow-based data pipeline platform with a visual designer, comprehensive monitoring through data provenance, and built-in back-pressure and delivery guarantees, directly covering the core needs of building, managing, and orchestrating ETL pipelines.
JavaData Pipeline OrchestrationData Pipeline OrchestratorsProcessor Graph Dataflow Models
Auf GitHub ansehen5,976
netflix/metaflow
Netflix/metaflow
9,764Auf GitHub ansehen
Metaflow is a Python machine learning framework and MLOps workflow orchestrator designed to manage the lifecycle of data pipelines from local prototyping to production. It serves as a distributed compute manager and an experiment tracking system, enabling the creation of reproducible pipelines that transition between development and high-availability production environments. The framework distinguishes itself through an integrated checkpointing system that automatically persists intermediate data artifacts to remote storage, allowing failed runs to be resumed from the last successful step. It
Metaflow is a Python framework that orchestrates data pipelines and ML workflows, featuring checkpointing for error recovery, monitoring, and seamless transition from development to production – directly matching the need for a data pipeline and ETL orchestration tool.
PythonMachine Learning PipelinesML Workflow EnginesWorkflow Orchestration
Auf GitHub ansehen9,764
pathwaycom/llm-app
pathwaycom/llm-app
59,341Auf GitHub ansehen
This project is a data processing engine and AI application platform designed for building production-grade machine learning workflows. It provides a unified programming model that handles both historical batch data and live stream ingestion, enabling the development of real-time ETL pipelines and scalable data transformation workflows. The framework distinguishes itself through differential dataflow execution, which propagates only changes through a pipeline rather than recomputing entire datasets. It supports distributed state management across worker nodes and utilizes incremental stream p
Pathway is a data processing engine for building real-time and batch ETL pipelines with differential dataflow for incremental processing, which squarely fits the data pipeline framework category, though it lacks explicit scheduling, monitoring, and generic data source connectors beyond its AI/LLM focus.
Jupyter NotebookData Processing FrameworksDifferential Dataflow EnginesDistributed State Management
Auf GitHub ansehen59,341
dlt-hub/dlt
dlt-hub/dlt
5,472Auf GitHub ansehen
dlt is a Python data ingestion tool and ETL pipeline framework designed to fetch data from diverse sources and persist it into structured destinations. It functions as a schema inference engine that automatically detects data types and flattens nested JSON structures into relational tables, moving data from sources to lakehouses, warehouses, or vector databases. The project distinguishes itself through AI-powered pipeline generation, using large language models to scaffold extraction code and connectors for REST APIs. It also supports multimodal vector storage and specialized population of ve
dlt is a Python ETL pipeline framework that ingests data from diverse sources with automatic schema inference and transformation, making it a solid fit for building data pipelines, though it leans toward ingestion and schema management rather than covering scheduling, monitoring, or retry handling as primary features.
PythonIncremental Data LoadingSnowflake Connectors
Auf GitHub ansehen5,472
apache/flink
apache/flink
26,086Auf GitHub ansehen
Apache Flink is a distributed processing engine designed for both high-throughput, low-latency data streams and finite batch workloads. It functions as a stateful stream processor and a SQL stream processing engine, providing a unified runtime to execute relational queries and event-based transformations. The system is distinguished by its ability to manage persistent operator state to ensure exactly-once processing guarantees and consistency during failures. It features specialized capabilities for complex event processing to detect temporal patterns and handles out-of-order events using eve
Apache Flink is a distributed stream and batch processing engine that provides SQL-based data transformation, stateful processing, and exactly-once guarantees, making it a strong fit for building data pipelines even though its built-in scheduling and orchestration capabilities are limited compared to dedicated workflow managers.
JavaDirected Acyclic Graph Engines
Auf GitHub ansehen26,086
elastic/logstash
elastic/logstash
14,884Auf GitHub ansehen
Logstash is a JVM-based event processor and extract, transform, load system designed for log data processing pipelines. It functions as a plugin-based data ingestor that collects, transforms, and delivers logs and event data from multiple sources to various destinations. The system utilizes a modular architecture of interchangeable input, filter, and output components to handle real-time data ingestion and enterprise log aggregation. Users can extend the pipeline's functionality by developing custom plugins to support unique data sources or specific transformation logic. The platform covers
Logstash is a genuine data pipeline / ETL framework specialized for log and event data, with a plugin-based architecture for data transformation and connectors, but it lacks built-in scheduling, SQL/Python transformation, and comprehensive error handling, making it a narrower fit for your general pipeline orchestration needs.
JavaData Transformation
Auf GitHub ansehen14,884
hatchet-dev/hatchet
hatchet-dev/hatchet
6,622Auf GitHub ansehen
Hatchet is an open-source durable workflow engine and task orchestration platform. It provides a framework for building and executing fault-tolerant, multi-step pipelines as directed acyclic graphs (DAGs), with automatic retries, scheduling, and real-time observability. The system is built around durable task checkpointing, which persists execution state after each step so work can resume from the last checkpoint after a worker crash or restart, and it supports event-driven task resumption that pauses a task until a matching external event arrives. The platform distinguishes itself through it
Hatchet is a durable workflow engine for building and orchestrating multi-step pipelines as DAGs, with automatic retries, scheduling, and real-time observability — it fits the need for a data pipeline framework, though you’ll need to implement data transformation and connectors yourself.
GoDAG Workflow ExecutionsRetry Policies
Auf GitHub ansehen6,622
datajuicer/data-juicer
datajuicer/data-juicer
6,574Auf GitHub ansehen
Data-Juicer is an open-source framework for cleaning, filtering, deduplicating, and transforming multimodal datasets to prepare them for training large language and vision models. It functions as a distributed data pipeline engine that runs processing jobs across Ray clusters, handling billions of samples with automatic operator fusion and adaptive parallelism. The framework provides a library of operators that leverage large language models for semantic extraction, filtering, and data synthesis within processing pipelines. The project distinguishes itself through a YAML-based data recipe sys
Data-Juicer is a distributed data pipeline framework that transforms, filters, and cleans multimodal datasets at scale using Ray, making it a direct fit for building and orchestrating ETL pipelines even though its focus is on AI/ML data preparation rather than general-purpose SQL-based transformation.
PythonData Curation PipelinesDeclarative Data RecipesMultimodal Data Processing
Auf GitHub ansehen6,574
apache/storm
apache/storm
6,683Auf GitHub ansehen
Storm is a distributed stream processing framework designed to execute unbounded computations across a cluster to process real-time data streams. It functions as a data pipeline orchestrator that allows users to define and deploy declarative data flow graphs connecting streaming sources to processing components. The system operates as a multi-tenant distributed compute engine that isolates workloads and limits resource usage across shared clusters using dedicated pools and access control. It is also a secure distributed processing engine that employs encrypted node communication and SSL-secur
Apache Storm is a distributed stream processing framework that enables building and orchestrating real-time data pipelines with declarative data flow graphs, making it directly applicable for streaming pipeline workloads, though it lacks native SQL/Python transformation and is focused on real-time rather than batch ETL.
JavaReal-Time Data StreamingStreaming Data ProcessingCluster Resource Isolation
Auf GitHub ansehen6,683

Data Engineering

apache/incubator-airflow

apache/airflow

pathwaycom/pathway

dagster-io/dagster

Unstructured-IO/unstructured

airbytehq/airbyte

apache/beam

PrefectHQ/prefect

apache/seatunnel

argoproj/argo-workflows

pentaho/pentaho-kettle

apache/nifi

Netflix/metaflow

pathwaycom/llm-app

dlt-hub/dlt

apache/flink

elastic/logstash

hatchet-dev/hatchet

datajuicer/data-juicer

apache/storm