# apache/seatunnel

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/apache-seatunnel).**

9,125 stars · 2,180 forks · Java · apache-2.0

## Links

- GitHub: https://github.com/apache/seatunnel
- Homepage: https://seatunnel.apache.org/
- awesome-repositories: https://awesome-repositories.com/repository/apache-seatunnel.md

## Topics

`apache` `batch` `cdc` `change-data-capture` `data-ingestion` `data-integration` `elt` `embeddings` `high-performance` `llm` `multimodal` `offline` `real-time` `streaming`

## Description

SeaTunnel is a distributed data integration engine designed to synchronize structured and unstructured data across diverse sources and sinks. It functions as a multi-engine execution framework that can run data integration tasks across different distributed computing backends to optimize workload performance.

The project is distinguished by a visual data pipeline designer for configuring workflows without manual code and a specialized change data capture tool for streaming incremental database updates. It also includes an enrichment pipeline that integrates large language models and embedding models to add semantic vectors to data records.

The engine provides broad capabilities for large-scale data integration, including SQL-based transformations, data quality validation, and multimodal synchronization. It manages reliability through fault-tolerant checkpointing, distributed data consistency, and a plugin architecture for custom connector development.

Operational oversight is supported by real-time synchronization progress monitoring, metric tracking, and a REST API for programmatic job submission.

## Tags

### Data & Databases

- [Distributed Data Engines](https://awesome-repositories.com/f/data-databases/distributed-data-engines.md) — Functions as a distributed data integration engine that orchestrates workflows across multiple compute clusters.
- [Change Data Capture Tools](https://awesome-repositories.com/f/data-databases/change-data-capture-tools.md) — Implements a specialized tool for streaming real-time incremental updates from database transaction logs. ([source](https://seatunnel.apache.org/docs/connectors/source))
- [Distributed Computing Engines](https://awesome-repositories.com/f/data-databases/data-engineering/distributed-compute-frameworks/distributed-computing-engines.md) — Functions as a framework that can execute data integration tasks across various distributed computing backends.
- [Data Format Converters](https://awesome-repositories.com/f/data-databases/data-format-converters.md) — Converts raw data into structured formats to ensure compatibility between source and destination systems. ([source](https://seatunnel.apache.org/docs/developer/coding-guide))
- [In-Transit Schema Transformations](https://awesome-repositories.com/f/data-databases/data-governance-modeling/data-modeling-schemas/data-schemas/in-transit-schema-transformations.md) — Modifies data formats and column structures during transit to align source and destination schemas. ([source](https://seatunnel.apache.org/docs/introduction/about))
- [Schema Transformation Pipelines](https://awesome-repositories.com/f/data-databases/data-governance-modeling/data-modeling-schemas/schema-mapping/schema-transformation-pipelines.md) — Processes data records through field renames and type conversions to align source and destination schemas.
- [Data Ingestion Sources](https://awesome-repositories.com/f/data-databases/data-ingestion-sources.md) — Ingests structured and unstructured data from databases, cloud storage, messaging queues, and applications. ([source](https://seatunnel.apache.org/docs/connectors/source))
- [Data Integration & Synchronization](https://awesome-repositories.com/f/data-databases/data-integration-synchronization.md) — Provides a high-performance engine for integrating and synchronizing both structured and unstructured data across diverse systems. ([source](https://cdn.jsdelivr.net/gh/apache/seatunnel@dev/README.md))
- [Execution Engine Translation](https://awesome-repositories.com/f/data-databases/data-pipeline-orchestration/data-engineering-pipelines/execution-engine-translation.md) — Provides a mechanism to adapt data connectors so they can run across various distributed computing engines. ([source](https://seatunnel.apache.org/docs/developer/coding-guide))
- [ETL Workflows](https://awesome-repositories.com/f/data-databases/data-pipeline-orchestration/etl-workflows.md) — Provides a framework for extracting raw data, transforming schemas, and loading it into target sinks.
- [Multi-Engine Execution Backends](https://awesome-repositories.com/f/data-databases/data-processing-configurations/execution-engines/multi-engine-execution-backends.md) — Supports running data integration tasks across various processing backends to optimize performance. ([source](https://cdn.jsdelivr.net/gh/apache/seatunnel@dev/README.md))
- [Data Sinking](https://awesome-repositories.com/f/data-databases/data-sinking.md) — Transfers processed data into target destinations such as databases, object storage, and message queues. ([source](https://seatunnel.apache.org/docs/connectors/sink))
- [Cross-Engine Data Synchronization](https://awesome-repositories.com/f/data-databases/data-synchronization-engines/cross-engine-data-synchronization.md) — Enables data movement between sources and sinks utilizing various distributed execution engines. ([source](https://seatunnel.apache.org/docs/getting-started/locally/deployment))
- [CDC Synchronization](https://awesome-repositories.com/f/data-databases/database-migrations/cdc-synchronization.md) — Ships a real-time synchronization tool that captures database transaction logs to stream incremental updates.
- [Large Scale Data Integration Frameworks](https://awesome-repositories.com/f/data-databases/large-scale-data-integration-frameworks.md) — Moves massive volumes of structured and unstructured data between diverse databases, cloud storage, and messaging systems.
- [Unified Data Provider Interfaces](https://awesome-repositories.com/f/data-databases/unified-data-provider-interfaces.md) — Provides a unified connector interface that abstracts diverse data sources and sinks into a common set of operations.
- [Unified Data Connector Interfaces](https://awesome-repositories.com/f/data-databases/unified-storage-interfaces/unified-data-connector-interfaces.md) — Provides a unified connector interface to abstract diverse data sources and sinks for multimodal data movement. ([source](https://seatunnel.apache.org/docs/introduction/about))
- [Multiplexed](https://awesome-repositories.com/f/data-databases/change-data-capture/database-synchronization/multiplexed.md) — Optimizes data movement across multiple tables and databases using JDBC multiplexing and log parsing. ([source](https://cdn.jsdelivr.net/gh/apache/seatunnel@dev/README.md))
- [LLM-Based Vector Enrichment](https://awesome-repositories.com/f/data-databases/data-enrichment/llm-based-vector-enrichment.md) — Implements a specialized pipeline that integrates LLMs and embedding models to enrich data records with semantic vectors.
- [Nodal Workflow Designers](https://awesome-repositories.com/f/data-databases/data-processing-pipelines/nodal-workflow-designers.md) — Offers a canvas-based graphical interface for constructing modular data processing pipelines using visual node connections.
- [Field Transformations](https://awesome-repositories.com/f/data-databases/field-transformations.md) — Supports renaming or replacing specific fields within a record to align source schemas with destination requirements. ([source](https://seatunnel.apache.org/docs/transforms))
- [Job State Persistence](https://awesome-repositories.com/f/data-databases/persistent-application-state/job-state-persistence.md) — Persists distributed map data and job metadata to external storage for automatic task restoration. ([source](https://seatunnel.apache.org/docs/engines/zeta/hybrid-cluster-deployment))
- [Slot-Based Resource Scheduling](https://awesome-repositories.com/f/data-databases/slot-based-resource-scheduling.md) — Controls parallel execution by assigning task groups to specific resource slots on cluster nodes.
- [Streaming SQL Transformations](https://awesome-repositories.com/f/data-databases/streaming-sql-transformations.md) — Allows executing SQL queries against data streams for complex filtering, aggregation, and restructuring. ([source](https://seatunnel.apache.org/docs/transforms))
- [Multi-Table Stream Processors](https://awesome-repositories.com/f/data-databases/table-data-processing/multi-table-stream-processors.md) — Enables transformation logic to be applied across multiple tables simultaneously using a single configuration. ([source](https://seatunnel.apache.org/docs/transforms))

### Software Engineering & Architecture

- [Backend-Agnostic Execution Layers](https://awesome-repositories.com/f/software-engineering-architecture/execution-engines/backend-agnostic-execution-layers.md) — Provides an engine-agnostic execution layer that translates pipeline configurations into plans compatible with multiple distributed backends.
- [Distributed Consistency Snapshots](https://awesome-repositories.com/f/software-engineering-architecture/architectural-design-patterns/state-management/persistence-and-serialization/state-serialization/state-snapshots/distributed-consistency-snapshots.md) — Implements distributed consistency snapshots to ensure exactly-once processing and fault recovery.
- [Fault Tolerance](https://awesome-repositories.com/f/software-engineering-architecture/fault-tolerance.md) — Implements fault-tolerant checkpointing by saving task state to distributed storage for reliable job recovery. ([source](https://seatunnel.apache.org/docs/engines/zeta/hybrid-cluster-deployment))
- [Database Transaction Log Parsers](https://awesome-repositories.com/f/software-engineering-architecture/custom-log-formatting/log-parsing/database-transaction-log-parsers.md) — Optimizes data ingestion by parsing database transaction logs for multiple tables in a single pass. ([source](https://seatunnel.apache.org/docs/introduction/about))
- [Transaction Log Multiplexing](https://awesome-repositories.com/f/software-engineering-architecture/custom-log-formatting/log-parsing/transaction-log-multiplexing.md) — Parses database transaction logs for multiple tables in a single pass to reduce I/O overhead during CDC.
- [SPI-Based Extension Mechanisms](https://awesome-repositories.com/f/software-engineering-architecture/integration-extensibility/extensibility/third-party-plugins/runtime-interface-implementations/spi-based-extension-mechanisms.md) — Utilizes a service provider interface (SPI) to load external connector plugins at runtime.

### Artificial Intelligence & ML

- [LLM Model Integrations](https://awesome-repositories.com/f/artificial-intelligence-ml/generative-ai-resources/generative-ai/llm-model-integrations.md) — Integrates large language models and embedding models to enrich data records with semantic vectors. ([source](https://seatunnel.apache.org/docs/transforms))

### Part of an Awesome List

- [Data Quality and Validation](https://awesome-repositories.com/f/awesome-lists/data/data-quality-and-validation.md) — Includes processes for checking records against predefined rules to ensure data integrity during movement. ([source](https://seatunnel.apache.org/docs/transforms))
- [Data Integration](https://awesome-repositories.com/f/awesome-lists/data/data-integration.md) — High-performance distributed platform for batch and streaming data synchronization.

### Development Tools & Productivity

- [Connector Development Toolkits](https://awesome-repositories.com/f/development-tools-productivity/connector-development-toolkits.md) — Provides a framework for developing custom source and sink plugins to support new data formats. ([source](https://seatunnel.apache.org/docs/introduction/about))

### DevOps & Infrastructure

- [High Availability Clustering](https://awesome-repositories.com/f/devops-infrastructure/high-availability-clustering.md) — Ensures continuous service availability by distributing status data across multiple nodes using synchronous backups. ([source](https://seatunnel.apache.org/docs/engines/zeta/hybrid-cluster-deployment))
- [Job Execution Engines](https://awesome-repositories.com/f/devops-infrastructure/job-execution-engines.md) — Executes data integration tasks using different distributed processing engines to fit existing infrastructure. ([source](https://seatunnel.apache.org/docs/introduction/about))

### System Administration & Monitoring

- [Migration Progress Monitors](https://awesome-repositories.com/f/system-administration-monitoring/data-migration/migration-progress-monitors.md) — Provides a real-time tracking system to monitor the progress and performance of data synchronization tasks. ([source](https://cdn.jsdelivr.net/gh/apache/seatunnel@dev/README.md))
