# dtstack/chunjun

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/dtstack-chunjun).**

4,104 stars · 1,690 forks · Java · Apache-2.0

## Links

- GitHub: https://github.com/DTStack/chunjun
- Homepage: https://dtstack.github.io/chunjun/
- awesome-repositories: https://awesome-repositories.com/repository/dtstack-chunjun.md

## Topics

`bigdata` `data-integration` `flink` `framework` `java`

## Description

Chunjun is a distributed data integration framework and SQL-based ETL pipeline designed to synchronize data between heterogeneous sources. It functions as a change data capture tool and a heterogeneous data synchronizer, utilizing a distributed processing environment to move and transform data across different database types.

The system is distinguished by its plugin-based connector architecture, which allows for the development of custom source and sink plugins to extend connectivity to unsupported data systems. It supports real-time change data capture from relational database logs and implements schema evolution propagation to automatically apply structural changes from source to destination tables.

The framework provides capabilities for incremental data synchronization and cross-source data calculation using SQL logic. Reliability is managed through checkpoint-based task recovery to resume interrupted transfers and dead-letter queues for dirty data management to audit malformed records.

Integration tasks can be deployed across standalone clusters, Yarn, or Kubernetes environments, with support for containerized deployment via Docker.

## Tags

### Data & Databases

- [Distributed Data Processing Frameworks](https://awesome-repositories.com/f/data-databases/distributed-data-processing-frameworks.md) — Provides a distributed framework for synchronizing and transforming data between heterogeneous sources using a plugin-based architecture.
- [Heterogeneous Data Synchronization](https://awesome-repositories.com/f/data-databases/heterogeneous-data-synchronization.md) — Transfers and aligns data between different heterogeneous data sources using a distributed integration framework. ([source](https://dtstack.github.io/chunjun/documents/zh/%E5%BF%AB%E9%80%9F%E5%BC%80%E5%A7%8B))
- [Change Data Capture](https://awesome-repositories.com/f/data-databases/change-data-capture.md) — Streams real-time updates from relational database logs to enable low-latency synchronization between heterogeneous systems.
- [Change Data Capture Tools](https://awesome-repositories.com/f/data-databases/change-data-capture-tools.md) — Collects data from relational databases in real-time via logs to facilitate low-latency synchronization. ([source](https://dtstack.github.io/chunjun/))
- [Checkpoints and Recovery](https://awesome-repositories.com/f/data-databases/data-engineering-infrastructure/data-persistence-storage/checkpoints-and-recovery.md) — Resumes interrupted data transfers from the last successful checkpoint to ensure disaster recovery and data consistency. ([source](https://cdn.jsdelivr.net/gh/dtstack/chunjun@master/README.md))
- [Distributed Cluster Execution](https://awesome-repositories.com/f/data-databases/distributed-cluster-execution.md) — Spreads data integration workloads across multiple nodes using Yarn or Kubernetes for parallel processing.
- [Incremental Data Synchronization](https://awesome-repositories.com/f/data-databases/incremental-data-synchronization.md) — Transfers only new or changed data records over time instead of performing full dataset copies. ([source](https://cdn.jsdelivr.net/gh/dtstack/chunjun@master/README.md))
- [SQL-Based Pipeline Definitions](https://awesome-repositories.com/f/data-databases/streaming-sql-transformations/sql-based-pipeline-definitions.md) — Allows defining data movement and transformation workflows using SQL declarations and JSON templates.
- [Connector Plugin Development](https://awesome-repositories.com/f/data-databases/connector-plugin-development.md) — The product allows developers to create new source or sink connectors to synchronize data between heterogeneous systems by implementing read and write logic. ([source](https://dtstack.github.io/chunjun/faq))
- [Cross-Source Data Integration](https://awesome-repositories.com/f/data-databases/cross-source-data-integration.md) — Joins and calculates data between diverse sources using a plugin-based architecture to ensure cross-database compatibility. ([source](https://cdn.jsdelivr.net/gh/dtstack/chunjun@master/README.md))
- [Data Quality Monitors](https://awesome-repositories.com/f/data-databases/data-pipelines/data-quality-monitors.md) — Captures failing records and provides metrics to monitor overall data quality during the synchronization process. ([source](https://cdn.jsdelivr.net/gh/dtstack/chunjun@master/README.md))
- [Dirty Data Capture](https://awesome-repositories.com/f/data-databases/data-pipelines/data-quality-monitors/dirty-data-capture.md) — Isolates and stores malformed records that fail processing to prevent pipeline crashes and enable correction. ([source](https://dtstack.github.io/chunjun/))
- [Incremental Sync Checkpointings](https://awesome-repositories.com/f/data-databases/data-synchronization-configurations/sync-endpoint-configurations/unidirectional-sync-configurations/resumable-sync-checkpoints/incremental-sync-checkpointings.md) — Monitors data sources and utilizes checkpoint-based resume to ensure consistency during incremental transfers. ([source](https://dtstack.github.io/chunjun/))
- [Distributed SQL Computations](https://awesome-repositories.com/f/data-databases/distributed-sql-computations.md) — Performs data computation and transformation tasks using SQL logic within a distributed processing environment. ([source](https://dtstack.github.io/chunjun/documents/zh/%E5%BF%AB%E9%80%9F%E5%BC%80%E5%A7%8B))
- [Schema Synchronizers](https://awesome-repositories.com/f/data-databases/schema-synchronizers.md) — Aligns structural definitions of source and destination tables to maintain data integrity across heterogeneous systems. ([source](https://dtstack.github.io/chunjun/))
- [Automated Schema Propagation](https://awesome-repositories.com/f/data-databases/schema-synchronizers/schema-propagation-protocols/automated-schema-propagation.md) — Automatically propagates structural changes from source databases to destination tables.
- [SQL-Based CDC Integrations](https://awesome-repositories.com/f/data-databases/sql-based-cdc-integrations.md) — Enables the definition of data integration and CDC workflows using SQL scripts compatible with streaming syntax. ([source](https://cdn.jsdelivr.net/gh/dtstack/chunjun@master/README.md))

### DevOps & Infrastructure

- [Data Pipeline Deployments](https://awesome-repositories.com/f/devops-infrastructure/data-pipeline-deployments.md) — Enables the deployment of large-scale data movement tasks across Kubernetes, Yarn, or standalone clusters.
- [Error Tracking and Exception Handling](https://awesome-repositories.com/f/devops-infrastructure/devops/operational-reliability/error-tracking-and-exception-handling.md) — Provides a dead-letter queue to capture and track malformed records that fail during synchronization for later auditing.
- [Data Processing Orchestrators](https://awesome-repositories.com/f/devops-infrastructure/kubernetes-cluster-deployments/data-processing-orchestrators.md) — Orchestrates data processing pipelines as scalable jobs within Kubernetes, Yarn, or standalone environments.
- [Data Source Extensions](https://awesome-repositories.com/f/devops-infrastructure/release-automation/plugin-extensibility/data-source-extensions.md) — Provides mechanisms to extend connectivity to unsupported data systems via custom reader, writer, and lookup plugins. ([source](https://dtstack.github.io/chunjun/))

### Software Engineering & Architecture

- [Plugin-Based Architectures](https://awesome-repositories.com/f/software-engineering-architecture/software-architecture/architectural-patterns/plugin-module-systems/modular-plugin-architectures/plugin-based-architectures/plugin-based-architectures.md) — Provides a plugin-based connector architecture with standardized read and write interfaces for heterogeneous data sources and sinks.
- [Checkpoint-Based Resumptions](https://awesome-repositories.com/f/software-engineering-architecture/checkpoint-based-resumptions.md) — Implements mechanisms to save data offsets, allowing interrupted synchronization tasks to resume from the last successful checkpoint.
- [Declarative Configuration Systems](https://awesome-repositories.com/f/software-engineering-architecture/declarative-configuration-systems.md) — Allows defining data movement workflows and processing pipelines using declarative JSON or SQL scripts.
- [Dead Letter Queues](https://awesome-repositories.com/f/software-engineering-architecture/queue-implementations/dead-letter-queues.md) — Utilizes dead-letter queues to isolate and store malformed records for auditing and manual correction.

### Development Tools & Productivity

- [Data Integration Task Definitions](https://awesome-repositories.com/f/development-tools-productivity/task-configuration-decorators/data-integration-task-definitions.md) — Allows defining data movement jobs and source-to-destination mappings using JSON or SQL declarations. ([source](https://dtstack.github.io/chunjun/))