# alibaba/datax

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/alibaba-datax).**

17,241 stars · 5,658 forks · Java · NOASSERTION

## Links

- GitHub: https://github.com/alibaba/DataX
- awesome-repositories: https://awesome-repositories.com/repository/alibaba-datax.md

## Description

DataX is a distributed data integration framework and plugin-based ETL tool designed for synchronizing large datasets between heterogeneous sources and destinations. It functions as a JDBC data migration engine and offline synchronization tool, enabling the movement of data between relational databases, NoSQL stores, and object storage.

The system utilizes a plugin-based connector architecture that decouples reader and writer logic, allowing it to map and transform data types across different storage engines using a standardized internal representation. This design supports heterogeneous data pipelines where source-specific data is converted into compatible target types to ensure cross-platform compatibility.

The framework provides comprehensive capabilities for data extraction, including support for columnar formats, incremental synchronization via SQL filtering, and archive decompression. Its writing surface includes batch commit operations, idempotent write strategies to maintain consistency during retries, and the ability to execute pre- and post-synchronization SQL scripts.

Performance is managed through task-level parallelism, throughput control to regulate memory and network traffic, and batch-based write buffering to increase ingestion speed.

## Tags

### Data & Databases

- [Data Extraction](https://awesome-repositories.com/f/data-databases/data-engineering-infrastructure/data-extraction-ingestion/data-extraction.md) — Provides core tools for isolating and retrieving specific data points from relational database sources. ([source](https://github.com/alibaba/DataX/blob/master/rdbmsreader/doc/rdbmsreader.md))
- [Large Scale Data Integration Frameworks](https://awesome-repositories.com/f/data-databases/large-scale-data-integration-frameworks.md) — Functions as a distributed framework for synchronizing massive volumes of data between heterogeneous sources and destinations.
- [Analytical Data Loads from Object Storage](https://awesome-repositories.com/f/data-databases/cloud-storage-definition-loading/analytical-data-loads-from-object-storage.md) — Loads data from cloud object storage into a transportable format for analytical processing. ([source](https://github.com/alibaba/DataX/blob/master/ossreader/doc/ossreader.md))
- [Relational Data Synchronization](https://awesome-repositories.com/f/data-databases/collaborative-data-synchronization-apis/relational-data-synchronization.md) — Synchronizes relational database records from sources like MySQL, PostgreSQL, and Oracle into target tables via JDBC.
- [Cross-Database Data Migrations](https://awesome-repositories.com/f/data-databases/cross-database-data-migrations.md) — Migrates large datasets between disparate database engines using a standardized internal representation for cross-platform compatibility.
- [Database-Specific Extractions](https://awesome-repositories.com/f/data-databases/data-engineering-infrastructure/data-extraction-ingestion/data-extraction/database-specific-extractions.md) — Enables record extraction from OceanBase databases via JDBC and SQL queries for migration. ([source](https://github.com/alibaba/DataX/blob/master/oceanbasev10reader/doc/oceanbasev10reader.md))
- [Data Insertion Interfaces](https://awesome-repositories.com/f/data-databases/data-insertion-interfaces.md) — Implements a programmatic interface for inserting records into target systems via standard JDBC statements. ([source](https://github.com/alibaba/DataX/blob/master/adswriter/doc/adswriter.md))
- [Data Integration Pipelines](https://awesome-repositories.com/f/data-databases/data-integration-pipelines.md) — Orchestrates the movement and routing of data between object storage, graph databases, and analytical warehouses via a plugin architecture.
- [Intermediate Representations](https://awesome-repositories.com/f/data-databases/data-processing-pipelines/data-processing-frameworks/intermediate-representations.md) — Employs internal data models that normalize diverse input formats into a consistent structure for uniform processing across different storage engines.
- [Data Type Mappings](https://awesome-repositories.com/f/data-databases/data-type-mappings.md) — Translates source data types into compatible target database-specific column formats. ([source](https://github.com/alibaba/DataX/blob/master/adswriter/doc/adswriter.md))
- [Data Extraction](https://awesome-repositories.com/f/data-databases/database-connectivity/mysql-connectors/data-extraction.md) — Implements specialized extraction logic to read records from MySQL databases using JDBC and SQL queries. ([source](https://github.com/alibaba/DataX/blob/master/mysqlreader/doc/mysqlreader.md))
- [Data Extraction](https://awesome-repositories.com/f/data-databases/database-connectivity/oracle-connectors/data-extraction.md) — Retrieves records from remote Oracle databases using JDBC and generated or custom SQL statements. ([source](https://github.com/alibaba/DataX/blob/master/oraclereader/doc/oraclereader.md))
- [Distributed Batch Processing](https://awesome-repositories.com/f/data-databases/distributed-batch-processing.md) — Transfers terabyte-scale datasets using parallel extraction and distributed writes to maximize system throughput.
- [Heterogeneous Data Pipelines](https://awesome-repositories.com/f/data-databases/heterogeneous-data-pipelines.md) — Implements pipelines that transform and map data types across different storage engines to ensure cross-platform compatibility.
- [Heterogeneous Data Synchronization](https://awesome-repositories.com/f/data-databases/heterogeneous-data-synchronization.md) — Enables data migration between diverse storage types such as relational databases and NoSQL stores using a standardized internal format. ([source](https://github.com/alibaba/DataX/blob/master/userGuid.md))
- [Incremental Data Synchronization](https://awesome-repositories.com/f/data-databases/incremental-data-synchronization.md) — Synchronizes only new or modified records by filtering data using WHERE clauses based on timestamps or IDs. ([source](https://github.com/alibaba/DataX/blob/master/sqlserverreader/doc/sqlserverreader.md))
- [JDBC Migration Engines](https://awesome-repositories.com/f/data-databases/jdbc-migration-engines.md) — Leverages JDBC drivers to extract and load records across a wide variety of relational database management systems.
- [Plugin-Based ETL Frameworks](https://awesome-repositories.com/f/data-databases/plugin-based-etl-frameworks.md) — Uses a plugin-based connector architecture to decouple reader and writer logic, allowing extensions for new heterogeneous data sources.
- [Pre and Post Load SQL Execution](https://awesome-repositories.com/f/data-databases/raw-sql-execution/pre-and-post-load-sql-execution.md) — Executes custom SQL statements immediately before or after synchronization tasks to prepare or finalize data. ([source](https://github.com/alibaba/DataX/blob/master/mysqlwriter/doc/mysqlwriter.md))
- [Relational Database Writers](https://awesome-repositories.com/f/data-databases/relational-database-writers.md) — Inserts data into target relational database tables using JDBC connections and custom drivers. ([source](https://github.com/alibaba/DataX/blob/master/rdbmswriter/doc/rdbmswriter.md))
- [Data Extraction](https://awesome-repositories.com/f/data-databases/sql-server-persistence/sql-server-data-sources/data-extraction.md) — Reads records from remote SQL Server databases using JDBC connections and SQL SELECT statements. ([source](https://github.com/alibaba/DataX/blob/master/sqlserverreader/doc/sqlserverreader.md))
- [Data Extraction](https://awesome-repositories.com/f/data-databases/sqlite-drivers/sqlite-storage-adapters/sqlite-or-postgresql-storage/postgresql-data-sources/data-extraction.md) — Extracts data from remote PostgreSQL databases using JDBC connections and SQL select statements. ([source](https://github.com/alibaba/DataX/blob/master/postgresqlreader/doc/postgresqlreader.md))
- [Batch Write Buffering](https://awesome-repositories.com/f/data-databases/batch-write-buffering.md) — Groups multiple record writes into a single transaction to increase data ingestion speed and reduce network overhead. ([source](https://github.com/alibaba/DataX/blob/master/gdbwriter/doc/gdbwriter.md))
- [Custom SQL Execution](https://awesome-repositories.com/f/data-databases/custom-sql-execution.md) — Allows the use of custom SQL queries for data extraction to enable complex operations like multi-table joins. ([source](https://github.com/alibaba/DataX/blob/master/rdbmsreader/doc/rdbmsreader.md))
- [Export Throughput Limiters](https://awesome-repositories.com/f/data-databases/data-export/export-throughput-limiters.md) — Regulates memory usage and network traffic by capping batch sizes and thread counts during data import. ([source](https://github.com/alibaba/DataX/blob/master/oceanbasev10writer/doc/oceanbasev10writer.md))
- [Data Synchronization Tools](https://awesome-repositories.com/f/data-databases/data-management/data-migration-synchronization/data-synchronization-tools.md) — Provides a tool for bulk data migrations and incremental synchronizations between relational databases and NoSQL stores.
- [Data Transformation Functions](https://awesome-repositories.com/f/data-databases/data-transformation-functions.md) — Provides built-in functions and custom rules for masking, completing, and filtering data during the migration process. ([source](https://github.com/alibaba/DataX/blob/master/introduction.md))
- [Unstructured Text Converters](https://awesome-repositories.com/f/data-databases/data-type-mappings/unstructured-text-converters.md) — Translates raw string values from object storage into structured types such as decimals and integers. ([source](https://github.com/alibaba/DataX/blob/master/ossreader/doc/ossreader.md))
- [Dirty Data Captures](https://awesome-repositories.com/f/data-databases/dirty-data-captures.md) — Provides a dirty-data capture mechanism to intercept and isolate records failing type conversion, ensuring pipeline stability.
- [Graph Database Writers](https://awesome-repositories.com/f/data-databases/graph-database-writers.md) — Synchronizes data into Neo4j using Cypher queries to create nodes and relationships. ([source](https://github.com/alibaba/DataX/blob/master/neo4jwriter/doc/neo4jwriter.md))
- [Graph Database Exporters](https://awesome-repositories.com/f/data-databases/graph-databases/graph-database-exporters.md) — Exports data into graph databases by converting source records into vertices and edges. ([source](https://github.com/alibaba/DataX/blob/master/gdbwriter/doc/gdbwriter.md))
- [Idempotent Write Strategies](https://awesome-repositories.com/f/data-databases/idempotent-write-strategies.md) — Implements idempotent write strategies that clear target partitions before writing to maintain consistency during failed task retries.
- [Bulk Load Optimizations](https://awesome-repositories.com/f/data-databases/large-scale-dataset-management/bulk-load-optimizations.md) — Moves terabyte-scale data by leveraging temporary distributed storage before triggering optimized bulk load commands. ([source](https://github.com/alibaba/DataX/blob/master/adswriter/doc/adswriter.md))
- [Columnar Tabular Storage](https://awesome-repositories.com/f/data-databases/large-scale-dataset-management/columnar-tabular-storage.md) — Extracts data from optimized columnar storage files including Parquet and ORC for large-scale processing. ([source](https://github.com/alibaba/DataX/blob/master/ossreader/doc/ossreader.md))
- [NoSQL Data Writers](https://awesome-repositories.com/f/data-databases/nosql-data-writers.md) — Transfers structured data into OTS NoSQL databases with multi-version record support. ([source](https://github.com/alibaba/DataX/blob/master/otswriter/doc/otswriter.md))
- [Parallel Storage Writing](https://awesome-repositories.com/f/data-databases/parallel-storage-writing.md) — Distributes data across multiple threads to write different sub-files simultaneously, improving overall ingestion throughput. ([source](https://github.com/alibaba/DataX/blob/master/osswriter/doc/osswriter.md))
- [Relational Database Drivers](https://awesome-repositories.com/f/data-databases/relational-database-drivers.md) — Integrates various relational database types by registering JDBC drivers and adding necessary driver files. ([source](https://github.com/alibaba/DataX/blob/master/rdbmsreader/doc/rdbmsreader.md))
- [Schema Column Mapping](https://awesome-repositories.com/f/data-databases/schema-column-mapping.md) — Selects specific columns for import and rearranges their order to align with the destination schema. ([source](https://github.com/alibaba/DataX/blob/master/odpswriter/doc/odpswriter.md))
- [SQL Data Retrieval](https://awesome-repositories.com/f/data-databases/sql-data-retrieval.md) — Implements techniques for filtering and extracting specific data from relational tables using SQL WHERE clauses. ([source](https://github.com/alibaba/DataX/blob/master/oceanbasev10reader/doc/oceanbasev10reader.md))
- [Structured Data File Extractors](https://awesome-repositories.com/f/data-databases/structured-data-extraction/structured-data-file-extractors.md) — Extracts text and field names from structured data files such as CSV and TXT using custom delimiters. ([source](https://github.com/alibaba/DataX/blob/master/ossreader/doc/ossreader.md))
- [Tabular Object Storage](https://awesome-repositories.com/f/data-databases/tabular-object-storage.md) — Transfers tabular data into object storage using structured formats like Parquet, ORC, and CSV. ([source](https://github.com/alibaba/DataX/blob/master/osswriter/doc/osswriter.md))
- [Throughput Controls](https://awesome-repositories.com/f/data-databases/throughput-controls.md) — Provides mechanisms to cap transfer rates using concurrency channels or byte limits to prevent target system overloading. ([source](https://github.com/alibaba/DataX/blob/master/introduction.md))
- [Column Projection](https://awesome-repositories.com/f/data-databases/wide-column-stores/column-oriented-disk-storage/column-projection.md) — Prunes data during export by selecting a specific subset of columns and defining their output order. ([source](https://github.com/alibaba/DataX/blob/master/oceanbasev10reader/doc/oceanbasev10reader.md))

### Software Engineering & Architecture

- [Plugin-Based Architectures](https://awesome-repositories.com/f/software-engineering-architecture/software-architecture/architectural-patterns/plugin-module-systems/modular-plugin-architectures/plugin-based-architectures.md) — Built on a plugin-based architecture that decouples reader and writer logic to support heterogeneous system integration.
- [Hub-and-Spoke Data Flows](https://awesome-repositories.com/f/software-engineering-architecture/hub-and-spoke-data-flows.md) — Utilizes a hub-and-spoke data flow to route information from specialized readers through a central engine to specialized writers.
- [Data Component Plugins](https://awesome-repositories.com/f/software-engineering-architecture/integration-extensibility/extensibility/plugin-architectures/developer-authoring-interfaces/custom-module-implementations/module-functionality-extenders/plugin-extenders/data-component-plugins.md) — Integrates new data sources through the implementation of customizable reader and writer plugins. ([source](https://github.com/alibaba/datax#readme))
- [Parallel Data Pipelines](https://awesome-repositories.com/f/software-engineering-architecture/task-scheduling/parallel-task-executors/parallel-task-spawning/parallel-data-pipelines.md) — Splits large datasets into concurrent tasks based on primary keys to increase synchronization speed. ([source](https://github.com/alibaba/DataX/blob/master/mysqlreader/doc/mysqlreader.md))

### Development Tools & Productivity

- [Data Partition Parallelism](https://awesome-repositories.com/f/development-tools-productivity/parallel-task-execution/data-partition-parallelism.md) — Supports task-level parallelism by splitting large datasets into independent chunks for concurrent extraction across threads.

### Part of an Awesome List

- [Object Selection Filters](https://awesome-repositories.com/f/awesome-lists/security/file-encryption/remote-file-managers/object-storage-file-operations/object-operation-filtering/object-selection-filters.md) — Supports isolating specific objects for extraction through name-based filtering and directory traversal. ([source](https://github.com/alibaba/DataX/blob/master/ossreader/doc/ossreader.md))

### Networking & Communication

- [Throughput Controllers](https://awesome-repositories.com/f/networking-communication/connection-managers/throughput-controllers.md) — Regulates memory usage and network traffic by capping concurrency levels and batch sizes during the transfer process.
- [Database Batch Writes](https://awesome-repositories.com/f/networking-communication/socket-stream-writing/general-write-buffering/database-batch-writes.md) — Implements batch-based write buffering to group records into single transactions, reducing network overhead and increasing ingestion speed.
- [Transfer Retry Mechanisms](https://awesome-repositories.com/f/networking-communication/transfer-retry-mechanisms.md) — Automatically retries failed operations at the thread or task level to ensure completion during network instability. ([source](https://github.com/alibaba/DataX/blob/master/introduction.md))