1 dépôt
High-throughput processing of massive datasets using parallel extraction and distributed writing.
Distinct from Large-Scale Dataset Management: Focuses on the movement of terabyte-scale data through parallelism, not image processing or spreadsheet streaming.
Explore 1 awesome GitHub repository matching data & databases · Distributed Batch Processing. Refine with filters or upvote what's useful.
DataX is a distributed data integration framework and plugin-based ETL tool designed for synchronizing large datasets between heterogeneous sources and destinations. It functions as a JDBC data migration engine and offline synchronization tool, enabling the movement of data between relational databases, NoSQL stores, and object storage. The system utilizes a plugin-based connector architecture that decouples reader and writer logic, allowing it to map and transform data types across different storage engines using a standardized internal representation. This design supports heterogeneous data
Transfers terabyte-scale datasets using parallel extraction and distributed writes to maximize system throughput.