5 个仓库
ETL systems that use a plugin architecture for readers and writers to extend connectivity to new data sources.
Distinct from ETL Workflows: Focuses on the plugin-based extensibility of the ETL process, whereas candidates focus on specific ETL types like Reverse ETL or Vector ETL.
Explore 5 awesome GitHub repositories matching data & databases · Plugin-Based ETL Frameworks. Refine with filters or upvote what's useful.
DataX is a distributed data integration framework and plugin-based ETL tool designed for synchronizing large datasets between heterogeneous sources and destinations. It functions as a JDBC data migration engine and offline synchronization tool, enabling the movement of data between relational databases, NoSQL stores, and object storage. The system utilizes a plugin-based connector architecture that decouples reader and writer logic, allowing it to map and transform data types across different storage engines using a standardized internal representation. This design supports heterogeneous data
Uses a plugin-based connector architecture to decouple reader and writer logic, allowing extensions for new heterogeneous data sources.
Pentaho Kettle 是一个企业级 ETL 数据集成平台,旨在在不同源和目标数据库之间提取、转换和加载数据。它充当元数据驱动的编排器,利用可视化工作流设计器来创建和管理复杂的数据任务序列和转换管道。 该系统的特点是其分布式数据处理引擎,可在服务器节点集群上执行工作负载以提高吞吐量。它采用基于插件的架构,允许通过外部 JAR 文件扩展平台,以提供与各种数据库和云服务的连接。 该平台涵盖了广泛的数据集成功能,包括批量加载、远程文件管理和数据结构转换。它提供用于数据质量验证、管道自动化和作业生命周期管理的工具,以及用于跟踪服务器健康状况和实时执行状态的监控实用程序。
Provides an ETL system using a plugin architecture for readers and writers to extend connectivity to new data sources.
This project is a streaming data integration framework that captures real-time database changes and synchronizes them with downstream systems. It operates as a distributed streaming ETL and database synchronizer, reading database logs and snapshots to propagate row-level modifications to target sinks. The system supports declarative data integration, allowing users to define source-to-sink data flows using SQL or YAML configurations. It distinguishes itself by automating schema evolution to maintain synchronization when source structures change and ensuring exactly-once delivery and processin
Implements a distributed streaming ETL framework for filtering, transforming, and routing data in flight.
Pinot is a distributed, columnar analytical database designed for high-concurrency, low-latency query processing. It functions as a real-time OLAP datastore, enabling interactive, user-facing analytics by ingesting and querying massive datasets from both streaming and batch sources. The system architecture relies on a centralized controller for cluster coordination and a distributed segment-based storage model to ensure horizontal scalability. The platform distinguishes itself through a hybrid ingestion pipeline that unifies real-time event streams and historical batch data into a single quer
Connects distributed processing frameworks to the datastore to enable reading and writing data within complex streaming pipelines.
dlt 是一个 Python 数据摄取工具和 ETL 流水线框架,旨在从不同来源获取数据并将其持久化到结构化目标中。它作为一个模式推断引擎,可自动检测数据类型并将嵌套的 JSON 结构扁平化为关系表,将数据从源端移动到数据湖、数据仓库或向量数据库。 该项目通过 AI 驱动的流水线生成脱颖而出,利用大语言模型为 REST API 构建提取代码和连接器。它还支持多模态向量存储和向量数据库的专门填充,以支持 AI 和机器学习应用。 该框架涵盖了广泛的功能,包括自动化模式演进、通过状态跟踪进行增量数据加载,以及通过强制执行数据契约进行数据质量验证。它提供了用于关系数据规范化、加载前后转换的工具,以及针对 SQL 数据库和云对象存储的多种目标适配器。 可观测性通过流水线执行仪表板、列血缘跟踪以及使用基于内容的哈希进行模式版本验证来处理。
Provides a pluggable framework that automates schema evolution, incremental loading, and normalization for ETL workflows.