Open-source data pipeline tools for syncing and replicating information between various databases and cloud services.
Unstructured is an enterprise-grade data orchestration engine designed to transform raw, unstructured files into structured, machine-readable formats. It functions as a comprehensive platform for document ingestion, partitioning, and enrichment, specifically engineered to prepare complex data for retrieval-augmented generation and agentic AI workflows. The platform distinguishes itself through its sophisticated document processing strategies, which combine rule-based extraction with vision-language models to handle diverse file layouts, tables, and images. It provides a modular architecture t
This platform functions as a specialized ETL engine focused on document ingestion and transformation for AI workflows, providing the orchestration, connectivity, and self-hostable architecture required for complex data pipelines.
Airflow is a platform for programmatically authoring, scheduling, and monitoring complex data pipelines. It functions as a workflow automation engine that manages the lifecycle of recurring business processes by executing code-defined task dependencies. By representing workflows as directed acyclic graphs, the system ensures that task execution order and data flow are explicitly defined and reliably maintained across distributed computing environments. The platform distinguishes itself through a highly modular, provider-based architecture that decouples core orchestration logic from external
Airflow is a comprehensive, self-hostable orchestration platform that provides the necessary scheduling, monitoring, and provider-based integration framework to build and manage complex ETL/ELT pipelines.
Illa-builder is a low-code internal tool builder and API integration platform used to create business applications and admin panels. It functions as a database GUI dashboard and visual workflow automator, allowing users to connect to databases and external APIs to manage data and automate business processes. The platform provides a self-hosted app framework that can be deployed on private infrastructure via Docker. It enables the creation of custom dashboards and CRMs while maintaining full control over data and hosting. The system includes a visual drag-and-drop canvas for designing user in
This is a low-code internal tool builder designed for creating custom admin panels and business applications, rather than a dedicated ETL/ELT platform for automated data pipeline orchestration.
RisingWave is a cloud-native streaming database and real-time analytics engine that uses standard SQL to process continuous data streams. It functions as a streaming data lakehouse, combining the capabilities of a streaming SQL database with a platform that integrates streaming ingestion with open table formats. The system is distinguished by its use of the PostgreSQL wire protocol, allowing it to integrate with existing SQL tools and drivers. It employs a decoupled compute and storage architecture, persisting streaming state and materialized views in cloud object storage to enable independen
RisingWave is a streaming database that natively supports streaming ETL pipelines and incremental data processing, making it a powerful tool for real-time data integration despite its primary focus on stream processing rather than batch-oriented ELT.
Debezium is a distributed change data capture platform that streams row-level database modifications as real-time events. By parsing database transaction logs, the system broadcasts structural and data changes to message brokers, enabling reactive processing and data integration across distributed architectures. The platform utilizes log-based capture to extract modifications directly from transaction logs, ensuring minimal impact on source system performance while maintaining the original commit order of operations. It employs database-specific connector adapters to translate proprietary bin
This is a specialized change data capture (CDC) platform designed to stream database logs into message brokers, serving as a foundational building block for ETL pipelines rather than a complete, self-contained data integration platform with built-in orchestration and destination management.
Redash is a self-hosted analytics platform and SQL data visualization tool. It provides a web-based SQL query editor for writing, executing, and scheduling database queries, and functions as a business intelligence dashboard for monitoring metrics via visual widgets. The platform distinguishes itself through its data source connectors, which integrate with various SQL, NoSQL, and API-based stores to retrieve information for analysis. It enables self-service analytics by allowing users to run queries with dynamic parameters and supports shared data reporting via public links or embedded dashbo
Redash is a business intelligence and data visualization platform designed for querying and dashboarding, rather than an ETL/ELT tool for moving and transforming data between systems.
Audited is a Ruby on Rails audit log library and change data capture framework. It tracks model changes by recording previous and current attribute values during create, update, and destroy operations to maintain a complete history of database modifications. The system functions as a database versioning tool and user activity tracker. It allows for the retrieval of historical record states by timestamp or index, enables reverting models to previous versions, and associates record modifications with specific user identities and remote IP addresses. The library includes capabilities for sensit
This is a database auditing and change-tracking library for Ruby on Rails applications, not a general-purpose ETL or data integration platform for moving data between external sources and destinations.
LoopBack is a Node.js API framework used to build RESTful services and backend applications. It functions as a model-driven API generator that automatically maps predefined data models to network endpoints to create standardized web interfaces. The project features a database abstraction layer that unifies access across diverse SQL databases, NoSQL stores, and remote data sources. It includes a backend application scaffolder using command-line generators to automate the creation of project structures and data connectors. Additionally, it provides an API authentication system to manage applica
LoopBack is a backend API framework designed for building RESTful services and managing data models, rather than an ETL/ELT platform for orchestrating data pipelines between sources and destinations.
Canal is a database replication middleware that performs change data capture by simulating a database replica. It monitors transaction logs to stream incremental data modifications to downstream systems in real time, acting as an event streaming infrastructure that transforms low-level binary logs into structured, consumable message streams. The project distinguishes itself through a high-throughput architecture that utilizes concurrent multi-threaded parsing and stateful log position tracking to ensure reliable data delivery. It employs a pluggable sink architecture that decouples data extra
Canal is a specialized change data capture tool that functions as a core component for real-time ETL pipelines by streaming database transaction logs to downstream systems, though it focuses more on log-based replication than full-featured orchestration or schema mapping.
This project is a data processing engine and AI application platform designed for building production-grade machine learning workflows. It provides a unified programming model that handles both historical batch data and live stream ingestion, enabling the development of real-time ETL pipelines and scalable data transformation workflows. The framework distinguishes itself through differential dataflow execution, which propagates only changes through a pipeline rather than recomputing entire datasets. It supports distributed state management across worker nodes and utilizes incremental stream p
This is a data processing framework capable of building real-time ETL pipelines and incremental stream processing, though it is primarily optimized for AI and RAG workflows rather than general-purpose data integration.