The visitor is looking for an open-source, self-hostable data integration platform to automate ETL/ELT pipelines between various data sources and destinations.

unstructured-io/unstructured is the closest match — This platform functions as a specialized ETL engine focused on document ingestion and transformation for AI workflows, providing the orchestration, connectivity, and self-hostable architecture required for complex data pipelines.. Other strong matches: apache/airflow, illacloud/illa-builder, risingwavelabs/risingwave, debezium/debezium.

Why does unstructured-io/unstructured match “a self-hosted Fivetran alternative”?

This platform functions as a specialized ETL engine focused on document ingestion and transformation for AI workflows, providing the orchestration, connectivity, and self-hostable architecture required for complex data pipelines.

Why does apache/airflow match “a self-hosted Fivetran alternative”?

Airflow is a comprehensive, self-hostable orchestration platform that provides the necessary scheduling, monitoring, and provider-based integration framework to build and manage complex ETL/ELT pipelines.

Why does illacloud/illa-builder match “a self-hosted Fivetran alternative”?

This is a low-code internal tool builder designed for creating custom admin panels and business applications, rather than a dedicated ETL/ELT platform for automated data pipeline orchestration.

Why does risingwavelabs/risingwave match “a self-hosted Fivetran alternative”?

RisingWave is a streaming database that natively supports streaming ETL pipelines and incremental data processing, making it a powerful tool for real-time data integration despite its primary focus on stream processing rather than batch-oriented ELT.

Why does debezium/debezium match “a self-hosted Fivetran alternative”?

This is a specialized change data capture (CDC) platform designed to stream database logs into message brokers, serving as a foundational building block for ETL pipelines rather than a complete, self-contained data integration platform with built-in orchestration and destination management.

Self-Hosted Data Integration Platforms

Open-source data pipeline tools for syncing and replicating information between various databases and cloud services.

Find the best repos with AI.We'll search the best matching repositories with AI.

unstructured-io/unstructured
Unstructured-IO/unstructured
14,019View on GitHub
Unstructured is an enterprise-grade data orchestration engine designed to transform raw, unstructured files into structured, machine-readable formats. It functions as a comprehensive platform for document ingestion, partitioning, and enrichment, specifically engineered to prepare complex data for retrieval-augmented generation and agentic AI workflows. The platform distinguishes itself through its sophisticated document processing strategies, which combine rule-based extraction with vision-language models to handle diverse file layouts, tables, and images. It provides a modular architecture t
This platform functions as a specialized ETL engine focused on document ingestion and transformation for AI workflows, providing the orchestration, connectivity, and self-hostable architecture required for complex data pipelines.
HTMLData ConnectorsData Destination ConnectorsData Source Connections
View on GitHub14,019
apache/airflow
apache/airflow
45,902View on GitHub
Airflow is a platform for programmatically authoring, scheduling, and monitoring complex data pipelines. It functions as a workflow automation engine that manages the lifecycle of recurring business processes by executing code-defined task dependencies. By representing workflows as directed acyclic graphs, the system ensures that task execution order and data flow are explicitly defined and reliably maintained across distributed computing environments. The platform distinguishes itself through a highly modular, provider-based architecture that decouples core orchestration logic from external
Airflow is a comprehensive, self-hostable orchestration platform that provides the necessary scheduling, monitoring, and provider-based integration framework to build and manage complex ETL/ELT pipelines.
PythonData Pipeline OrchestratorsWorkflow OrchestrationWorkflow Orchestration Engines
View on GitHub45,902
illacloud/illa-builder
illacloud/illa-builder
12,268View on GitHub
Illa-builder is a low-code internal tool builder and API integration platform used to create business applications and admin panels. It functions as a database GUI dashboard and visual workflow automator, allowing users to connect to databases and external APIs to manage data and automate business processes. The platform provides a self-hosted app framework that can be deployed on private infrastructure via Docker. It enables the creation of custom dashboards and CRMs while maintaining full control over data and hosting. The system includes a visual drag-and-drop canvas for designing user in
This is a low-code internal tool builder designed for creating custom admin panels and business applications, rather than a dedicated ETL/ELT platform for automated data pipeline orchestration.
TypeScriptOn-Premise DeploymentData Source Connections
View on GitHub12,268
risingwavelabs/risingwave
risingwavelabs/risingwave
9,093View on GitHub
RisingWave is a cloud-native streaming database and real-time analytics engine that uses standard SQL to process continuous data streams. It functions as a streaming data lakehouse, combining the capabilities of a streaming SQL database with a platform that integrates streaming ingestion with open table formats. The system is distinguished by its use of the PostgreSQL wire protocol, allowing it to integrate with existing SQL tools and drivers. It employs a decoupled compute and storage architecture, persisting streaming state and materialized views in cloud object storage to enable independen
RisingWave is a streaming database that natively supports streaming ETL pipelines and incremental data processing, making it a powerful tool for real-time data integration despite its primary focus on stream processing rather than batch-oriented ELT.
RustChange Data CaptureData Sinking
View on GitHub9,093
debezium/debezium
debezium/debezium
12,421View on GitHub
Debezium is a distributed change data capture platform that streams row-level database modifications as real-time events. By parsing database transaction logs, the system broadcasts structural and data changes to message brokers, enabling reactive processing and data integration across distributed architectures. The platform utilizes log-based capture to extract modifications directly from transaction logs, ensuring minimal impact on source system performance while maintaining the original commit order of operations. It employs database-specific connector adapters to translate proprietary bin
This is a specialized change data capture (CDC) platform designed to stream database logs into message brokers, serving as a foundational building block for ETL pipelines rather than a complete, self-contained data integration platform with built-in orchestration and destination management.
JavaChange Data Capture
View on GitHub12,421
getredash/redash
getredash/redash
28,653View on GitHub
Redash is a self-hosted analytics platform and SQL data visualization tool. It provides a web-based SQL query editor for writing, executing, and scheduling database queries, and functions as a business intelligence dashboard for monitoring metrics via visual widgets. The platform distinguishes itself through its data source connectors, which integrate with various SQL, NoSQL, and API-based stores to retrieve information for analysis. It enables self-service analytics by allowing users to run queries with dynamic parameters and supports shared data reporting via public links or embedded dashbo
Redash is a business intelligence and data visualization platform designed for querying and dashboarding, rather than an ETL/ELT tool for moving and transforming data between systems.
PythonData Source ConnectionsRemote Data Source Connectors
View on GitHub28,653
collectiveidea/audited
collectiveidea/audited
3,491View on GitHub
Audited is a Ruby on Rails audit log library and change data capture framework. It tracks model changes by recording previous and current attribute values during create, update, and destroy operations to maintain a complete history of database modifications. The system functions as a database versioning tool and user activity tracker. It allows for the retrieval of historical record states by timestamp or index, enables reverting models to previous versions, and associates record modifications with specific user identities and remote IP addresses. The library includes capabilities for sensit
This is a database auditing and change-tracking library for Ruby on Rails applications, not a general-purpose ETL or data integration platform for moving data between external sources and destinations.
RubyChange Data CaptureChange Data Capture
View on GitHub3,491
strongloop/loopback
strongloop/loopback
13,159View on GitHub
LoopBack is a Node.js API framework used to build RESTful services and backend applications. It functions as a model-driven API generator that automatically maps predefined data models to network endpoints to create standardized web interfaces. The project features a database abstraction layer that unifies access across diverse SQL databases, NoSQL stores, and remote data sources. It includes a backend application scaffolder using command-line generators to automate the creation of project structures and data connectors. Additionally, it provides an API authentication system to manage applica
LoopBack is a backend API framework designed for building RESTful services and managing data models, rather than an ETL/ELT platform for orchestrating data pipelines between sources and destinations.
JavaScriptData Source ConnectionsRemote Data Source Connectors
View on GitHub13,159
alibaba/canal
alibaba/canal
29,697View on GitHub
Canal is a database replication middleware that performs change data capture by simulating a database replica. It monitors transaction logs to stream incremental data modifications to downstream systems in real time, acting as an event streaming infrastructure that transforms low-level binary logs into structured, consumable message streams. The project distinguishes itself through a high-throughput architecture that utilizes concurrent multi-threaded parsing and stateful log position tracking to ensure reliable data delivery. It employs a pluggable sink architecture that decouples data extra
Canal is a specialized change data capture tool that functions as a core component for real-time ETL pipelines by streaming database transaction logs to downstream systems, though it focuses more on log-based replication than full-featured orchestration or schema mapping.
JavaChange Data Capture ServicesChange Data Capture ToolsDatabase Change Subscriptions
View on GitHub29,697
pathwaycom/llm-app
pathwaycom/llm-app
59,341View on GitHub
This project is a data processing engine and AI application platform designed for building production-grade machine learning workflows. It provides a unified programming model that handles both historical batch data and live stream ingestion, enabling the development of real-time ETL pipelines and scalable data transformation workflows. The framework distinguishes itself through differential dataflow execution, which propagates only changes through a pipeline rather than recomputing entire datasets. It supports distributed state management across worker nodes and utilizes incremental stream p
This is a data processing framework capable of building real-time ETL pipelines and incremental stream processing, though it is primarily optimized for AI and RAG workflows rather than general-purpose data integration.
Jupyter NotebookData Processing FrameworksDifferential Dataflow EnginesDistributed State Management
View on GitHub59,341

Self-Hosted Data Integration Platforms

Unstructured-IO/unstructured

apache/airflow

illacloud/illa-builder

risingwavelabs/risingwave

debezium/debezium

getredash/redash

collectiveidea/audited

strongloop/loopback

alibaba/canal

pathwaycom/llm-app