10 Repos
Mechanisms for writing large-scale query results to external storage systems.
Distinguishing note: Specifically addresses the output phase of data pipelines into cloud storage.
Explore 10 awesome GitHub repositories matching data & databases · Data Sinking. Refine with filters or upvote what's useful.
Polars is a high-performance columnar data processing library designed for efficient analytical workflows. It functions as a structured data library that organizes information into typed columns, utilizing the Apache Arrow memory format to enable zero-copy data sharing and cache-friendly, vectorized operations. The engine is built to handle large-scale tabular datasets, providing both local and distributed analytical runtimes that scale from single-machine environments to multi-node clusters. The project distinguishes itself through a sophisticated lazy query engine that constructs abstract e
Saves large-scale query results directly to cloud storage to support automated data pipelines.
This project is a collection of educational resources and reference implementations for the Apache Flink stream processing framework. It provides a learning resource focused on mastering distributed stream processing through implementation guides, performance tuning tutorials, and practical examples. The repository features detailed walkthroughs for building real-time data pipelines using the DataStream and Table APIs. It includes specific integration examples for connecting Apache Flink with Kafka brokers and Elasticsearch indices, as well as reference implementations for real-time deduplica
Provides interfaces to define how records are written to non-standard external storage systems.
Rerun is a multimodal data visualizer and robotics data logger designed for rendering synchronized streams of 3D spatial data, images, and time-series metrics. It functions as a tool for capturing high-frequency sensor data and AI outputs into a queryable columnar format, providing a dedicated interface for viewing MCAP recording files and analyzing physical environments. The project distinguishes itself as a machine learning dataset streamer, capable of feeding logged recordings directly into GPU buffers and PyTorch training pipelines without intermediate exports. It supports a high-performa
Provides mechanisms for directing logged sensor data to various external destinations including gRPC servers and binary files.
SeaTunnel is a distributed data integration engine designed to synchronize structured and unstructured data across diverse sources and sinks. It functions as a multi-engine execution framework that can run data integration tasks across different distributed computing backends to optimize workload performance. The project is distinguished by a visual data pipeline designer for configuring workflows without manual code and a specialized change data capture tool for streaming incremental database updates. It also includes an enrichment pipeline that integrates large language models and embedding
Transfers processed data into target destinations such as databases, object storage, and message queues.
RisingWave is a cloud-native streaming database and real-time analytics engine that uses standard SQL to process continuous data streams. It functions as a streaming data lakehouse, combining the capabilities of a streaming SQL database with a platform that integrates streaming ingestion with open table formats. The system is distinguished by its use of the PostgreSQL wire protocol, allowing it to integrate with existing SQL tools and drivers. It employs a decoupled compute and storage architecture, persisting streaming state and materialized views in cloud object storage to enable independen
Continuously pushes processed data streams into downstream databases, data lakes, or message queues.
Hazelcast is a distributed data platform that combines an in-memory data grid with a stream processing engine to support real-time analytics and event-driven applications. It functions as a partitioned, distributed key-value store that replicates data across cluster nodes to provide low-latency access and high availability. The platform also serves as a distributed SQL query engine, allowing users to execute standard SQL statements against both in-memory datasets and external data sources. What distinguishes Hazelcast is its use of a distributed consensus subsystem to maintain strongly consis
Routes processed data from pipelines to local or remote storage systems using built-in connectors.
Ibis is a portable Python dataframe library and multi-backend query engine that provides a unified interface for executing data transformations across diverse compute engines. It functions as a Python SQL expression compiler and dialect transpiler, allowing users to define data logic once and execute it across cloud warehouses, embedded databases, and distributed clusters without rewriting code. The project distinguishes itself through a database backend abstraction that decouples transformation logic from the underlying execution engine. It enables polyglot data workflows by mixing raw SQL s
Implements mechanisms for writing large-scale query results to external destination systems.
This project is a streaming data integration framework that captures real-time database changes and synchronizes them with downstream systems. It operates as a distributed streaming ETL and database synchronizer, reading database logs and snapshots to propagate row-level modifications to target sinks. The system supports declarative data integration, allowing users to define source-to-sink data flows using SQL or YAML configurations. It distinguishes itself by automating schema evolution to maintain synchronization when source structures change and ensuring exactly-once delivery and processin
Maps specific source tables to designated sink tables to organize data distribution across multiple target systems.
Zero-cost asynchronous programming in Rust
Sends a sequence of values to a destination using the Sink trait and its adapters.
ZIO is a functional effect system for the JVM that models asynchronous and concurrent programs as pure, composable values with typed error handling and dependency injection. Its core identity is built on fiber-based concurrency, where lightweight, non-blocking fibers execute millions of concurrent tasks with structured lifecycle management, and a dual-channel error model that separates expected business failures from unexpected system defects at compile time. The system provides effect-typed dependency injection through a layer-based dependency graph, pull-based reactive stream processing with
Provides sink type transformation as part of its stream processing capabilities.