14 dépôts
Processes for writing continuous data streams into tabular formats with transactional guarantees.
Distinct from Table Data Processing: Focuses on the ingestion of live streams into tables, whereas Table Data Processing is general manipulation.
Explore 14 awesome GitHub repositories matching data & databases · Stream Ingestion. Refine with filters or upvote what's useful.
Delta is a lakehouse table format that brings ACID transactions and data warehouse consistency to large scale data lakes on cloud object storage. It serves as an ACID transaction manager, coordinating atomic commits and serializable isolation for concurrent reads and writes across distributed compute engines. The project provides a multi-engine interoperability layer that uses format translation to allow diverse SQL engines and processing frameworks to read and write the same tables. It functions as a data versioning system, utilizing a transaction log to enable time travel, historical snapsh
Writes continuous record flows to tables with exactly-once processing guarantees during concurrent operations.
Moto is a cloud service mockery framework and API mock server that simulates AWS infrastructure locally. It allows developers to test cloud-dependent code and verify infrastructure-as-code templates without deploying real resources or incurring costs. The project functions as an SDK interceptor that can patch existing service clients to redirect requests to a local mock environment. It can also be run as a standalone HTTP server, enabling any programming language to interact with the simulated endpoints. The framework covers a vast array of simulated capabilities, including data storage, com
Simulates the ingestion of continuous data streams into delivery targets.
Feast is a machine learning feature store and MLOps data infrastructure layer. It provides a centralized system for managing and serving features across offline training and online production environments, utilizing an online feature serving layer for low-latency retrieval. The project centers on a feature registry that acts as a central catalog for defining, governing, and discovering feature services. It employs a unified data access layer to decouple feature retrieval from physical storage and includes a point-in-time data generator to create historically accurate training datasets that pr
Implements a real-time pipeline that processes event streams and updates the online feature store.
Feast is an open-source feature store for machine learning that provides a central platform for defining, storing, and serving features across both training and inference workflows. It operates as a declarative system where feature definitions are written as code in Python files, synchronized to a central registry, and made available for low-latency online retrieval or point-in-time correct historical joins for training datasets. The project abstracts storage behind a pluggable architecture, allowing offline and online backends to be swapped without changing retrieval logic, and coordinates ma
Consumes data from stream sources and loads it into online and offline stores for feature serving.
Materialize is a streaming SQL database that continuously ingests live data from sources such as Kafka, Redpanda, PostgreSQL, and MySQL, and incrementally maintains materialized views. It provides a PostgreSQL-compatible query engine that accepts standard SQL over the PostgreSQL wire protocol, enabling any existing SQL client or BI tool to query real-time data. The system also includes a Model Context Protocol (MCP) server that exposes live materialized view data to AI agents, providing fresh context without polling. Materialize distinguishes itself through its ability to offer configurable c
Reads live data directly from PostgreSQL/MySQL replication streams, Kafka, or webhooks to keep the database continuously updated.
Pinot is a distributed, columnar analytical database designed for high-concurrency, low-latency query processing. It functions as a real-time OLAP datastore, enabling interactive, user-facing analytics by ingesting and querying massive datasets from both streaming and batch sources. The system architecture relies on a centralized controller for cluster coordination and a distributed segment-based storage model to ensure horizontal scalability. The platform distinguishes itself through a hybrid ingestion pipeline that unifies real-time event streams and historical batch data into a single quer
Processes continuous data streams from real-time sources into queryable tabular formats.
Apache Hudi is an open-source table format that brings ACID transactions, incremental processing, and multi-modal indexing to data lakes. It provides atomic commits with snapshot isolation, rollback, and optimistic concurrency control for reliable data lake operations, while supporting upserts, record-level updates, and deletions in large analytical datasets. The project distinguishes itself through a timeline-based architecture that coordinates all write operations, enabling features like time-travel querying, incremental change streaming, and multi-modal query views that include snapshot, i
Ingests both streaming and batch data from Spark, Flink, and Kafka into a transactional data lake.
GreptimeDB is a distributed, open-source time-series database built for unified observability. It stores and queries metrics, logs, and traces together in a single columnar engine, supporting both SQL and PromQL for analysis. The database is designed as a Kubernetes-native operator with a decoupled compute and storage architecture, enabling horizontal scaling and multi-region deployment. What distinguishes GreptimeDB is its role as a multi-protocol ingestion gateway, accepting data through OpenTelemetry, Prometheus Remote Write, InfluxDB, Loki, Elasticsearch, Kafka, and MQTT protocols without
Reads metrics formatted in InfluxDB line protocol from Kafka topics and ingests them.
dlt est un outil d'ingestion de données Python et un framework de pipeline ETL conçu pour récupérer des données depuis diverses sources et les persister dans des destinations structurées. Il fonctionne comme un moteur d'inférence de schéma qui détecte automatiquement les types de données et aplatit les structures JSON imbriquées en tables relationnelles, déplaçant les données des sources vers des lakehouses, des entrepôts ou des bases de données vectorielles. Le projet se distingue par une génération de pipeline alimentée par l'IA, utilisant de grands modèles de langage pour échafauder le code d'extraction et les connecteurs pour les API REST. Il prend également en charge le stockage vectoriel multimodal et la population spécialisée de bases de données vectorielles pour prendre en charge les applications d'IA et de machine learning. Le framework couvre un large éventail de capacités, incluant l'évolution automatique du schéma, le chargement incrémentiel de données via le suivi d'état et la validation de la qualité des données par l'application de contrats de données. Il fournit des outils pour la normalisation des données relationnelles, les transformations pré- et post-chargement, et une variété d'adaptateurs de destination pour les bases de données SQL et les magasins d'objets cloud. L'observabilité est gérée via des tableaux de bord d'exécution de pipeline, le suivi de lignage des colonnes et la vérification de version de schéma utilisant des hachages basés sur le contenu.
Processes streams with multiple item types using discriminated unions to dispatch data to specific tables.
OpenPanel is a self-hosted product analytics platform designed for tracking user behavior and visualizing product metrics on private infrastructure. It provides a comprehensive system for collecting events across web, mobile, and server environments while ensuring complete ownership of data. The platform distinguishes itself through a privacy-first approach, utilizing cookieless event tracking and regional data residency to simplify regulatory compliance. It integrates large language models via the Model Context Protocol, enabling users to query behavioral data and analyze trends using natura
Ingests high-throughput event data through a Kafka-compatible broker for parallel processing.
Arroyo is a high-performance stream processing platform built in Rust. It executes continuous SQL queries on streaming data with event-time semantics, enabling accurate windowed aggregations, joins, and stateful computations on unbounded event streams. The platform uses native Rust execution for high throughput and low latency, with periodic checkpointing for exactly-once fault tolerance and horizontal scaling across distributed workers. The system integrates deeply with Kafka for reading and writing topics with exactly-once delivery and supports change data capture (CDC) from MySQL and Postg
Consumes and produces messages from Kafka topics with exactly-once semantics using SQL queries.
IntelOwl is a threat intelligence platform and security orchestration engine designed to aggregate, analyze, and enrich security observables. It functions as a security incident investigation tool and a threat intelligence aggregator, collecting data on files, domains, and IP addresses from diverse internal and external sources. The system differentiates itself through playbook-based workflow automation, allowing users to define reusable sequences of analysis tasks that trigger subsequent jobs based on prior outputs. It unifies disparate security data into a common schema and utilizes protoco
Imports streams of observables or files automatically for immediate processing and analysis.
ChatLab is a self-hosted chat database and data pipeline designed to normalize, store, and analyze large-scale social conversation histories. It functions as an analytics platform that uses large language models to extract patterns and insights from messaging data imported from multiple platforms. The system distinguishes itself through an AI-powered analysis engine that utilizes vector-based history analysis and agent-based function calling to summarize conversation trends. It further identifies behavioral patterns by generating visual analytics, including heatmaps, word clouds, and activity
Processes large chat imports by streaming data through multiple worker threads to maintain responsiveness at scale.
Uptrace is an OpenTelemetry-based observability platform designed to collect, store, and analyze distributed traces, metrics, and logs. It functions as a centralized logging backend, a distributed tracing system, and a metrics engine to monitor application performance and system health. The platform is distinguished by AI-powered operational capabilities, allowing users to query telemetry data and manage monitoring dashboards using natural language. It specifically includes specialized monitoring for generative AI pipelines, tracking token usage and response quality for LLM interactions and r
Uses a message queue to retain telemetry data during database downtime and optimize insert throughput.