Why is delta-io/delta a recommended Stream Ingestion GitHub Repositories repository?

Writes continuous record flows to tables with exactly-once processing guarantees during concurrent operations.

Why is getmoto/moto a recommended Stream Ingestion GitHub Repositories repository?

Simulates the ingestion of continuous data streams into delivery targets.

Why is gojek/feast a recommended Stream Ingestion GitHub Repositories repository?

Implements a real-time pipeline that processes event streams and updates the online feature store.

Why is feast-dev/feast a recommended Stream Ingestion GitHub Repositories repository?

Consumes data from stream sources and loads it into online and offline stores for feature serving.

Why is materializeinc/materialize a recommended Stream Ingestion GitHub Repositories repository?

Reads live data directly from PostgreSQL/MySQL replication streams, Kafka, or webhooks to keep the database continuously updated.

Why is apache/pinot a recommended Stream Ingestion GitHub Repositories repository?

Processes continuous data streams from real-time sources into queryable tabular formats.

Why is apache/hudi a recommended Stream Ingestion GitHub Repositories repository?

Ingests both streaming and batch data from Spark, Flink, and Kafka into a transactional data lake.

Why is greptimeteam/greptimedb a recommended Stream Ingestion GitHub Repositories repository?

Reads metrics formatted in InfluxDB line protocol from Kafka topics and ingests them.

Why is dlt-hub/dlt a recommended Stream Ingestion GitHub Repositories repository?

Processes streams with multiple item types using discriminated unions to dispatch data to specific tables.

Why is openpanel-dev/openpanel a recommended Stream Ingestion GitHub Repositories repository?

Ingests high-throughput event data through a Kafka-compatible broker for parallel processing.

14 dépôts

Awesome GitHub RepositoriesStream Ingestion

Processes for writing continuous data streams into tabular formats with transactional guarantees.

Distinct from Table Data Processing: Focuses on the ingestion of live streams into tables, whereas Table Data Processing is general manipulation.

Explore 14 awesome GitHub repositories matching data & databases · Stream Ingestion. Refine with filters or upvote what's useful.

Trouvez les meilleurs dépôts grâce à l'IA.Nous recherchons les dépôts les plus pertinents grâce à l'IA.

delta-io/delta
delta-io/delta
8,596Voir sur GitHub
Delta is a lakehouse table format that brings ACID transactions and data warehouse consistency to large scale data lakes on cloud object storage. It serves as an ACID transaction manager, coordinating atomic commits and serializable isolation for concurrent reads and writes across distributed compute engines. The project provides a multi-engine interoperability layer that uses format translation to allow diverse SQL engines and processing frameworks to read and write the same tables. It functions as a data versioning system, utilizing a transaction log to enable time travel, historical snapsh
Writes continuous record flows to tables with exactly-once processing guarantees during concurrent operations.
Scalaacidanalyticsbig-data
Voir sur GitHub8,596
getmoto/moto
getmoto/moto
8,550Voir sur GitHub
Moto is a cloud service mockery framework and API mock server that simulates AWS infrastructure locally. It allows developers to test cloud-dependent code and verify infrastructure-as-code templates without deploying real resources or incurring costs. The project functions as an SDK interceptor that can patch existing service clients to redirect requests to a local mock environment. It can also be run as a standalone HTTP server, enabling any programming language to interact with the simulated endpoints. The framework covers a vast array of simulated capabilities, including data storage, com
Simulates the ingestion of continuous data streams into delivery targets.
Pythonawsbotoec2
Voir sur GitHub8,550
gojek/feast
gojek/feast
7,095Voir sur GitHub
Feast is a machine learning feature store and MLOps data infrastructure layer. It provides a centralized system for managing and serving features across offline training and online production environments, utilizing an online feature serving layer for low-latency retrieval. The project centers on a feature registry that acts as a central catalog for defining, governing, and discovering feature services. It employs a unified data access layer to decouple feature retrieval from physical storage and includes a point-in-time data generator to create historically accurate training datasets that pr
Implements a real-time pipeline that processes event streams and updates the online feature store.
Python
Voir sur GitHub7,095
feast-dev/feast
feast-dev/feast
6,727Voir sur GitHub
Feast is an open-source feature store for machine learning that provides a central platform for defining, storing, and serving features across both training and inference workflows. It operates as a declarative system where feature definitions are written as code in Python files, synchronized to a central registry, and made available for low-latency online retrieval or point-in-time correct historical joins for training datasets. The project abstracts storage behind a pluggable architecture, allowing offline and online backends to be swapped without changing retrieval logic, and coordinates ma
Consumes data from stream sources and loads it into online and offline stores for feature serving.
Pythonbig-datadata-engineeringdata-quality
Voir sur GitHub6,727
materializeinc/materialize
MaterializeInc/materialize
6,314Voir sur GitHub
Materialize is a streaming SQL database that continuously ingests live data from sources such as Kafka, Redpanda, PostgreSQL, and MySQL, and incrementally maintains materialized views. It provides a PostgreSQL-compatible query engine that accepts standard SQL over the PostgreSQL wire protocol, enabling any existing SQL client or BI tool to query real-time data. The system also includes a Model Context Protocol (MCP) server that exposes live materialized view data to AI agents, providing fresh context without polling. Materialize distinguishes itself through its ability to offer configurable c
Reads live data directly from PostgreSQL/MySQL replication streams, Kafka, or webhooks to keep the database continuously updated.
Rust
Voir sur GitHub6,314
apache/pinot
apache/pinot
6,098Voir sur GitHub
Pinot is a distributed, columnar analytical database designed for high-concurrency, low-latency query processing. It functions as a real-time OLAP datastore, enabling interactive, user-facing analytics by ingesting and querying massive datasets from both streaming and batch sources. The system architecture relies on a centralized controller for cluster coordination and a distributed segment-based storage model to ensure horizontal scalability. The platform distinguishes itself through a hybrid ingestion pipeline that unifies real-time event streams and historical batch data into a single quer
Processes continuous data streams from real-time sources into queryable tabular formats.
Java
Voir sur GitHub6,098
apache/hudi
apache/hudi
6,097Voir sur GitHub
Apache Hudi is an open-source table format that brings ACID transactions, incremental processing, and multi-modal indexing to data lakes. It provides atomic commits with snapshot isolation, rollback, and optimistic concurrency control for reliable data lake operations, while supporting upserts, record-level updates, and deletions in large analytical datasets. The project distinguishes itself through a timeline-based architecture that coordinates all write operations, enabling features like time-travel querying, incremental change streaming, and multi-modal query views that include snapshot, i
Ingests both streaming and batch data from Spark, Flink, and Kafka into a transactional data lake.
Javaapacheflinkapachehudiapachespark
Voir sur GitHub6,097
greptimeteam/greptimedb
GreptimeTeam/greptimedb
5,968Voir sur GitHub
GreptimeDB is a distributed, open-source time-series database built for unified observability. It stores and queries metrics, logs, and traces together in a single columnar engine, supporting both SQL and PromQL for analysis. The database is designed as a Kubernetes-native operator with a decoupled compute and storage architecture, enabling horizontal scaling and multi-region deployment. What distinguishes GreptimeDB is its role as a multi-protocol ingestion gateway, accepting data through OpenTelemetry, Prometheus Remote Write, InfluxDB, Loki, Elasticsearch, Kafka, and MQTT protocols without
Reads metrics formatted in InfluxDB line protocol from Kafka topics and ingests them.
Rustanalyticscloud-nativedatabase
Voir sur GitHub5,968
dlt-hub/dlt
dlt-hub/dlt
5,472Voir sur GitHub
dlt est un outil d'ingestion de données Python et un framework de pipeline ETL conçu pour récupérer des données depuis diverses sources et les persister dans des destinations structurées. Il fonctionne comme un moteur d'inférence de schéma qui détecte automatiquement les types de données et aplatit les structures JSON imbriquées en tables relationnelles, déplaçant les données des sources vers des lakehouses, des entrepôts ou des bases de données vectorielles. Le projet se distingue par une génération de pipeline alimentée par l'IA, utilisant de grands modèles de langage pour échafauder le code d'extraction et les connecteurs pour les API REST. Il prend également en charge le stockage vectoriel multimodal et la population spécialisée de bases de données vectorielles pour prendre en charge les applications d'IA et de machine learning. Le framework couvre un large éventail de capacités, incluant l'évolution automatique du schéma, le chargement incrémentiel de données via le suivi d'état et la validation de la qualité des données par l'application de contrats de données. Il fournit des outils pour la normalisation des données relationnelles, les transformations pré- et post-chargement, et une variété d'adaptateurs de destination pour les bases de données SQL et les magasins d'objets cloud. L'observabilité est gérée via des tableaux de bord d'exécution de pipeline, le suivi de lignage des colonnes et la vérification de version de schéma utilisant des hachages basés sur le contenu.
Processes streams with multiple item types using discriminated unions to dispatch data to specific tables.
Pythondatadata-engineeringdata-lake
Voir sur GitHub5,472
openpanel-dev/openpanel
Openpanel-dev/openpanel
5,349Voir sur GitHub
OpenPanel is a self-hosted product analytics platform designed for tracking user behavior and visualizing product metrics on private infrastructure. It provides a comprehensive system for collecting events across web, mobile, and server environments while ensuring complete ownership of data. The platform distinguishes itself through a privacy-first approach, utilizing cookieless event tracking and regional data residency to simplify regulatory compliance. It integrates large language models via the Model Context Protocol, enabling users to query behavioral data and analyze trends using natura
Ingests high-throughput event data through a Kafka-compatible broker for parallel processing.
TypeScriptalternativeanalyticsopen-source
Voir sur GitHub5,349
arroyosystems/arroyo
ArroyoSystems/arroyo
4,819Voir sur GitHub
Arroyo is a high-performance stream processing platform built in Rust. It executes continuous SQL queries on streaming data with event-time semantics, enabling accurate windowed aggregations, joins, and stateful computations on unbounded event streams. The platform uses native Rust execution for high throughput and low latency, with periodic checkpointing for exactly-once fault tolerance and horizontal scaling across distributed workers. The system integrates deeply with Kafka for reading and writing topics with exactly-once delivery and supports change data capture (CDC) from MySQL and Postg
Consumes and produces messages from Kafka topics with exactly-once semantics using SQL queries.
Rustdatadata-stream-processingdev-tools
Voir sur GitHub4,819
intelowlproject/intelowl
intelowlproject/IntelOwl
4,605Voir sur GitHub
IntelOwl is a threat intelligence platform and security orchestration engine designed to aggregate, analyze, and enrich security observables. It functions as a security incident investigation tool and a threat intelligence aggregator, collecting data on files, domains, and IP addresses from diverse internal and external sources. The system differentiates itself through playbook-based workflow automation, allowing users to define reusable sequences of analysis tasks that trigger subsequent jobs based on prior outputs. It unifies disparate security data into a common schema and utilizes protoco
Imports streams of observables or files automatically for immediate processing and analysis.
Pythoncyber-securitycyber-threat-intelligencecybersecurity
Voir sur GitHub4,605
hellodigua/chatlab
hellodigua/ChatLab
4,522Voir sur GitHub
ChatLab is a self-hosted chat database and data pipeline designed to normalize, store, and analyze large-scale social conversation histories. It functions as an analytics platform that uses large language models to extract patterns and insights from messaging data imported from multiple platforms. The system distinguishes itself through an AI-powered analysis engine that utilizes vector-based history analysis and agent-based function calling to summarize conversation trends. It further identifies behavioral patterns by generating visual analytics, including heatmaps, word clouds, and activity
Processes large chat imports by streaming data through multiple worker threads to maintain responsiveness at scale.
TypeScriptaichat-analysischat-history
Voir sur GitHub4,522
uptrace/uptrace
uptrace/uptrace
4,098Voir sur GitHub
Uptrace is an OpenTelemetry-based observability platform designed to collect, store, and analyze distributed traces, metrics, and logs. It functions as a centralized logging backend, a distributed tracing system, and a metrics engine to monitor application performance and system health. The platform is distinguished by AI-powered operational capabilities, allowing users to query telemetry data and manage monitoring dashboards using natural language. It specifically includes specialized monitoring for generative AI pipelines, tracking token usage and response quality for LLM interactions and r
Uses a message queue to retain telemetry data during database downtime and optimize insert throughput.
Goapmapplication-monitoringclickhouse
Voir sur GitHub4,098

Awesome Stream Ingestion GitHub Repositories

delta-io/delta

getmoto/moto

gojek/feast

feast-dev/feast

MaterializeInc/materialize

apache/pinot

apache/hudi

GreptimeTeam/greptimedb

dlt-hub/dlt

Openpanel-dev/openpanel

ArroyoSystems/arroyo

intelowlproject/IntelOwl

hellodigua/ChatLab

uptrace/uptrace

Explorer les sous-tags