14 مستودعات
Processes data based on timestamps embedded within records to handle out-of-order events.
Distinct from System Time Base Definitions: Existing candidates focus on Windows OS events or trading windows, not stream processing time semantics
Explore 14 awesome GitHub repositories matching data & databases · Event-Time Processing. Refine with filters or upvote what's useful.
Apache Flink is a distributed processing engine designed for both high-throughput, low-latency data streams and finite batch workloads. It functions as a stateful stream processor and a SQL stream processing engine, providing a unified runtime to execute relational queries and event-based transformations. The system is distinguished by its ability to manage persistent operator state to ensure exactly-once processing guarantees and consistency during failures. It features specialized capabilities for complex event processing to detect temporal patterns and handles out-of-order events using eve
Groups data using timestamps embedded in records to accurately process out-of-order events.
This project is a collection of educational resources and reference implementations for the Apache Flink stream processing framework. It provides a learning resource focused on mastering distributed stream processing through implementation guides, performance tuning tutorials, and practical examples. The repository features detailed walkthroughs for building real-time data pipelines using the DataStream and Table APIs. It includes specific integration examples for connecting Apache Flink with Kafka brokers and Elasticsearch indices, as well as reference implementations for real-time deduplica
Guides the use of event-time processing to group data using system, ingestion, or event timestamps.
RisingWave is a cloud-native streaming database and real-time analytics engine that uses standard SQL to process continuous data streams. It functions as a streaming data lakehouse, combining the capabilities of a streaming SQL database with a platform that integrates streaming ingestion with open table formats. The system is distinguished by its use of the PostgreSQL wire protocol, allowing it to integrate with existing SQL tools and drivers. It employs a decoupled compute and storage architecture, persisting streaming state and materialized views in cloud object storage to enable independen
Computes running totals and metrics using various time-windowing strategies based on event-time progress.
Apache Beam is a distributed data pipeline framework and unified data processing model designed to handle both bounded batch data and unbounded real-time streams. It provides a system for building scalable, data-parallel workflows that operate across compute clusters using a single programming model. The framework utilizes a cross-runner pipeline abstraction that decouples the data processing logic from the underlying execution backend, allowing the same pipeline to run on different distributed compute engines. It supports multi-language pipeline development by translating high-level code fro
Groups data elements by the time they occurred rather than processing time to handle out-of-order data.
ElastAlert is an alerting framework and query monitor for Elasticsearch. It functions as a real-time log monitoring tool and event notification engine that scans indices for specific patterns to trigger automated alerts when predefined rules are matched. The system distinguishes itself through specialized detection logic, including event spike detection, event frequency monitoring, field change tracking, and the identification of new terms within data fields. It handles notification noise via stateful alert suppression to prevent redundant messages and provides time-windowed aggregation to gr
Buffers matching documents over a set period to send a single summary report instead of individual alerts.
Featuretools is an automated feature engineering library and data transformation framework written in Python. It automatically generates machine learning feature vectors from multi-table datasets by applying synthesis patterns to relational and timestamped data. The system functions as a distributed feature synthesis engine, allowing the process of creating feature vectors to scale across multiple cores or clusters to handle large-scale datasets. The library supports the synthesis of multi-table datasets, time series feature generation, and the creation of custom machine learning primitives
Implements time-window aggregations to generate temporal features and prevent data leakage.
Faust is a Python library for building distributed stream processing applications that integrate with Kafka. It functions as an asynchronous stream processor designed to handle high-throughput event streams and real-time data analysis using asynchronous functions. The system operates as a distributed stream processor and state store, utilizing sharding and partitioned topics to scale processing workloads horizontally across multiple worker nodes. It maintains state through a replicated key-value storage system backed by local databases to ensure high availability and fast recovery. The frame
Tracks data summaries over sliding or tumbling time intervals to analyze trends within continuous streams.
Feast is an open-source feature store for machine learning that provides a central platform for defining, storing, and serving features across both training and inference workflows. It operates as a declarative system where feature definitions are written as code in Python files, synchronized to a central registry, and made available for low-latency online retrieval or point-in-time correct historical joins for training datasets. The project abstracts storage behind a pluggable architecture, allowing offline and online backends to be swapped without changing retrieval logic, and coordinates ma
Feast computes batch or streamable aggregations such as sums or averages on data within a specified time period.
Hazelcast is a distributed data platform that combines an in-memory data grid with a stream processing engine to support real-time analytics and event-driven applications. It functions as a partitioned, distributed key-value store that replicates data across cluster nodes to provide low-latency access and high availability. The platform also serves as a distributed SQL query engine, allowing users to execute standard SQL statements against both in-memory datasets and external data sources. What distinguishes Hazelcast is its use of a distributed consensus subsystem to maintain strongly consis
Provides time-windowed aggregations for unbounded event streams to enable real-time analytics.
Pinot is a distributed, columnar analytical database designed for high-concurrency, low-latency query processing. It functions as a real-time OLAP datastore, enabling interactive, user-facing analytics by ingesting and querying massive datasets from both streaming and batch sources. The system architecture relies on a centralized controller for cluster coordination and a distributed segment-based storage model to ensure horizontal scalability. The platform distinguishes itself through a hybrid ingestion pipeline that unifies real-time event streams and historical batch data into a single quer
Computes rankings, rolling aggregates, and frame-based calculations over ordered data partitions.
GreptimeDB is a distributed, open-source time-series database built for unified observability. It stores and queries metrics, logs, and traces together in a single columnar engine, supporting both SQL and PromQL for analysis. The database is designed as a Kubernetes-native operator with a decoupled compute and storage architecture, enabling horizontal scaling and multi-region deployment. What distinguishes GreptimeDB is its role as a multi-protocol ingestion gateway, accepting data through OpenTelemetry, Prometheus Remote Write, InfluxDB, Loki, Elasticsearch, Kafka, and MQTT protocols without
Partitions time-series data into fixed intervals for grouped computation using a date_bin function.
هذا المشروع عبارة عن مجموعة من أطر عمل وخطوط أنابيب البيانات الضخمة، بما في ذلك إطار عمل تحليل Apache Hive، ومنصة تحليلات سلوكية، ومحرك تحليلات تنبؤية، وخطوط أنابيب بيانات في الوقت الفعلي. يوفر البنية التحتية لبناء سير عمل الاستخراج والتحويل والتحميل (ETL) لمعالجة مجموعات البيانات الكبيرة للتخزين الموزع والتحليل القائم على SQL. يدعم النظام تطبيقات تحليلية متنوعة، مثل محرك تنبؤي يستخدم الانحدار الخطي لتوقع القيم، وبنية في الوقت الفعلي تنقل البيانات عبر وسطاء الرسائل للتقارير الفورية. يتضمن قدرات متخصصة لتحليلات سلوك المستخدم، وقياس أداء التجارة الإلكترونية، وتحليل بيانات النقل الحضري. يغطي الكود المصدري نطاقاً واسعاً من هندسة وتحليل البيانات، بما في ذلك تنظيف البيانات وتحويلها، واستيعاب البيانات الموزع، ومعالجة التدفق القائم على النوافذ، وتصور النتائج من خلال أدوات ذكاء الأعمال. كما يتيح حساب مقاييس أعمال محددة مثل معدلات التحويل، وأداء تحقيق الدخل، ومستويات تفاعل المستخدم.
Calculates real-time metrics by grouping continuous data streams into discrete time intervals using windowing functions.
Arroyo is a high-performance stream processing platform built in Rust. It executes continuous SQL queries on streaming data with event-time semantics, enabling accurate windowed aggregations, joins, and stateful computations on unbounded event streams. The platform uses native Rust execution for high throughput and low latency, with periodic checkpointing for exactly-once fault tolerance and horizontal scaling across distributed workers. The system integrates deeply with Kafka for reading and writing topics with exactly-once delivery and supports change data capture (CDC) from MySQL and Postg
Handles out-of-order events using watermarks and supports tumbling, sliding, and session windows.
This project is a reference library of architectural blueprints, study materials, and design patterns for building scalable, high-availability distributed systems. It serves as a technical guide for scalability engineering, providing structural solutions for common engineering challenges. The repository focuses on distributed systems design, covering essential patterns for data replication, consensus algorithms, and transaction management. It distinguishes itself by offering detailed blueprints for specialized domains, including real-time data streaming, large-scale data storage, and high-ava
Implements a system for calculating ad click metrics over tumbling or session windows for billing reports.