24 个仓库
Processing systems that ingest and transform data streams in real-time for continuous analytics and event handling.
Explore 24 awesome GitHub repositories matching data & databases · Real-Time Data Processors. Refine with filters or upvote what's useful.
This project is a community-driven directory that aggregates essential software projects and educational content for the Node.js ecosystem. It functions as a centralized knowledge base and discovery index, designed to simplify the navigation of a fragmented technical landscape by providing a structured collection of high-quality links, tools, and learning materials. The repository distinguishes itself through a decentralized, peer-reviewed curation model. By utilizing standard version control workflows and pull requests, the community ensures that all listed resources undergo human verificati
Identify high-performance frameworks capable of ingesting and transforming data streams in real time.
Pathway is a high-performance data processing framework designed for building unified batch and streaming pipelines. It functions as an orchestrator for complex data transformations, utilizing a differential dataflow engine to process updates incrementally. By treating static datasets and continuous event streams with identical logic, the platform ensures exactly-once processing semantics and consistent results across diverse data sources. The framework distinguishes itself through its specialized support for real-time artificial intelligence and retrieval-augmented generation. It features in
Processes continuous data streams in real-time to facilitate immediate event-driven analytics.
This project is a data processing engine and AI application platform designed for building production-grade machine learning workflows. It provides a unified programming model that handles both historical batch data and live stream ingestion, enabling the development of real-time ETL pipelines and scalable data transformation workflows. The framework distinguishes itself through differential dataflow execution, which propagates only changes through a pipeline rather than recomputing entire datasets. It supports distributed state management across worker nodes and utilizes incremental stream p
Ingests and processes information from diverse sources in real-time to ensure continuous visibility into changing data.
Apache Spark is a unified distributed data processing engine designed for large-scale data analysis and computation graphs. It functions as a distributed machine learning framework, a graph processing system, a real-time stream processor, and a SQL analytics engine. The system enables the execution of distributed SQL querying, large-scale graph analysis, and real-time stream analytics across clusters of machines. It also provides a scalable environment for implementing machine learning algorithms and predictive model development on massive datasets. The engine incorporates relational query e
Ships a processing system that ingests and transforms real-time data streams for continuous analytics.
Zstandard is a lossless data compression library and archive format designed for high compression ratios and fast real-time processing. It functions as a real-time data compressor and multi-threaded compression engine capable of distributing workloads across multiple CPU cores to increase throughput. The system features a dictionary-based compressor that trains on sample data to improve the compression ratio and speed of small files. It also provides long distance pattern matching to identify repeated sequences across large files. The library covers a broad range of capabilities including st
Enables high-throughput real-time decompression to restore data quickly for immediate application use.
VLC is a cross-platform multimedia player and framework designed to decode and render virtually any audio or video format, network stream, or physical disc without requiring external codecs. It functions as both a standalone application and a portable library, providing a modular architecture that allows developers to integrate playback, filtering, and streaming capabilities into third-party software. The project distinguishes itself through a highly modular plugin-based engine that supports real-time media processing, including format transcoding and the application of audio and video filter
Applies audio and video transformations sequentially to raw data streams before final rendering.
Doris is a distributed SQL data warehouse designed for high-performance analytical workloads and real-time data processing. It functions as a unified platform that integrates traditional relational warehousing with lakehouse query capabilities, allowing users to execute analytical operations directly against external data lakes without requiring data migration. The system distinguishes itself through a shared-nothing, massively parallel processing architecture that utilizes vectorized query execution and columnar storage to maintain sub-second latency. It supports dynamic schema evolution, en
Supports continuous real-time data ingestion to ensure new information is immediately available for analysis.
DataHub is a metadata management platform designed to unify technical, operational, and business context across diverse data ecosystems. By utilizing a graph-based metadata model and an event-driven ingestion architecture, it creates a centralized source of truth that maps complex data relationships, lineage, and ownership. This foundational framework enables organizations to maintain a synchronized view of their data landscape, supporting both human-led discovery and automated data operations. The platform distinguishes itself through its focus on grounding artificial intelligence and autono
Processes metadata updates in real-time using an event-driven architecture to maintain current data context.
Perspective is a columnar data analytics library and streaming data visualization engine. It provides an interactive data grid component and notebook analytics widgets designed for processing high-volume data and rendering interactive charts and grids. The system utilizes a high-performance query engine to enable real-time data analysis and streaming dataset visualization. It supports the creation of customizable dashboards and reports that update automatically as new data arrives without requiring full dataset reloads. The project covers large-scale dataset analytics through a schema-driven
Processes and transforms data streams in real-time to provide continuous analytics and visual updates.
Quantaxis is a quantitative trading framework designed for building, backtesting, and executing automated strategies across global equities, futures, and cryptocurrencies. It integrates an event-driven backtesting engine, a multi-market execution gateway for order routing, and a quantitative data pipeline for ingesting and storing multi-asset market data. The system features a Rust-accelerated financial library that utilizes Apache Arrow for high-performance technical indicator calculation and zero-copy data processing. It provides a containerized infrastructure model designed for orchestrati
Processes live financial data feeds in real-time to retrieve current prices, spreads, and changes.
RisingWave is a cloud-native streaming database and real-time analytics engine that uses standard SQL to process continuous data streams. It functions as a streaming data lakehouse, combining the capabilities of a streaming SQL database with a platform that integrates streaming ingestion with open table formats. The system is distinguished by its use of the PostgreSQL wire protocol, allowing it to integrate with existing SQL tools and drivers. It employs a decoupled compute and storage architecture, persisting streaming state and materialized views in cloud object storage to enable independen
Ingests and transforms data streams in real-time using SQL for continuous analytics and event handling.
Fluent Bit 是一个云原生日志转发器和统一遥测收集器,设计为资源高效的数据流水线。它从多个来源摄取日志、指标和追踪信息,并在将数据路由到外部存储后端之前进行实时处理。 该项目作为一个实时流处理器和 OpenTelemetry 日志处理器,能够使用 SQL 和条件逻辑转换和过滤数据。它还充当分布式追踪代理,可以对追踪进行采样以减少数据量,同时保留完整的请求路径。 该系统通过基于文件系统的缓冲和有状态重试逻辑提供可靠的数据交付,以防止停机期间的数据丢失。其模块化架构支持可插拔的输入和输出插件、元数据驱动的路由,以及通过共享库扩展功能的能力。 该软件可以作为容器部署在不同的 CPU 架构和操作系统上。
Ingests and transforms telemetry data streams in real-time using conditional logic for continuous analytics.
litegraph.js is a JavaScript dataflow framework and visual node graph engine used to define programmable logic and data flow. It provides a node-based visual programming tool for designing complex logic through connected functional blocks. The library allows for the creation of hierarchical logic by nesting multiple nodes into recursive subgraphs. It also supports the development of custom node types with unique inputs and outputs, as well as custom widgets and live views that can hide the underlying graph structure to present a visual interface. The engine enables the execution of logic gra
Executes logic graphs across browser or server environments to process and route data in real-time.
Apache Storm is a distributed stream processing framework and real-time data processing engine. It functions as a fault-tolerant distributed computing system designed to analyze data in motion across a cluster of machines for continuous stream computation. The system enables the creation of fault-tolerant data pipelines and scalable event processing by distributing workloads across a network of computing nodes. This architecture ensures low latency and high throughput for live data while allowing the system to recover automatically from individual node failures. The framework provides capabi
Ingests and transforms data streams in real-time for continuous analytics and event handling.
Storm is a distributed stream processing framework designed to execute unbounded computations across a cluster to process real-time data streams. It functions as a data pipeline orchestrator that allows users to define and deploy declarative data flow graphs connecting streaming sources to processing components. The system operates as a multi-tenant distributed compute engine that isolates workloads and limits resource usage across shared clusters using dedicated pools and access control. It is also a secure distributed processing engine that employs encrypted node communication and SSL-secur
Orchestrates declarative data flow graphs that connect streaming sources to processing components.
Hazelcast is a distributed data platform that combines an in-memory data grid with a stream processing engine to support real-time analytics and event-driven applications. It functions as a partitioned, distributed key-value store that replicates data across cluster nodes to provide low-latency access and high availability. The platform also serves as a distributed SQL query engine, allowing users to execute standard SQL statements against both in-memory datasets and external data sources. What distinguishes Hazelcast is its use of a distributed consensus subsystem to maintain strongly consis
Handles both endless streams of event data and finite static datasets for unified processing.
GreptimeDB is a distributed, open-source time-series database built for unified observability. It stores and queries metrics, logs, and traces together in a single columnar engine, supporting both SQL and PromQL for analysis. The database is designed as a Kubernetes-native operator with a decoupled compute and storage architecture, enabling horizontal scaling and multi-region deployment. What distinguishes GreptimeDB is its role as a multi-protocol ingestion gateway, accepting data through OpenTelemetry, Prometheus Remote Write, InfluxDB, Loki, Elasticsearch, Kafka, and MQTT protocols without
GreptimeDB processes incoming data incrementally and continuously, updating results as new data arrives for immediate analytics.
Fluvio is a distributed event streaming platform and cloud-native streaming engine designed for collecting, persisting, and replicating real-time data streams across a distributed cluster. It functions as a real-time data pipeline for building stateful workflows that ingest, enrich, and export data between external sources and sinks. The platform is distinguished by its use of WebAssembly to execute compiled modules for in-line data transformations and filtering. This allows for the execution of custom business logic to reshape information in motion without requiring a restart of the cluster.
Implements a framework for building stateful workflows that ingest, enrich, and export data.
RxPY 是一个 Python 函数式响应式编程库,也是 ReactiveX 的可观察对象库。它作为一个异步流处理器和事件驱动的协调框架,用于构建能够对状态变化或随时间变化的事件流做出反应的数据流水线。 该库提供了一套工具,用于使用可观察序列和操作符来编写异步和基于事件的程序。它通过使用可配置的调度器来管理并发、时序和订阅生命周期,从而脱颖而出。 该项目涵盖了广泛的流处理能力,包括数据聚合、过滤和组合。它提供了事件广播、序列缓冲和错误处理机制,以及用于协调可观察流与异步事件循环的工具。 通过虚拟时间模拟、大理石图建模和发射验证,该库提供了完善的测试和质量保证支持。
Processes live data streams in real-time by chaining operators to aggregate, buffer, or merge values.
Arroyo is a high-performance stream processing platform built in Rust. It executes continuous SQL queries on streaming data with event-time semantics, enabling accurate windowed aggregations, joins, and stateful computations on unbounded event streams. The platform uses native Rust execution for high throughput and low latency, with periodic checkpointing for exactly-once fault tolerance and horizontal scaling across distributed workers. The system integrates deeply with Kafka for reading and writing topics with exactly-once delivery and supports change data capture (CDC) from MySQL and Postg
An open-source system for building fault-tolerant, stateful pipelines that process millions of events per second with subsecond latency.