24 रिपॉजिटरी
Processing systems that ingest and transform data streams in real-time for continuous analytics and event handling.
Explore 24 awesome GitHub repositories matching data & databases · Real-Time Data Processors. Refine with filters or upvote what's useful.
This project is a community-driven directory that aggregates essential software projects and educational content for the Node.js ecosystem. It functions as a centralized knowledge base and discovery index, designed to simplify the navigation of a fragmented technical landscape by providing a structured collection of high-quality links, tools, and learning materials. The repository distinguishes itself through a decentralized, peer-reviewed curation model. By utilizing standard version control workflows and pull requests, the community ensures that all listed resources undergo human verificati
Identify high-performance frameworks capable of ingesting and transforming data streams in real time.
Pathway is a high-performance data processing framework designed for building unified batch and streaming pipelines. It functions as an orchestrator for complex data transformations, utilizing a differential dataflow engine to process updates incrementally. By treating static datasets and continuous event streams with identical logic, the platform ensures exactly-once processing semantics and consistent results across diverse data sources. The framework distinguishes itself through its specialized support for real-time artificial intelligence and retrieval-augmented generation. It features in
Processes continuous data streams in real-time to facilitate immediate event-driven analytics.
This project is a data processing engine and AI application platform designed for building production-grade machine learning workflows. It provides a unified programming model that handles both historical batch data and live stream ingestion, enabling the development of real-time ETL pipelines and scalable data transformation workflows. The framework distinguishes itself through differential dataflow execution, which propagates only changes through a pipeline rather than recomputing entire datasets. It supports distributed state management across worker nodes and utilizes incremental stream p
Ingests and processes information from diverse sources in real-time to ensure continuous visibility into changing data.
Apache Spark is a unified distributed data processing engine designed for large-scale data analysis and computation graphs. It functions as a distributed machine learning framework, a graph processing system, a real-time stream processor, and a SQL analytics engine. The system enables the execution of distributed SQL querying, large-scale graph analysis, and real-time stream analytics across clusters of machines. It also provides a scalable environment for implementing machine learning algorithms and predictive model development on massive datasets. The engine incorporates relational query e
Ships a processing system that ingests and transforms real-time data streams for continuous analytics.
Zstandard is a lossless data compression library and archive format designed for high compression ratios and fast real-time processing. It functions as a real-time data compressor and multi-threaded compression engine capable of distributing workloads across multiple CPU cores to increase throughput. The system features a dictionary-based compressor that trains on sample data to improve the compression ratio and speed of small files. It also provides long distance pattern matching to identify repeated sequences across large files. The library covers a broad range of capabilities including st
Enables high-throughput real-time decompression to restore data quickly for immediate application use.
VLC is a cross-platform multimedia player and framework designed to decode and render virtually any audio or video format, network stream, or physical disc without requiring external codecs. It functions as both a standalone application and a portable library, providing a modular architecture that allows developers to integrate playback, filtering, and streaming capabilities into third-party software. The project distinguishes itself through a highly modular plugin-based engine that supports real-time media processing, including format transcoding and the application of audio and video filter
Applies audio and video transformations sequentially to raw data streams before final rendering.
Doris is a distributed SQL data warehouse designed for high-performance analytical workloads and real-time data processing. It functions as a unified platform that integrates traditional relational warehousing with lakehouse query capabilities, allowing users to execute analytical operations directly against external data lakes without requiring data migration. The system distinguishes itself through a shared-nothing, massively parallel processing architecture that utilizes vectorized query execution and columnar storage to maintain sub-second latency. It supports dynamic schema evolution, en
Supports continuous real-time data ingestion to ensure new information is immediately available for analysis.
DataHub is a metadata management platform designed to unify technical, operational, and business context across diverse data ecosystems. By utilizing a graph-based metadata model and an event-driven ingestion architecture, it creates a centralized source of truth that maps complex data relationships, lineage, and ownership. This foundational framework enables organizations to maintain a synchronized view of their data landscape, supporting both human-led discovery and automated data operations. The platform distinguishes itself through its focus on grounding artificial intelligence and autono
Processes metadata updates in real-time using an event-driven architecture to maintain current data context.
Perspective is a columnar data analytics library and streaming data visualization engine. It provides an interactive data grid component and notebook analytics widgets designed for processing high-volume data and rendering interactive charts and grids. The system utilizes a high-performance query engine to enable real-time data analysis and streaming dataset visualization. It supports the creation of customizable dashboards and reports that update automatically as new data arrives without requiring full dataset reloads. The project covers large-scale dataset analytics through a schema-driven
Processes and transforms data streams in real-time to provide continuous analytics and visual updates.
Quantaxis is a quantitative trading framework designed for building, backtesting, and executing automated strategies across global equities, futures, and cryptocurrencies. It integrates an event-driven backtesting engine, a multi-market execution gateway for order routing, and a quantitative data pipeline for ingesting and storing multi-asset market data. The system features a Rust-accelerated financial library that utilizes Apache Arrow for high-performance technical indicator calculation and zero-copy data processing. It provides a containerized infrastructure model designed for orchestrati
Processes live financial data feeds in real-time to retrieve current prices, spreads, and changes.
RisingWave is a cloud-native streaming database and real-time analytics engine that uses standard SQL to process continuous data streams. It functions as a streaming data lakehouse, combining the capabilities of a streaming SQL database with a platform that integrates streaming ingestion with open table formats. The system is distinguished by its use of the PostgreSQL wire protocol, allowing it to integrate with existing SQL tools and drivers. It employs a decoupled compute and storage architecture, persisting streaming state and materialized views in cloud object storage to enable independen
Ingests and transforms data streams in real-time using SQL for continuous analytics and event handling.
Fluent Bit is a cloud-native log shipper and unified telemetry collector designed as a resource-efficient data pipeline. It ingests logs, metrics, and traces from multiple sources, processing them in real-time before routing the data to external storage backends. The project functions as a real-time stream processor and OpenTelemetry log processor, capable of transforming and filtering data using SQL and conditional logic. It also acts as a distributed tracing agent that can sample traces to reduce data volume while preserving full request paths. The system provides reliable data delivery th
Ingests and transforms telemetry data streams in real-time using conditional logic for continuous analytics.
litegraph.js is a JavaScript dataflow framework and visual node graph engine used to define programmable logic and data flow. It provides a node-based visual programming tool for designing complex logic through connected functional blocks. The library allows for the creation of hierarchical logic by nesting multiple nodes into recursive subgraphs. It also supports the development of custom node types with unique inputs and outputs, as well as custom widgets and live views that can hide the underlying graph structure to present a visual interface. The engine enables the execution of logic gra
Executes logic graphs across browser or server environments to process and route data in real-time.
Apache Storm is a distributed stream processing framework and real-time data processing engine. It functions as a fault-tolerant distributed computing system designed to analyze data in motion across a cluster of machines for continuous stream computation. The system enables the creation of fault-tolerant data pipelines and scalable event processing by distributing workloads across a network of computing nodes. This architecture ensures low latency and high throughput for live data while allowing the system to recover automatically from individual node failures. The framework provides capabi
Ingests and transforms data streams in real-time for continuous analytics and event handling.
Storm is a distributed stream processing framework designed to execute unbounded computations across a cluster to process real-time data streams. It functions as a data pipeline orchestrator that allows users to define and deploy declarative data flow graphs connecting streaming sources to processing components. The system operates as a multi-tenant distributed compute engine that isolates workloads and limits resource usage across shared clusters using dedicated pools and access control. It is also a secure distributed processing engine that employs encrypted node communication and SSL-secur
Orchestrates declarative data flow graphs that connect streaming sources to processing components.
Hazelcast is a distributed data platform that combines an in-memory data grid with a stream processing engine to support real-time analytics and event-driven applications. It functions as a partitioned, distributed key-value store that replicates data across cluster nodes to provide low-latency access and high availability. The platform also serves as a distributed SQL query engine, allowing users to execute standard SQL statements against both in-memory datasets and external data sources. What distinguishes Hazelcast is its use of a distributed consensus subsystem to maintain strongly consis
Handles both endless streams of event data and finite static datasets for unified processing.
GreptimeDB is a distributed, open-source time-series database built for unified observability. It stores and queries metrics, logs, and traces together in a single columnar engine, supporting both SQL and PromQL for analysis. The database is designed as a Kubernetes-native operator with a decoupled compute and storage architecture, enabling horizontal scaling and multi-region deployment. What distinguishes GreptimeDB is its role as a multi-protocol ingestion gateway, accepting data through OpenTelemetry, Prometheus Remote Write, InfluxDB, Loki, Elasticsearch, Kafka, and MQTT protocols without
GreptimeDB processes incoming data incrementally and continuously, updating results as new data arrives for immediate analytics.
Fluvio एक वितरित इवेंट स्ट्रीमिंग प्लेटफ़ॉर्म और क्लाउड-नेटिव स्ट्रीमिंग इंजन है जिसे वितरित क्लस्टर में रीयल-टाइम डेटा स्ट्रीम को एकत्र करने, बनाए रखने और दोहराने के लिए डिज़ाइन किया गया है। यह बाहरी स्रोतों और सिंक के बीच डेटा को इनजेस्ट, समृद्ध और निर्यात करने वाले स्टेटफ़ुल वर्कफ़्लो बनाने के लिए रीयल-टाइम डेटा पाइपलाइन के रूप में कार्य करता है। यह प्लेटफ़ॉर्म इन-लाइन डेटा परिवर्तनों और फ़िल्टरिंग के लिए संकलित मॉड्यूल को निष्पादित करने के लिए WebAssembly के उपयोग से प्रतिष्ठित है। यह क्लस्टर को पुनरारंभ करने की आवश्यकता के बिना जानकारी को फिर से आकार देने के लिए कस्टम व्यावसायिक तर्क के निष्पादन की अनुमति देता है। सिस्टम बाहरी प्रोटोकॉल से कनेक्टर-आधारित डेटा इंजेक्शन, ज़ीरो-कॉपी IO के साथ लॉग-स्ट्रक्चर्ड अपरिवर्तनीय भंडारण और क्षैतिज क्लस्टर स्केलिंग सहित क्षमताओं की एक विस्तृत श्रृंखला को कवर करता है। यह जटिल इवेंट-संचालित पाइपलाइनों के निर्माण का समर्थन करता है जो स्टेटफ़ुल प्रोसेसिंग, विंडो-आधारित एकत्रीकरण और विभाजन-आधारित डेटा वितरण का उपयोग करते हैं। इंजन को एज डेटा प्रोसेसिंग के लिए ARM64 IoT उपकरणों सहित विविध सिस्टम आर्किटेक्चर पर एक हल्के बाइनरी के रूप में तैनात किया जा सकता है।
Implements a framework for building stateful workflows that ingest, enrich, and export data.
RxPY is a functional reactive programming library and a ReactiveX observable library for Python. It serves as an asynchronous stream processor and event-driven coordination framework used to build data pipelines that react to changes in state or streams of events over time. The library provides a toolkit for composing asynchronous and event-based programs using observable sequences and operators. It distinguishes itself through the use of configurable schedulers to manage concurrency, timing, and subscription lifecycles. The project covers a wide range of stream processing capabilities, incl
Processes live data streams in real-time by chaining operators to aggregate, buffer, or merge values.
Arroyo is a high-performance stream processing platform built in Rust. It executes continuous SQL queries on streaming data with event-time semantics, enabling accurate windowed aggregations, joins, and stateful computations on unbounded event streams. The platform uses native Rust execution for high throughput and low latency, with periodic checkpointing for exactly-once fault tolerance and horizontal scaling across distributed workers. The system integrates deeply with Kafka for reading and writing topics with exactly-once delivery and supports change data capture (CDC) from MySQL and Postg
An open-source system for building fault-tolerant, stateful pipelines that process millions of events per second with subsecond latency.