24 dépôts
Processing systems that ingest and transform data streams in real-time for continuous analytics and event handling.
Explore 24 awesome GitHub repositories matching data & databases · Real-Time Data Processors. Refine with filters or upvote what's useful.
This project is a community-driven directory that aggregates essential software projects and educational content for the Node.js ecosystem. It functions as a centralized knowledge base and discovery index, designed to simplify the navigation of a fragmented technical landscape by providing a structured collection of high-quality links, tools, and learning materials. The repository distinguishes itself through a decentralized, peer-reviewed curation model. By utilizing standard version control workflows and pull requests, the community ensures that all listed resources undergo human verificati
Identify high-performance frameworks capable of ingesting and transforming data streams in real time.
Pathway is a high-performance data processing framework designed for building unified batch and streaming pipelines. It functions as an orchestrator for complex data transformations, utilizing a differential dataflow engine to process updates incrementally. By treating static datasets and continuous event streams with identical logic, the platform ensures exactly-once processing semantics and consistent results across diverse data sources. The framework distinguishes itself through its specialized support for real-time artificial intelligence and retrieval-augmented generation. It features in
Processes continuous data streams in real-time to facilitate immediate event-driven analytics.
This project is a data processing engine and AI application platform designed for building production-grade machine learning workflows. It provides a unified programming model that handles both historical batch data and live stream ingestion, enabling the development of real-time ETL pipelines and scalable data transformation workflows. The framework distinguishes itself through differential dataflow execution, which propagates only changes through a pipeline rather than recomputing entire datasets. It supports distributed state management across worker nodes and utilizes incremental stream p
Ingests and processes information from diverse sources in real-time to ensure continuous visibility into changing data.
Apache Spark is a unified distributed data processing engine designed for large-scale data analysis and computation graphs. It functions as a distributed machine learning framework, a graph processing system, a real-time stream processor, and a SQL analytics engine. The system enables the execution of distributed SQL querying, large-scale graph analysis, and real-time stream analytics across clusters of machines. It also provides a scalable environment for implementing machine learning algorithms and predictive model development on massive datasets. The engine incorporates relational query e
Ships a processing system that ingests and transforms real-time data streams for continuous analytics.
Zstandard is a lossless data compression library and archive format designed for high compression ratios and fast real-time processing. It functions as a real-time data compressor and multi-threaded compression engine capable of distributing workloads across multiple CPU cores to increase throughput. The system features a dictionary-based compressor that trains on sample data to improve the compression ratio and speed of small files. It also provides long distance pattern matching to identify repeated sequences across large files. The library covers a broad range of capabilities including st
Enables high-throughput real-time decompression to restore data quickly for immediate application use.
VLC is a cross-platform multimedia player and framework designed to decode and render virtually any audio or video format, network stream, or physical disc without requiring external codecs. It functions as both a standalone application and a portable library, providing a modular architecture that allows developers to integrate playback, filtering, and streaming capabilities into third-party software. The project distinguishes itself through a highly modular plugin-based engine that supports real-time media processing, including format transcoding and the application of audio and video filter
Applies audio and video transformations sequentially to raw data streams before final rendering.
Doris is a distributed SQL data warehouse designed for high-performance analytical workloads and real-time data processing. It functions as a unified platform that integrates traditional relational warehousing with lakehouse query capabilities, allowing users to execute analytical operations directly against external data lakes without requiring data migration. The system distinguishes itself through a shared-nothing, massively parallel processing architecture that utilizes vectorized query execution and columnar storage to maintain sub-second latency. It supports dynamic schema evolution, en
Supports continuous real-time data ingestion to ensure new information is immediately available for analysis.
DataHub is a metadata management platform designed to unify technical, operational, and business context across diverse data ecosystems. By utilizing a graph-based metadata model and an event-driven ingestion architecture, it creates a centralized source of truth that maps complex data relationships, lineage, and ownership. This foundational framework enables organizations to maintain a synchronized view of their data landscape, supporting both human-led discovery and automated data operations. The platform distinguishes itself through its focus on grounding artificial intelligence and autono
Processes metadata updates in real-time using an event-driven architecture to maintain current data context.
Perspective is a columnar data analytics library and streaming data visualization engine. It provides an interactive data grid component and notebook analytics widgets designed for processing high-volume data and rendering interactive charts and grids. The system utilizes a high-performance query engine to enable real-time data analysis and streaming dataset visualization. It supports the creation of customizable dashboards and reports that update automatically as new data arrives without requiring full dataset reloads. The project covers large-scale dataset analytics through a schema-driven
Processes and transforms data streams in real-time to provide continuous analytics and visual updates.
Quantaxis is a quantitative trading framework designed for building, backtesting, and executing automated strategies across global equities, futures, and cryptocurrencies. It integrates an event-driven backtesting engine, a multi-market execution gateway for order routing, and a quantitative data pipeline for ingesting and storing multi-asset market data. The system features a Rust-accelerated financial library that utilizes Apache Arrow for high-performance technical indicator calculation and zero-copy data processing. It provides a containerized infrastructure model designed for orchestrati
Processes live financial data feeds in real-time to retrieve current prices, spreads, and changes.
RisingWave is a cloud-native streaming database and real-time analytics engine that uses standard SQL to process continuous data streams. It functions as a streaming data lakehouse, combining the capabilities of a streaming SQL database with a platform that integrates streaming ingestion with open table formats. The system is distinguished by its use of the PostgreSQL wire protocol, allowing it to integrate with existing SQL tools and drivers. It employs a decoupled compute and storage architecture, persisting streaming state and materialized views in cloud object storage to enable independen
Ingests and transforms data streams in real-time using SQL for continuous analytics and event handling.
Fluent Bit est un collecteur de logs et de télémétrie unifié cloud-native conçu comme un pipeline de données efficace en ressources. Il ingère des logs, des métriques et des traces provenant de multiples sources, les traitant en temps réel avant d'acheminer les données vers des backends de stockage externes. Le projet fonctionne comme un processeur de flux en temps réel et un processeur de logs OpenTelemetry, capable de transformer et de filtrer les données en utilisant SQL et une logique conditionnelle. Il agit également comme un agent de traçage distribué capable d'échantillonner les traces pour réduire le volume de données tout en préservant les chemins de requête complets. Le système fournit une livraison de données fiable grâce à une mise en mémoire tampon basée sur le système de fichiers et une logique de réessai avec état pour éviter la perte de données lors des pannes. Son architecture modulaire prend en charge des plugins d'entrée et de sortie enfichables, un routage basé sur les métadonnées et la capacité d'étendre les fonctionnalités via des bibliothèques partagées. Le logiciel peut être déployé en tant que conteneur sur différentes architectures CPU et systèmes d'exploitation.
Ingests and transforms telemetry data streams in real-time using conditional logic for continuous analytics.
litegraph.js is a JavaScript dataflow framework and visual node graph engine used to define programmable logic and data flow. It provides a node-based visual programming tool for designing complex logic through connected functional blocks. The library allows for the creation of hierarchical logic by nesting multiple nodes into recursive subgraphs. It also supports the development of custom node types with unique inputs and outputs, as well as custom widgets and live views that can hide the underlying graph structure to present a visual interface. The engine enables the execution of logic gra
Executes logic graphs across browser or server environments to process and route data in real-time.
Apache Storm is a distributed stream processing framework and real-time data processing engine. It functions as a fault-tolerant distributed computing system designed to analyze data in motion across a cluster of machines for continuous stream computation. The system enables the creation of fault-tolerant data pipelines and scalable event processing by distributing workloads across a network of computing nodes. This architecture ensures low latency and high throughput for live data while allowing the system to recover automatically from individual node failures. The framework provides capabi
Ingests and transforms data streams in real-time for continuous analytics and event handling.
Storm is a distributed stream processing framework designed to execute unbounded computations across a cluster to process real-time data streams. It functions as a data pipeline orchestrator that allows users to define and deploy declarative data flow graphs connecting streaming sources to processing components. The system operates as a multi-tenant distributed compute engine that isolates workloads and limits resource usage across shared clusters using dedicated pools and access control. It is also a secure distributed processing engine that employs encrypted node communication and SSL-secur
Orchestrates declarative data flow graphs that connect streaming sources to processing components.
Hazelcast is a distributed data platform that combines an in-memory data grid with a stream processing engine to support real-time analytics and event-driven applications. It functions as a partitioned, distributed key-value store that replicates data across cluster nodes to provide low-latency access and high availability. The platform also serves as a distributed SQL query engine, allowing users to execute standard SQL statements against both in-memory datasets and external data sources. What distinguishes Hazelcast is its use of a distributed consensus subsystem to maintain strongly consis
Handles both endless streams of event data and finite static datasets for unified processing.
GreptimeDB is a distributed, open-source time-series database built for unified observability. It stores and queries metrics, logs, and traces together in a single columnar engine, supporting both SQL and PromQL for analysis. The database is designed as a Kubernetes-native operator with a decoupled compute and storage architecture, enabling horizontal scaling and multi-region deployment. What distinguishes GreptimeDB is its role as a multi-protocol ingestion gateway, accepting data through OpenTelemetry, Prometheus Remote Write, InfluxDB, Loki, Elasticsearch, Kafka, and MQTT protocols without
GreptimeDB processes incoming data incrementally and continuously, updating results as new data arrives for immediate analytics.
Fluvio est une plateforme de streaming d'événements distribuée et un moteur de streaming cloud-native conçu pour collecter, persister et répliquer des flux de données en temps réel à travers un cluster distribué. Il fonctionne comme un pipeline de données temps réel pour construire des workflows avec état qui ingèrent, enrichissent et exportent des données entre des sources et des destinations externes. La plateforme se distingue par son utilisation de WebAssembly pour exécuter des modules compilés pour des transformations et filtrages de données en ligne. Cela permet l'exécution d'une logique métier personnalisée pour remodeler l'information en mouvement sans nécessiter de redémarrage du cluster. Le système couvre un large éventail de capacités, incluant l'ingestion de données basée sur des connecteurs depuis des protocoles externes, un stockage immuable structuré en logs avec E/S zéro-copie, et une mise à l'échelle horizontale du cluster. Il prend en charge la création de pipelines complexes pilotés par les événements qui utilisent le traitement avec état, les agrégations par fenêtrage et la distribution de données basée sur les partitions. Le moteur peut être déployé comme un binaire léger sur diverses architectures système, y compris des appareils IoT ARM64 pour le traitement de données en périphérie (edge).
Implements a framework for building stateful workflows that ingest, enrich, and export data.
RxPY est une bibliothèque de programmation réactive fonctionnelle et une bibliothèque d'observables ReactiveX pour Python. Elle sert de processeur de flux asynchrone et de framework de coordination piloté par les événements, utilisé pour construire des pipelines de données qui réagissent aux changements d'état ou aux flux d'événements au fil du temps. La bibliothèque fournit une boîte à outils pour composer des programmes asynchrones et basés sur les événements en utilisant des séquences observables et des opérateurs. Elle se distingue par l'utilisation de planificateurs configurables pour gérer la concurrence, le timing et les cycles de vie des abonnements. Le projet couvre un large éventail de capacités de traitement de flux, y compris l'agrégation, le filtrage et la combinaison de données. Il fournit des mécanismes pour la diffusion d'événements, la mise en tampon de séquences et la gestion des erreurs, ainsi que des outils pour coordonner les flux observables avec des boucles d'événements asynchrones. Les tests et l'assurance qualité sont pris en charge par la simulation de temps virtuel, la modélisation par diagrammes de billes et la vérification des émissions.
Processes live data streams in real-time by chaining operators to aggregate, buffer, or merge values.
Arroyo is a high-performance stream processing platform built in Rust. It executes continuous SQL queries on streaming data with event-time semantics, enabling accurate windowed aggregations, joins, and stateful computations on unbounded event streams. The platform uses native Rust execution for high throughput and low latency, with periodic checkpointing for exactly-once fault tolerance and horizontal scaling across distributed workers. The system integrates deeply with Kafka for reading and writing topics with exactly-once delivery and supports change data capture (CDC) from MySQL and Postg
An open-source system for building fault-tolerant, stateful pipelines that process millions of events per second with subsecond latency.