What are the best open-source GitHub repositories for an engine to process streaming data?

apache/flink is the closest match — Apache Flink is a comprehensive distributed stream processing engine that natively supports stateful transformations, exactly-once semantics, and SQL-based interfaces for high-volume, low-latency data pipelines.. Other strong matches: arroyosystems/arroyo, pathwaycom/pathway, risingwavelabs/risingwave, robinhood/faust.

Why does apache/flink match “an engine to process streaming data”?

Apache Flink is a comprehensive distributed stream processing engine that natively supports stateful transformations, exactly-once semantics, and SQL-based interfaces for high-volume, low-latency data pipelines.

Why does arroyosystems/arroyo match “an engine to process streaming data”?

Arroyo is a distributed, high-performance stream processing engine that natively supports SQL-based transformations, stateful windowing, exactly-once semantics, and a broad ecosystem of connectors for real-time data pipelines.

Why does pathwaycom/pathway match “an engine to process streaming data”?

Pathway is a distributed stream processing engine that utilizes a differential dataflow engine to handle real-time transformations and stateful updates with exactly-once semantics, making it a strong fit for high-volume data pipelines.

Why does risingwavelabs/risingwave match “an engine to process streaming data”?

RisingWave is a distributed, stateful stream processing engine that provides exactly-once semantics and a SQL-based interface, making it a comprehensive solution for real-time data transformation and analytics.

Why does robinhood/faust match “an engine to process streaming data”?

Faust is a distributed stream processing engine that provides stateful transformations, windowing, and horizontal scaling, though it lacks a native SQL-based interface for querying streams.

Stream Processing Engines

High-performance frameworks for real-time data transformation and complex event processing across distributed computing environments.

Find the best repos with AI.We'll search the best matching repositories with AI.

apache/flink
apache/flink
26,086View on GitHub
Apache Flink is a distributed processing engine designed for both high-throughput, low-latency data streams and finite batch workloads. It functions as a stateful stream processor and a SQL stream processing engine, providing a unified runtime to execute relational queries and event-based transformations. The system is distinguished by its ability to manage persistent operator state to ensure exactly-once processing guarantees and consistency during failures. It features specialized capabilities for complex event processing to detect temporal patterns and handles out-of-order events using eve
Apache Flink is a comprehensive distributed stream processing engine that natively supports stateful transformations, exactly-once semantics, and SQL-based interfaces for high-volume, low-latency data pipelines.
JavaExactly-Once Processing SemanticsStreaming SQLExternal Data Connectors
View on GitHub26,086
arroyosystems/arroyo
ArroyoSystems/arroyo
4,819View on GitHub
Arroyo is a high-performance stream processing platform built in Rust. It executes continuous SQL queries on streaming data with event-time semantics, enabling accurate windowed aggregations, joins, and stateful computations on unbounded event streams. The platform uses native Rust execution for high throughput and low latency, with periodic checkpointing for exactly-once fault tolerance and horizontal scaling across distributed workers. The system integrates deeply with Kafka for reading and writing topics with exactly-once delivery and supports change data capture (CDC) from MySQL and Postg
Arroyo is a distributed, high-performance stream processing engine that natively supports SQL-based transformations, stateful windowing, exactly-once semantics, and a broad ecosystem of connectors for real-time data pipelines.
RustHorizontal ScalingReal-Time Data ProcessorsStreaming SQL
View on GitHub4,819
pathwaycom/pathway
pathwaycom/pathway
62,959View on GitHub
Pathway is a high-performance data processing framework designed for building unified batch and streaming pipelines. It functions as an orchestrator for complex data transformations, utilizing a differential dataflow engine to process updates incrementally. By treating static datasets and continuous event streams with identical logic, the platform ensures exactly-once processing semantics and consistent results across diverse data sources. The framework distinguishes itself through its specialized support for real-time artificial intelligence and retrieval-augmented generation. It features in
Pathway is a distributed stream processing engine that utilizes a differential dataflow engine to handle real-time transformations and stateful updates with exactly-once semantics, making it a strong fit for high-volume data pipelines.
PythonExactly-Once Processing SemanticsReal-Time Data ProcessorsStream Processing Engines
View on GitHub62,959
risingwavelabs/risingwave
risingwavelabs/risingwave
9,093View on GitHub
RisingWave is a cloud-native streaming database and real-time analytics engine that uses standard SQL to process continuous data streams. It functions as a streaming data lakehouse, combining the capabilities of a streaming SQL database with a platform that integrates streaming ingestion with open table formats. The system is distinguished by its use of the PostgreSQL wire protocol, allowing it to integrate with existing SQL tools and drivers. It employs a decoupled compute and storage architecture, persisting streaming state and materialized views in cloud object storage to enable independen
RisingWave is a distributed, stateful stream processing engine that provides exactly-once semantics and a SQL-based interface, making it a comprehensive solution for real-time data transformation and analytics.
RustExactly-Once Processing SemanticsReal-Time Data ProcessorsReal-Time Event Processing
View on GitHub9,093
robinhood/faust
robinhood/faust
6,822View on GitHub
Faust is a Python library for building distributed stream processing applications that integrate with Kafka. It functions as an asynchronous stream processor designed to handle high-throughput event streams and real-time data analysis using asynchronous functions. The system operates as a distributed stream processor and state store, utilizing sharding and partitioned topics to scale processing workloads horizontally across multiple worker nodes. It maintains state through a replicated key-value storage system backed by local databases to ensure high availability and fast recovery. The frame
Faust is a distributed stream processing engine that provides stateful transformations, windowing, and horizontal scaling, though it lacks a native SQL-based interface for querying streams.
PythonHorizontal ScalingReal-Time Event ProcessingDistributed State Management
View on GitHub6,822
apache/beam
apache/beam
8,612View on GitHub
Apache Beam is a distributed data pipeline framework and unified data processing model designed to handle both bounded batch data and unbounded real-time streams. It provides a system for building scalable, data-parallel workflows that operate across compute clusters using a single programming model. The framework utilizes a cross-runner pipeline abstraction that decouples the data processing logic from the underlying execution backend, allowing the same pipeline to run on different distributed compute engines. It supports multi-language pipeline development by translating high-level code fro
Apache Beam is a comprehensive distributed processing framework that provides a unified model for both batch and real-time stream processing, supporting stateful transformations, SQL-based querying, and a wide range of connectors.
JavaStateful Processing Patterns
View on GitHub8,612
apache/kafka
apache/kafka
32,846View on GitHub
Kafka is a distributed event streaming platform designed for capturing, storing, and processing real-time data streams across interconnected nodes. It functions as a distributed commit log, providing a fault-tolerant storage mechanism that records state changes sequentially to ensure data consistency and durability across distributed environments. The platform distinguishes itself through a partitioned commit log architecture that enables horizontal scaling and parallel processing of data streams. It integrates a stream processing engine for continuous transformations and aggregations, while
Apache Kafka is a foundational distributed event streaming platform that provides the core architecture, stateful processing capabilities, and connector ecosystem required for high-volume, real-time data stream transformation.
JavaStream Processing Engines
View on GitHub32,846
pathwaycom/llm-app
pathwaycom/llm-app
59,341View on GitHub
This project is a data processing engine and AI application platform designed for building production-grade machine learning workflows. It provides a unified programming model that handles both historical batch data and live stream ingestion, enabling the development of real-time ETL pipelines and scalable data transformation workflows. The framework distinguishes itself through differential dataflow execution, which propagates only changes through a pipeline rather than recomputing entire datasets. It supports distributed state management across worker nodes and utilizes incremental stream p
This is a distributed stream processing engine that utilizes differential dataflow for low-latency, stateful transformations, though it is primarily positioned as a platform for building real-time AI and RAG workflows rather than a general-purpose SQL-based stream processor.
Jupyter NotebookReal-Time Data ProcessorsDistributed State Management
View on GitHub59,341
apache/spark
apache/spark
43,467View on GitHub
Apache Spark is a unified distributed data processing engine designed for large-scale data analysis and computation graphs. It functions as a distributed machine learning framework, a graph processing system, a real-time stream processor, and a SQL analytics engine. The system enables the execution of distributed SQL querying, large-scale graph analysis, and real-time stream analytics across clusters of machines. It also provides a scalable environment for implementing machine learning algorithms and predictive model development on massive datasets. The engine incorporates relational query e
Apache Spark is a comprehensive distributed processing engine that natively supports real-time stream processing, stateful transformations, and SQL-based analytics, making it a flagship solution for high-volume data pipelines.
ScalaReal-Time Data Processors
View on GitHub43,467
apache/rocketmq
apache/rocketmq
22,461View on GitHub
RocketMQ is a cloud-native distributed messaging platform and streaming engine. It functions as a distributed transactional queue that ensures atomicity between local transactions and message delivery, and serves as an MQTT IoT message broker to bridge lightweight device traffic into high-performance data streams. The system is distinguished by a Kubernetes-native architecture that decouples compute from storage to allow independent scaling of traffic and data retention. It utilizes a tiered storage model to offload older data to remote storage and employs quorum-based replication and automat
This is a distributed messaging and streaming platform that provides the core infrastructure for high-volume data pipelines, though it focuses more on message queuing and transactional delivery than on complex SQL-based stream transformations.
JavaExternal Data ConnectorsStream Processing EnginesStream Processing
View on GitHub22,461
apache/pulsar
apache/pulsar
15,276View on GitHub
Apache Pulsar is a cloud-native distributed pub-sub messaging system designed for high-performance data ingestion. It functions as a geo-replicated data streamer and a multi-tenant event streaming platform, providing a serverless stream processing engine and a tiered storage messaging broker. The system distinguishes itself by separating serving layers from storage layers to allow independent scaling of compute and data retention. It features native geo-replication to synchronize messages across different geographical regions and employs a multi-layered tenant isolation model using authentica
Apache Pulsar is a distributed messaging and event streaming platform that includes a built-in serverless stream processing engine, making it a robust choice for high-volume data pipelines despite its primary focus on messaging and storage.
JavaHorizontal ScalingStream Processing Engines
View on GitHub15,276
redpanda-data/connect
redpanda-data/connect
8,681View on GitHub
Connect is a Kafka data integration platform and stream processing engine used to build declarative pipelines that move and transform messages between Kafka topics and external sources. It functions as a Kafka Connect framework and a change data capture tool, streaming real-time database modifications to synchronize data across distributed environments. The project differentiates itself through a dedicated mapping language for mutating and reshaping message payloads and the ability to execute custom processing logic within a sandboxed WebAssembly runtime. It also provides an observability pip
This is a stream processing engine focused on data integration and pipeline transformation, providing the necessary distributed architecture and stateful processing capabilities to handle high-volume message streams.
GoStream Processing Engines
View on GitHub8,681
apache/nifi
apache/nifi
5,976View on GitHub
Apache NiFi is a flow-based programming platform that enables the visual design, monitoring, and management of data pipelines. At its core, it provides a web-based visual dataflow designer where users build directed graphs of processors to route, transform, and mediate data movement between any source and destination without writing custom code. The system records fine-grained data provenance for every data item from ingestion to delivery, supporting audit, debugging, and replay of data lineage. The platform distinguishes itself through a zero-master cluster architecture that distributes proc
Apache NiFi is a robust data integration and flow-based orchestration platform that handles distributed data movement and transformation, though it focuses more on visual pipeline management than the SQL-based stream processing typical of engines like Flink or Kafka Streams.
JavaHorizontal Scaling
View on GitHub5,976
vectordotdev/vector
vectordotdev/vector
22,071View on GitHub
Vector is a high-performance observability data pipeline designed to collect, transform, and route logs, metrics, and traces across distributed infrastructure. It functions as a modular engine that decouples data ingestion from processing and transmission, utilizing a component-based architecture to connect diverse sources to multiple destinations. The project distinguishes itself through a focus on reliability and flow control. It implements backpressure-aware data movement to prevent data loss during traffic spikes and utilizes disk-backed event buffering to ensure durability during network
Vector is a high-performance stream processing engine specifically optimized for observability data pipelines, offering distributed deployment and robust transformation capabilities, though it lacks a SQL-based interface.
RustExactly-Once Processing SemanticsStream Processing
View on GitHub22,071
ray-project/ray
ray-project/ray
42,895View on GitHub
Ray is a distributed computing framework designed to scale Python and Java applications across clusters by abstracting task scheduling and resource management. It functions as a resource-aware execution engine that manages task dependencies, placement, and fault tolerance across networked compute nodes. At its core, the system provides a stateful actor model, allowing developers to define classes that run in dedicated processes to maintain and mutate internal state across remote method calls. The framework distinguishes itself through a robust cross-language interoperability layer, enabling f
Ray is a general-purpose distributed computing framework that provides the primitives for stateful, low-latency data processing and streaming, though it functions as a foundational execution engine rather than a specialized SQL-based stream processing platform.
PythonActor ModelsDistributed Computing FrameworksDistributed Datasets
View on GitHub42,895

Stream Processing Engines

apache/flink

ArroyoSystems/arroyo

pathwaycom/pathway

risingwavelabs/risingwave

robinhood/faust

apache/beam

apache/kafka

pathwaycom/llm-app

apache/spark

apache/rocketmq

apache/pulsar

redpanda-data/connect

apache/nifi

vectordotdev/vector

ray-project/ray