30 open-source projects similar to apache/storm, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best Storm alternative.
Hazelcast is a distributed data platform that combines an in-memory data grid with a stream processing engine to support real-time analytics and event-driven applications. It functions as a partitioned, distributed key-value store that replicates data across cluster nodes to provide low-latency access and high availability. The platform also serves as a distributed SQL query engine, allowing users to execute standard SQL statements against both in-memory datasets and external data sources. What distinguishes Hazelcast is its use of a distributed consensus subsystem to maintain strongly consis
Fluvio is a distributed event streaming platform and cloud-native streaming engine designed for collecting, persisting, and replicating real-time data streams across a distributed cluster. It functions as a real-time data pipeline for building stateful workflows that ingest, enrich, and export data between external sources and sinks. The platform is distinguished by its use of WebAssembly to execute compiled modules for in-line data transformations and filtering. This allows for the execution of custom business logic to reshape information in motion without requiring a restart of the cluster.
RisingWave is a cloud-native streaming database and real-time analytics engine that uses standard SQL to process continuous data streams. It functions as a streaming data lakehouse, combining the capabilities of a streaming SQL database with a platform that integrates streaming ingestion with open table formats. The system is distinguished by its use of the PostgreSQL wire protocol, allowing it to integrate with existing SQL tools and drivers. It employs a decoupled compute and storage architecture, persisting streaming state and materialized views in cloud object storage to enable independen
Apache Storm is a distributed stream processing framework and real-time data processing engine. It functions as a fault-tolerant distributed computing system designed to analyze data in motion across a cluster of machines for continuous stream computation. The system enables the creation of fault-tolerant data pipelines and scalable event processing by distributing workloads across a network of computing nodes. This architecture ensures low latency and high throughput for live data while allowing the system to recover automatically from individual node failures. The framework provides capabi
Storm is a distributed stream processing framework and fault-tolerant compute engine designed for executing real-time continuous computations across a cluster of machines. It functions as a stateful stream processor and cluster topology manager, enabling the deployment and monitoring of distributed data flow configurations. The system ensures exactly-once semantics by utilizing transactional state management to guarantee that every message in a data stream is processed exactly one time. It further operates as a distributed RPC system, allowing for the integration of non-native languages throu
This project is a collection of educational resources and reference implementations for the Apache Flink stream processing framework. It provides a learning resource focused on mastering distributed stream processing through implementation guides, performance tuning tutorials, and practical examples. The repository features detailed walkthroughs for building real-time data pipelines using the DataStream and Table APIs. It includes specific integration examples for connecting Apache Flink with Kafka brokers and Elasticsearch indices, as well as reference implementations for real-time deduplica
Apache Spark is a unified distributed data processing engine designed for large-scale data analysis and computation graphs. It functions as a distributed machine learning framework, a graph processing system, a real-time stream processor, and a SQL analytics engine. The system enables the execution of distributed SQL querying, large-scale graph analysis, and real-time stream analytics across clusters of machines. It also provides a scalable environment for implementing machine learning algorithms and predictive model development on massive datasets. The engine incorporates relational query e
Perspective is a columnar data analytics library and streaming data visualization engine. It provides an interactive data grid component and notebook analytics widgets designed for processing high-volume data and rendering interactive charts and grids. The system utilizes a high-performance query engine to enable real-time data analysis and streaming dataset visualization. It supports the creation of customizable dashboards and reports that update automatically as new data arrives without requiring full dataset reloads. The project covers large-scale dataset analytics through a schema-driven
Arroyo is a high-performance stream processing platform built in Rust. It executes continuous SQL queries on streaming data with event-time semantics, enabling accurate windowed aggregations, joins, and stateful computations on unbounded event streams. The platform uses native Rust execution for high throughput and low latency, with periodic checkpointing for exactly-once fault tolerance and horizontal scaling across distributed workers. The system integrates deeply with Kafka for reading and writing topics with exactly-once delivery and supports change data capture (CDC) from MySQL and Postg
Orleans is a .NET distributed actor framework designed for building scalable, cloud-native applications. It implements a virtual actor model where entities with stable identities manage their own state and lifecycle across a cluster of servers. The framework provides a distributed state management system with ACID transaction support and a distributed pub/sub streaming engine for real-time data processing. It distinguishes itself through location-transparent routing, automatic actor activation and deactivation, and elastic cluster scaling that redistributes workloads during node failures. Th
YouPlot is a command line plotting utility and terminal data visualization tool used to render statistical plots and charts directly within a terminal interface using Unicode characters. It functions as a Unix pipeline plotter, allowing users to visualize numerical data without leaving the shell. The project operates as a real-time data visualizer, drawing plots progressively as data streams into the system. It integrates into command line pipelines by reading data from standard input to provide real-time stream monitoring and data analysis. The tool covers a variety of rendering capabilitie
ClearML is a comprehensive MLOps platform designed to manage the entire machine learning lifecycle. It functions as an experiment tracking tool, a data versioning system, and a pipeline orchestrator, while providing infrastructure for GPU cluster management and model serving. The platform is distinguished by its ability to handle hybrid-cloud compute scheduling and fractional GPU allocation, allowing multiple workloads to share a single hardware accelerator. It employs a metadata-based approach to data versioning, using virtual views to track large datasets and artifacts without duplicating r
Apache Mesos is a distributed systems kernel and cluster resource manager that abstracts CPU, memory, and storage across a pool of nodes. It functions as a distributed infrastructure orchestrator, providing a layer to run multiple orchestration frameworks on a shared set of physical or virtual machines. The system acts as a resource isolation engine, dividing a shared cluster into isolated containers to run diverse workloads concurrently. It enables multi-framework orchestration, allowing different distributed application frameworks to share a single infrastructure to maximize hardware utiliz
Otter is a distributed database synchronization system and change data capture tool designed to replicate data between databases across multiple geographic regions. It functions as a synchronization orchestrator and ETL data pipeline that mirrors records and associated files in real time. The system employs incremental log parsing to capture database changes and utilizes a consistency-based convergence algorithm and loop-avoidance logic to manage bi-directional replication. It processes data through a pipeline of selection, extraction, transformation, and loading to handle joins and format co
Faust is a Python library for building distributed stream processing applications that integrate with Kafka. It functions as an asynchronous stream processor designed to handle high-throughput event streams and real-time data analysis using asynchronous functions. The system operates as a distributed stream processor and state store, utilizing sharding and partitioned topics to scale processing workloads horizontally across multiple worker nodes. It maintains state through a replicated key-value storage system backed by local databases to ensure high availability and fast recovery. The frame
Pipeline is a Kubernetes native CI/CD framework and cloud native pipeline orchestrator. It functions as a custom resource controller that translates declarative pipeline definitions into coordinated pod executions and managed workloads. The system acts as a containerized task runner, allowing for the execution of standalone build steps and reusable tasks that process specific inputs to produce defined outputs. It enables the orchestration of complex workflows by running a sequence of independent containers as modular components within a cloud environment. The platform covers automated softwa
Pinot is a distributed, columnar analytical database designed for high-concurrency, low-latency query processing. It functions as a real-time OLAP datastore, enabling interactive, user-facing analytics by ingesting and querying massive datasets from both streaming and batch sources. The system architecture relies on a centralized controller for cluster coordination and a distributed segment-based storage model to ensure horizontal scalability. The platform distinguishes itself through a hybrid ingestion pipeline that unifies real-time event streams and historical batch data into a single quer
Fluent Bit is a cloud-native log shipper and unified telemetry collector designed as a resource-efficient data pipeline. It ingests logs, metrics, and traces from multiple sources, processing them in real-time before routing the data to external storage backends. The project functions as a real-time stream processor and OpenTelemetry log processor, capable of transforming and filtering data using SQL and conditional logic. It also acts as a distributed tracing agent that can sample traces to reduce data volume while preserving full request paths. The system provides reliable data delivery th
This project is a reference library of architectural blueprints, study materials, and design patterns for building scalable, high-availability distributed systems. It serves as a technical guide for scalability engineering, providing structural solutions for common engineering challenges. The repository focuses on distributed systems design, covering essential patterns for data replication, consensus algorithms, and transaction management. It distinguishes itself by offering detailed blueprints for specialized domains, including real-time data streaming, large-scale data storage, and high-ava
Cube is a semantic data layer that provides a unified framework for defining business metrics, dimensions, and relationships across diverse data sources. By acting as a headless business intelligence engine, it transforms raw data into a governed model that can be queried via SQL, REST, and GraphQL interfaces. This architecture ensures consistent data definitions and logic across all downstream analytical applications and reporting tools. The platform distinguishes itself through its integrated conversational AI capabilities, which allow users to explore data using natural language. It orches
ZenML is an extensible machine learning orchestration framework designed to manage the end-to-end lifecycle of data pipelines and AI agent workflows. It functions as a durable orchestrator that executes machine learning tasks as directed acyclic graphs, ensuring that every step is containerized for consistent performance across local, cloud, and hybrid infrastructure. By decoupling pipeline code from underlying compute and storage backends, the platform allows developers to define infrastructure-agnostic stacks that remain portable across diverse environments. The project distinguishes itself
Swift Build is a modular build system designed to orchestrate the compilation of software projects. It functions as a low-level engine that manages the entire build lifecycle, including dependency resolution, task scheduling, and the generation of executable binaries or libraries. By utilizing a decoupled client-server architecture, the system separates the build engine from the interface to facilitate consistent and isolated task execution. The system distinguishes itself through a graph-based approach to task scheduling and a persistent database that tracks file states to ensure incremental
Pathway is a high-performance data processing framework designed for building unified batch and streaming pipelines. It functions as an orchestrator for complex data transformations, utilizing a differential dataflow engine to process updates incrementally. By treating static datasets and continuous event streams with identical logic, the platform ensures exactly-once processing semantics and consistent results across diverse data sources. The framework distinguishes itself through its specialized support for real-time artificial intelligence and retrieval-augmented generation. It features in
This project is an agile project management platform designed to centralize task tracking, workflow organization, and team productivity monitoring. It provides a unified workspace where users can manage projects, tickets, and milestones through visual boards, while simultaneously recording time spent on specific activities to generate detailed performance reports. The platform distinguishes itself through its ability to consolidate data from multiple third-party management tools into a single, normalized schema. It incorporates a locale-aware interface framework that supports global teams by
Kubero is a self-hosted Platform as a Service (PaaS) that simplifies the deployment, scaling, and management of containerized applications on Kubernetes. It functions as an application manager, CI/CD orchestrator, and multi-tenant manager, allowing users to run workloads without writing manual configuration files. The platform distinguishes itself through automated image synthesis, transforming source code from Git repositories into deployable containers via buildpacks, Dockerfiles, or nixpacks. It implements a GitOps delivery model with automated pipelines that trigger builds on push events
Azkaban is a distributed workflow manager and DAG-based job orchestrator designed as an enterprise batch processor. It serves as a Java-based workflow engine that schedules and executes complex job sequences across a cluster of executor servers, with specific functionality for managing big data workloads on Hadoop clusters. The system distinguishes itself through a distributed executor model that coordinates state via a shared database to ensure high availability. It employs a plugin-based architecture that allows for custom job types and system functionality extensions, including the ability
This project provides a comprehensive implementation of the AT Protocol, serving as a framework for building decentralized social networking applications. It enables the creation of distributed data repositories where users maintain cryptographic ownership of their identity and content, allowing for portable accounts that can be migrated between independent servers without central authority intervention. The platform distinguishes itself by decoupling content hosting from discovery through modular algorithmic curation. Users can select third-party services to filter and organize their feeds,
NATS Server is a high-performance, lightweight messaging system designed for cloud-native applications, edge computing, and distributed microservices. It functions as a distributed publish-subscribe broker that routes messages using hierarchical, dot-separated subject strings, enabling decoupled communication between services without requiring centralized broker lookups. The system supports core messaging patterns including asynchronous publish-subscribe, request-reply, and load-balanced queue processing. The platform distinguishes itself through a decentralized architecture that eliminates t
Iggy is a distributed message streaming platform and multi-protocol message broker that functions as a persistent distributed log store. It provides infrastructure for publishing and consuming binary messages using an append-only log, ensuring high availability and data consistency across nodes through Viewstamped Replication. The platform is distinguished by its specialized LLM streaming infrastructure, which uses a server protocol to connect large language models to streaming data and system controls. This includes standardized protocols for context management and data bridging via HTTP or
RStudio is a specialized integrated development environment for the R programming language and statistical computing. It provides a workbench for writing, debugging, and executing R code, offering both a desktop application and a server-hosted collaborative platform for managing data science projects. The platform enables the creation of interactive data applications, AI-powered dashboards, and technical reports. It facilitates the sharing of analysis results through a centralized publishing platform and supports the rendering of notebooks and markdown into multiple file formats. The environ