What are the best open-source alternatives to Spark?

30 open-source projects similar to apache/spark, ranked by shared features. Top picks: apache/flink, hazelcast/hazelcast, dask/dask, risingwavelabs/risingwave, apache/beam, apache/hadoop, apache/hive, trinodb/trino, h2oai/h2o-3, prestodb/presto.

Is apache/flink a good alternative to Spark?

Apache Flink is a distributed processing engine designed for both high-throughput, low-latency data streams and finite batch workloads. It functions as a stateful stream processor and a SQL stream processing engine, providing a unified runtime to execute relational queries and event-based transform…

Is hazelcast/hazelcast a good alternative to Spark?

Hazelcast is a distributed data platform that combines an in-memory data grid with a stream processing engine to support real-time analytics and event-driven applications. It functions as a partitioned, distributed key-value store that replicates data across cluster nodes to provide low-latency acc…

Is dask/dask a good alternative to Spark?

Dask is a parallel computing framework and distributed task scheduler designed to scale Python data science workflows from single machines to large clusters. It functions as a cluster resource manager that orchestrates computational logic by representing tasks and their dependencies as directed acy…

Is risingwavelabs/risingwave a good alternative to Spark?

RisingWave is a cloud-native streaming database and real-time analytics engine that uses standard SQL to process continuous data streams. It functions as a streaming data lakehouse, combining the capabilities of a streaming SQL database with a platform that integrates streaming ingestion with open…

Is apache/beam a good alternative to Spark?

Apache Beam is a distributed data pipeline framework and unified data processing model designed to handle both bounded batch data and unbounded real-time streams. It provides a system for building scalable, data-parallel workflows that operate across compute clusters using a single programming mode…

Is apache/hadoop a good alternative to Spark?

Hadoop is a big data infrastructure suite and distributed data processing framework designed to store and process massive datasets across clusters of computers. It consists of a distributed storage system for managing large files across multiple nodes and a parallel computing engine for processing…

Is apache/hive a good alternative to Spark?

Apache Hive is a SQL-on-Hadoop data warehouse that enables querying and managing petabytes of data stored in distributed storage such as HDFS and cloud storage services. It provides a familiar SQL interface for batch analytics and reporting, supported by a core set of components including the HiveS…

Is trinodb/trino a good alternative to Spark?

Trino is a distributed SQL query engine designed for large-scale data analytics. It functions as a data federation platform, providing a unified interface that allows users to execute complex analytical queries across multiple heterogeneous data sources simultaneously without requiring data movemen…

Is h2oai/h2o-3 a good alternative to Spark?

h2o-3 is a distributed machine learning platform and automated machine learning framework designed for training and deploying predictive models using distributed in-memory computing. It functions as a deep learning framework and a distributed model scoring engine, capable of operating as a Kubernet…

Is prestodb/presto a good alternative to Spark?

Presto is a distributed SQL query engine designed for high-performance analytical processing across heterogeneous data sources. It functions as a data federation platform and massively parallel processing engine, allowing users to execute interactive queries against diverse storage systems without…

Back to apache/spark

Open-source alternatives to Spark

30 open-source projects similar to apache/spark, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best Spark alternative.

apache/flink
apache/flink
26,086View on GitHub
Apache Flink is a distributed processing engine designed for both high-throughput, low-latency data streams and finite batch workloads. It functions as a stateful stream processor and a SQL stream processing engine, providing a unified runtime to execute relational queries and event-based transformations. The system is distinguished by its ability to manage persistent operator state to ensure exactly-once processing guarantees and consistency during failures. It features specialized capabilities for complex event processing to detect temporal patterns and handles out-of-order events using eve
Java
View on GitHub26,086
hazelcast/hazelcast
hazelcast/hazelcast
6,570View on GitHub
Hazelcast is a distributed data platform that combines an in-memory data grid with a stream processing engine to support real-time analytics and event-driven applications. It functions as a partitioned, distributed key-value store that replicates data across cluster nodes to provide low-latency access and high availability. The platform also serves as a distributed SQL query engine, allowing users to execute standard SQL statements against both in-memory datasets and external data sources. What distinguishes Hazelcast is its use of a distributed consensus subsystem to maintain strongly consis
Javabig-datacachingdata-in-motion
View on GitHub6,570
dask/dask
dask/dask
13,746View on GitHub
Dask is a parallel computing framework and distributed task scheduler designed to scale Python data science workflows from single machines to large clusters. It functions as a cluster resource manager that orchestrates computational logic by representing tasks and their dependencies as directed acyclic graphs. This architecture allows the system to automate the distribution of workloads across available hardware while managing complex execution requirements. The project distinguishes itself through a lazy evaluation engine that defers data operations until they are explicitly requested, enabl
Pythondasknumpypandas
View on GitHub13,746

Open-source alternatives to Spark

apache/flink

hazelcast/hazelcast

dask/dask

risingwavelabs/risingwave

apache/beam

apache/hadoop

apache/hive

trinodb/trino

h2oai/h2o-3

prestodb/presto

apache/doris

apache/pinot

citusdata/citus

apache/storm

ray-project/ray

ArroyoSystems/arroyo

boto/boto3

alteryx/featuretools

osquery/osquery

pathwaycom/pathway

google-deepmind/sonnet

ivy-llc/ivy

dbt-labs/dbt-core

lukasmasuch/best-of-ml-python

sindresorhus/awesome

apache/kafka

MaterializeInc/materialize

spotify/luigi

modin-project/modin

infinyon/fluvio