What are the best open-source alternatives to Hadoop?

30 open-source projects similar to apache/hadoop, ranked by shared features. Top picks: apache/flink, apache/hbase, apache/spark, hazelcast/hazelcast, jerrylead/sparkinternals, gluster/glusterfs, apache/beam, ceph/ceph, donnemartin/data-science-ipython-notebooks, alteryx/featuretools.

Is apache/flink a good alternative to Hadoop?

Apache Flink is a distributed processing engine designed for both high-throughput, low-latency data streams and finite batch workloads. It functions as a stateful stream processor and a SQL stream processing engine, providing a unified runtime to execute relational queries and event-based transform…

Is apache/hbase a good alternative to Hadoop?

HBase is a distributed, wide-column NoSQL store and big data storage engine designed for sparse datasets. It functions as a scalable columnar database built on top of the Hadoop Distributed File System to provide real-time read and write access to massive volumes of structured and unstructured data…

Is apache/spark a good alternative to Hadoop?

Apache Spark is a unified distributed data processing engine designed for large-scale data analysis and computation graphs. It functions as a distributed machine learning framework, a graph processing system, a real-time stream processor, and a SQL analytics engine. The system enables the executio…

Is hazelcast/hazelcast a good alternative to Hadoop?

Hazelcast is a distributed data platform that combines an in-memory data grid with a stream processing engine to support real-time analytics and event-driven applications. It functions as a partitioned, distributed key-value store that replicates data across cluster nodes to provide low-latency acc…

Is jerrylead/sparkinternals a good alternative to Hadoop?

SparkInternals is a technical reference and architecture guide detailing the internal design and implementation of the Apache Spark distributed computing engine. It serves as a study of big data engine analysis, focusing on how the system manages cluster execution and the interaction between driver…

Is gluster/glusterfs a good alternative to Hadoop?

GlusterFS is a software-defined distributed file system and scale-out storage cluster that aggregates disk resources from multiple servers into a single global namespace. It functions as a unified storage platform, allowing the same underlying data to be exposed through file, block, and object stor…

Is apache/beam a good alternative to Hadoop?

Apache Beam is a distributed data pipeline framework and unified data processing model designed to handle both bounded batch data and unbounded real-time streams. It provides a system for building scalable, data-parallel workflows that operate across compute clusters using a single programming mode…

Is ceph/ceph a good alternative to Hadoop?

Ceph is a unified, software-defined storage platform designed to provide object, block, and file storage services from a single distributed cluster. By decoupling data management from physical hardware, it enables elastic scaling across commodity hardware, allowing organizations to build large-scal…

Is donnemartin/data-science-ipython-notebooks a good alternative to Hadoop?

This project is a collection of interactive Python notebooks and educational resources designed for mastering data science, machine learning, and numerical computing. It provides a series of practical guides and tutorials covering deep learning, big data processing, and statistical analysis. The r…

Is alteryx/featuretools a good alternative to Hadoop?

Featuretools is an automated feature engineering library and data transformation framework written in Python. It automatically generates machine learning feature vectors from multi-table datasets by applying synthesis patterns to relational and timestamped data. The system functions as a distribut…

Back to apache/hadoop

Open-source alternatives to Hadoop

30 open-source projects similar to apache/hadoop, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best Hadoop alternative.

apache/flink
apache/flink
26,086View on GitHub
Apache Flink is a distributed processing engine designed for both high-throughput, low-latency data streams and finite batch workloads. It functions as a stateful stream processor and a SQL stream processing engine, providing a unified runtime to execute relational queries and event-based transformations. The system is distinguished by its ability to manage persistent operator state to ensure exactly-once processing guarantees and consistency during failures. It features specialized capabilities for complex event processing to detect temporal patterns and handles out-of-order events using eve
Java
View on GitHub26,086
apache/hbase
apache/hbase
5,540View on GitHub
HBase is a distributed, wide-column NoSQL store and big data storage engine designed for sparse datasets. It functions as a scalable columnar database built on top of the Hadoop Distributed File System to provide real-time read and write access to massive volumes of structured and unstructured data. The system acts as a cross-language database gateway, offering connectivity through native remote procedure calls, REST, and Thrift interfaces. It distinguishes itself through a master-worker coordination model that enables horizontal scaling and fault tolerance across a cluster. The project cove
Java
View on GitHub5,540
apache/spark
apache/spark
43,467View on GitHub
Apache Spark is a unified distributed data processing engine designed for large-scale data analysis and computation graphs. It functions as a distributed machine learning framework, a graph processing system, a real-time stream processor, and a SQL analytics engine. The system enables the execution of distributed SQL querying, large-scale graph analysis, and real-time stream analytics across clusters of machines. It also provides a scalable environment for implementing machine learning algorithms and predictive model development on massive datasets. The engine incorporates relational query e
Scalabig-datajavajdbc
View on GitHub43,467

Open-source alternatives to Hadoop

apache/flink

apache/hbase

apache/spark

hazelcast/hazelcast

JerryLead/SparkInternals

gluster/glusterfs

apache/beam

ceph/ceph

donnemartin/data-science-ipython-notebooks

alteryx/featuretools

dagster-io/dagster

shekhargulati/52-technologies-in-2016

nrwl/nx

quarkusio/quarkus

databricks/Spark-The-Definitive-Guide

databricks/learning-spark

h2oai/h2o-3

deepseek-ai/3FS

modin-project/modin

dask/dask

azkaban/azkaban

apache/iotdb

featuretools/featuretools

Vonng/ddia

e2b-dev/code-interpreter

linkedin/school-of-sre

jupyter/docker-stacks

kananinirav/AWS-Certified-Cloud-Practitioner-Notes

microsoft/rushstack

AnswerDotAI/nbdev