Hadoop

Features

Big Data Processing - Provides the primary infrastructure for managing, storing, and processing massive volumes of data across distributed systems.
Distributed Computing - Executes large-scale data analytics and processing tasks in parallel across distributed computing clusters.
Distributed Data Processing Frameworks - Provides a framework for partitioning, transforming, and processing large-scale datasets across distributed clusters.
Distributed File Systems - Implements a scalable distributed file system that splits files into blocks across commodity hardware nodes.
Large-Scale Data Computation - Implements a distributed framework for executing complex data analysis and computation across large clusters.
MapReduce Processing Engines - Provides a parallel computing engine based on the MapReduce programming model for processing massive datasets.
Distributed Storage Clusters - Creates a scalable storage system by aggregating multiple nodes into a unified distributed storage cluster.
Distributed File Systems - Implements a scalable distributed file system that manages large files across multiple nodes for fault tolerance.
Data-Locality Scheduling - Implements scheduling that minimizes network traffic by executing logic on the physical node where the required data resides.
Fault Tolerance - Ensures high data availability and resilience by replicating data blocks across multiple physical nodes.
Cluster Resource Managers - Provides a mechanism to allocate computing resources and schedule jobs across distributed network nodes to optimize hardware usage.
Master-Worker Coordination - Uses a central node to manage metadata and orchestrate task assignment to worker nodes across the cluster.
Big Data Frameworks - Framework for distributed processing of large datasets.
Data Processing - Distributed processing framework for big data workloads.
Data Processing and Analysis - Foundation for distributed storage and large-scale data processing.
Distributed Filesystems - Distributed filesystem for high-throughput application data.
Data Engineering - Framework for distributed processing of large datasets.
Data Infrastructure Management - Framework for distributed processing of large datasets across compute clusters.

Open-source alternatives to Hadoop

Similar open-source projects, ranked by how many features they share with Hadoop.

apache/flink
apache/flink
26,086View on GitHub
Apache Flink is a distributed processing engine designed for both high-throughput, low-latency data streams and finite batch workloads. It functions as a stateful stream processor and a SQL stream processing engine, providing a unified runtime to execute relational queries and event-based transformations. The system is distinguished by its ability to manage persistent operator state to ensure exactly-once processing guarantees and consistency during failures. It features specialized capabilities for complex event processing to detect temporal patterns and handles out-of-order events using eve
Java
View on GitHub26,086
apache/hbase
apache/hbase
5,540View on GitHub
HBase is a distributed, wide-column NoSQL store and big data storage engine designed for sparse datasets. It functions as a scalable columnar database built on top of the Hadoop Distributed File System to provide real-time read and write access to massive volumes of structured and unstructured data. The system acts as a cross-language database gateway, offering connectivity through native remote procedure calls, REST, and Thrift interfaces. It distinguishes itself through a master-worker coordination model that enables horizontal scaling and fault tolerance across a cluster. The project cove
Java
View on GitHub5,540
apache/spark
apache/spark
43,467View on GitHub
Apache Spark is a unified distributed data processing engine designed for large-scale data analysis and computation graphs. It functions as a distributed machine learning framework, a graph processing system, a real-time stream processor, and a SQL analytics engine. The system enables the execution of distributed SQL querying, large-scale graph analysis, and real-time stream analytics across clusters of machines. It also provides a scalable environment for implementing machine learning algorithms and predictive model development on massive datasets. The engine incorporates relational query e
Scalabig-datajavajdbc
View on GitHub43,467
hazelcast/hazelcast
hazelcast/hazelcast
6,570View on GitHub
Hazelcast is a distributed data platform that combines an in-memory data grid with a stream processing engine to support real-time analytics and event-driven applications. It functions as a partitioned, distributed key-value store that replicates data across cluster nodes to provide low-latency access and high availability. The platform also serves as a distributed SQL query engine, allowing users to execute standard SQL statements against both in-memory datasets and external data sources. What distinguishes Hazelcast is its use of a distributed consensus subsystem to maintain strongly consis
Javabig-datacachingdata-in-motion
View on GitHub6,570

See all 30 alternatives to Hadoop

apachehadoop

Features

Open-source alternatives to Hadoop

apache/flink

apache/hbase

apache/spark

hazelcast/hazelcast

Star history

Open-source alternatives to Hadoop

apache/flink

apache/hbase

apache/spark

hazelcast/hazelcast