# heibaiying/bigdata-notes

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/heibaiying-bigdata-notes).**

16,912 stars · 4,283 forks · Java

## Links

- GitHub: https://github.com/heibaiying/BigData-Notes
- awesome-repositories: https://awesome-repositories.com/repository/heibaiying-bigdata-notes.md

## Topics

`azkaban` `big-data` `bigdata` `flume` `hadoop` `hbase` `hdfs` `hive` `kafka` `mapreduce` `phoenix` `scala` `spark` `sqoop` `storm` `yarn` `zookeeper`

## Description

BigData-Notes is a big data learning resource and data engineering knowledge base. It provides a collection of guides, technical references, and documentation focused on the installation and configuration of distributed data processing technologies.

The project covers a learning path for distributed systems, including the setup of large-scale data storage and computing clusters. It specifically addresses both batch and stream processing workflows and the implementation of data APIs for interacting with distributed messaging and storage systems.

The materials are organized using markdown-based knowledge structuring and a hierarchical category mapping to separate technology stacks. This structure includes step-by-step configuration flows for deploying distributed computing environments.

## Tags

### Education & Learning Resources

- [Big Data Learning Paths](https://awesome-repositories.com/f/education-learning-resources/big-data-learning-paths.md) — Provides a structured educational path for learning the foundational concepts of big data and distributed systems. ([source](https://github.com/heibaiying/bigdata-notes#readme))
- [Distributed Systems Study Guides](https://awesome-repositories.com/f/education-learning-resources/distributed-systems-study-guides.md) — Includes detailed study guides for installing and configuring a variety of distributed system tools. ([source](https://github.com/heibaiying/bigdata-notes#readme))
- [Engineering Knowledge Bases](https://awesome-repositories.com/f/education-learning-resources/engineering-knowledge-bases.md) — Curates a technical knowledge base of materials and workflows specifically for data engineers.
- [Installation Guides](https://awesome-repositories.com/f/education-learning-resources/installation-guides.md) — Provides linear, step-by-step configuration flows for deploying complex distributed computing environments.

### Data & Databases

- [Deployment Guides](https://awesome-repositories.com/f/data-databases/big-data-processing/deployment-guides.md) — Provides detailed instructions for setting up big data software stacks on servers for large-scale processing.
- [Unified Batch and Stream Processing Engines](https://awesome-repositories.com/f/data-databases/data-processing-pipelines/data-processing-frameworks/unified-batch-and-stream-processing-engines.md) — Documents the use of unified engines for processing both historical batch data and live data streams.
- [Data Processing Workflows](https://awesome-repositories.com/f/data-databases/data-processing-workflows.md) — Covers the execution and definition of batch and stream processing tasks using distributed computing engines. ([source](https://github.com/heibaiying/bigdata-notes#readme))
- [API Reference Guides](https://awesome-repositories.com/f/data-databases/big-data-processing/api-reference-guides.md) — Offers reference materials for using programming interfaces to interact with distributed storage and messaging systems. ([source](https://github.com/heibaiying/bigdata-notes#readme))
- [Distributed Data API Implementations](https://awesome-repositories.com/f/data-databases/distributed-data-api-implementations.md) — Demonstrates how to implement and use APIs for interacting with distributed messaging and storage systems.

### Software Engineering & Architecture

- [Distributed Storage Clusters](https://awesome-repositories.com/f/software-engineering-architecture/distributed-systems/distributed-data-management/distributed-storage-clusters.md) — Provides technical references and setup instructions for managing large-scale distributed storage clusters.
- [Distributed System API References](https://awesome-repositories.com/f/software-engineering-architecture/api-specification-versions/api-specification-references/distributed-system-api-references.md) — Provides detailed reference indexing for programming interfaces used to interact with distributed storage and messaging systems.
- [Technology Stack Modularization](https://awesome-repositories.com/f/software-engineering-architecture/technology-stack-modularization.md) — Groups related big data tools into modular sections to separate batch, stream, and storage concepts.

### Content Management & Publishing

- [Hierarchical Navigations](https://awesome-repositories.com/f/content-management-publishing/category-organizations/hierarchical-navigations.md) — Organizes learning paths through nested category structures that guide users from basic to advanced distributed tools.
- [Markdown-Based Knowledge Bases](https://awesome-repositories.com/f/content-management-publishing/content-management-systems/content-architecture-modeling/markdown-ecosystem-tools/markdown-based-knowledge-bases.md) — Uses markdown-based files to structure technical documentation for version control and static site generation.
