What are the best Awesome Distributed Processing Frameworks GitHub Repositories?

Systems designed for parallel execution and large-scale batch or event-driven data computation across clusters. Explore 59 awesome GitHub repositories matching data & databases · Distributed Processing Frameworks. Refine with filters or upvote what's useful. Top picks: josephmisiti/awesome-machine-learning, sindresorhus/awesome-nodejs, pathwaycom/pathway, xingshaocheng/architect-awesome, pathwaycom/llm-app, apache/spark, donnemartin/data-science-ipython-notebooks, facebook/zstd, vonng/ddia, vi…

Why is josephmisiti/awesome-machine-learning a recommended Distributed Processing Frameworks GitHub Repositories repository?

Enables large-scale computation through distributed frameworks designed for parallelized data processing and analytics.

Why is sindresorhus/awesome-nodejs a recommended Distributed Processing Frameworks GitHub Repositories repository?

Identify high-performance frameworks capable of ingesting and transforming data streams in real time.

Why is pathwaycom/pathway a recommended Distributed Processing Frameworks GitHub Repositories repository?

Processes continuous data streams in real-time to facilitate immediate event-driven analytics.

Why is xingshaocheng/architect-awesome a recommended Distributed Processing Frameworks GitHub Repositories repository?

Execute large-scale data analytics across distributed clusters to derive insights from high-volume information sources.

Why is pathwaycom/llm-app a recommended Distributed Processing Frameworks GitHub Repositories repository?

Ingests and processes information from diverse sources in real-time to ensure continuous visibility into changing data.

Why is apache/spark a recommended Distributed Processing Frameworks GitHub Repositories repository?

Ships a processing system that ingests and transforms real-time data streams for continuous analytics.

Why is donnemartin/data-science-ipython-notebooks a recommended Distributed Processing Frameworks GitHub Repositories repository?

Includes tutorials on executing MapReduce jobs and in-memory cluster computing across distributed file systems.

Why is facebook/zstd a recommended Distributed Processing Frameworks GitHub Repositories repository?

Enables high-throughput real-time decompression to restore data quickly for immediate application use.

Why is vonng/ddia a recommended Distributed Processing Frameworks GitHub Repositories repository?

Provides frameworks for executing large-scale data processing and computation across distributed clusters.

Why is videolan/vlc a recommended Distributed Processing Frameworks GitHub Repositories repository?

Applies audio and video transformations sequentially to raw data streams before final rendering.

59 Repos

Awesome GitHub RepositoriesDistributed Processing Frameworks

Systems designed for parallel execution and large-scale batch or event-driven data computation across clusters.

Explore 59 awesome GitHub repositories matching data & databases · Distributed Processing Frameworks. Refine with filters or upvote what's useful.

Finde die besten Repos mit KI.Wir suchen mit KI nach den am besten passenden Repositories.

josephmisiti/awesome-machine-learning
josephmisiti/awesome-machine-learning
72,867Auf GitHub ansehen
This project is a comprehensive, community-driven directory of machine learning resources, software libraries, and educational materials. It serves as a centralized knowledge base for developers and researchers, organizing tools and frameworks by their primary programming language and technical domain to simplify discovery across the artificial intelligence ecosystem. The collection distinguishes itself by providing a cross-language development index that spans diverse programming environments, including C, C++, Rust, Clojure, and Python. It covers a wide range of specialized capabilities, fr
Enables large-scale computation through distributed frameworks designed for parallelized data processing and analytics.
Python
Auf GitHub ansehen72,867
sindresorhus/awesome-nodejs
sindresorhus/awesome-nodejs
65,973Auf GitHub ansehen
This project is a community-driven directory that aggregates essential software projects and educational content for the Node.js ecosystem. It functions as a centralized knowledge base and discovery index, designed to simplify the navigation of a fragmented technical landscape by providing a structured collection of high-quality links, tools, and learning materials. The repository distinguishes itself through a decentralized, peer-reviewed curation model. By utilizing standard version control workflows and pull requests, the community ensures that all listed resources undergo human verificati
Identify high-performance frameworks capable of ingesting and transforming data streams in real time.
awesomeawesome-listjavascript
Auf GitHub ansehen65,973
pathwaycom/pathway
pathwaycom/pathway
62,959Auf GitHub ansehen
Pathway is a high-performance data processing framework designed for building unified batch and streaming pipelines. It functions as an orchestrator for complex data transformations, utilizing a differential dataflow engine to process updates incrementally. By treating static datasets and continuous event streams with identical logic, the platform ensures exactly-once processing semantics and consistent results across diverse data sources. The framework distinguishes itself through its specialized support for real-time artificial intelligence and retrieval-augmented generation. It features in
Processes continuous data streams in real-time to facilitate immediate event-driven analytics.
Pythonbatch-processingdata-analyticsdata-pipelines
Auf GitHub ansehen62,959
xingshaocheng/architect-awesome
xingshaocheng/architect-awesome
60,821Auf GitHub ansehen
This project serves as a comprehensive knowledge base and reference for distributed systems engineering and enterprise software architecture. It provides a structured collection of technical resources, design patterns, and methodologies intended to assist in the design, maintenance, and scaling of complex, high-performance software environments. The repository distinguishes itself by offering deep dives into core architectural concepts such as actor-based concurrency, aspect-oriented interception, and inversion-of-control containers. It emphasizes the practical application of distributed syst
Execute large-scale data analytics across distributed clusters to derive insights from high-volume information sources.
Auf GitHub ansehen60,821
pathwaycom/llm-app
pathwaycom/llm-app
59,341Auf GitHub ansehen
This project is a data processing engine and AI application platform designed for building production-grade machine learning workflows. It provides a unified programming model that handles both historical batch data and live stream ingestion, enabling the development of real-time ETL pipelines and scalable data transformation workflows. The framework distinguishes itself through differential dataflow execution, which propagates only changes through a pipeline rather than recomputing entire datasets. It supports distributed state management across worker nodes and utilizes incremental stream p
Ingests and processes information from diverse sources in real-time to ensure continuous visibility into changing data.
Jupyter Notebookchatbothugging-facellm
Auf GitHub ansehen59,341
apache/spark
apache/spark
43,467Auf GitHub ansehen
Apache Spark is a unified distributed data processing engine designed for large-scale data analysis and computation graphs. It functions as a distributed machine learning framework, a graph processing system, a real-time stream processor, and a SQL analytics engine. The system enables the execution of distributed SQL querying, large-scale graph analysis, and real-time stream analytics across clusters of machines. It also provides a scalable environment for implementing machine learning algorithms and predictive model development on massive datasets. The engine incorporates relational query e
Ships a processing system that ingests and transforms real-time data streams for continuous analytics.
Scalabig-datajavajdbc
Auf GitHub ansehen43,467
donnemartin/data-science-ipython-notebooks
donnemartin/data-science-ipython-notebooks
29,166Auf GitHub ansehen
This project is a collection of interactive Python notebooks and educational resources designed for mastering data science, machine learning, and numerical computing. It provides a series of practical guides and tutorials covering deep learning, big data processing, and statistical analysis. The repository features specialized instructional suites for implementing classical machine learning algorithms, building deep learning model architectures, and managing AWS cloud infrastructure. It includes dedicated notebooks for data visualization and numerical computing exercises. The project covers
Includes tutorials on executing MapReduce jobs and in-memory cluster computing across distributed file systems.
Pythonawsbig-datacaffe
Auf GitHub ansehen29,166
facebook/zstd
facebook/zstd
27,259Auf GitHub ansehen
Zstandard is a lossless data compression library and archive format designed for high compression ratios and fast real-time processing. It functions as a real-time data compressor and multi-threaded compression engine capable of distributing workloads across multiple CPU cores to increase throughput. The system features a dictionary-based compressor that trains on sample data to improve the compression ratio and speed of small files. It also provides long distance pattern matching to identify repeated sequences across large files. The library covers a broad range of capabilities including st
Enables high-throughput real-time decompression to restore data quickly for immediate application use.
C
Auf GitHub ansehen27,259
vonng/ddia
Vonng/ddia
22,648Auf GitHub ansehen
This project serves as a comprehensive technical reference for the architecture and design of data-intensive applications. It provides a structured analysis of the fundamental principles required to build reliable, scalable, and maintainable software systems, covering the core trade-offs inherent in modern data infrastructure. The repository explores the mechanics of distributed data management, including strategies for replication, partitioning, and achieving consensus across multiple nodes. It details the design of storage engines, indexing techniques, and transaction management models, whi
Provides frameworks for executing large-scale data processing and computation across distributed clusters.
Pythonbookdatabaseddia
Auf GitHub ansehen22,648
videolan/vlc
videolan/vlc
18,717Auf GitHub ansehen
VLC is a cross-platform multimedia player and framework designed to decode and render virtually any audio or video format, network stream, or physical disc without requiring external codecs. It functions as both a standalone application and a portable library, providing a modular architecture that allows developers to integrate playback, filtering, and streaming capabilities into third-party software. The project distinguishes itself through a highly modular plugin-based engine that supports real-time media processing, including format transcoding and the application of audio and video filter
Applies audio and video transformations sequentially to raw data streams before final rendering.
Ccframeworkgplv2
Auf GitHub ansehen18,717
pingcap/tikv
pingcap/tikv
16,724Auf GitHub ansehen
TiKV is a cloud-native distributed transactional key-value store and storage engine. It provides a distributed database designed for horizontal scalability and strong consistency across a cluster of physical nodes. The system uses a Raft-based consensus mechanism to maintain data availability and state synchronization. It ensures ACID compliance for distributed transactions through a two-phase commit workflow and manages data distribution via multi-Raft sharding. The engine handles massive datasets using automated range splitting and cluster load balancing to distribute data across different
Executes filtering and aggregation logic directly on storage nodes to reduce data transfer over the network.
Rust
Auf GitHub ansehen16,724
piskvorky/gensim
piskvorky/gensim
16,361Auf GitHub ansehen
Gensim is a natural language processing toolkit designed for large-scale text analysis and the training of semantic vector embeddings. It provides a framework for identifying latent thematic structures within document collections and calculating semantic similarity between text segments using unsupervised statistical algorithms. The project is distinguished by its ability to handle datasets that exceed available system memory through incremental corpus streaming, which processes documents one at a time from disk. It utilizes sparse vector representations and dictionary-based token mapping to
Distributes heavy computational tasks across multiple processor cores or clusters to accelerate data operations.
Pythondata-miningdata-sciencedocument-similarity
Auf GitHub ansehen16,361
apache/hadoop
apache/hadoop
15,567Auf GitHub ansehen
Hadoop is a big data infrastructure suite and distributed data processing framework designed to store and process massive datasets across clusters of computers. It consists of a distributed storage system for managing large files across multiple nodes and a parallel computing engine for processing data across a distributed cluster. The framework implements a distributed file system to ensure fault tolerance and high throughput, paired with a programming model that processes large datasets in parallel. It manages the underlying hardware and software environment required for distributed big dat
Executes large-scale data analytics and processing tasks in parallel across distributed computing clusters.
Java
Auf GitHub ansehen15,567
apache/doris
apache/doris
15,526Auf GitHub ansehen
Doris is a distributed SQL data warehouse designed for high-performance analytical workloads and real-time data processing. It functions as a unified platform that integrates traditional relational warehousing with lakehouse query capabilities, allowing users to execute analytical operations directly against external data lakes without requiring data migration. The system distinguishes itself through a shared-nothing, massively parallel processing architecture that utilizes vectorized query execution and columnar storage to maintain sub-second latency. It supports dynamic schema evolution, en
Supports continuous real-time data ingestion to ensure new information is immediately available for analysis.
Javaagentaibigquery
Auf GitHub ansehen15,526
dagster-io/dagster
dagster-io/dagster
14,974Auf GitHub ansehen
Dagster is a data orchestration platform designed to manage the entire lifecycle of data assets through declarative modeling and version-controlled code. It functions as a workflow engine that treats data assets as first-class primitives, allowing teams to define, schedule, and monitor complex pipelines while maintaining clear visibility into lineage, dependencies, and data quality. The platform distinguishes itself by using a code-as-configuration framework that enables standard software engineering practices, such as unit testing and local mocking, to be applied directly to data workflows.
Launches and manages code execution across cloud-native, serverless, and containerized infrastructure to scale processing power.
Pythonanalyticsdagsterdata-engineering
Auf GitHub ansehen14,974
oxnr/awesome-bigdata
oxnr/awesome-bigdata
14,454Auf GitHub ansehen
This project is a curated directory of software, frameworks, and educational resources designed for building, scaling, and maintaining distributed data processing and storage architectures. It serves as a comprehensive index for the distributed computing ecosystem, helping users identify the appropriate tools for managing large-scale information systems. The repository functions as a central hub for data engineering, offering categorized access to technologies that support batch and stream processing, machine learning, and interactive querying. By organizing these resources, it assists in the
Executes data processing tasks across interconnected nodes to handle massive datasets through parallel computation.
awesomeawesome-listbigdata
Auf GitHub ansehen14,454
dask/dask
dask/dask
13,746Auf GitHub ansehen
Dask ist ein Framework für paralleles Rechnen und ein verteilter Task-Scheduler, der darauf ausgelegt ist, Python-Data-Science-Workflows von einzelnen Maschinen auf große Cluster zu skalieren. Es fungiert als Cluster-Ressourcenmanager, der die Berechnungslogik orchestriert, indem Aufgaben und deren Abhängigkeiten als gerichtete azyklische Graphen dargestellt werden. Diese Architektur ermöglicht es dem System, die Verteilung von Workloads auf verfügbare Hardware zu automatisieren und gleichzeitig komplexe Ausführungsanforderungen zu verwalten. Das Projekt zeichnet sich durch eine Lazy-Evaluation-Engine aus, die Datenoperationen verzögert, bis sie explizit angefordert werden, was eine globale Graphoptimierung und effiziente Ressourcenzuweisung ermöglicht. Es integriert speicherbewusstes Data-Spilling, um Systemabstürze bei der Verarbeitung von Datensätzen zu verhindern, die den verfügbaren Speicher überschreiten, und nutzt Task-Graph-Fusion, um Sequenzen von Operationen in einzelne Ausführungsschritte zu kombinieren, wodurch Scheduling-Overhead und Inter-Node-Kommunikation minimiert werden. Die Plattform bietet eine umfassende Oberfläche für die Datenanalyse im großen Maßstab, einschließlich Unterstützung für verteiltes maschinelles Lernen, Integration in das Hochleistungsrechnen und parallele Datenverarbeitung. Sie bietet umfangreiche Werkzeuge für das Cluster-Lebenszyklusmanagement, Performance-Profiling und die Echtzeitüberwachung der Aufgabenausführung. Benutzer können diese Umgebungen über verschiedene Infrastrukturen hinweg bereitstellen, einschließlich lokaler Hardware, Cloud-Anbietern, containerisierten Systemen und Hochleistungsrechner-Clustern.
Triggers the execution of lazy operations across a cluster to return final results to the local environment.
Pythondasknumpypandas
Auf GitHub ansehen13,746
ydataai/ydata-profiling
ydataai/ydata-profiling
13,388Auf GitHub ansehen
Ydata-profiling is an automated exploratory data analysis framework designed to generate comprehensive statistical reports and visual summaries from dataframes. It functions as a diagnostic tool for assessing data quality, identifying missing values, duplicates, and outliers, while providing a scalable engine for profiling massive datasets across distributed enterprise environments. The project distinguishes itself through its ability to handle large-scale data through distributed task orchestration and lazy stream processing, which minimizes memory overhead during complex computations. It in
Distributes heavy computational tasks across multiple machines to profile massive datasets.
Pythonbig-data-analyticsdata-analysisdata-exploration
Auf GitHub ansehen13,388
vesoft-inc/nebula
vesoft-inc/nebula
12,239Auf GitHub ansehen
Nebula is a distributed graph database designed for storing and querying massive volumes of interconnected vertices and edges across a horizontally scalable cluster. It functions as a Kubernetes-native database and a distributed graph analytics engine, utilizing a Raft-based distributed store to ensure strong consistency and high availability. The system features an OpenCypher query engine for performing complex graph traversals and pattern matching. It distinguishes itself with a decoupled compute-storage architecture and a shared-nothing distributed design, allowing query processing and dat
Enables the execution of complex graph algorithms on dataframes via a distributed computing engine.
C++big-datacppdatabase
Auf GitHub ansehen12,239
datahub-project/datahub
datahub-project/datahub
12,141Auf GitHub ansehen
DataHub is a metadata management platform designed to unify technical, operational, and business context across diverse data ecosystems. By utilizing a graph-based metadata model and an event-driven ingestion architecture, it creates a centralized source of truth that maps complex data relationships, lineage, and ownership. This foundational framework enables organizations to maintain a synchronized view of their data landscape, supporting both human-led discovery and automated data operations. The platform distinguishes itself through its focus on grounding artificial intelligence and autono
Processes metadata updates in real-time using an event-driven architecture to maintain current data context.
Pythondata-catalogdata-discoverydata-governance
Auf GitHub ansehen12,141