Why is sindresorhus/awesome a recommended Distributed Data Processing Frameworks GitHub Repositories repository?

Provides frameworks for partitioning and processing large-scale datasets across distributed clusters.

Why is apache/spark a recommended Distributed Data Processing Frameworks GitHub Repositories repository?

Functions as a unified engine for partitioning, transforming, and processing massive datasets across distributed clusters.

Why is ray-project/ray a recommended Distributed Data Processing Frameworks GitHub Repositories repository?

A framework that represents data as partitioned blocks to support incremental transformations and parallel execution across large clusters.

Why is dask/dask a recommended Distributed Data Processing Frameworks GitHub Repositories repository?

Creates parallel collections from sequences, files, or URLs to enable distributed processing of unstructured data.

Why is modin-project/modin a recommended Distributed Data Processing Frameworks GitHub Repositories repository?

Partitions, transforms, and processes large-scale Pandas dataframes across distributed computing clusters.

Why is featuretools/featuretools a recommended Distributed Data Processing Frameworks GitHub Repositories repository?

Integrates with distributed computing frameworks to maintain performance when processing large volumes of data.

Why is hazelcast/hazelcast a recommended Distributed Data Processing Frameworks GitHub Repositories repository?

Redistributes data across cluster members to prevent processing bottlenecks.

13 repository-uri

Awesome GitHub RepositoriesDistributed Data Processing Frameworks

Systems for partitioning, transforming, and processing large-scale datasets across distributed computing clusters.

Distinguishing note: Specifically targets lazy, partitioned data processing rather than general database management or storage.

Explore 13 awesome GitHub repositories matching data & databases · Distributed Data Processing Frameworks. Refine with filters or upvote what's useful.

Găsește cele mai bune repo-uri cu AI.Vom căuta cele mai potrivite repository-uri folosind AI.

sindresorhus/awesome
sindresorhus/awesome
476,211Vezi pe GitHub
Acest proiect este un director întreținut de comunitate care servește drept index cuprinzător de instrumente software, framework-uri și materiale educaționale. Funcționează ca o bază de cunoștințe open-source, organizând diverse domenii de inginerie și resurse tehnice într-o taxonomie structurată pentru a ajuta dezvoltatorii să descopere conținut de înaltă calitate. Directorul se distinge printr-un model de peer-review descentralizat, unde contribuitori independenți curatoriază, verifică și actualizează intrările pentru a asigura acuratețea și relevanța. Toate informațiile sunt stocate într-un format markdown de tip flat-file, controlat prin versiuni, ceea ce asigură independența față de platformă, transparența și auditabilitatea întregii colecții. Proiectul acoperă o suprafață vastă de capabilități, incluzând descoperirea resurselor tehnice, avansarea în cariera profesională și gestionarea cunoștințelor de dezvoltare software. Oferă acces la căi de învățare structurate, instrumente de infrastructură și securitate, utilitare de gestionare a datelor și resurse specializate pentru domenii variind de la sănătate la științe umaniste digitale. Repository-ul este menținut ca o colecție publică, controlată prin versiuni, permițând accesul programatic și actualizări bazate pe comunitate pentru datele sale structurate.
Provides frameworks for partitioning and processing large-scale datasets across distributed clusters.
awesomeawesome-listlists
Vezi pe GitHub476,211
apache/spark
apache/spark
43,467Vezi pe GitHub
Apache Spark is a unified distributed data processing engine designed for large-scale data analysis and computation graphs. It functions as a distributed machine learning framework, a graph processing system, a real-time stream processor, and a SQL analytics engine. The system enables the execution of distributed SQL querying, large-scale graph analysis, and real-time stream analytics across clusters of machines. It also provides a scalable environment for implementing machine learning algorithms and predictive model development on massive datasets. The engine incorporates relational query e
Functions as a unified engine for partitioning, transforming, and processing massive datasets across distributed clusters.
Scalabig-datajavajdbc
Vezi pe GitHub43,467
ray-project/ray
ray-project/ray
42,895Vezi pe GitHub
Ray is a distributed computing framework designed to scale Python and Java applications across clusters by abstracting task scheduling and resource management. It functions as a resource-aware execution engine that manages task dependencies, placement, and fault tolerance across networked compute nodes. At its core, the system provides a stateful actor model, allowing developers to define classes that run in dedicated processes to maintain and mutate internal state across remote method calls. The framework distinguishes itself through a robust cross-language interoperability layer, enabling f
A framework that represents data as partitioned blocks to support incremental transformations and parallel execution across large clusters.
Pythondata-sciencedeep-learningdeployment
Vezi pe GitHub42,895
apache/hadoop
apache/hadoop
15,567Vezi pe GitHub
Hadoop is a big data infrastructure suite and distributed data processing framework designed to store and process massive datasets across clusters of computers. It consists of a distributed storage system for managing large files across multiple nodes and a parallel computing engine for processing data across a distributed cluster. The framework implements a distributed file system to ensure fault tolerance and high throughput, paired with a programming model that processes large datasets in parallel. It manages the underlying hardware and software environment required for distributed big dat
Provides a framework for partitioning, transforming, and processing large-scale datasets across distributed clusters.
Java
Vezi pe GitHub15,567
dask/dask
dask/dask
13,746Vezi pe GitHub
Dask este un framework de calcul paralel și un scheduler de sarcini distribuit conceput pentru a scala fluxurile de lucru de știința datelor în Python de la mașini individuale la clustere mari. Acesta funcționează ca un manager de resurse de cluster care orchestrează logica computațională prin reprezentarea sarcinilor și a dependențelor acestora sub formă de grafuri aciclice direcționate. Această arhitectură permite sistemului să automatizeze distribuția sarcinilor de lucru pe hardware-ul disponibil, gestionând în același timp cerințe complexe de execuție. Proiectul se distinge printr-un motor de evaluare leneșă (lazy) care amână operațiunile pe date până când sunt solicitate explicit, permițând optimizarea globală a grafului și alocarea eficientă a resurselor. Acesta încorporează „spilling” de date conștient de memorie pentru a preveni blocarea sistemului la procesarea seturilor de date care depășesc memoria disponibilă și utilizează fuziunea grafului de sarcini pentru a combina secvențe de operațiuni în pași de execuție unici, minimizând overhead-ul de programare și comunicarea între noduri. Platforma oferă o suprafață cuprinzătoare de capabilități pentru analiza datelor la scară largă, inclusiv suport pentru învățare automată distribuită, integrare cu calcul de înaltă performanță și procesare paralelă a datelor. Oferă instrumente extinse pentru gestionarea ciclului de viață al clusterului, profilarea performanței și monitorizarea în timp real a execuției sarcinilor. Utilizatorii pot implementa aceste medii pe diverse infrastructuri, inclusiv hardware local, furnizori de cloud, sisteme containerizate și clustere de calcul de înaltă performanță.
Creates parallel collections from sequences, files, or URLs to enable distributed processing of unstructured data.
Pythondasknumpypandas
Vezi pe GitHub13,746
modin-project/modin
modin-project/modin
10,389Vezi pe GitHub
Modin is a distributed dataframe library and parallel data processing engine designed to handle large datasets that exceed system memory. It functions as a distributed computing framework that parallelizes data manipulation tasks across multiple CPU cores or clusters to increase throughput and avoid memory errors. The project mirrors the Pandas API, allowing for the distribution of data workflows without changing core code logic. It utilizes a pluggable backend interface, which enables users to switch between different distributed execution engines to optimize performance based on available h
Partitions, transforms, and processes large-scale Pandas dataframes across distributed computing clusters.
Pythonanalyticsdata-sciencedataframe
Vezi pe GitHub10,389
apache/beam
apache/beam
8,612Vezi pe GitHub
Apache Beam is a distributed data pipeline framework and unified data processing model designed to handle both bounded batch data and unbounded real-time streams. It provides a system for building scalable, data-parallel workflows that operate across compute clusters using a single programming model. The framework utilizes a cross-runner pipeline abstraction that decouples the data processing logic from the underlying execution backend, allowing the same pipeline to run on different distributed compute engines. It supports multi-language pipeline development by translating high-level code fro
Provides a system for partitioning, transforming, and processing large-scale datasets across distributed computing clusters.
Java
Vezi pe GitHub8,612
featuretools/featuretools
featuretools/featuretools
7,655Vezi pe GitHub
Featuretools is a Python data science library and automated feature engineering framework designed to create predictive features from multiple related datasets. It automates the data preparation and transformation steps required for machine learning models through deep feature synthesis. The library enables the automatic generation of comprehensive feature tables by applying recursive transformations to relational data. It supports the transformation of unstructured text into structured numeric features and allows users to define custom primitives to extend the synthesis process with specific
Integrates with distributed computing frameworks to maintain performance when processing large volumes of data.
Python
Vezi pe GitHub7,655
hazelcast/hazelcast
hazelcast/hazelcast
6,570Vezi pe GitHub
Hazelcast is a distributed data platform that combines an in-memory data grid with a stream processing engine to support real-time analytics and event-driven applications. It functions as a partitioned, distributed key-value store that replicates data across cluster nodes to provide low-latency access and high availability. The platform also serves as a distributed SQL query engine, allowing users to execute standard SQL statements against both in-memory datasets and external data sources. What distinguishes Hazelcast is its use of a distributed consensus subsystem to maintain strongly consis
Redistributes data across cluster members to prevent processing bottlenecks.
Javabig-datacachingdata-in-motion
Vezi pe GitHub6,570
jerrylead/sparkinternals
JerryLead/SparkInternals
5,363Vezi pe GitHub
SparkInternals este un ghid tehnic de referință și arhitectură care detaliază designul intern și implementarea motorului de calcul distribuit Apache Spark. Acesta servește drept studiu de analiză a motoarelor de big data, concentrându-se pe modul în care sistemul gestionează execuția în cluster și interacțiunea dintre nodurile driver, executori și workeri. Proiectul oferă o detaliere a modului în care planurile logice sunt convertite în etape de execuție fizică. Analizează în mod specific mecanica operațiunilor de shuffle a datelor, gestionarea memoriei și coordonarea programării joburilor distribuite. Documentația acoperă o gamă largă de capabilități de calcul distribuit, inclusiv planificarea execuției interogărilor, gestionarea dependențelor de date și strategii de caching în memorie. De asemenea, examinează distribuția sarcinilor, execuția paralelă și procesele utilizate pentru recuperarea în caz de eroare și persistența datelor.
Analyzes the systems used for partitioning, transforming, and processing large-scale datasets across clusters.
Vezi pe GitHub5,363
dtstack/chunjun
DTStack/chunjun
4,104Vezi pe GitHub
Chunjun este un framework distribuit de integrare a datelor și pipeline ETL bazat pe SQL, conceput pentru a sincroniza datele între surse eterogene. Acesta funcționează ca un instrument de change data capture și un sincronizator de date eterogene, utilizând un mediu de procesare distribuit pentru a muta și transforma datele între diferite tipuri de baze de date. Sistemul se distinge prin arhitectura sa de conectori bazată pe plugin-uri, care permite dezvoltarea de plugin-uri personalizate de sursă și destinație pentru a extinde conectivitatea către sisteme de date neacceptate. Suportă change data capture în timp real din log-urile bazelor de date relaționale și implementează propagarea evoluției schemei pentru a aplica automat modificările structurale de la tabelele sursă la cele de destinație. Framework-ul oferă capabilități pentru sincronizarea incrementală a datelor și calculul datelor între surse folosind logica SQL. Fiabilitatea este gestionată prin recuperarea sarcinilor bazată pe checkpoint-uri pentru a relua transferurile întrerupte și cozi de mesaje dead-letter pentru gestionarea datelor murdare, pentru a audita înregistrările malformate. Sarcinile de integrare pot fi implementate pe clustere standalone, Yarn sau medii Kubernetes, cu suport pentru implementare containerizată prin Docker.
Provides a distributed framework for synchronizing and transforming data between heterogeneous sources using a plugin-based architecture.
Javabigdatadata-integrationflink
Vezi pe GitHub4,104
databricks/learning-spark
databricks/learning-spark
3,899Vezi pe GitHub
This project is a learning curriculum and programming guide for Apache Spark, providing a structured set of educational resources and practical code examples for mastering distributed data processing. It serves as a course for building scalable data workflows and big data engineering pipelines. The repository provides practical source code and project layouts that demonstrate how to connect external data stores, process streaming data, and organize code for distributed environments. It includes implementation examples for scaling machine learning algorithms across clusters to handle large tra
Implements systems for partitioning, transforming, and processing large-scale datasets across compute clusters.
Java
Vezi pe GitHub3,899
kananinirav/aws-certified-cloud-practitioner-notes
kananinirav/AWS-Certified-Cloud-Practitioner-Notes
3,829Vezi pe GitHub
This project is a collection of structured study notes and conceptual breakdowns designed for the AWS Certified Cloud Practitioner exam. It serves as a technical reference and study guide, organizing cloud service details and architectural principles to assist in certification preparation. The knowledge base is built using markdown files and includes curated cheat sheets and interactive mind-map visualizations. These tools map complex certification topics into visual hierarchies to enable drill-down study paths and rapid revision. The materials cover a wide range of cloud capabilities, inclu
Explains the use of distributed frameworks for data transformation and machine learning across compute clusters.
HTMLamazon-web-servicesawsaws-certified-cloud-practitioner
Vezi pe GitHub3,829