Why is ray-project/ray a recommended Distributed Data Processing GitHub Repositories repository?

Converts datasets into distributed formats to enable interoperability with large-scale data processing libraries.

Why is pola-rs/polars a recommended Distributed Data Processing GitHub Repositories repository?

Scales data processing workflows from local machines to multi-node clusters for parallelized execution.

Why is donnemartin/data-science-ipython-notebooks a recommended Distributed Data Processing GitHub Repositories repository?

Includes instructional materials on scaling data operations and processing across multiple compute nodes.

Why is ipfs/ipfs a recommended Distributed Data Processing GitHub Repositories repository?

Queries distributed hash tables to identify which peers are hosting specific content identifiers.

Why is prestodb/presto a recommended Distributed Data Processing GitHub Repositories repository?

Generates quantile sketches to approximate the distribution of values for efficient rank calculation.

Why is oxnr/awesome-bigdata a recommended Distributed Data Processing GitHub Repositories repository?

Executes batch and real-time data workflows across computing clusters using parallel programming models.

Why is ydataai/ydata-profiling a recommended Distributed Data Processing GitHub Repositories repository?

Scales heavy computational analysis across multiple machines to profile massive datasets.

Why is citusdata/citus a recommended Distributed Data Processing GitHub Repositories repository?

Identifies the specific worker node and shard containing data for a given tenant or distribution key.

Why is aws/amazon-sagemaker-examples a recommended Distributed Data Processing GitHub Repositories repository?

Runs distributed preprocessing and feature transformation workloads using containerized tools to prepare large datasets.

Why is modin-project/modin a recommended Distributed Data Processing GitHub Repositories repository?

Scales data operations across multiple compute nodes to increase performance and throughput.

34 dépôts

Awesome GitHub RepositoriesDistributed Data Processing

Frameworks and utilities for scaling data operations across multiple compute nodes.

Distinguishing note: Focuses on distributed data conversion and processing rather than general database management.

Explore 34 awesome GitHub repositories matching data & databases · Distributed Data Processing. Refine with filters or upvote what's useful.

Trouvez les meilleurs dépôts grâce à l'IA.Nous recherchons les dépôts les plus pertinents grâce à l'IA.

ray-project/ray
ray-project/ray
42,895Voir sur GitHub
Ray is a distributed computing framework designed to scale Python and Java applications across clusters by abstracting task scheduling and resource management. It functions as a resource-aware execution engine that manages task dependencies, placement, and fault tolerance across networked compute nodes. At its core, the system provides a stateful actor model, allowing developers to define classes that run in dedicated processes to maintain and mutate internal state across remote method calls. The framework distinguishes itself through a robust cross-language interoperability layer, enabling f
Converts datasets into distributed formats to enable interoperability with large-scale data processing libraries.
Pythondata-sciencedeep-learningdeployment
Voir sur GitHub42,895
pola-rs/polars
pola-rs/polars
38,855Voir sur GitHub
Polars is a high-performance columnar data processing library designed for efficient analytical workflows. It functions as a structured data library that organizes information into typed columns, utilizing the Apache Arrow memory format to enable zero-copy data sharing and cache-friendly, vectorized operations. The engine is built to handle large-scale tabular datasets, providing both local and distributed analytical runtimes that scale from single-machine environments to multi-node clusters. The project distinguishes itself through a sophisticated lazy query engine that constructs abstract e
Scales data processing workflows from local machines to multi-node clusters for parallelized execution.
Rustarrowdataframedataframe-library
Voir sur GitHub38,855
donnemartin/data-science-ipython-notebooks
donnemartin/data-science-ipython-notebooks
29,166Voir sur GitHub
This project is a collection of interactive Python notebooks and educational resources designed for mastering data science, machine learning, and numerical computing. It provides a series of practical guides and tutorials covering deep learning, big data processing, and statistical analysis. The repository features specialized instructional suites for implementing classical machine learning algorithms, building deep learning model architectures, and managing AWS cloud infrastructure. It includes dedicated notebooks for data visualization and numerical computing exercises. The project covers
Includes instructional materials on scaling data operations and processing across multiple compute nodes.
Pythonawsbig-datacaffe
Voir sur GitHub29,166
ipfs/ipfs
ipfs/ipfs
23,137Voir sur GitHub
IPFS is a peer-to-peer hypermedia protocol and content-addressed storage system that identifies data by cryptographic hashes rather than network locations. It enables the creation of a decentralized web by organizing files and directories as directed acyclic graphs of linked content identifiers. The project differentiates itself through the use of a distributed hash table for locating peers and a system of signed records to map human-readable names to changing content. It also provides HTTP gateways that translate standard web requests into peer-to-peer queries, allowing decentralized data to
Queries distributed hash tables to identify which peers are hosting specific content identifiers.
ipfsipfs-protocolipfs-web
Voir sur GitHub23,137
prestodb/presto
prestodb/presto
16,711Voir sur GitHub
Presto is a distributed SQL query engine designed for high-performance analytical processing across heterogeneous data sources. It functions as a data federation platform and massively parallel processing engine, allowing users to execute interactive queries against diverse storage systems without requiring data migration. By mapping remote metadata and structures to a unified relational namespace, it enables seamless cross-platform analysis through a standard SQL interface. The engine distinguishes itself through a pluggable connector architecture and a shared-nothing distributed processing
Generates quantile sketches to approximate the distribution of values for efficient rank calculation.
Javabig-datadatahadoop
Voir sur GitHub16,711
oxnr/awesome-bigdata
oxnr/awesome-bigdata
14,454Voir sur GitHub
This project is a curated directory of software, frameworks, and educational resources designed for building, scaling, and maintaining distributed data processing and storage architectures. It serves as a comprehensive index for the distributed computing ecosystem, helping users identify the appropriate tools for managing large-scale information systems. The repository functions as a central hub for data engineering, offering categorized access to technologies that support batch and stream processing, machine learning, and interactive querying. By organizing these resources, it assists in the
Executes batch and real-time data workflows across computing clusters using parallel programming models.
awesomeawesome-listbigdata
Voir sur GitHub14,454
ydataai/ydata-profiling
ydataai/ydata-profiling
13,388Voir sur GitHub
Ydata-profiling is an automated exploratory data analysis framework designed to generate comprehensive statistical reports and visual summaries from dataframes. It functions as a diagnostic tool for assessing data quality, identifying missing values, duplicates, and outliers, while providing a scalable engine for profiling massive datasets across distributed enterprise environments. The project distinguishes itself through its ability to handle large-scale data through distributed task orchestration and lazy stream processing, which minimizes memory overhead during complex computations. It in
Scales heavy computational analysis across multiple machines to profile massive datasets.
Pythonbig-data-analyticsdata-analysisdata-exploration
Voir sur GitHub13,388
citusdata/citus
citusdata/citus
12,562Voir sur GitHub
Citus is a PostgreSQL extension that transforms a standard database into a distributed system. It functions as a sharding framework and distributed SQL engine, enabling horizontal scaling by partitioning tables across a cluster of nodes. By utilizing a coordinator-worker topology, the system manages metadata and routes queries to the appropriate nodes, allowing for parallel execution of complex operations across distributed data shards. The platform distinguishes itself through its specialized support for multi-tenant architectures and real-time analytical processing. It enables tenant-based
Identifies the specific worker node and shard containing data for a given tenant or distribution key.
Ccituscitus-extensiondatabase
Voir sur GitHub12,562
aws/amazon-sagemaker-examples
aws/amazon-sagemaker-examples
10,958Voir sur GitHub
This repository is a collection of Jupyter notebooks providing reference implementations and templates for building, training, and deploying machine learning models using Amazon SageMaker. It serves as an example library for implementing model architectures and automating the machine learning lifecycle. The library provides practical patterns for machine learning training, data engineering, and model deployment. It includes implementation guides for MLOps, including workflows for model monitoring, lineage tracking, and hyperparameter tuning. The examples cover a broad range of capabilities i
Runs distributed preprocessing and feature transformation workloads using containerized tools to prepare large datasets.
Jupyter Notebookawsdata-sciencedeep-learning
Voir sur GitHub10,958
modin-project/modin
modin-project/modin
10,389Voir sur GitHub
Modin is a distributed dataframe library and parallel data processing engine designed to handle large datasets that exceed system memory. It functions as a distributed computing framework that parallelizes data manipulation tasks across multiple CPU cores or clusters to increase throughput and avoid memory errors. The project mirrors the Pandas API, allowing for the distribution of data workflows without changing core code logic. It utilizes a pluggable backend interface, which enables users to switch between different distributed execution engines to optimize performance based on available h
Scales data operations across multiple compute nodes to increase performance and throughput.
Pythonanalyticsdata-sciencedataframe
Voir sur GitHub10,389
jeffallan/claude-skills
Jeffallan/claude-skills
9,935Voir sur GitHub
This project is an AI agent workflow orchestrator and automated software lifecycle manager designed to sequence specialized AI personas for end-to-end software development. It serves as a prompt engineering library and a full-stack development toolkit that guides the process from initial discovery and specification through to deployment and code review. The system features a context management framework that utilizes progressive loading and routing tables to fetch reference files on-demand, reducing token consumption within the model context window. It employs a definition-based routing syste
Enables manipulation and cleaning of data at scale using distributed processing tools.
Pythonai-agentsclaudeclaude-code
Voir sur GitHub9,935
thoughtbot/guides
thoughtbot/guides
9,556Voir sur GitHub
This project is a software engineering style guide and a curated collection of architectural patterns and coding standards. It provides a multi-language coding standard to ensure maintainable software across Ruby, Python, JavaScript, and Swift. The project establishes a development workflow specification for version control, continuous integration, and peer review to maintain a linear project history. It also includes a web accessibility framework based on ARIA and WCAG standards, using design tokens and semantic HTML patterns to build inclusive interfaces. The guides cover a broad range of
Defines mechanisms for partitioning large datasets across multiple machines to increase processing throughput.
Ruby
Voir sur GitHub9,556
jupyter/docker-stacks
jupyter/docker-stacks
8,432Voir sur GitHub
This project is a collection of pre-configured Docker images that provide ready-to-run environments for interactive computing and data science. It functions as a scientific computing stack and a polyglot notebook server, bundling language interpreters and libraries for Python, R, and Julia within a containerized system to ensure reproducible research environments. The collection uses a layered image hierarchy to provide versioned software dependencies and support for hardware acceleration across different CPU architectures. It allows for the creation of custom images based on a foundation of
Integrates Spark clusters and distributed binaries into containers for large-scale data processing.
Pythondockeripythonipython-notebook
Voir sur GitHub8,432
pentaho/pentaho-kettle
pentaho/pentaho-kettle
8,353Voir sur GitHub
Pentaho Kettle est une plateforme d'intégration de données ETL d'entreprise conçue pour extraire, transformer et charger des données entre des sources disparates et des bases de données cibles. Il fonctionne comme un orchestrateur piloté par les métadonnées qui utilise un concepteur de flux de travail visuel pour créer et gérer des séquences complexes de tâches de données et de pipelines de transformation. Le système se distingue par son moteur de traitement de données distribué, qui exécute les charges de travail sur des clusters de nœuds de serveur pour augmenter le débit. Il emploie une architecture basée sur des plugins, permettant à la plateforme d'être étendue via des fichiers JAR externes pour fournir une connectivité à diverses bases de données et services cloud. La plateforme couvre un large éventail de capacités d'intégration de données, notamment le chargement en masse, la gestion de fichiers à distance et la transformation de structures de données. Elle fournit des outils pour la validation de la qualité des données, l'automatisation des pipelines et la gestion du cycle de vie des tâches, ainsi que des utilitaires de surveillance pour suivre la santé du serveur et l'état d'exécution en temps réel.
Provides frameworks and utilities for scaling data operations across multiple compute nodes to increase throughput.
Java
Voir sur GitHub8,353
linkedin/school-of-sre
linkedin/school-of-sre
8,093Voir sur GitHub
This project is a comprehensive educational resource and curriculum focused on site reliability engineering, distributed systems, and infrastructure operations. It provides technical guides, a systems engineering course, and instructional manuals designed to teach the principles of managing large-scale computing environments. The curriculum covers high-level architectural design for scalability and resilience, including fault-tolerant infrastructure, high-availability patterns, and microservices decomposition. It emphasizes the practical application of site reliability engineering through the
Explains frameworks and utilities for scaling data operations and analyzing high-volume streams across multiple nodes.
HTMLgithadooplinux
Voir sur GitHub8,093
datajuicer/data-juicer
datajuicer/data-juicer
6,574Voir sur GitHub
Data-Juicer is an open-source framework for cleaning, filtering, deduplicating, and transforming multimodal datasets to prepare them for training large language and vision models. It functions as a distributed data pipeline engine that runs processing jobs across Ray clusters, handling billions of samples with automatic operator fusion and adaptive parallelism. The framework provides a library of operators that leverage large language models for semantic extraction, filtering, and data synthesis within processing pipelines. The project distinguishes itself through a YAML-based data recipe sys
Scales data processing across multiple machines to handle large datasets efficiently.
Pythondatadata-analysisdata-pipeline
Voir sur GitHub6,574
apache/pinot
apache/pinot
6,098Voir sur GitHub
Pinot is a distributed, columnar analytical database designed for high-concurrency, low-latency query processing. It functions as a real-time OLAP datastore, enabling interactive, user-facing analytics by ingesting and querying massive datasets from both streaming and batch sources. The system architecture relies on a centralized controller for cluster coordination and a distributed segment-based storage model to ensure horizontal scalability. The platform distinguishes itself through a hybrid ingestion pipeline that unifies real-time event streams and historical batch data into a single quer
Reduces network traffic during joins by partitioning data across servers based on equality conditions.
Java
Voir sur GitHub6,098
doubiiu/tooncrafter
Doubiiu/ToonCrafter
5,972Voir sur GitHub
ToonCrafter is a model that combines latent diffusion, reference-based colorization, and sketch-guided control for cartoon animation and interpolation. It functions as a cartoon video interpolation model, a reference-based colorization model, and a sketch-guided animation tool, all built on a latent diffusion animation framework. The project distinguishes itself by integrating three core capabilities into a single pipeline: generating smooth intermediate frames between two cartoon images using diffusion-based priors, transferring color and style from a reference image onto black-and-white ske
Ships a pipeline that uses sparse sketch outlines to steer the interpolation process and shape resulting video frames.
Python
Voir sur GitHub5,972
rolando/scrapy-redis
rolando/scrapy-redis
5,639Voir sur GitHub
Ce projet est un framework de crawling web distribué qui permet la mise à l'échelle horizontale des tâches de scraping. Il utilise Redis comme gestionnaire de file d'attente de requêtes centralisé et magasin d'état pour coordonner la progression du crawl et les métadonnées des requêtes sur plusieurs instances de serveur. Le système distribue les charges de travail de crawling en partageant une file d'attente de requêtes unique et utilise un filtre de doublons distribué pour empêcher plusieurs workers de visiter la même page. Il persiste l'état complexe des requêtes et les métadonnées sous forme de chaînes JSON au sein du magasin distant partagé. Le framework fournit également des capacités pour le traitement de données distribué en poussant les éléments scrapés dans une file d'attente partagée pour une consommation parallèle par des workers de traitement séparés.
Facilitates distributed data processing by pushing scraped items into shared queues for parallel worker consumption.
Python
Voir sur GitHub5,639
jerrylead/sparkinternals
JerryLead/SparkInternals
5,363Voir sur GitHub
SparkInternals est une référence technique et un guide d'architecture détaillant la conception interne et l'implémentation du moteur de calcul distribué Apache Spark. Il sert d'étude sur l'analyse des moteurs de big data, en se concentrant sur la gestion de l'exécution en cluster et l'interaction entre les nœuds drivers, les exécuteurs et les workers. Le projet fournit une décomposition détaillée de la manière dont les plans logiques sont convertis en étapes d'exécution physiques. Il analyse spécifiquement la mécanique des opérations de shuffle, la gestion de la mémoire et la coordination de la planification des jobs distribués. La documentation couvre un large éventail de capacités de calcul distribué, incluant la planification de l'exécution des requêtes, la gestion des dépendances de données et les stratégies de mise en cache en mémoire. Elle examine également la distribution des tâches, l'exécution parallèle et les processus utilisés pour la reprise sur erreur et la persistance des données.
Retrieves distributed data segments from multiple worker nodes using a tracker to locate and fetch blocks.
Voir sur GitHub5,363

Awesome Distributed Data Processing GitHub Repositories

ray-project/ray

pola-rs/polars

donnemartin/data-science-ipython-notebooks

ipfs/ipfs

prestodb/presto

oxnr/awesome-bigdata

ydataai/ydata-profiling

citusdata/citus

aws/amazon-sagemaker-examples

modin-project/modin

Jeffallan/claude-skills

thoughtbot/guides

jupyter/docker-stacks

pentaho/pentaho-kettle

linkedin/school-of-sre

datajuicer/data-juicer

apache/pinot

Doubiiu/ToonCrafter

rolando/scrapy-redis

JerryLead/SparkInternals

Explorer les sous-tags