34 dépôts
Frameworks and utilities for scaling data operations across multiple compute nodes.
Distinguishing note: Focuses on distributed data conversion and processing rather than general database management.
Explore 34 awesome GitHub repositories matching data & databases · Distributed Data Processing. Refine with filters or upvote what's useful.
Ray is a distributed computing framework designed to scale Python and Java applications across clusters by abstracting task scheduling and resource management. It functions as a resource-aware execution engine that manages task dependencies, placement, and fault tolerance across networked compute nodes. At its core, the system provides a stateful actor model, allowing developers to define classes that run in dedicated processes to maintain and mutate internal state across remote method calls. The framework distinguishes itself through a robust cross-language interoperability layer, enabling f
Converts datasets into distributed formats to enable interoperability with large-scale data processing libraries.
Polars is a high-performance columnar data processing library designed for efficient analytical workflows. It functions as a structured data library that organizes information into typed columns, utilizing the Apache Arrow memory format to enable zero-copy data sharing and cache-friendly, vectorized operations. The engine is built to handle large-scale tabular datasets, providing both local and distributed analytical runtimes that scale from single-machine environments to multi-node clusters. The project distinguishes itself through a sophisticated lazy query engine that constructs abstract e
Scales data processing workflows from local machines to multi-node clusters for parallelized execution.
This project is a collection of interactive Python notebooks and educational resources designed for mastering data science, machine learning, and numerical computing. It provides a series of practical guides and tutorials covering deep learning, big data processing, and statistical analysis. The repository features specialized instructional suites for implementing classical machine learning algorithms, building deep learning model architectures, and managing AWS cloud infrastructure. It includes dedicated notebooks for data visualization and numerical computing exercises. The project covers
Includes instructional materials on scaling data operations and processing across multiple compute nodes.
IPFS is a peer-to-peer hypermedia protocol and content-addressed storage system that identifies data by cryptographic hashes rather than network locations. It enables the creation of a decentralized web by organizing files and directories as directed acyclic graphs of linked content identifiers. The project differentiates itself through the use of a distributed hash table for locating peers and a system of signed records to map human-readable names to changing content. It also provides HTTP gateways that translate standard web requests into peer-to-peer queries, allowing decentralized data to
Queries distributed hash tables to identify which peers are hosting specific content identifiers.
Presto is a distributed SQL query engine designed for high-performance analytical processing across heterogeneous data sources. It functions as a data federation platform and massively parallel processing engine, allowing users to execute interactive queries against diverse storage systems without requiring data migration. By mapping remote metadata and structures to a unified relational namespace, it enables seamless cross-platform analysis through a standard SQL interface. The engine distinguishes itself through a pluggable connector architecture and a shared-nothing distributed processing
Generates quantile sketches to approximate the distribution of values for efficient rank calculation.
This project is a curated directory of software, frameworks, and educational resources designed for building, scaling, and maintaining distributed data processing and storage architectures. It serves as a comprehensive index for the distributed computing ecosystem, helping users identify the appropriate tools for managing large-scale information systems. The repository functions as a central hub for data engineering, offering categorized access to technologies that support batch and stream processing, machine learning, and interactive querying. By organizing these resources, it assists in the
Executes batch and real-time data workflows across computing clusters using parallel programming models.
Ydata-profiling is an automated exploratory data analysis framework designed to generate comprehensive statistical reports and visual summaries from dataframes. It functions as a diagnostic tool for assessing data quality, identifying missing values, duplicates, and outliers, while providing a scalable engine for profiling massive datasets across distributed enterprise environments. The project distinguishes itself through its ability to handle large-scale data through distributed task orchestration and lazy stream processing, which minimizes memory overhead during complex computations. It in
Scales heavy computational analysis across multiple machines to profile massive datasets.
Citus is a PostgreSQL extension that transforms a standard database into a distributed system. It functions as a sharding framework and distributed SQL engine, enabling horizontal scaling by partitioning tables across a cluster of nodes. By utilizing a coordinator-worker topology, the system manages metadata and routes queries to the appropriate nodes, allowing for parallel execution of complex operations across distributed data shards. The platform distinguishes itself through its specialized support for multi-tenant architectures and real-time analytical processing. It enables tenant-based
Identifies the specific worker node and shard containing data for a given tenant or distribution key.
This repository is a collection of Jupyter notebooks providing reference implementations and templates for building, training, and deploying machine learning models using Amazon SageMaker. It serves as an example library for implementing model architectures and automating the machine learning lifecycle. The library provides practical patterns for machine learning training, data engineering, and model deployment. It includes implementation guides for MLOps, including workflows for model monitoring, lineage tracking, and hyperparameter tuning. The examples cover a broad range of capabilities i
Runs distributed preprocessing and feature transformation workloads using containerized tools to prepare large datasets.
Modin is a distributed dataframe library and parallel data processing engine designed to handle large datasets that exceed system memory. It functions as a distributed computing framework that parallelizes data manipulation tasks across multiple CPU cores or clusters to increase throughput and avoid memory errors. The project mirrors the Pandas API, allowing for the distribution of data workflows without changing core code logic. It utilizes a pluggable backend interface, which enables users to switch between different distributed execution engines to optimize performance based on available h
Scales data operations across multiple compute nodes to increase performance and throughput.
This project is an AI agent workflow orchestrator and automated software lifecycle manager designed to sequence specialized AI personas for end-to-end software development. It serves as a prompt engineering library and a full-stack development toolkit that guides the process from initial discovery and specification through to deployment and code review. The system features a context management framework that utilizes progressive loading and routing tables to fetch reference files on-demand, reducing token consumption within the model context window. It employs a definition-based routing syste
Enables manipulation and cleaning of data at scale using distributed processing tools.
This project is a software engineering style guide and a curated collection of architectural patterns and coding standards. It provides a multi-language coding standard to ensure maintainable software across Ruby, Python, JavaScript, and Swift. The project establishes a development workflow specification for version control, continuous integration, and peer review to maintain a linear project history. It also includes a web accessibility framework based on ARIA and WCAG standards, using design tokens and semantic HTML patterns to build inclusive interfaces. The guides cover a broad range of
Defines mechanisms for partitioning large datasets across multiple machines to increase processing throughput.
This project is a collection of pre-configured Docker images that provide ready-to-run environments for interactive computing and data science. It functions as a scientific computing stack and a polyglot notebook server, bundling language interpreters and libraries for Python, R, and Julia within a containerized system to ensure reproducible research environments. The collection uses a layered image hierarchy to provide versioned software dependencies and support for hardware acceleration across different CPU architectures. It allows for the creation of custom images based on a foundation of
Integrates Spark clusters and distributed binaries into containers for large-scale data processing.
Pentaho Kettle est une plateforme d'intégration de données ETL d'entreprise conçue pour extraire, transformer et charger des données entre des sources disparates et des bases de données cibles. Il fonctionne comme un orchestrateur piloté par les métadonnées qui utilise un concepteur de flux de travail visuel pour créer et gérer des séquences complexes de tâches de données et de pipelines de transformation. Le système se distingue par son moteur de traitement de données distribué, qui exécute les charges de travail sur des clusters de nœuds de serveur pour augmenter le débit. Il emploie une architecture basée sur des plugins, permettant à la plateforme d'être étendue via des fichiers JAR externes pour fournir une connectivité à diverses bases de données et services cloud. La plateforme couvre un large éventail de capacités d'intégration de données, notamment le chargement en masse, la gestion de fichiers à distance et la transformation de structures de données. Elle fournit des outils pour la validation de la qualité des données, l'automatisation des pipelines et la gestion du cycle de vie des tâches, ainsi que des utilitaires de surveillance pour suivre la santé du serveur et l'état d'exécution en temps réel.
Provides frameworks and utilities for scaling data operations across multiple compute nodes to increase throughput.
This project is a comprehensive educational resource and curriculum focused on site reliability engineering, distributed systems, and infrastructure operations. It provides technical guides, a systems engineering course, and instructional manuals designed to teach the principles of managing large-scale computing environments. The curriculum covers high-level architectural design for scalability and resilience, including fault-tolerant infrastructure, high-availability patterns, and microservices decomposition. It emphasizes the practical application of site reliability engineering through the
Explains frameworks and utilities for scaling data operations and analyzing high-volume streams across multiple nodes.
Data-Juicer is an open-source framework for cleaning, filtering, deduplicating, and transforming multimodal datasets to prepare them for training large language and vision models. It functions as a distributed data pipeline engine that runs processing jobs across Ray clusters, handling billions of samples with automatic operator fusion and adaptive parallelism. The framework provides a library of operators that leverage large language models for semantic extraction, filtering, and data synthesis within processing pipelines. The project distinguishes itself through a YAML-based data recipe sys
Scales data processing across multiple machines to handle large datasets efficiently.
Pinot is a distributed, columnar analytical database designed for high-concurrency, low-latency query processing. It functions as a real-time OLAP datastore, enabling interactive, user-facing analytics by ingesting and querying massive datasets from both streaming and batch sources. The system architecture relies on a centralized controller for cluster coordination and a distributed segment-based storage model to ensure horizontal scalability. The platform distinguishes itself through a hybrid ingestion pipeline that unifies real-time event streams and historical batch data into a single quer
Reduces network traffic during joins by partitioning data across servers based on equality conditions.
ToonCrafter is a model that combines latent diffusion, reference-based colorization, and sketch-guided control for cartoon animation and interpolation. It functions as a cartoon video interpolation model, a reference-based colorization model, and a sketch-guided animation tool, all built on a latent diffusion animation framework. The project distinguishes itself by integrating three core capabilities into a single pipeline: generating smooth intermediate frames between two cartoon images using diffusion-based priors, transferring color and style from a reference image onto black-and-white ske
Ships a pipeline that uses sparse sketch outlines to steer the interpolation process and shape resulting video frames.
Ce projet est un framework de crawling web distribué qui permet la mise à l'échelle horizontale des tâches de scraping. Il utilise Redis comme gestionnaire de file d'attente de requêtes centralisé et magasin d'état pour coordonner la progression du crawl et les métadonnées des requêtes sur plusieurs instances de serveur. Le système distribue les charges de travail de crawling en partageant une file d'attente de requêtes unique et utilise un filtre de doublons distribué pour empêcher plusieurs workers de visiter la même page. Il persiste l'état complexe des requêtes et les métadonnées sous forme de chaînes JSON au sein du magasin distant partagé. Le framework fournit également des capacités pour le traitement de données distribué en poussant les éléments scrapés dans une file d'attente partagée pour une consommation parallèle par des workers de traitement séparés.
Facilitates distributed data processing by pushing scraped items into shared queues for parallel worker consumption.
SparkInternals est une référence technique et un guide d'architecture détaillant la conception interne et l'implémentation du moteur de calcul distribué Apache Spark. Il sert d'étude sur l'analyse des moteurs de big data, en se concentrant sur la gestion de l'exécution en cluster et l'interaction entre les nœuds drivers, les exécuteurs et les workers. Le projet fournit une décomposition détaillée de la manière dont les plans logiques sont convertis en étapes d'exécution physiques. Il analyse spécifiquement la mécanique des opérations de shuffle, la gestion de la mémoire et la coordination de la planification des jobs distribués. La documentation couvre un large éventail de capacités de calcul distribué, incluant la planification de l'exécution des requêtes, la gestion des dépendances de données et les stratégies de mise en cache en mémoire. Elle examine également la distribution des tâches, l'exécution parallèle et les processus utilisés pour la reprise sur erreur et la persistance des données.
Retrieves distributed data segments from multiple worker nodes using a tracker to locate and fetch blocks.