34 Repos
Frameworks and utilities for scaling data operations across multiple compute nodes.
Distinguishing note: Focuses on distributed data conversion and processing rather than general database management.
Explore 34 awesome GitHub repositories matching data & databases · Distributed Data Processing. Refine with filters or upvote what's useful.
Ray is a distributed computing framework designed to scale Python and Java applications across clusters by abstracting task scheduling and resource management. It functions as a resource-aware execution engine that manages task dependencies, placement, and fault tolerance across networked compute nodes. At its core, the system provides a stateful actor model, allowing developers to define classes that run in dedicated processes to maintain and mutate internal state across remote method calls. The framework distinguishes itself through a robust cross-language interoperability layer, enabling f
Converts datasets into distributed formats to enable interoperability with large-scale data processing libraries.
Polars is a high-performance columnar data processing library designed for efficient analytical workflows. It functions as a structured data library that organizes information into typed columns, utilizing the Apache Arrow memory format to enable zero-copy data sharing and cache-friendly, vectorized operations. The engine is built to handle large-scale tabular datasets, providing both local and distributed analytical runtimes that scale from single-machine environments to multi-node clusters. The project distinguishes itself through a sophisticated lazy query engine that constructs abstract e
Scales data processing workflows from local machines to multi-node clusters for parallelized execution.
This project is a collection of interactive Python notebooks and educational resources designed for mastering data science, machine learning, and numerical computing. It provides a series of practical guides and tutorials covering deep learning, big data processing, and statistical analysis. The repository features specialized instructional suites for implementing classical machine learning algorithms, building deep learning model architectures, and managing AWS cloud infrastructure. It includes dedicated notebooks for data visualization and numerical computing exercises. The project covers
Includes instructional materials on scaling data operations and processing across multiple compute nodes.
IPFS is a peer-to-peer hypermedia protocol and content-addressed storage system that identifies data by cryptographic hashes rather than network locations. It enables the creation of a decentralized web by organizing files and directories as directed acyclic graphs of linked content identifiers. The project differentiates itself through the use of a distributed hash table for locating peers and a system of signed records to map human-readable names to changing content. It also provides HTTP gateways that translate standard web requests into peer-to-peer queries, allowing decentralized data to
Queries distributed hash tables to identify which peers are hosting specific content identifiers.
Presto is a distributed SQL query engine designed for high-performance analytical processing across heterogeneous data sources. It functions as a data federation platform and massively parallel processing engine, allowing users to execute interactive queries against diverse storage systems without requiring data migration. By mapping remote metadata and structures to a unified relational namespace, it enables seamless cross-platform analysis through a standard SQL interface. The engine distinguishes itself through a pluggable connector architecture and a shared-nothing distributed processing
Generates quantile sketches to approximate the distribution of values for efficient rank calculation.
This project is a curated directory of software, frameworks, and educational resources designed for building, scaling, and maintaining distributed data processing and storage architectures. It serves as a comprehensive index for the distributed computing ecosystem, helping users identify the appropriate tools for managing large-scale information systems. The repository functions as a central hub for data engineering, offering categorized access to technologies that support batch and stream processing, machine learning, and interactive querying. By organizing these resources, it assists in the
Executes batch and real-time data workflows across computing clusters using parallel programming models.
Ydata-profiling is an automated exploratory data analysis framework designed to generate comprehensive statistical reports and visual summaries from dataframes. It functions as a diagnostic tool for assessing data quality, identifying missing values, duplicates, and outliers, while providing a scalable engine for profiling massive datasets across distributed enterprise environments. The project distinguishes itself through its ability to handle large-scale data through distributed task orchestration and lazy stream processing, which minimizes memory overhead during complex computations. It in
Scales heavy computational analysis across multiple machines to profile massive datasets.
Citus is a PostgreSQL extension that transforms a standard database into a distributed system. It functions as a sharding framework and distributed SQL engine, enabling horizontal scaling by partitioning tables across a cluster of nodes. By utilizing a coordinator-worker topology, the system manages metadata and routes queries to the appropriate nodes, allowing for parallel execution of complex operations across distributed data shards. The platform distinguishes itself through its specialized support for multi-tenant architectures and real-time analytical processing. It enables tenant-based
Identifies the specific worker node and shard containing data for a given tenant or distribution key.
This repository is a collection of Jupyter notebooks providing reference implementations and templates for building, training, and deploying machine learning models using Amazon SageMaker. It serves as an example library for implementing model architectures and automating the machine learning lifecycle. The library provides practical patterns for machine learning training, data engineering, and model deployment. It includes implementation guides for MLOps, including workflows for model monitoring, lineage tracking, and hyperparameter tuning. The examples cover a broad range of capabilities i
Runs distributed preprocessing and feature transformation workloads using containerized tools to prepare large datasets.
Modin is a distributed dataframe library and parallel data processing engine designed to handle large datasets that exceed system memory. It functions as a distributed computing framework that parallelizes data manipulation tasks across multiple CPU cores or clusters to increase throughput and avoid memory errors. The project mirrors the Pandas API, allowing for the distribution of data workflows without changing core code logic. It utilizes a pluggable backend interface, which enables users to switch between different distributed execution engines to optimize performance based on available h
Scales data operations across multiple compute nodes to increase performance and throughput.
This project is an AI agent workflow orchestrator and automated software lifecycle manager designed to sequence specialized AI personas for end-to-end software development. It serves as a prompt engineering library and a full-stack development toolkit that guides the process from initial discovery and specification through to deployment and code review. The system features a context management framework that utilizes progressive loading and routing tables to fetch reference files on-demand, reducing token consumption within the model context window. It employs a definition-based routing syste
Enables manipulation and cleaning of data at scale using distributed processing tools.
This project is a software engineering style guide and a curated collection of architectural patterns and coding standards. It provides a multi-language coding standard to ensure maintainable software across Ruby, Python, JavaScript, and Swift. The project establishes a development workflow specification for version control, continuous integration, and peer review to maintain a linear project history. It also includes a web accessibility framework based on ARIA and WCAG standards, using design tokens and semantic HTML patterns to build inclusive interfaces. The guides cover a broad range of
Defines mechanisms for partitioning large datasets across multiple machines to increase processing throughput.
This project is a collection of pre-configured Docker images that provide ready-to-run environments for interactive computing and data science. It functions as a scientific computing stack and a polyglot notebook server, bundling language interpreters and libraries for Python, R, and Julia within a containerized system to ensure reproducible research environments. The collection uses a layered image hierarchy to provide versioned software dependencies and support for hardware acceleration across different CPU architectures. It allows for the creation of custom images based on a foundation of
Integrates Spark clusters and distributed binaries into containers for large-scale data processing.
Pentaho Kettle ist eine Enterprise-ETL-Datenintegrationsplattform, die darauf ausgelegt ist, Daten zwischen unterschiedlichen Quellen und Zieldatenbanken zu extrahieren, zu transformieren und zu laden. Sie fungiert als metadatengesteuerter Orchestrator, der einen visuellen Workflow-Designer nutzt, um komplexe Sequenzen von Datenaufgaben und Transformationspipelines zu erstellen und zu verwalten. Das System zeichnet sich durch seine verteilte Datenverarbeitungs-Engine aus, die Workloads über Cluster von Server-Nodes hinweg ausführt, um den Durchsatz zu erhöhen. Es verwendet eine Plugin-basierte Architektur, die es ermöglicht, die Plattform über externe JAR-Dateien zu erweitern, um Konnektivität zu diversen Datenbanken und Cloud-Diensten bereitzustellen. Die Plattform deckt ein breites Spektrum an Datenintegrationsfunktionen ab, einschließlich Bulk-Loading, Remote-Dateiverwaltung und Datenstrukturtransformation. Sie bietet Werkzeuge für Datenqualitätsvalidierung, Pipeline-Automatisierung und Job-Lebenszyklusmanagement sowie Überwachungsprogramme zur Verfolgung des Serverzustands und des Echtzeit-Ausführungsstatus.
Provides frameworks and utilities for scaling data operations across multiple compute nodes to increase throughput.
This project is a comprehensive educational resource and curriculum focused on site reliability engineering, distributed systems, and infrastructure operations. It provides technical guides, a systems engineering course, and instructional manuals designed to teach the principles of managing large-scale computing environments. The curriculum covers high-level architectural design for scalability and resilience, including fault-tolerant infrastructure, high-availability patterns, and microservices decomposition. It emphasizes the practical application of site reliability engineering through the
Explains frameworks and utilities for scaling data operations and analyzing high-volume streams across multiple nodes.
Data-Juicer is an open-source framework for cleaning, filtering, deduplicating, and transforming multimodal datasets to prepare them for training large language and vision models. It functions as a distributed data pipeline engine that runs processing jobs across Ray clusters, handling billions of samples with automatic operator fusion and adaptive parallelism. The framework provides a library of operators that leverage large language models for semantic extraction, filtering, and data synthesis within processing pipelines. The project distinguishes itself through a YAML-based data recipe sys
Scales data processing across multiple machines to handle large datasets efficiently.
Pinot is a distributed, columnar analytical database designed for high-concurrency, low-latency query processing. It functions as a real-time OLAP datastore, enabling interactive, user-facing analytics by ingesting and querying massive datasets from both streaming and batch sources. The system architecture relies on a centralized controller for cluster coordination and a distributed segment-based storage model to ensure horizontal scalability. The platform distinguishes itself through a hybrid ingestion pipeline that unifies real-time event streams and historical batch data into a single quer
Reduces network traffic during joins by partitioning data across servers based on equality conditions.
ToonCrafter is a model that combines latent diffusion, reference-based colorization, and sketch-guided control for cartoon animation and interpolation. It functions as a cartoon video interpolation model, a reference-based colorization model, and a sketch-guided animation tool, all built on a latent diffusion animation framework. The project distinguishes itself by integrating three core capabilities into a single pipeline: generating smooth intermediate frames between two cartoon images using diffusion-based priors, transferring color and style from a reference image onto black-and-white ske
Ships a pipeline that uses sparse sketch outlines to steer the interpolation process and shape resulting video frames.
This project is a distributed web crawling framework that enables the horizontal scaling of scraping tasks. It uses Redis as a centralized request queue manager and state store to coordinate crawl progress and request metadata across multiple server instances. The system distributes crawling workloads by sharing a single request queue and utilizes a distributed duplicate filter to prevent multiple workers from visiting the same page. It persists complex request state and metadata as JSON strings within the shared remote store. The framework also provides capabilities for distributed data pro
Facilitates distributed data processing by pushing scraped items into shared queues for parallel worker consumption.
SparkInternals ist ein technisches Referenz- und Architekturhandbuch, das das interne Design und die Implementierung der verteilten Computing-Engine Apache Spark detailliert beschreibt. Es dient als Analyse von Big-Data-Engines und konzentriert sich darauf, wie das System die Cluster-Ausführung sowie das Zusammenspiel zwischen Driver-Nodes, Executors und Workern verwaltet. Das Projekt bietet eine detaillierte Aufschlüsselung, wie logische Pläne in physische Ausführungsstufen konvertiert werden. Es analysiert spezifisch die Mechanik von Data-Shuffle-Operationen, Speicherverwaltung und die Koordination der verteilten Job-Planung. Die Dokumentation deckt ein breites Spektrum an verteilten Computing-Funktionen ab, einschließlich Query-Execution-Planung, Datenabhängigkeitsmanagement und In-Memory-Caching-Strategien. Zudem werden Aufgabenverteilung, parallele Ausführung sowie Prozesse zur Fehlerwiederherstellung und Datenpersistenz untersucht.
Retrieves distributed data segments from multiple worker nodes using a tracker to locate and fetch blocks.