Ray

Ray is a distributed computing framework designed to scale Python and Java applications across clusters by abstracting task scheduling and resource management. It functions as a resource-aware execution engine that manages task dependencies, placement, and fault tolerance across networked compute nodes. At its core, the system provides a stateful actor model, allowing developers to define classes that run in dedicated processes to maintain and mutate internal state across remote method calls.

The framework distinguishes itself through a robust cross-language interoperability layer, enabling functions and objects to be invoked seamlessly between different programming language runtimes. It supports complex distributed workflows through directed acyclic graph execution, which optimizes task dependency chains for accelerated performance. Additionally, Ray includes a distributed data processing engine that utilizes lazy evaluation and partitioned blocks to handle large-scale data transformations, ingestion, and streaming workflows across heterogeneous clusters.

Beyond its core execution primitives, the project provides comprehensive capabilities for distributed machine learning inference and stateful service hosting. It includes built-in tools for cluster observability, such as execution tracing, memory inspection, and real-time status monitoring, which assist in diagnosing performance bottlenecks and managing resource allocation. The system also offers specialized support for managing runtime environments and dependencies to ensure consistent execution across distributed nodes.

Technical documentation and educational resources are available at docs.ray.io, covering architectural patterns, design templates, and common implementation strategies for distributed systems.

Features

Distributed Datasets - Creates and controls data collections that support lazy transformations and parallel processing across various storage sources.
Distributed Shared Memory - Ray provides a shared memory space to store and retrieve objects, enabling efficient data sharing and asynchronous processing across workers.
Distributed Computing Frameworks - A programming model that scales Python and Java applications across clusters by abstracting task scheduling and resource management.
Distributed Task Orchestration - Ray supports creating remote functions and actor classes to execute code across a cluster while managing resource requirements and lifecycles.
Distributed Task Orchestrators - Scaling Python functions and classes across a cluster to execute parallel workloads with fine-grained resource and dependency management.
Distributed Task Schedulers - A resource-aware execution engine that manages task dependencies, placement, and fault tolerance across a pool of networked compute nodes.
Actor Models - A model where stateful objects run in dedicated processes to maintain and mutate internal state across remote method calls.
Stateful Distributed Actors - Ray supports defining classes that run in dedicated processes to maintain and mutate internal state across multiple remote method calls.
Inference Pipeline Orchestrators - Executes multi-stage inference pipelines that handle preprocessing, tokenization, and accelerated GPU inference.
Model Serving Frameworks - Deploying and scaling complex model pipelines across multiple GPUs to handle high-throughput requests with automatic resource autoscaling.
Data Processing Frameworks - Transforming and analyzing massive datasets in parallel using lazy evaluation, distributed shuffles, and efficient memory management.
Dataset Transformations - Applies functions to rows or batches to filter, map, or manipulate data for downstream processing tasks.
Distributed Data Engines - A library for parallelizing large-scale data transformations, ingestion, and streaming workflows across heterogeneous compute clusters.
Fault Tolerance - Ensures distributed tasks and actors remain resilient through automated failure handling and object ownership management.
Graph Compilation - Builds and runs directed acyclic graphs to optimize task performance and inspect dependencies.
Inference Scaling Frameworks - Distributes inference workloads across multiple GPUs or nodes by configuring concurrency and parallel strategies.
Dataset Aggregations - Computes custom or built-in aggregations on datasets by passing functions to grouping operations for efficient data analysis.
Distributed Data Processing - Converts datasets into distributed formats to enable interoperability with large-scale data processing libraries.
Distributed Data Processing Frameworks - A framework that represents data as partitioned blocks to support incremental transformations and parallel execution across large clusters.
Distributed Object Stores - A shared memory system that enables efficient data sharing and asynchronous communication between workers across a cluster.
Resource Management Policies - Enforces global CPU, GPU, and memory limits to prevent resource contention during concurrent job execution.
Runtime Environment Configuration - Configures dependencies and packages for applications to ensure consistent execution across distributed clusters.
Scheduling Strategies - Creates custom placement rules for tasks and actors to pin work to specific nodes or group resources together.
Stateful Service Runtimes - Building long-running, fault-tolerant services that maintain internal state and handle concurrent requests across a distributed infrastructure.
High-Performance Data Transfer - Moves tensors between actors using specialized libraries to avoid expensive serialization.
Concurrency Models - Ray allows creating actors with asynchronous methods to execute multiple tasks concurrently on a single event loop during I/O operations.
Foreign Function Interfaces - A serialization and communication layer that allows functions and objects to be invoked across different programming language runtimes.
Runtime Environments - Integrating components written in different programming languages into a single application by sharing data and execution handles seamlessly.
Fault Tolerance Policies - Ray enables defining restart limits and retry counts for actors to handle unexpected crashes and maintain high service availability.
Cluster Monitoring - Provides real-time cluster health displays, including resource allocation and autoscaling information.
Automated Machine Learning - Scalable framework for distributed hyperparameter tuning.
Optimization Tools - Framework for building and running distributed applications.
Perception and Machine Learning - Distributed framework for scaling machine learning applications.
Reinforcement Learning - Scalable industry-level library for distributed reinforcement learning.
Data Science and Databases - Framework for building distributed applications.
Distributed Programming - Fast framework for building and running distributed applications.
Distributed Computing - Distributed system for parallel Python and ML.
Computation and Optimization - Distributed execution framework for machine learning workloads.
Developer Tools - Distributed application framework.
Graph Computation - Framework for building distributed applications.
Scientific Computing Libraries - Framework for building distributed applications.
Workflow Frameworks - High-performance distributed execution framework for Python.
Data Checkpointing - Sets storage backends and persistence settings to manage the retrieval of checkpoint files during distributed processing.
Data Writers - Persists datasets to local or cloud storage using standard URI schemes to ensure data availability across nodes.
Incremental Data Streaming - Processes data blocks incrementally to handle datasets that exceed total cluster memory capacity.
Memory Optimization Strategies - Monitors heap memory and adjusts block size targets to prevent out-of-memory errors during task execution.
Parallel Data Transformation - Applies user-defined functions to dataset rows, automatically parallelizing work across the cluster.
Storage File Readers - Ingests files from local or cloud storage in various formats with support for column pruning and parallel processing.
Vectorized Data Processing - Processes datasets in vectorized batches to achieve higher performance compared to row-by-row operations.
Job Environment Management - Defines a runtime environment for an entire job to ensure all tasks share the same dependencies.
Resource Placement - Organizes clusters of resources to ensure tasks and actors are co-located or distributed according to specific requirements.
Resource Scheduling Policies - Ray enables assigning specific hardware resources like CPUs or GPUs to an actor during instantiation to ensure sufficient processing capacity.
Execution Graphs - Supports binding actor methods and configuring transport settings to prepare complex task chains.
Task Orchestration Engines - A system that builds and optimizes task dependency chains to enable accelerated execution paths across distributed nodes.
Distributed Model Orchestration - Scales complex transformations across nodes using placement groups to manage model replicas.
Inference Configuration Engines - Sets model sources and engine parameters for text generation and multimodal inference tasks.
Data Ingestion Tuning - Adjusts output block counts during data reads to balance parallelism and memory overhead for efficient processing.
Dataset Iterators - Reads dataset records as individual rows or batches to prepare data for machine learning training workflows.
Distributed Debugging - Identifies performance bottlenecks by setting breakpoints, inspecting serializability, and generating profiling timelines for distributed code.
Task Schedulers - Ray Core Scheduling Capabilities — a named example documented in this learning resource.
Java Ecosystem - Allows invoking Java static methods and instantiating Java actors directly from Python code.
Python Tooling - Allows invoking Python remote functions and instantiating Python actors directly from Java code.
Environment Isolation - Provides isolated runtime environments for distributed tasks to prevent dependency conflicts.
Performance Tuning Utilities - Uses vectorized processing for data transformations to improve performance when working with numerical data.
Query Optimization Engines - Translates high-level operations into optimized physical execution plans by applying custom rules.
Execution Tracing - Generates visual execution timelines to identify bottlenecks and analyze task dependencies within distributed workflows.
Data Processing Configurations - Sets global parameters for block sizes and shuffle strategies to control data operations across the cluster.
Data Processing Engines - Utilizes high-performance engines for internal sorting operations to improve performance on large tabular datasets.
Data Shuffling Algorithms - Redistributes data across the cluster using hash or range algorithms to support joins and group-by operations.
Database Connectors - Queries SQL databases using standard connectors to ingest data directly into distributed datasets for large-scale processing.
File Synchronization - Automatically uploads local source files and configuration directories to remote cluster nodes.
Interoperability - Propagates stack traces across language boundaries to debug errors occurring in remote tasks.
Asynchronous Execution Patterns - Executes asynchronous operations within transforms to handle I/O-bound tasks efficiently.
Concurrency Control Policies - Ray enables grouping actor methods to limit concurrent executions, preventing resource-intensive tasks from overwhelming the actor's processing capacity.
Dynamic Task Scheduling - Ray allows assigning actor methods to specific concurrency groups at runtime to override default settings for individual task invocations.
Graph Orchestration - Enables setting entry and exit points for directed acyclic graphs to manage data flow.
Service Discovery Mechanisms - Ray provides mechanisms to retrieve a handle to an existing actor by name or create a new one if the name is currently unavailable.
Software Design Patterns - Ray Core Design Patterns — a named example documented in this learning resource.
Memory Inspection - Enables analyzing object references held in a cluster to identify memory leaks or high usage.
Performance Metrics - Monitors application performance using counters and histograms to track state changes across distributed tasks.
Performance Profiling Tools - Retrieves detailed timing and memory usage statistics for operators to identify performance bottlenecks.
Execution Tracers - Collects stack traces from all local workers to diagnose performance issues or deadlocks.
Model Evaluation - Enables row-level error handling and automatic recovery to maintain pipeline reliability for inference jobs.

dask/dask

13,746View on GitHub

Dask is a parallel computing framework and distributed task scheduler designed to scale Python data science workflows from single machines to large clusters. It functions as a cluster resource manager that orchestrates computational logic by representing tasks and their dependencies as directed acyclic graphs. This architecture allows the system to automate the distribution of workloads across available hardware while managing complex execution requirements. The project distinguishes itself through a lazy evaluation engine that defers data operations until they are explicitly requested, enabl

modin-project/modin

10,389View on GitHub

Modin is a distributed dataframe library and parallel data processing engine designed to handle large datasets that exceed system memory. It functions as a distributed computing framework that parallelizes data manipulation tasks across multiple CPU cores or clusters to increase throughput and avoid memory errors. The project mirrors the Pandas API, allowing for the distribution of data workflows without changing core code logic. It utilizes a pluggable backend interface, which enables users to switch between different distributed execution engines to optimize performance based on available h

JerryLead/SparkInternals

5,363View on GitHub

SparkInternals is a technical reference and architecture guide detailing the internal design and implementation of the Apache Spark distributed computing engine. It serves as a study of big data engine analysis, focusing on how the system manages cluster execution and the interaction between driver nodes, executors, and workers. The project provides a detailed breakdown of how logical plans are converted into physical execution stages. It specifically analyzes the mechanics of data shuffle operations, memory management, and the coordination of distributed job scheduling. The documentation co

Vonng/ddia

22,648View on GitHub

This project serves as a comprehensive technical reference for the architecture and design of data-intensive applications. It provides a structured analysis of the fundamental principles required to build reliable, scalable, and maintainable software systems, covering the core trade-offs inherent in modern data infrastructure. The repository explores the mechanics of distributed data management, including strategies for replication, partitioning, and achieving consensus across multiple nodes. It details the design of storage engines, indexing techniques, and transaction management models, whi

ray-projectray

Features