awesome-repositories.com
© 2026 Bringes Technology SRL·VAT RO45896025·hello@bringes.io
MCPSitemapPrivacyTerms
Ray | Awesome Repository
← All repositories

ray-project/ray

0
View on GitHub↗
41,400 stars·7,239 forks·Python·apache-2.0·0 viewsray.io↗

Ray

Features

  • Distributed Datasets - Creates and controls data collections that support lazy transformations and parallel processing across various storage sources.
  • Distributed Shared Memory - Ray provides a shared memory space to store and retrieve objects, enabling efficient data sharing and asynchronous processing across workers.
  • Distributed Task Orchestration - Ray supports creating remote functions and actor classes to execute code across a cluster while managing resource requirements and lifecycles.
  • Distributed Task Orchestrators - Scaling Python functions and classes across a cluster to execute parallel workloads with fine-grained resource and dependency management.
  • Distributed Task Schedulers - A resource-aware execution engine that manages task dependencies, placement, and fault tolerance across a pool of networked compute nodes.
  • Concurrency Models - A runtime model that uses asynchronous event loops within workers to handle multiple tasks or I/O operations concurrently.
  • Distributed Computing Frameworks - A programming model that scales Python and Java applications across clusters by abstracting task scheduling and resource management.
  • Actor Models - A model where stateful objects run in dedicated processes to maintain and mutate internal state across remote method calls.
  • Stateful Distributed Actors - Ray supports defining classes that run in dedicated processes to maintain and mutate internal state across multiple remote method calls.
  • Inference Pipeline Orchestrators - Executes multi-stage inference pipelines that handle preprocessing, tokenization, and accelerated GPU inference.
  • Model Serving Frameworks - Deploying and scaling complex model pipelines across multiple GPUs to handle high-throughput requests with automatic resource autoscaling.
  • Data Processing Frameworks - Transforming and analyzing massive datasets in parallel using lazy evaluation, distributed shuffles, and efficient memory management.
  • Dataset Transformations - Applies functions to rows or batches to filter, map, or manipulate data for downstream processing tasks.
  • Distributed Data Engines - A library for parallelizing large-scale data transformations, ingestion, and streaming workflows across heterogeneous compute clusters.
  • Fault Tolerance - Ensures distributed tasks and actors remain resilient through automated failure handling and object ownership management.
  • Graph Compilation - Builds and runs directed acyclic graphs to optimize task performance and inspect dependencies.
  • Inference Scaling Frameworks - Distributes inference workloads across multiple GPUs or nodes by configuring concurrency and parallel strategies.
  • Dataset Aggregations - Computes custom or built-in aggregations on datasets by passing functions to grouping operations for efficient data analysis.
  • Distributed Data Processing - Converts datasets into distributed formats to enable interoperability with large-scale data processing libraries.
  • Distributed Data Processing Frameworks - A framework that represents data as partitioned blocks to support incremental transformations and parallel execution across large clusters.
  • Distributed Object Stores - A shared memory system that enables efficient data sharing and asynchronous communication between workers across a cluster.
  • Resource Management Policies - Enforces global CPU, GPU, and memory limits to prevent resource contention during concurrent job execution.
  • Runtime Environment Configuration - Configures dependencies and packages for applications to ensure consistent execution across distributed clusters.
  • Scheduling Strategies - Creates custom placement rules for tasks and actors to pin work to specific nodes or group resources together.
  • Stateful Service Runtimes - Building long-running, fault-tolerant services that maintain internal state and handle concurrent requests across a distributed infrastructure.
  • High-Performance Data Transfer - Moves tensors between actors using specialized libraries to avoid expensive serialization.
  • Asynchronous Concurrency Models - Ray allows creating actors with asynchronous methods to execute multiple tasks concurrently on a single event loop during I/O operations.
  • Foreign Function Interfaces - A serialization and communication layer that allows functions and objects to be invoked across different programming language runtimes.
  • Polyglot Runtimes - Integrating components written in different programming languages into a single application by sharing data and execution handles seamlessly.
  • Fault Tolerance Policies - Ray enables defining restart limits and retry counts for actors to handle unexpected crashes and maintain high service availability.
  • Cluster Monitoring - Provides real-time cluster health displays, including resource allocation and autoscaling information.
  • Data Checkpointing - Sets storage backends and persistence settings to manage the retrieval of checkpoint files during distributed processing.
  • Data Writers - Persists datasets to local or cloud storage using standard URI schemes to ensure data availability across nodes.
  • Incremental Data Streaming - Processes data blocks incrementally to handle datasets that exceed total cluster memory capacity.
  • Memory Optimization Strategies - Monitors heap memory and adjusts block size targets to prevent out-of-memory errors during task execution.
  • Parallel Data Transformation - Applies user-defined functions to dataset rows, automatically parallelizing work across the cluster.
  • Storage File Readers - Ingests files from local or cloud storage in various formats with support for column pruning and parallel processing.
  • Vectorized Data Processing - Processes datasets in vectorized batches to achieve higher performance compared to row-by-row operations.
  • Job Environment Management - Defines a runtime environment for an entire job to ensure all tasks share the same dependencies.
  • Resource Placement - Organizes clusters of resources to ensure tasks and actors are co-located or distributed according to specific requirements.
  • Resource Scheduling Policies - Ray enables assigning specific hardware resources like CPUs or GPUs to an actor during instantiation to ensure sufficient processing capacity.
  • Cross-Language Serialization - Automatically converts primitive and container data types when passing arguments between different language environments.
  • Distributed Future Handles - Ray allows awaiting remote object references as standard futures to integrate distributed results seamlessly into existing event-loop applications.
  • Execution Graphs - Supports binding actor methods and configuring transport settings to prepare complex task chains.
  • Task Orchestration Engines - A system that builds and optimizes task dependency chains to enable accelerated execution paths across distributed nodes.
  • Distributed Model Orchestration - Scales complex transformations across nodes using placement groups to manage model replicas.
  • Inference Configuration Engines - Sets model sources and engine parameters for text generation and multimodal inference tasks.
  • Data Ingestion Tuning - Adjusts output block counts during data reads to balance parallelism and memory overhead for efficient processing.
  • Dataset Iterators - Reads dataset records as individual rows or batches to prepare data for machine learning training workflows.
  • Distributed Debugging - Identifies performance bottlenecks by setting breakpoints, inspecting serializability, and generating profiling timelines for distributed code.
  • Task Schedulers - Ray Core Scheduling Capabilities — a named example documented in this learning resource.
  • Java Interoperability - Allows invoking Java static methods and instantiating Java actors directly from Python code.
  • Python Interoperability - Allows invoking Python remote functions and instantiating Python actors directly from Java code.
  • Environment Isolation - Provides isolated runtime environments for distributed tasks to prevent dependency conflicts.
  • Performance Tuning Utilities - Uses vectorized processing for data transformations to improve performance when working with numerical data.
  • Query Optimization Engines - Translates high-level operations into optimized physical execution plans by applying custom rules.
  • Execution Tracing - Generates visual execution timelines to identify bottlenecks and analyze task dependencies within distributed workflows.
  • Data Processing Configurations - Sets global parameters for block sizes and shuffle strategies to control data operations across the cluster.
  • Data Processing Engines - Utilizes high-performance engines for internal sorting operations to improve performance on large tabular datasets.
  • Data Shuffling Algorithms - Redistributes data across the cluster using hash or range algorithms to support joins and group-by operations.
  • Database Connectors - Queries SQL databases using standard connectors to ingest data directly into distributed datasets for large-scale processing.
  • File Synchronization - Automatically uploads local source files and configuration directories to remote cluster nodes.
  • Cross-Language Debugging - Propagates stack traces across language boundaries to debug errors occurring in remote tasks.
  • Asynchronous Execution Patterns - Executes asynchronous operations within transforms to handle I/O-bound tasks efficiently.
  • Concurrency Control Policies - Ray enables grouping actor methods to limit concurrent executions, preventing resource-intensive tasks from overwhelming the actor's processing capacity.
  • Dynamic Task Scheduling - Ray allows assigning actor methods to specific concurrency groups at runtime to override default settings for individual task invocations.
  • Graph Orchestration - Enables setting entry and exit points for directed acyclic graphs to manage data flow.
  • Service Discovery Mechanisms - Ray provides mechanisms to retrieve a handle to an existing actor by name or create a new one if the name is currently unavailable.
  • Software Design Patterns - Ray Core Design Patterns — a named example documented in this learning resource.
  • Memory Inspection - Enables analyzing object references held in a cluster to identify memory leaks or high usage.
  • Performance Metrics - Monitors application performance using counters and histograms to track state changes across distributed tasks.
  • Performance Profiling Tools - Retrieves detailed timing and memory usage statistics for operators to identify performance bottlenecks.
  • Inference Reliability Tools - Enables row-level error handling and automatic recovery to maintain pipeline reliability for inference jobs.
  • Stack Tracing - Collects stack traces from all local workers to diagnose performance issues or deadlocks.
  • Ray is a distributed computing framework designed to scale Python and Java applications across clusters by abstracting task scheduling and resource management. It functions as a resource-aware execution engine that manages task dependencies, placement, and fault tolerance across networked compute nodes. At its core, the system provides a stateful actor model, allowing developers to define classes that run in dedicated processes to maintain and mutate internal state across remote method calls.

    The framework distinguishes itself through a robust cross-language interoperability layer, enabling functions and objects to be invoked seamlessly between different programming language runtimes. It supports complex distributed workflows through directed acyclic graph execution, which optimizes task dependency chains for accelerated performance. Additionally, Ray includes a distributed data processing engine that utilizes lazy evaluation and partitioned blocks to handle large-scale data transformations, ingestion, and streaming workflows across heterogeneous clusters.

    Beyond its core execution primitives, the project provides comprehensive capabilities for distributed machine learning inference and stateful service hosting. It includes built-in tools for cluster observability, such as execution tracing, memory inspection, and real-time status monitoring, which assist in diagnosing performance bottlenecks and managing resource allocation. The system also offers specialized support for managing runtime environments and dependencies to ensure consistent execution across distributed nodes.

    Technical documentation and educational resources are available at docs.ray.io, covering architectural patterns, design templates, and common implementation strategies for distributed systems.