26 个仓库
Systems designed to distribute computational workloads across multiple networked machines.
Distinguishing note: Focuses on workload distribution and parallel processing across a cluster rather than general cluster management.
Explore 26 awesome GitHub repositories matching devops & infrastructure · Distributed Computing Frameworks. Refine with filters or upvote what's useful.
Exo is a distributed inference engine designed to run machine learning models across local hardware. It functions as a network orchestration layer that automatically discovers available devices to form a unified computing cluster, allowing users to scale artificial intelligence workloads by distributing computational tasks across multiple machines. The platform distinguishes itself through its ability to manage the entire lifecycle of local models while providing a standardized gateway for external applications. By translating local model outputs into industry-standard formats, it enables exi
Distributes large computational workloads across multiple local devices to improve processing performance.
Ray is a distributed computing framework designed to scale Python and Java applications across clusters by abstracting task scheduling and resource management. It functions as a resource-aware execution engine that manages task dependencies, placement, and fault tolerance across networked compute nodes. At its core, the system provides a stateful actor model, allowing developers to define classes that run in dedicated processes to maintain and mutate internal state across remote method calls. The framework distinguishes itself through a robust cross-language interoperability layer, enabling f
A programming model that scales Python and Java applications across clusters by abstracting task scheduling and resource management.
Puter is a browser-based desktop environment and cloud-native development platform that provides a virtualized graphical workspace. It enables developers to build and deploy full-stack web applications by integrating cloud storage, authentication, and serverless backend logic directly into the browser, eliminating the need for traditional server infrastructure. The platform distinguishes itself through a unified cloud storage layer and a distributed network runtime that facilitates peer-to-peer communication and cross-origin resource fetching. It features a sophisticated cross-window orchestr
Provides a browser-native execution environment for peer-to-peer communication and decentralized applications.
Anoma is a distributed operating system designed to abstract the complexities of blockchain networks into a unified interface for cross-chain coordination. At its core, the platform utilizes a resource-based state machine and an intent-centric execution model, where user-defined goals are processed and settled by decentralized solvers rather than through direct, manual execution. This architecture enables the creation of applications that operate across heterogeneous distributed networks while maintaining a consistent developer and user experience. The platform distinguishes itself through a
Abstracts blockchain complexities to provide a unified interface for users and developers.
This project is a comprehensive microservices development framework designed to build scalable, resilient backend systems. It provides a production-ready runtime that integrates stability patterns directly into the service architecture, ensuring consistent performance and reliability for both web and remote procedure call services even under heavy traffic conditions. The framework centers on an interface-first development model, utilizing a domain-specific language to define service contracts that serve as the single source of truth. This approach powers an extensive code generation ecosystem
Provides a production-ready runtime environment designed for high performance and reliability under heavy network traffic.
Linera is a multi-chain smart contract platform designed for horizontal scalability through a microchain-based distributed ledger. By partitioning state into independent, parallel chains that share a common validator set, the protocol enables high-performance execution of modular applications. The system utilizes a WebAssembly-based runtime to ensure secure, platform-independent execution of contract logic across the network. The platform distinguishes itself through an asynchronous messaging framework that coordinates state changes between chains by queuing messages for execution in subseque
Interact with applications using operations for local chain execution and messages for cross-chain communication to ensure atomicity through bundled message groups.
Hyperframes is an HTML-to-video rendering engine and composition tool that transforms web layouts and CSS into encoded video files. It functions as a headless browser video pipeline and a distributed video rendering framework, allowing users to create seekable animations and programmatic motion designs using HTML, CSS, and JavaScript. The project differentiates itself as an AI agent video orchestrator, enabling the automation of video scripts and compositions through natural language prompts. It supports distributed video encoding by splitting rendering tasks across multiple serverless functi
Implements a cloud-native infrastructure for splitting video encoding tasks across serverless functions and worker processes.
Dapr is a distributed application runtime that provides a sidecar-based infrastructure layer for building resilient microservices and event-driven applications. By utilizing a sidecar proxy pattern, it abstracts complex infrastructure tasks into standardized, network-accessible APIs, allowing developers to focus on application logic while the runtime handles service discovery, state management, and secure communication. The platform distinguishes itself through a pluggable component architecture and language-agnostic design, enabling services written in any programming language to interact wi
Write distributed applications using language-specific tools that provide simple interfaces for interacting with runtime building blocks and underlying infrastructure services during the development process.
This project serves as a comprehensive, community-driven directory of high-quality open-source Python libraries and tools for machine learning, data science, and artificial intelligence. It functions as a centralized resource for developers to discover, evaluate, and track the maintenance status of software packages across the entire machine learning ecosystem. The platform distinguishes itself through automated popularity tracking and data-driven content curation, which programmatically validate and rank projects based on community activity and development velocity. By organizing these tools
Parallelizes training and inference workloads across large-scale compute infrastructure.
This project is a functional programming library and toolkit for building production TypeScript applications. It provides a system for managing concurrency, error handling, and resource lifecycles using functional effects. The project distinguishes itself through a comprehensive suite of specialized toolkits, including a dependency injection framework for decoupling service implementations, a workflow orchestrator for coordinating durable processes, and a SQL database toolkit for consistent data operations across multiple dialects. It also implements an OpenTelemetry instrumentation library f
Spreads heavy workloads across multiple worker nodes to process data in parallel.
Bullet3 is a professional physics simulation engine designed for calculating rigid body, soft body, and collision dynamics within 3D environments and robotics applications. It functions as a computational framework for determining complex geometric intersections and contact manifolds between objects in simulated space. The library distinguishes itself through a distributed rendering framework that scales heavy graphical workloads and scene generation tasks across large clusters of machines. This capability enables the production of massive datasets by distributing complex scene generation acr
Scales heavy graphical workloads and scene generation tasks across large clusters of machines.
Dask 是一个并行计算框架和分布式任务调度器,旨在将 Python 数据科学工作流从单机扩展到大型集群。它作为一个集群资源管理器,通过将任务及其依赖项表示为有向无环图来编排计算逻辑。这种架构允许系统在管理复杂执行要求的同时,自动将工作负载分配到可用硬件上。 该项目通过一个延迟评估引擎脱颖而出,该引擎将数据操作推迟到明确请求时才执行,从而实现全局图优化和高效的资源分配。它结合了内存感知数据溢出功能,以防止在处理超过可用内存的数据集时系统崩溃,并利用任务图融合将操作序列组合成单个执行步骤,从而最大限度地减少调度开销和节点间通信。 该平台为大规模数据分析提供了全面的功能面,包括对分布式机器学习、高性能计算集成和并行数据处理的支持。它提供了用于集群生命周期管理、性能分析和任务执行实时监控的广泛工具。用户可以在各种基础设施上部署这些环境,包括本地硬件、云提供商、容器化系统和高性能计算集群。
Provides a framework for scaling Python workflows from single machines to distributed clusters by orchestrating task graphs.
Meshroom is a node-based photogrammetry software designed to transform collections of two-dimensional images into three-dimensional models and scene geometry. It provides a visual interface for constructing and managing modular data pipelines, allowing users to automate complex computer vision tasks such as feature extraction, depth map estimation, and mesh generation. The software distinguishes itself through a distributed computational framework that dispatches resource-intensive tasks across local hardware or remote render farms. By utilizing a directed acyclic graph execution model, it en
Dispatches resource-intensive reconstruction tasks across local hardware or remote render farms to optimize processing performance.
QuantAxis is a quantitative trading platform and algorithmic trading framework. It provides a comprehensive local environment for backtesting strategies, managing financial market data, and executing trades across stocks, futures, and options markets. The system distinguishes itself through a distributed task scheduler that spreads asynchronous computations and heavy mathematical workloads across a network of remote agents. It incorporates a multi-account trading interface to standardize the monitoring of positions and the execution of orders across various brokerage accounts. The platform c
Distributes asynchronous computational workloads across a local network of remote agents.
Metaflow is a Python machine learning framework and MLOps workflow orchestrator designed to manage the lifecycle of data pipelines from local prototyping to production. It serves as a distributed compute manager and an experiment tracking system, enabling the creation of reproducible pipelines that transition between development and high-availability production environments. The framework distinguishes itself through an integrated checkpointing system that automatically persists intermediate data artifacts to remote storage, allowing failed runs to be resumed from the last successful step. It
Distributes computational workloads across cloud CPUs and GPUs using ephemeral clusters and spot instances.
Hyperopt is a Python library for hyperparameter optimization designed to minimize scalar-valued objective functions. It operates as a stochastic search space engine that finds optimal input parameters by searching through real-valued, discrete, and conditional spaces. The framework distinguishes itself through its support for complex search space configurations, allowing for conditional parameter hierarchies where specific hyperparameters are sampled only if their parent parameters meet certain criteria. It is built as an asynchronous optimization framework, decoupling the generation of searc
Parallelizes the hyperparameter search process across multiple machines using external clusters or database backends.
ZenML is an extensible machine learning orchestration framework designed to manage the end-to-end lifecycle of data pipelines and AI agent workflows. It functions as a durable orchestrator that executes machine learning tasks as directed acyclic graphs, ensuring that every step is containerized for consistent performance across local, cloud, and hybrid infrastructure. By decoupling pipeline code from underlying compute and storage backends, the platform allows developers to define infrastructure-agnostic stacks that remain portable across diverse environments. The project distinguishes itself
Executes parallel or distributed computing tasks by initializing frameworks like Spark, Ray, or Dask directly within pipeline steps.
Apache Mesos 是一个分布式系统内核和集群资源管理器,抽象了节点池中的 CPU、内存和存储。它作为一个分布式基础设施编排器,提供了一个在共享物理或虚拟机器集上运行多个编排框架的层。 该系统充当资源隔离引擎,将共享集群划分为隔离的容器以并发运行各种工作负载。它实现了多框架编排,允许不同的分布式应用框架共享单个基础设施,从而最大化硬件利用率。 该项目涵盖了大规模计算分发和分布式集群管理。其功能包括管理分布式资源,并跨多个应用隔离计算能力,以防止干扰并确保共享服务器上的稳定性能。
Provides a distributed infrastructure for running multiple computing frameworks across networked machines.
Volcano is a Kubernetes-native batch scheduler specialized for AI, machine learning, and high-performance computing workloads. It provides gang scheduling to atomically allocate resources for all tasks of a distributed job, preventing deadlocks from partial allocation, and supports hierarchical queue management for multi-tenant resource isolation with configurable quotas, borrowing, and preemption. Topology-aware placement optimizes communication-intensive workloads by modeling network hierarchy to minimize cross-switch latency. Volcano differentiates itself with automated orchestration of di
Runs batch jobs from popular data processing, ML, and streaming frameworks without custom integration.
statsforecast 是一个高性能统计时间序列预测库,旨在生成点预测和预测区间。它作为一个分布式时间序列框架,利用基于 C 的预测引擎和自动模型选择器来识别并拟合数据集中每个唯一序列的最佳统计模型。该系统还包括一个时间序列异常检测器,通过将观测值与概率预测区间进行比较来识别异常数据点。 该项目的特色在于其处理数百万个独立序列的大规模并行预测的能力。它通过分布式计算框架、多核并行执行和加速核心 ARIMA 及指数平滑逻辑的编译 C 内核来实现这一点。该系统进一步利用长格式数据布局和惰性求值数据流水线来优化大规模处理,以减少内存开销。 该库提供了一套全面的模型,包括 AutoARIMA、用于间歇性或季节性需求的各种指数平滑方法、Theta 分解以及用于金融风险的 GARCH 波动率建模。它涵盖了更广泛的功能领域,例如带有外生变量的多元预测、时间序列分解以及通过历史交叉验证和滑动窗口分析进行模型评估。 该库与 Polars 等高性能数据结构集成,并提供将保存的模型作为 REST 端点提供服务以进行网络可访问预测的实用程序。
Scales forecasting workloads across server clusters using distributed computing and parallel execution.