30 open-source projects similar to cortexlabs/cortex, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best Cortex alternative.
ClearML is a comprehensive MLOps platform designed to manage the end-to-end machine learning lifecycle, from initial experimentation to production deployment. It provides a suite of integrated tools including a pipeline orchestrator for automating workflows, an experiment tracking tool for logging hyperparameters and metrics, and a metadata-driven data versioning system for managing large-scale datasets and model artifacts. The platform is distinguished by its advanced compute management and serving capabilities. It features a GPU compute manager that supports fractional resource slicing and
BentoML is a machine learning model serving framework and GPU-accelerated inference server designed to package, deploy, and scale AI models as production-ready REST APIs. It functions as an AI model lifecycle manager and an inference graph orchestrator, enabling the chaining of multiple models and custom logic into complex pipelines for advanced task sequences. The framework distinguishes itself through a dynamic batching engine that optimizes GPU throughput and an artifact-based packaging system that bundles model weights and dependencies into immutable archives for consistent deployment. It
Nebullvm is an AI inference accelerator, GPU resource orchestrator, and performance optimization library for large language models. It functions as an optimization layer designed to lower operational costs by aligning model execution with underlying hardware architectures. The system maximizes cluster efficiency through real-time dynamic partitioning and elastic quotas for shared hardware resources. It employs alignment methods and techniques to reduce the hardware and data requirements necessary for tuning large language models. The project covers broad capability areas including AI infrast
Boto3 is the AWS SDK for Python, providing a programmatic interface for managing and automating AWS cloud infrastructure and services. It serves as a cloud management API client and resource manager for provisioning, configuring, and scaling virtual servers, databases, and storage. The library enables the implementation of infrastructure-as-code through declarative templates and scripts, allowing for the deployment of identical resource stacks across multiple accounts and geographic regions. It also provides a framework for coordinating distributed workflows, serverless functions, and contain
PyCaret is a Python AutoML platform and MLOps lifecycle manager designed to automate machine learning workflows. It functions as a low-code environment that leverages a scikit-learn native engine to execute preprocessing, training, and evaluation for tabular data. The platform distinguishes itself as an LLM-powered ML copilot, using large language model agents to analyze datasets, design experiment configurations, and explain model results. It also serves as a Kubernetes ML orchestrator and model registry, enabling the versioning of trained pipelines and their promotion to production API endp
h2o-3 is a distributed machine learning platform and automated machine learning framework designed for training and deploying predictive models using distributed in-memory computing. It functions as a deep learning framework and a distributed model scoring engine, capable of operating as a Kubernetes ML cluster to process large datasets in parallel. The platform distinguishes itself through automated machine learning capabilities that automatically select the best algorithms and hyperparameters to optimize model performance. It provides specialized deep learning toolkits for tasks including i
ZenML is an extensible machine learning orchestration framework designed to manage the end-to-end lifecycle of data pipelines and AI agent workflows. It functions as a durable orchestrator that executes machine learning tasks as directed acyclic graphs, ensuring that every step is containerized for consistent performance across local, cloud, and hybrid infrastructure. By decoupling pipeline code from underlying compute and storage backends, the platform allows developers to define infrastructure-agnostic stacks that remain portable across diverse environments. The project distinguishes itself
Cube Studio is a cloud-native MLOps platform and Kubernetes-based AI orchestrator designed for the entire machine learning lifecycle. It provides a distributed training framework for large-scale model fine-tuning, a GPU resource manager for hardware virtualization, and an ML pipeline orchestrator that uses visual directed acyclic graphs to manage end-to-end workflows. The platform distinguishes itself through its specialized LLM inference server, which supports retrieval-augmented generation and the construction of private knowledge bases. It features a dedicated system for supervised fine-tu
ClearML is a comprehensive MLOps platform designed to manage the entire machine learning lifecycle. It functions as an experiment tracking tool, a data versioning system, and a pipeline orchestrator, while providing infrastructure for GPU cluster management and model serving. The platform is distinguished by its ability to handle hybrid-cloud compute scheduling and fractional GPU allocation, allowing multiple workloads to share a single hardware accelerator. It employs a metadata-based approach to data versioning, using virtual views to track large datasets and artifacts without duplicating r
Olares is a comprehensive suite of self-hosted identity, storage, AI, and orchestration services designed for private infrastructure management. It functions as a Kubernetes home server orchestrator, enabling the deployment of containerized applications, AI models, and GPU resources on local hardware to replace third-party cloud services. The platform distinguishes itself through a combination of self-hosted AI infrastructure for running large language models and image generators, alongside a decentralized identity manager that uses cryptographic keys and OIDC for trustless authentication. It
PyTorch Lightning is a high-level deep learning framework for PyTorch that automates training loops and removes repetitive engineering boilerplate. It functions as a structured pipeline for managing machine learning experiments, providing a distributed training orchestrator and tools for mixed-precision training. The framework decouples scientific model architecture from the engineering required for infrastructure and scaling. This separation allows the same model code to execute across CPUs, GPUs, or TPUs through a hardware-agnostic execution engine and a centralized trainer that manages the
The Accidental CTO is a comprehensive collection of guides and frameworks focused on distributed systems architecture, resilience engineering, and system observability. It provides strategies for scaling applications from thousands to millions of users while maintaining high availability. The project offers specific methodologies for managing data volume through replication, sharding, and caching. It includes a framework for analyzing cloud infrastructure spending and evaluating transitions to self-hosted environments to reduce operational expenses. The resource covers the implementation of
This project is a Kubernetes serverless framework and OCI container function platform. It provides a system for deploying event-driven functions and microservices as compatible container images onto a Kubernetes cluster. The platform includes an event-driven function orchestrator that triggers executions via HTTP requests or message streams. It features an auto-scaling function manager that adjusts the number of active instances based on real-time demand and scales down to zero during inactivity. A background queuing system is included to process asynchronous tasks and maintain application re
OpenTelemetry Go is a framework for generating and collecting distributed traces, metrics, and logs from Go applications. It provides a standardized telemetry instrumentation API for adding observability markers to code and a corresponding SDK for processing and emitting these signals. The project utilizes a configurable observability pipeline to sample and export telemetry data to external backends using the OTLP wire protocol. It features a pluggable export system and a separation between the public API and the SDK implementation, allowing telemetry to be routed to third-party platforms wit
Petals is a decentralized framework and inference engine for running large language models across a peer-to-peer network. It enables the execution of models that exceed the memory of any single machine by splitting computations and model layers across a collaborative swarm of GPUs. The system functions as a collaborative compute network where participants share local GPU resources and host model weights. It supports distributed prompt-tuning to adapt massive models to specific tasks and allows for the establishment of private compute swarms to process sensitive data within restricted, trusted
Vector is a high-performance observability data pipeline designed to collect, transform, and route logs, metrics, and traces across distributed infrastructure. It functions as a modular engine that decouples data ingestion from processing and transmission, utilizing a component-based architecture to connect diverse sources to multiple destinations. The project distinguishes itself through a focus on reliability and flow control. It implements backpressure-aware data movement to prevent data loss during traffic spikes and utilizes disk-backed event buffering to ensure durability during network
River is a transactional job queue and distributed job scheduler for Go that uses PostgreSQL for persistence and state management. It functions as a resumable task framework, allowing long-running background work to be broken into persisted steps that can resume from the last saved checkpoint after a failure. The system ensures strict data consistency by allowing background tasks to be enqueued and completed within the same database transaction as the primary application data. It distinguishes itself through a coordinator model that employs leader election to manage periodic and delayed tasks
This project is a comprehensive educational resource and curriculum focused on site reliability engineering, distributed systems, and infrastructure operations. It provides technical guides, a systems engineering course, and instructional manuals designed to teach the principles of managing large-scale computing environments. The curriculum covers high-level architectural design for scalability and resilience, including fault-tolerant infrastructure, high-availability patterns, and microservices decomposition. It emphasizes the practical application of site reliability engineering through the
PyTorch is a machine learning framework centered on a GPU-ready tensor library that supports multi-dimensional array operations across both CPU and accelerator hardware. It provides a foundational infrastructure for mathematical computation and dynamic neural network construction, utilizing a tape-based automatic differentiation system that allows for flexible, non-static graph execution. The framework is designed for deep integration with Python, enabling natural usage alongside standard scientific computing ecosystems. It distinguishes itself through a comprehensive distributed training sui
Apache MXNet is a deep learning framework and distributed machine learning library designed for training and deploying neural networks across distributed systems, mobile devices, and hardware accelerators. It functions as a cross-platform runtime and a dynamic dataflow scheduler that optimizes neural network execution. The framework provides a multi-language API, enabling the development of machine learning models using Python, R, Julia, Scala, Go, and JavaScript. It supports high-performance model training and the scaling of workloads across multiple GPUs and machines. The system covers cap
SciKit-Learn Laboratory (SKLL) makes it easy to run machine learning experiments.
Wasp is a declarative full-stack web framework that enables developers to build and deploy applications by defining their architecture in a centralized configuration. By using a high-level specification, the framework automates the orchestration of frontend, backend, and database components, ensuring that infrastructure concerns like routing, authentication, and data modeling are handled consistently across the entire stack. The framework distinguishes itself through its compiler-driven approach, which translates declarative configurations into cohesive, production-ready codebases. It provide
This project is a comprehensive educational repository designed to teach DevOps practices through structured learning paths and hands-on exercises. It focuses on mastering infrastructure management, container orchestration, and system administration by providing a curriculum that covers the full lifecycle of cloud-native environments, from initial provisioning to ongoing maintenance and security. The repository distinguishes itself by offering a practical, task-based approach to complex operational domains. It guides users through the implementation of infrastructure-as-code, the configuratio
The AWS Cloud Development Kit is an infrastructure-as-code framework that enables developers to define and provision cloud resources using familiar programming languages. By utilizing construct-based synthesis, it translates high-level, object-oriented code into declarative templates, allowing for the automated management of complex cloud environments through a centralized, code-driven control plane. The framework distinguishes itself through its ability to model infrastructure as a dependency-aware resource graph, ensuring that components are provisioned and updated in the correct order. It
Dkron is a distributed, fault-tolerant system designed for scheduling and executing recurring tasks across a cluster of nodes. It functions as a cron-based orchestrator that manages job lifecycles, including automatic retries, timeouts, and complex dependencies, while ensuring state consistency through a consensus protocol. By coordinating remote task execution across infrastructure, it enables the automation of background operations and the management of distributed workflows. The system distinguishes itself through a modular architecture that supports pluggable storage backends and a plugin
This project is a comprehensive educational resource and tutorial handbook for building, training, and deploying machine learning models using TensorFlow 2. It serves as a structured learning guide covering core deep learning concepts, including neural network architectures, automatic differentiation, and tensor operations. The handbook provides technical guidance on optimizing execution efficiency through GPU memory management, distributed training, and model quantization. It also includes detailed manuals for constructing high-performance data pipelines and exporting models for production s
Nomad is a distributed workload orchestrator and infrastructure automation platform designed to manage the lifecycle of applications across large-scale, heterogeneous environments. It functions as a multi-cloud orchestration engine, providing a unified control plane to deploy, scale, and govern containers, virtual machines, and legacy applications. By utilizing declarative job specifications, the system ensures infrastructure convergence and maintains the desired state across distributed data centers and geographic regions. The platform distinguishes itself through a flexible, plugin-based ar
KServe is an open platform for deploying and serving generative and predictive AI models on Kubernetes. It defines inference services as custom resources with declarative YAML specifications, enabling a Kubernetes-native approach to model deployment and lifecycle management. The platform leverages Knative-based serverless scaling for automatic scale-to-zero and revision management, and supports a pluggable serving runtime architecture that maps model formats to containerized execution environments. KServe distinguishes itself through model-aware autoscaling that scales replicas based on token
This project is a distributed computing platform designed to orchestrate containerized workloads across heterogeneous hardware clusters. It functions as a centralized control plane that manages resource allocation, scheduling, and execution environments, enabling organizations to share high-performance computing infrastructure securely among multiple users and projects. The platform distinguishes itself through advanced hardware virtualization and multi-tenant management capabilities. It supports the partitioning of physical graphics processing units into fractional slices, allowing multiple
KServe is a Kubernetes-native platform for deploying and serving machine learning models as scalable inference services. It supports both generative AI models, including large language models, and traditional predictive models from frameworks such as TensorFlow, PyTorch, Scikit-Learn, XGBoost, and ONNX. The platform manages the full lifecycle of model deployments, including revision tracking, canary rollouts, A/B testing, and automatic rollbacks, and provides serverless scale-to-zero capabilities for cost-efficient resource management. KServe distinguishes itself through a standardized infere