# tencentmusic/cube-studio

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/tencentmusic-cube-studio).**

5,062 stars · 885 forks · Python · NOASSERTION

## Links

- GitHub: https://github.com/tencentmusic/cube-studio
- awesome-repositories: https://awesome-repositories.com/repository/tencentmusic-cube-studio.md

## Topics

`ai` `aihub` `argo` `automl` `deepseek` `gpt` `inference` `kubeflow` `kubernetes` `llmops` `mlops` `notebook` `pipeline` `pytorch` `spark` `vgpu` `workflow`

## Description

Cube Studio is a cloud-native MLOps platform and Kubernetes-based AI orchestrator designed for the entire machine learning lifecycle. It provides a distributed training framework for large-scale model fine-tuning, a GPU resource manager for hardware virtualization, and an ML pipeline orchestrator that uses visual directed acyclic graphs to manage end-to-end workflows.

The platform distinguishes itself through its specialized LLM inference server, which supports retrieval-augmented generation and the construction of private knowledge bases. It features a dedicated system for supervised fine-tuning and reinforcement learning of large language models, complemented by visual hyperparameter search tools.

The system covers a broad range of operational capabilities, including multimodal data labeling, distributed data pipelines, and multi-cluster workload scheduling. It also provides browser-based interactive development environments, container image management, and a model registry for versioning and deploying scalable inference APIs with traffic splitting.

The infrastructure includes integrated cluster health monitoring and role-based access control with single sign-on integration.

## Tags

### DevOps & Infrastructure

- [Kubernetes Orchestrators](https://awesome-repositories.com/f/devops-infrastructure/gpu-resource-orchestrators/kubernetes-orchestrators.md) — Schedules containerized AI workloads and hardware accelerators across multiple Kubernetes clusters and edge nodes.
- [Model Inference Deployment](https://awesome-repositories.com/f/devops-infrastructure/deployment-management/model-inference-deployment.md) — Publishes trained models as scalable APIs with support for traffic splitting and high-performance inference engines.
- [Multi-Tenant GPU Workload Isolation](https://awesome-repositories.com/f/devops-infrastructure/gpu-infrastructure-lifecycle-management/multi-tenant-gpu-workload-isolation.md) — Isolates GPU and NPU resources across different projects using a hierarchical permission and quota system.
- [GPU Resource Allocators](https://awesome-repositories.com/f/devops-infrastructure/gpu-resource-allocators.md) — Virtually allocates and isolates GPU compute and memory resources across multi-tenant projects and edge nodes.
- [Kubernetes ML Platforms](https://awesome-repositories.com/f/devops-infrastructure/kubernetes-ml-platforms.md) — Provides a cloud-native environment for managing the entire machine learning lifecycle on Kubernetes.
- [Cloud Native Orchestration](https://awesome-repositories.com/f/devops-infrastructure/cloud-native-orchestration.md) — Manages compute allocation and lifecycle of services across multiple clusters and edge nodes using containers.
- [ML Workload Schedulers](https://awesome-repositories.com/f/devops-infrastructure/cluster-job-schedulers/ml-workload-schedulers.md) — Balances ML training and evaluation workloads across multiple clusters and resource groups with dynamic load balancing. ([source](https://github.com/tencentmusic/cube-studio/blob/master/README.md))
- [Container Image Management](https://awesome-repositories.com/f/devops-infrastructure/container-image-management.md) — Manages private and public container image repositories to pull specific environments for distributed training. ([source](https://github.com/tencentmusic/cube-studio/tree/master/job-template))
- [Custom Container Images](https://awesome-repositories.com/f/devops-infrastructure/container-orchestration/image-management-tools/custom-container-images.md) — Creates customized container images online to ensure consistent runtime configurations for all processing and training tasks. ([source](https://github.com/tencentmusic/cube-studio/tree/master/install))
- [GPU Resource Orchestrators](https://awesome-repositories.com/f/devops-infrastructure/gpu-resource-orchestrators.md) — Dynamically orchestrates and partitions CPU, GPU, and NPU resources across multiple clusters and edge nodes. ([source](https://github.com/tencentmusic/cube-studio#readme))
- [Resource Quotas](https://awesome-repositories.com/f/devops-infrastructure/infrastructure/configuration-policy-enforcement/resource-quotas.md) — Enforces resource limits and budgets for tenants and projects to prevent compute overconsumption. ([source](https://github.com/tencentmusic/cube-studio#readme))
- [Job Templates](https://awesome-repositories.com/f/devops-infrastructure/job-templates.md) — Provides reusable configurations for startup commands and environment variables to standardize task execution. ([source](https://github.com/tencentmusic/cube-studio/tree/master/job-template))

### Artificial Intelligence & ML

- [LLM Inference Servers](https://awesome-repositories.com/f/artificial-intelligence-ml/agent-deployment-servers/llm-inference-servers.md) — Serves large language models using vLLM and Ollama with integrated support for RAG and private knowledge bases.
- [Distributed Deep Learning Frameworks](https://awesome-repositories.com/f/artificial-intelligence-ml/distributed-deep-learning-frameworks.md) — Provides a unified platform for large-scale model training and fine-tuning across multiple compute nodes.
- [Distributed Training Coordination](https://awesome-repositories.com/f/artificial-intelligence-ml/distributed-training-coordination.md) — Synchronizes large-scale model training across multiple compute nodes using high-speed communication and priority scheduling.
- [End-to-End Training Pipelines](https://awesome-repositories.com/f/artificial-intelligence-ml/end-to-end-training-pipelines.md) — Provides integrated workflows for managing the ML lifecycle from data labeling and training to model deployment.
- [LLM Fine-Tuning](https://awesome-repositories.com/f/artificial-intelligence-ml/full-parameter-fine-tuning/custom-data-fine-tunings/llm-fine-tuning.md) — Adapts pre-trained large language models through supervised fine-tuning and reinforcement learning for specialized tasks.
- [Large-Scale Model Training](https://awesome-repositories.com/f/artificial-intelligence-ml/large-scale-model-training.md) — Provides specialized methodologies and acceleration frameworks for training massive models that exceed single-device capacity. ([source](https://github.com/tencentmusic/cube-studio/blob/master/README.md))
- [Distributed Training](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/machine-learning-training/distributed-training.md) — Executes large-scale deep learning tasks across multiple nodes using distributed frameworks and high-speed networking. ([source](https://github.com/tencentmusic/cube-studio#readme))
- [ML Pipeline Orchestration](https://awesome-repositories.com/f/artificial-intelligence-ml/ml-pipeline-orchestration.md) — Provides a visual interface to manage complex machine learning task dependencies and end-to-end workflow scheduling. ([source](https://github.com/tencentmusic/cube-studio#readme))
- [RAG Context Retrieval](https://awesome-repositories.com/f/artificial-intelligence-ml/rag-context-retrieval.md) — Combines semantic embeddings with vector retrieval to provide domain-specific context for grounding large language model responses.
- [Data Labeling Platforms](https://awesome-repositories.com/f/artificial-intelligence-ml/data-labeling-platforms.md) — Provides a platform to annotate images, text, and audio using both manual entry and automated assistant tools. ([source](https://github.com/tencentmusic/cube-studio#readme))
- [Dataset Management](https://awesome-repositories.com/f/artificial-intelligence-ml/dataset-management.md) — Organizes and stores structured and media datasets, managing ground truth and metadata through a unified interface. ([source](https://github.com/tencentmusic/cube-studio/blob/master/README_EN.md))
- [Hardware Acceleration Abstractions](https://awesome-repositories.com/f/artificial-intelligence-ml/hardware-acceleration-abstractions.md) — Abstracts compute backends including GPU and NPU to provide unified interfaces for diverse deep learning frameworks.
- [Model Registries](https://awesome-repositories.com/f/artificial-intelligence-ml/model-architecture-registries/model-registries.md) — Implements a centralized version control system for storing, tracking, and managing trained models across different environments. ([source](https://github.com/tencentmusic/cube-studio/blob/master/README.md))
- [Visual Hyperparameter Search](https://awesome-repositories.com/f/artificial-intelligence-ml/model-fine-tuning-resources/hyperparameter-tuning/hyperparameter-search-strategies/visual-hyperparameter-search.md) — Provides a graphical interface for analyzing and executing hyperparameter search strategies. ([source](https://github.com/tencentmusic/cube-studio#readme))
- [Hyperparameter Optimizers](https://awesome-repositories.com/f/artificial-intelligence-ml/optimization-algorithms/hyperparameter-optimizers.md) — Automates the search for optimal model configurations to improve overall accuracy and performance. ([source](https://github.com/tencentmusic/cube-studio#readme))

### Data & Databases

- [Knowledge Base Construction](https://awesome-repositories.com/f/data-databases/index-construction/knowledge-base-construction.md) — Integrates domain-specific data using embeddings and semantic retrieval to build private knowledge bases. ([source](https://github.com/tencentmusic/cube-studio#readme))
- [Distributed Data Processing Engines](https://awesome-repositories.com/f/data-databases/data-processing-pipelines/data-transformation/stream-pipeline-orchestration/distributed-data-processing-engines.md) — Executes distributed jobs to import heterogeneous data and extract features using big data processing engines. ([source](https://github.com/tencentmusic/cube-studio/blob/master/README_EN.md))

### Development Tools & Productivity

- [Browser-Based Execution Environments](https://awesome-repositories.com/f/development-tools-productivity/browser-based-execution-environments.md) — Provides an integrated notebook environment to author and execute code directly within the web browser. ([source](https://github.com/tencentmusic/cube-studio/tree/master/install))
- [Browser-Based IDEs](https://awesome-repositories.com/f/development-tools-productivity/development-environment-management/development-environments/cloud-remote-workspaces/browser-based-ides.md) — Provides full-featured code editors and notebooks accessible through a web browser, bound to cluster hardware.
- [IDE Provisioning](https://awesome-repositories.com/f/development-tools-productivity/notebook-environments/environment-provisioning/ide-provisioning.md) — Deploys browser-based editors and notebooks integrated with real-time hardware monitoring and version control. ([source](https://github.com/tencentmusic/cube-studio/blob/master/README.md))

### Programming Languages & Runtimes

- [Visual Pipeline DAG Executors](https://awesome-repositories.com/f/programming-languages-runtimes/runtime-execution-environments/runtime-environments/runtimes/graph-symbolic-execution-engines/directed-acyclic-graph-execution-engines/visual-pipeline-dag-executors.md) — Defines machine learning workflows using a visual directed acyclic graph interface to manage task dependencies and sequencing.

### Software Engineering & Architecture

- [DAG Workflow Pipelines](https://awesome-repositories.com/f/software-engineering-architecture/parallel-processing-pipelines/dag-workflow-pipelines.md) — Defines machine learning workflows using directed acyclic graphs to manage complex task dependencies.
- [ML Pipeline Orchestrators](https://awesome-repositories.com/f/software-engineering-architecture/reusable-component-architectures/ml-pipeline-orchestrators.md) — Provides a visual DAG-based interface for designing and managing end-to-end ML workflows and task dependencies.
- [Container-Based Isolation](https://awesome-repositories.com/f/software-engineering-architecture/software-architecture/architectural-patterns/modular-decoupled-design/decoupled-architectures/container-based-isolation.md) — Uses container images to encapsulate runtimes, ensuring consistent library and language versions across development and training.

### Web Development

- [Model Inference APIs](https://awesome-repositories.com/f/web-development/service-hosting/model-inference-apis.md) — Publishes trained models as scalable APIs with integrated traffic splitting and automatic scaling. ([source](https://github.com/tencentmusic/cube-studio#readme))

### Networking & Communication

- [Revision Traffic Splits](https://awesome-repositories.com/f/networking-communication/traffic-shaping/scaling/traffic-distribution/weighted-traffic-splitting/revision-traffic-splits.md) — Serves models as scalable APIs with a routing layer for canary and blue-green releases across immutable revisions.

### Security & Cryptography

- [Cluster Resource Isolation](https://awesome-repositories.com/f/security-cryptography/multi-tenant-isolation/cluster-resource-isolation.md) — Isolates users and projects through a hierarchical permission system and dedicated cluster resource assignments.
- [Role-Based Access Control](https://awesome-repositories.com/f/security-cryptography/role-based-access-control.md) — Manages user identities and granular permissions using a hierarchical role-based access control system. ([source](https://github.com/tencentmusic/cube-studio/tree/master/install))

### System Administration & Monitoring

- [Cluster Health Monitoring](https://awesome-repositories.com/f/system-administration-monitoring/cluster-health-monitoring.md) — Tracks host, process, and GPU health metrics across distributed clusters with custom notifications. ([source](https://github.com/tencentmusic/cube-studio#readme))
- [Resource Monitoring](https://awesome-repositories.com/f/system-administration-monitoring/cluster-monitoring/resource-monitoring.md) — Monitors CPU and GPU utilization across the cluster via integrated telemetry and dashboards. ([source](https://github.com/tencentmusic/cube-studio/tree/master/install))
- [Monitoring and Observability](https://awesome-repositories.com/f/system-administration-monitoring/monitoring-and-observability.md) — Observes real-time host and process loads using integrated observability and monitoring tools. ([source](https://github.com/tencentmusic/cube-studio#readme))
- [In-Browser Metric Visualization](https://awesome-repositories.com/f/system-administration-monitoring/monitoring-and-observability/observability-platforms/metric-performance-monitors/model-training-metrics/in-browser-metric-visualization.md) — Renders machine learning training and performance metrics as charts and images within a web interface. ([source](https://github.com/tencentmusic/cube-studio/tree/master/job-template))
