30 open-source projects similar to gpustack/gpustack, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best Gpustack alternative.
FedML is a distributed machine learning training library, federated learning framework, and GPU workload orchestrator. It provides the core system components necessary to execute large-scale model training and fine-tuning across multi-cloud, on-premise, and decentralized GPU clusters, while offering a dedicated engine for scalable model serving and an MLOps pipeline manager for end-to-end lifecycle management. The platform distinguishes itself by enabling privacy-preserving federated learning across decentralized edge devices and organizational silos, keeping raw data on local hardware. It al
ClearML is a comprehensive MLOps platform designed to manage the end-to-end machine learning lifecycle, from initial experimentation to production deployment. It provides a suite of integrated tools including a pipeline orchestrator for automating workflows, an experiment tracking tool for logging hyperparameters and metrics, and a metadata-driven data versioning system for managing large-scale datasets and model artifacts. The platform is distinguished by its advanced compute management and serving capabilities. It features a GPU compute manager that supports fractional resource slicing and
pysheeet is a technical reference library providing a curated collection of code snippets and implementation patterns for advanced Python development, system integration, and high-performance computing. It serves as a comprehensive guide for implementing low-level network programming, native C extensions, and asynchronous and concurrent programming. The project provides specialized frameworks for the development and deployment of large language models, including tools for distributed GPU inference and high-performance serving. It also includes detailed patterns for high-performance computing
Serving is a high-performance framework designed for deploying and scaling machine learning models as production services. It functions as a distributed inference engine that enables the execution of complex data processing workflows by chaining multiple models into directed acyclic graphs. The platform distinguishes itself through its ability to manage the entire production model lifecycle, allowing for hot-swappable versioning that updates services without downtime. It supports horizontal scaling through distributed model sharding and optimizes high-dimensional data retrieval via specialize
PaddleX is a PaddlePaddle-based framework for building, deploying, and fine-tuning AI model pipelines, with pre-built support for computer vision, OCR, document analysis, and time series tasks. It offers a toolkit of ready-to-use pipelines for image classification, object detection, segmentation, and pose estimation, alongside an end-to-end OCR document analysis pipeline that extracts text, tables, formulas, and layout information. The platform also includes a dedicated time series forecasting pipeline for analyzing historical data to detect anomalies, classify patterns, and predict future val
Cube Studio is a cloud-native MLOps platform and Kubernetes-based AI orchestrator designed for the entire machine learning lifecycle. It provides a distributed training framework for large-scale model fine-tuning, a GPU resource manager for hardware virtualization, and an ML pipeline orchestrator that uses visual directed acyclic graphs to manage end-to-end workflows. The platform distinguishes itself through its specialized LLM inference server, which supports retrieval-augmented generation and the construction of private knowledge bases. It features a dedicated system for supervised fine-tu
Sglang is a high-performance inference engine and serving system designed for large language and multimodal models. It provides a programmable interface for orchestrating complex generation workflows, enabling developers to coordinate multi-turn dialogues, tool invocations, and reasoning chains through a domain-specific language. The platform is built to support production-scale deployments, offering an OpenAI-compatible API that allows for integration with existing application ecosystems. The system distinguishes itself through a disaggregated architecture that separates compute-intensive pr
Formbricks is an open-source survey and feedback platform designed to help teams capture and analyze user insights through targeted, in-app, and website-based interactions. It functions as a comprehensive customer experience analytics system that allows organizations to maintain full control over their data, user attributes, and survey workflows. The platform distinguishes itself through its event-driven architecture, which enables precise behavioral targeting by triggering surveys based on specific user actions or application events. It supports deep integration with external ecosystems by a
AISystem is a comprehensive AI full-stack infrastructure project covering the entire pipeline from AI chip architecture to high-level training frameworks. It encompasses the development of AI compiler frameworks, inference engines, and distributed training orchestrators designed to coordinate workloads across a heterogeneous compute stack of CPUs, GPUs, and NPUs. The project focuses on the deep integration of software and hardware, employing software-hardware co-design to align tensor layouts with physical memory structures. It provides specialized capabilities for accelerating Transformer mo
This project is a distributed computing platform designed to orchestrate containerized workloads across heterogeneous hardware clusters. It functions as a centralized control plane that manages resource allocation, scheduling, and execution environments, enabling organizations to share high-performance computing infrastructure securely among multiple users and projects. The platform distinguishes itself through advanced hardware virtualization and multi-tenant management capabilities. It supports the partitioning of physical graphics processing units into fractional slices, allowing multiple
Triton Inference Server is a high-performance AI model inference server and multi-framework model runtime designed for deploying machine learning models across cloud, data center, and embedded edge infrastructure. It serves as an execution engine that allows for the concurrent running of models from various frameworks to optimize hardware utilization. The project features a dynamic batching inference engine that groups individual requests into larger batches to increase total processing throughput. It also provides a model ensemble pipeline, which enables the chaining of multiple models toget
Text Embeddings Inference is a high-performance inference server designed to host text embedding and sequence classification models as scalable API endpoints. It provides a vector embedding API to convert text into dense representations and a cross-encoder reranking server for scoring the relevance of document sequences against a query. The project features a GPU-accelerated inference engine that utilizes dynamic batching and specialized kernels to maximize throughput. It offers a high-performance binary interface via gRPC as an alternative to standard HTTP to reduce network latency and seria
KServe is a Kubernetes-native platform for deploying and serving machine learning models as scalable inference services. It supports both generative AI models, including large language models, and traditional predictive models from frameworks such as TensorFlow, PyTorch, Scikit-Learn, XGBoost, and ONNX. The platform manages the full lifecycle of model deployments, including revision tracking, canary rollouts, A/B testing, and automatic rollbacks, and provides serverless scale-to-zero capabilities for cost-efficient resource management. KServe distinguishes itself through a standardized infere
KubeOperator is a comprehensive Kubernetes cluster management platform, infrastructure orchestrator, and multi-cluster manager. It functions as an enterprise Kubernetes distribution designed to automate the deployment, scaling, and lifecycle management of production clusters across diverse cloud platforms and physical machines. The platform distinguishes itself with specialized capabilities for air-gapped environments, including an offline installation engine that generates software archives and manages private registries for secure, non-internet deployments. It also provides a centralized da
Anomalib is a PyTorch-based library for visual anomaly detection, offering a modular framework, a comprehensive model zoo, and a benchmarking suite designed for industrial defect detection. It provides a wide range of algorithms—including generative, discriminative, teacher-student, and vision-language approaches—that support unsupervised, few-shot, and zero-shot settings. The library enables deployment through model export to ONNX and OpenVINO for edge devices, and includes a no-code web application for training and inference. It also features a command-line interface for orchestrating multi
Cerebro is an administration tool for OpenSearch and Elasticsearch clusters, providing a web-based graphical interface to monitor health and manage performance. It serves as a central console for cluster administration, including the creation and organization of indices, aliases, and index templates. The project distinguishes itself through integrated directory authentication, utilizing LDAP services to manage user identities and access permissions. It also includes a dedicated REST client console for sending manual requests to clusters, featuring autocompletion and the ability to export requ
Dynamo is a distributed inference orchestration platform designed for large language models. It functions as a system to coordinate prefill and decode phases across GPU nodes, utilizing a multi-backend runtime adapter to connect engines like vLLM and TensorRT-LLM through a unified block-oriented memory interface. An OpenAI-compatible API server provides the frontend for integration with existing tools and clients. The project is distinguished by its disaggregated serving architecture, which separates prompt processing and token generation onto independent GPU pools to optimize throughput and
Zeebe is a cloud-native workflow engine and distributed state machine designed for business process orchestration using BPMN and DMN standards. It operates as a high-performance gRPC workflow runtime that executes complex business processes through a partitioned event-streaming architecture. The system also functions as an orchestrator for large language model agents, coordinating AI reasoning and tool use within deterministic business processes. The engine is distinguished by its peer-to-peer broker networking and a consensus-based data replication model that ensures high availability and fa
Olares is a comprehensive suite of self-hosted identity, storage, AI, and orchestration services designed for private infrastructure management. It functions as a Kubernetes home server orchestrator, enabling the deployment of containerized applications, AI models, and GPU resources on local hardware to replace third-party cloud services. The platform distinguishes itself through a combination of self-hosted AI infrastructure for running large language models and image generators, alongside a decentralized identity manager that uses cryptographic keys and OIDC for trustless authentication. It
This project is a PyTorch model serving framework designed to deploy and scale machine learning models in production via scalable network endpoints. It functions as a high-performance inference server, optimizer, and model lifecycle manager that handles model loading, request batching, and hardware acceleration. The system distinguishes itself through advanced orchestration and optimization capabilities, such as chaining multiple models into sequential workflows using execution graphs and employing dynamic batching to improve throughput and latency. It provides specialized support for generat
This project is a comprehensive educational resource and tutorial handbook for building, training, and deploying machine learning models using TensorFlow 2. It serves as a structured learning guide covering core deep learning concepts, including neural network architectures, automatic differentiation, and tensor operations. The handbook provides technical guidance on optimizing execution efficiency through GPU memory management, distributed training, and model quantization. It also includes detailed manuals for constructing high-performance data pipelines and exporting models for production s
This project provides a comprehensive guide and set of scripts for deploying and configuring a production-ready Kubernetes cluster from scratch. It centers on establishing a functional environment by installing core management components, storage, and networking across multiple nodes. The implementation emphasizes high availability for the control plane, utilizing layer-4 load balancing and leader election for the API server, scheduler, and controller manager. It further ensures reliability through the deployment of a distributed key-value store for persistent runtime data. The project cover
This project is a comprehensive platform for hosting and interacting with large language models directly on local hardware. It provides a web-based graphical interface that allows users to manage model loading, configure generation parameters, and execute text or chat interactions entirely offline. By running models locally, the software ensures complete data privacy and eliminates reliance on external cloud services for generative tasks. Beyond basic inference, the platform functions as a versatile workbench for generative AI development. It includes an integrated pipeline for fine-tuning mo
HAMi is a hardware orchestration and virtualization system designed to manage accelerators within Kubernetes. It functions as a device plugin that partitions physical hardware into isolated virtual slices, enabling multiple containers to share a single device through enforced memory limits and compute quotas. The project provides a virtualization manager and a heterogeneous compute scheduler that distributes tasks across diverse accelerator types. It uses packing and topology policies to optimize workload placement and allows for specific hardware targeting using unique device identifiers. T
lmdeploy is a high-performance inference engine and deployment framework for large language models and vision models. It functions as a multi-modal model server and compression toolkit designed to serve models with high throughput and low latency. The system enables the distribution of model services across multiple machines using request-based load balancing and tensor parallelism. It includes specialized tools for model quantization and compression to reduce the memory footprint of weights and caches. The framework covers broad capability areas including production deployment, distributed
NATS Server is a high-performance, lightweight messaging system designed for cloud-native applications, edge computing, and distributed microservices. It functions as a distributed publish-subscribe broker that routes messages using hierarchical, dot-separated subject strings, enabling decoupled communication between services without requiring centralized broker lookups. The system supports core messaging patterns including asynchronous publish-subscribe, request-reply, and load-balanced queue processing. The platform distinguishes itself through a decentralized architecture that eliminates t
Presto is a distributed SQL query engine designed for high-performance analytical processing across heterogeneous data sources. It functions as a data federation platform and massively parallel processing engine, allowing users to execute interactive queries against diverse storage systems without requiring data migration. By mapping remote metadata and structures to a unified relational namespace, it enables seamless cross-platform analysis through a standard SQL interface. The engine distinguishes itself through a pluggable connector architecture and a shared-nothing distributed processing
MiniCPM is a collection of small language models designed for local, on-device deployment in resource-constrained environments. The project focuses on running dense Transformer models on consumer hardware, including GPUs, CPUs, and Apple Silicon, without requiring custom code forks. The project distinguishes itself through heavy optimization for edge hardware, utilizing quantized weight compression in GGUF and MLX formats to reduce memory overhead. It implements advanced inference techniques such as speculative sampling and radix-tree prefix caching to accelerate generation speed and throughp
Open WebUI is a self-hosted, web-based platform designed for interacting with local and remote artificial intelligence models. It functions as a unified interface and orchestration suite, enabling users to build, deploy, and manage specialized AI agents equipped with custom instructions, external tool access, and private knowledge bases. The platform distinguishes itself through a modular architecture that supports complex AI workflows. It features a plugin-based framework for custom logic and pipeline-based request processing, allowing developers to filter or transform data streams before th
OpenLLM is a framework for deploying, managing, and scaling open-source large language models