6 Repos
Hosting models from diverse deep learning frameworks across varied hardware accelerators.
Distinct from Model Serving Frameworks: Specifically addresses the ability to serve models from multiple different frameworks simultaneously.
Explore 6 awesome GitHub repositories matching artificial intelligence & ml · Multi-Framework Model Serving. Refine with filters or upvote what's useful.
jetson-inference is a set of libraries and tools for executing optimized deep learning models on embedded GPU hardware. Its primary purpose is to enable real-time computer vision and AI inference at the edge with low latency and high throughput. The project distinguishes itself through high-performance streaming analytics and the ability to execute concurrent AI pipelines on auto-grade silicon. It provides specialized support for multi-sensor stream processing, utilizing zero-copy data transport to load camera frames directly into GPU memory. The codebase covers a broad surface of capabiliti
Serves models from multiple frameworks across diverse hardware accelerators and CPUs using optimized configurations.
KServe is a Kubernetes-native platform for deploying and serving machine learning models as scalable inference services. It supports both generative AI models, including large language models, and traditional predictive models from frameworks such as TensorFlow, PyTorch, Scikit-Learn, XGBoost, and ONNX. The platform manages the full lifecycle of model deployments, including revision tracking, canary rollouts, A/B testing, and automatic rollbacks, and provides serverless scale-to-zero capabilities for cost-efficient resource management. KServe distinguishes itself through a standardized infere
Supports serving models from TensorFlow, PyTorch, Scikit-Learn, XGBoost, ONNX, and Hugging Face with standardized inference protocols.
KServe is an open platform for deploying and serving generative and predictive AI models on Kubernetes. It defines inference services as custom resources with declarative YAML specifications, enabling a Kubernetes-native approach to model deployment and lifecycle management. The platform leverages Knative-based serverless scaling for automatic scale-to-zero and revision management, and supports a pluggable serving runtime architecture that maps model formats to containerized execution environments. KServe distinguishes itself through model-aware autoscaling that scales replicas based on token
Runs exported models from TensorFlow, PyTorch, Scikit-learn, XGBoost, and others behind a unified inference endpoint.
SakuraLLM is a multi-format document translation system that hosts large language models for translating Japanese text into other languages. It functions as an inference server that exposes translation models through an OpenAI-compatible API, allowing any tool supporting the OpenAI client format to send translation requests. The system is designed as a glossary-aware translation engine that applies user-defined term dictionaries to ensure consistent translation of proper nouns and names across outputs. The project distinguishes itself by supporting multiple high-performance inference backends
Loads full-precision models using the vLLM backend with PagedAttention and tensor parallel multi-GPU acceleration.
Dieses Projekt ist eine umfassende Bildungsressource und ein Tutorial-Handbuch für das Erstellen, Trainieren und Bereitstellen von Machine-Learning-Modellen mit TensorFlow 2. Es dient als strukturierter Lernleitfaden für grundlegende Deep-Learning-Konzepte, einschließlich neuronaler Netzwerkarchitekturen, automatischer Differenzierung und Tensor-Operationen. Das Handbuch bietet technische Anleitungen zur Optimierung der Ausführungseffizienz durch GPU-Speicherverwaltung, verteiltes Training und Modellquantisierung. Es enthält zudem detaillierte Anleitungen für den Aufbau leistungsfähiger Datenpipelines und den Export von Modellen für Produktionsserver, mobile Geräte und Webbrowser. Das Material deckt ein breites Spektrum an Funktionen ab, darunter die Modellentwicklung mit konvolutionellen und rekurrenten Netzwerken, die Implementierung benutzerdefinierter Verlustfunktionen und Layer sowie die Nutzung vortrainierter Modelle für Transfer Learning. Zudem werden Bereitstellungsstrategien für Edge-Geräte und die Nutzung cloudbasierter Runtimes zur Hardwarebeschleunigung behandelt. Die Ressource ist als Sammlung von Jupyter Notebooks implementiert.
Explains how to load specific model versions and automatically update to the latest deployment version.
vllm-omni is a high-throughput serving engine and distributed inference framework designed for omni-modal models. It serves as a multi-modal model API server capable of generating text, image, video, and audio data, providing a standardized interface for remote client access. The system features a non-autoregressive generation engine for parallel media production and a robot policy inference server that acts as a real-time communication bridge to robotic hardware using specialized protocols. It supports hybrid execution models that combine sequential token generation with parallelized media g
Serves as a high-throughput runtime for omni-modal models using vLLM's PagedAttention and tensor parallelism.