awesome-repositories.com
© 2026 Bringes Technology SRL·VAT RO45896025·hello@bringes.io
MCPSitemapPrivacyTerms
Inference Servers and Runtimes · Awesome GitHub Repositories

14 repos

Awesome GitHub RepositoriesInference Servers and Runtimes

Explore 14 awesome GitHub repositories matching artificial intelligence & ml · Inference Servers and Runtimes. Refine with filters or upvote what's useful.

  1. Home
  2. Artificial Intelligence & ML
  3. Machine Learning
  4. Infrastructure
  5. Deployment & Serving
  6. Inference Servers and Runtimes

Awesome Inference Servers and Runtimes GitHub Repositories

Describe the repository you're looking for…
We'll search the best matching repositories with AI.
  • tensorflow/tensorflow

    tensorflow/tensorflow

    193,864GitHubView on GitHub↗

    TensorFlow is a comprehensive machine learning framework designed for the construction, training, and deployment of complex mathematical models. It utilizes a graph-based execution model that represents operations as directed acyclic graphs, enabling automatic differentiation and efficient parallel processing. The syst

    Deploys models into production environments to handle scalable requests while maintaining consistent inference latency.

    C++deep-learningdeep-neural-networksdistributed
  • huggingface/transformers

    huggingface/transformers

    156,730GitHubView on GitHub↗

    Transformers is a comprehensive library for machine learning that provides a unified interface for training, fine-tuning, and deploying transformer-based models. It supports a wide range of tasks, including text classification, language modeling, question answering, and sequence-to-sequence translation, while offering

    Exports models into a portable format with ahead-of-time memory planning and hardware-specific operation dispatch for edge device inference.

    Pythonaudiodeep-learningdeepseek
  • Comfy-Org/ComfyUI

    Comfy-Org/ComfyUI

    103,654GitHubView on GitHub↗

    ComfyUI is a node-based generative AI orchestration engine designed for constructing, testing, and executing complex image and video synthesis pipelines. By utilizing a directed acyclic graph execution model, the platform allows users to build reproducible workflows through modular, interconnected processing blocks wit

    Serves visual, node-based generative pipelines as programmable API endpoints for integration into external software.

    Pythonaicomfycomfyui
  • deepseek-ai/DeepSeek-V3

    deepseek-ai/DeepSeek-V3

    101,631GitHubView on GitHub↗

    DeepSeek-V3 is a large language model that provides comprehensive resources for model utilization, including technical specifications, pre-trained weights, and evaluation benchmarks. The project details the core transformer architecture, including parameter counts and multi-token prediction modules, while supporting na

    Handles high-performance serving through multi-machine tensor parallelism and mixed-precision execution for large-scale language models.

    Python
  • ggml-org/llama.cpp

    ggml-org/llama.cpp

    95,400GitHubView on GitHub↗

    Llama.cpp is an inference engine designed for the local execution of text-based and multimodal language models on consumer hardware. It provides a core environment for running models that process both text and image inputs, utilizing hardware-accelerated backends to optimize performance across diverse CPU and GPU archi

    Executes large language models locally on standard consumer hardware with high performance.

    C++ggml
  • hacksider/Deep-Live-Cam

    hacksider/Deep-Live-Cam

    79,568GitHubView on GitHub↗

    Deep-Live-Cam is a generative video transformation tool designed for real-time facial manipulation and cinematic enhancement. It functions as a local-first AI runtime, performing all media processing directly on the user's hardware to ensure complete data privacy without external network dependencies. By utilizing a hi

    Optimizes generative models for low-latency, real-time inference on consumer-grade hardware.

    Pythonaiai-deep-fakeai-face
  • nomic-ai/gpt4all

    nomic-ai/gpt4all

    77,146GitHubView on GitHub↗

    GPT4All is a cross-platform runtime environment designed to execute large language models directly on local consumer hardware. By leveraging an optimized C++ inference backend, it enables private, offline AI interactions without requiring an internet connection or external cloud services. The project provides a compreh

    Delivers a cross-platform execution environment for running large language models locally on consumer hardware.

    C++ai-chatllm-inference
  • mlabonne/llm-course

    mlabonne/llm-course

    75,340GitHubView on GitHub↗

    This project is a comprehensive educational curriculum and engineering handbook focused on the lifecycle of large language models. It serves as a structured knowledge base for machine learning practitioners, covering the fundamental mathematical and architectural principles of transformer-based sequence modeling, as we

    Architectural patterns for scaling model inference range from simple local setups to complex multi-GPU cluster configurations.

    courselarge-language-modelsllm
  • PaddlePaddle/PaddleOCR

    PaddlePaddle/PaddleOCR

    70,931GitHubView on GitHub↗

    PaddleOCR is a comprehensive optical character recognition framework designed for detecting and transcribing text from images and documents into structured, machine-readable formats. It provides a modular computer vision pipeline that decouples image preprocessing, text detection, and character recognition into indepen

    Facilitates the deployment of text extraction models as scalable services across various hardware environments.

    Pythonai4sciencechineseocrdocument-parsing
  • vllm-project/vllm

    vllm-project/vllm

    70,745GitHubView on GitHub↗

    vLLM is a high-throughput inference engine designed for the efficient serving and execution of large language models. It functions as a production-ready distributed model server, providing standard API protocols for online serving while also supporting offline batch processing. The system is built to maximize token gen

    Scales large language model inference to handle high volumes of concurrent requests with minimal latency.

    Pythonamdblackwellcuda
  • hiyouga/LlamaFactory

    hiyouga/LlamaFactory

    67,386GitHubView on GitHub↗

    LlamaFactory is a unified framework for fine-tuning and adapting large language models. It provides a comprehensive platform that standardizes training workflows across diverse machine learning architectures, allowing users to execute both full-tuning and parameter-efficient methods through a single interface. The pro

    Wraps model execution in a web-accessible interface to provide consistent endpoints for client-side requests.

    Pythonagentaideepseek
  • meta-llama/llama

    meta-llama/llama

    59,157GitHubView on GitHub↗

    Llama is a computational framework and runtime environment designed for executing transformer-based neural networks locally. It functions as a generative AI inference engine, enabling the processing of input sequences through pre-trained model weights to produce text completions and structured data outputs directly on

    Executes model checkpoints locally with configurable parameters like sequence length and batch size to optimize performance.

    Python
  • ultralytics/yolov5

    ultralytics/yolov5

    56,830GitHubView on GitHub↗

    YOLOv5 is a comprehensive computer vision framework designed for end-to-end deep learning, specializing in real-time object detection, image classification, and instance segmentation. It provides a unified toolkit that manages the entire lifecycle of a model, from initial dataset configuration and hyperparameter tuning

    Executes high-speed visual inference using hardware-accelerated processing and test-time augmentation.

    Pythoncoremldeep-learningios
  • tensorflow/tfjs-examples

    tensorflow/tfjs-examples

    6,783GitHubView on GitHub↗

    This repository provides a collection of practical demonstrations and implementation guides for machine learning tasks using TensorFlow.js. It serves as a resource for developers to explore model architectures, training workflows, and data manipulation techniques across domains such as computer vision, natural language

    Low-level interfaces allow for precise weight initialization and the construction of custom model architectures using granular tensor operations.

    JavaScript

Explore sub-tags

  • Distributed Model ServersServices that expose generative model capabilities over network protocols for integration into external applications.
  • High-Throughput Model ServingArchitectures designed to handle large volumes of concurrent inference requests with low latency.
  • Inference API Servers1 sub-tagNetwork services that expose model inference capabilities through standardized web APIs to support automated application workflows.
Inference Frameworks
Software libraries for deploying and serving machine learning models.
  • Inference Runtimes6 sub-tagsExecution environments designed to load and run machine learning models for real-time or high-performance inference tasks.
  • Inference ServersServices that provide standardized API endpoints for model execution.
  • LLM Serving ArchitecturesHigh-performance systems and engineering architectures designed to deploy and serve large language models at scale.
  • Machine Learning Model APIs1 sub-tagStandardized programming interfaces that allow applications to interact with and query machine learning models.
  • Model Execution APIsInterfaces for loading and running pre-trained model assets.
  • Model Inference APIsStandardized interfaces for serving model predictions via local or remote endpoints.
  • Model Inference AcceleratorsHardware or software components that increase the speed and efficiency of model inference tasks.
  • Multimodal Inference EnginesSoftware engines capable of processing and generating outputs from multiple data types, such as text, images, and audio simultaneously.
  • Online Model ServersServices that provide real-time model inference and chat completions via standard API protocols.
  • Text-Only Inference EnginesSpecialized engines optimized exclusively for processing and generating natural language text sequences.