# LLM inference and serving

> Search results for `LLM inference and serving` on awesome-repositories.com. 114 total matches; showing the first 50.

Explore on the web: https://awesome-repositories.com/q/llm-inference-and-serving

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [this search on awesome-repositories.com](https://awesome-repositories.com/q/llm-inference-and-serving).**

## Results

- [llm-d/llm-d](https://awesome-repositories.com/repository/llm-d-llm-d.md) (2,514 ⭐) — llm-d is a distributed serving framework designed for large language model inference. It functions as an inference orchestrator and gateway, providing a control plane for deploying model replicas and managing hardware accelerators. The system includes a batch inference scheduler and a cache manager to coordinate request flow and memory utilization.

The project is distinguished by a disaggregated serving architecture that separates prefill and decode execution phases across specialized workers to maximize throughput. It employs a hardware-agnostic control plane and tiered cache offloading, mov
- [jina-ai/serve](https://awesome-repositories.com/repository/jina-ai-serve.md) (21,859 ⭐) — Serve is a multimodal AI orchestrator and inference server designed for deploying and scaling machine learning models as cloud-native services. It functions as a containerized workflow engine and distributed service mesh that routes multimodal data through connected execution units.

The framework provides specialized capabilities for large language models, including a token streaming gateway that delivers generated text incrementally to reduce perceived latency. It distinguishes itself by enabling the chaining of executors into complex data processing pipelines and the orchestration of these
- [huggingface/text-generation-inference](https://awesome-repositories.com/repository/huggingface-text-generation-inference.md) (10,775 ⭐) — Text Generation Inference is a production-ready engine designed for the deployment and serving of large language models. It functions as a containerized runtime environment that manages model execution, scales across distributed hardware, and provides high-performance inference capabilities for demanding production environments.

The project distinguishes itself through advanced optimization techniques, including continuous batching to maximize hardware utilization and tensor parallelism to shard large models across multiple accelerator cards. It supports efficient inference through custom com
- [dusty-nv/jetson-inference](https://awesome-repositories.com/repository/dusty-nv-jetson-inference.md) (8,734 ⭐) — jetson-inference is a set of libraries and tools for executing optimized deep learning models on embedded GPU hardware. Its primary purpose is to enable real-time computer vision and AI inference at the edge with low latency and high throughput.

The project distinguishes itself through high-performance streaming analytics and the ability to execute concurrent AI pipelines on auto-grade silicon. It provides specialized support for multi-sensor stream processing, utilizing zero-copy data transport to load camera frames directly into GPU memory.

The codebase covers a broad surface of capabiliti
- [kvcache-ai/ktransformers](https://awesome-repositories.com/repository/kvcache-ai-ktransformers.md) (17,288 ⭐) — Ktransformers is a comprehensive framework designed for the operation, fine-tuning, and serving of large language models. It functions as a heterogeneous inference engine and quantized execution runtime, enabling the deployment of massive models by distributing computational workloads across both CPU and GPU resources. This architecture allows users to bypass local memory constraints, making it possible to run and train models that exceed the capacity of a single device.

The project distinguishes itself through specialized support for sparse architectures, particularly mixture-of-experts mode
- [knative/serving](https://awesome-repositories.com/repository/knative-serving.md) (6,064 ⭐) — Kubernetes-based, scale-to-zero, request-driven compute
- [ai-dynamo/dynamo](https://awesome-repositories.com/repository/ai-dynamo-dynamo.md) (6,112 ⭐) — Dynamo is a distributed inference orchestration platform designed for large language models. It functions as a system to coordinate prefill and decode phases across GPU nodes, utilizing a multi-backend runtime adapter to connect engines like vLLM and TensorRT-LLM through a unified block-oriented memory interface. An OpenAI-compatible API server provides the frontend for integration with existing tools and clients.

The project is distinguished by its disaggregated serving architecture, which separates prompt processing and token generation onto independent GPU pools to optimize throughput and
- [briland/llm-security-and-privacy](https://awesome-repositories.com/repository/briland-llm-security-and-privacy.md) (54 ⭐) — LLM security and privacy
- [livekit/livekit](https://awesome-repositories.com/repository/livekit-livekit.md) (19,358 ⭐) — LiveKit is a comprehensive framework for building and orchestrating real-time, multimodal AI agents that interact with users through voice, video, and text. It provides a centralized, event-driven architecture to manage the entire lifecycle of automated participants, from initialization and session state management to graceful shutdown. By utilizing a selective forwarding unit, the platform efficiently routes media streams between participants and agents, ensuring low-latency communication and secure, token-based authentication for all connections.

The platform distinguishes itself through it
- [aria42/infer](https://awesome-repositories.com/repository/aria42-infer.md) (176 ⭐) — inference and machine learning in clojure
- [nvidia/isaac-gr00t](https://awesome-repositories.com/repository/nvidia-isaac-gr00t.md) (6,222 ⭐)
- [lmcache/lmcache](https://awesome-repositories.com/repository/lmcache-lmcache.md) (6,909 ⭐) — LMCache is a distributed key-value cache manager and tiering system designed to accelerate large language model inference. It functions as a tiered storage layer that offloads tensors from GPU memory to CPU RAM, local disks, or remote object stores, enabling the reuse of cached prefixes across different inference sessions and serving engines.

The system differentiates itself through a disaggregated prefill-decode model, which separates prompt processing from token generation by transferring caches between distributed compute nodes. It utilizes peer-to-peer orchestration to share and retrieve
- [zeit/serve](https://awesome-repositories.com/repository/zeit-serve.md) (9,870 ⭐) — Static file serving and directory listing
- [pytorch/serve](https://awesome-repositories.com/repository/pytorch-serve.md) (4,354 ⭐) — Serve, optimize and scale PyTorch models in production
- [huggingface/transformers](https://awesome-repositories.com/repository/huggingface-transformers.md) (161,630 ⭐) — Transformers is a comprehensive library for machine learning that provides a unified interface for training, fine-tuning, and deploying transformer-based models. It supports a wide range of tasks, including text classification, language modeling, question answering, and sequence-to-sequence translation, while offering specialized architectures for both text and vision processing. The framework includes tools for managing the entire model lifecycle, from data preprocessing and tokenization to distributed training and inference.

The library features extensive support for model optimization and
- [thu-pacman/chitu](https://awesome-repositories.com/repository/thu-pacman-chitu.md) (3,265 ⭐) — Chitu is a distributed serving platform and orchestrator for large language model inference. It functions as a compute manager designed to deploy and scale model workloads across diverse hardware architectures, including GPUs, CPUs, and heterogeneous hardware clusters.

The platform enables model deployment across a wide range of targets, including NVIDIA GPUs, regional chipsets, and legacy hardware. It manages the execution of models across these varying environments to increase available computing capacity and optimize resource utilization.

The system includes capabilities for distributed i
- [honojs/hono](https://awesome-repositories.com/repository/honojs-hono.md) (30,994 ⭐) — Hono is a lightweight web framework built on Web Standard APIs that executes across JavaScript runtimes including Cloudflare Workers, Deno, Bun, and Node.js.
- [tensorflow/serving](https://awesome-repositories.com/repository/tensorflow-serving.md) (6,351 ⭐) — TensorFlow Serving is a high-performance machine learning inference server designed to deploy TensorFlow models to production environments. It functions as a complete serving system that executes predictions on input data through a graph executor, providing network endpoints that eliminate the need for a separate runtime environment for client applications.

The system is distinguished by its model version manager, which organizes and selects specific model versions within a directory hierarchy. It uses a filesystem watcher to detect new model versions and trigger automatic updates without int
- [axolotl-ai-cloud/axolotl](https://awesome-repositories.com/repository/axolotl-ai-cloud-axolotl.md) (12,059 ⭐) — Axolotl is a configuration-driven framework designed for the fine-tuning, evaluation, and quantization of large language models. It functions as a comprehensive orchestrator for distributed training, enabling users to manage complex workflows across multi-node and multi-GPU environments. By utilizing structured configuration files, the platform streamlines the setup of training parameters, dataset paths, and hardware distribution strategies.

The project distinguishes itself through its support for diverse training methodologies, including full-parameter tuning, parameter-efficient adaptation,
- [mlcommons/inference](https://awesome-repositories.com/repository/mlcommons-inference.md) (1,582 ⭐) — Reference implementations of MLPerf® inference benchmarks
- [tensorflow/tensorflow](https://awesome-repositories.com/repository/tensorflow-tensorflow.md) (195,697 ⭐) — TensorFlow is a comprehensive machine learning framework designed for the construction, training, and deployment of complex mathematical models. It utilizes a graph-based execution model that represents operations as directed acyclic graphs, enabling automatic differentiation and efficient parallel processing. The system provides high-level interfaces for defining neural network architectures, alongside a robust engine for managing multidimensional array structures and tensor mathematics.

The framework distinguishes itself through a scalable distributed runtime that orchestrates workloads acr
- [facebook/react](https://awesome-repositories.com/repository/facebook-react.md) (245,669 ⭐) — React is a JavaScript library for building user interfaces based on a component-driven architecture and unidirectional data flow.
- [mlabonne/llm-course](https://awesome-repositories.com/repository/mlabonne-llm-course.md) (80,178 ⭐) — This project is a comprehensive educational curriculum and engineering handbook focused on the lifecycle of large language models. It serves as a structured knowledge base for machine learning practitioners, covering the fundamental mathematical and architectural principles of transformer-based sequence modeling, as well as the practical implementation of supervised instruction fine-tuning and preference-based model alignment.

The repository distinguishes itself by providing a deep dive into advanced model composition and optimization techniques. It details methodologies for weight-space mode
- [developmentseed/fastai-serving](https://awesome-repositories.com/repository/developmentseed-fastai-serving.md) (121 ⭐) — A Docker image for serving fastai models, mimicking the API of Tensorflow Serving. It is designed for running batch inference at scale. It is not optimized for performance (but it's not that slow).
- [skyzh/tiny-llm](https://awesome-repositories.com/repository/skyzh-tiny-llm.md) (4,304 ⭐) — A course of learning LLM inference serving on Apple Silicon for systems engineers: build a tiny vLLM + Qwen.
- [modeltc/lightllm](https://awesome-repositories.com/repository/modeltc-lightllm.md) (3,901 ⭐) — LightLLM is a high-performance serving framework for deploying and executing large language models. It functions as a multi-GPU inference engine and server capable of handling dense architectures, mixture-of-experts designs, and multimodal models that process both text and images.

The system is distinguished by its specialized support for Mixture-of-Experts models using expert parallelism and fused kernels. It implements structured text generation through deterministic state machines and pushdown automata to enforce precise output formats. To optimize throughput, the framework employs specula
- [11ty/eleventy](https://awesome-repositories.com/repository/11ty-eleventy.md) (19,670 ⭐) — Eleventy is a JavaScript-based static site generator designed to transform templates, data files, and markdown into optimized HTML. It functions as a versatile template rendering engine and content management framework, allowing developers to aggregate data from diverse sources—including local files, databases, and external APIs—to populate structured web content.

The project is distinguished by its template-engine-agnostic pipeline, which decouples the build process from specific rendering languages. This allows users to integrate multiple template formats, such as Liquid, Nunjucks, Handleba
- [crystal-lang/crystal](https://awesome-repositories.com/repository/crystal-lang-crystal.md) (20,299 ⭐) — Crystal is a statically typed, compiled programming language designed for high performance and memory safety. It leverages an LLVM-based compiler to translate source code into optimized machine-executable binaries, while its type-inference-based static analysis enforces strict safety rules during the build process.

The language distinguishes itself through a fiber-based concurrent runtime that manages lightweight execution units for asynchronous input and output without blocking the main process. It also features a powerful compile-time macro system that allows for the inspection and transfor
- [xorbitsai/inference](https://awesome-repositories.com/repository/xorbitsai-inference.md) (9,358 ⭐) — This project is a platform for the deployment of open source large language and multimodal models. It provides a unified interface to serve text, image, and speech models across local or cloud hardware.

The system enables distributed AI inference by orchestrating model workloads across multiple nodes and devices. It includes a unified API adapter layer to standardize inputs and outputs, as well as tools for multimodal chat and structural image generation.

The platform covers a broad capability surface including request batching for throughput optimization, dynamic model loading, and integrat
- [zackriya-solutions/meeting-minutes](https://awesome-repositories.com/repository/zackriya-solutions-meeting-minutes.md) (12,757 ⭐) — This project is a self-hosted meeting transcription and summarization tool that converts audio recordings into text transcripts and structured notes using large language models. It functions as an enterprise meeting documentation manager, allowing for the organization and editing of timestamped records.

The system prioritizes data privacy through local-first processing and the ability to deploy on private infrastructure. It supports a provider-agnostic architecture, enabling users to connect to local AI engines, self-hosted servers, or cloud-based API endpoints for both transcription and summ
- [gokumohandas/made-with-ml](https://awesome-repositories.com/repository/gokumohandas-made-with-ml.md) (48,343 ⭐) — Made-With-ML is an automated documentation generator and developer experience platform designed to transform source code into structured, searchable reference websites. It functions as a codebase intelligence tool that parses implementation details to provide clear explanations of logic and data requirements.

The system distinguishes itself by leveraging language-level type annotations and structured code comments to generate interface specifications. By utilizing static analysis to extract metadata, it automates the transformation of docstrings into web-ready documentation, ensuring that tec
- [vercel/serve](https://awesome-repositories.com/repository/vercel-serve.md) (9,863 ⭐) — Serve is a Node.js static file server that delivers assets and single-page applications from a local directory over HTTP. It functions as both a command-line web server for hosting directories directly from the terminal and as HTTP middleware for integrating static asset delivery into existing servers.

The project includes a directory browser interface that provides a web-based file explorer for navigating and accessing files within a served folder. It supports single-page application fallback by redirecting unmatched request paths to a root file to enable client-side routing.

The server han
- [infiniflow/ragflow](https://awesome-repositories.com/repository/infiniflow-ragflow.md) (82,922 ⭐) — This project is a comprehensive retrieval-augmented generation platform designed for building, managing, and deploying knowledge-based AI applications. It provides a unified environment for organizing datasets, configuring conversational chat assistants, and developing autonomous agents that execute multi-step reasoning workflows. By integrating document intelligence with advanced retrieval pipelines, the platform enables the creation of grounded, verifiable responses supported by traceable citations.

The platform distinguishes itself through deep document understanding and sophisticated know
- [huggingface/llm-swarm](https://awesome-repositories.com/repository/huggingface-llm-swarm.md) (288 ⭐) — Manage scalable open LLM inference endpoints in Slurm clusters
- [open-llm-vtuber/open-llm-vtuber](https://awesome-repositories.com/repository/open-llm-vtuber-open-llm-vtuber.md) (5,946 ⭐)
- [facebook/infer](https://awesome-repositories.com/repository/facebook-infer.md) (15,646 ⭐) — Infer is a static analysis toolset for Java, C, C++, and Objective-C designed to detect memory leaks, null dereferences, and resource bugs. It functions as a multi-language bug finder that identifies race conditions, deadlocks, and memory safety issues by translating source code into a common intermediate representation for analysis.

The project distinguishes itself through an inter-procedural data flow analyzer that tracks movement between sources and sinks to detect tainted flows and generate data flow graphs. It also includes a framework for verifying temporal properties and reachability u
- [denoland/deno](https://awesome-repositories.com/repository/denoland-deno.md) (107,110 ⭐) — Deno is a high-performance runtime for JavaScript and TypeScript that prioritizes security and developer productivity. Built on the V8 engine, it provides a secure execution environment that enforces a default-deny security model, requiring explicit user authorization for access to system resources like the file system, network, and environment variables. The runtime natively supports modern web-standard APIs, ensuring consistent behavior and portability across different environments.

What distinguishes Deno is its integrated approach to the software development lifecycle. It bundles essentia
- [coleam00/local-ai-packaged](https://awesome-repositories.com/repository/coleam00-local-ai-packaged.md) (3,539 ⭐) — This project is a containerized local AI infrastructure stack designed to deploy large language models and vector databases on private hardware. It functions as an orchestration platform that combines AI runners, knowledge graphs, and a visual workflow builder for creating agentic chatflows and automating tasks via tool integration.

The platform distinguishes itself through a low-code approach to agent orchestration, utilizing a visual interface to design complex sequences and connect agents to external tools and search engines. It includes a dedicated local observability stack to track promp
- [openbmb/minicpm](https://awesome-repositories.com/repository/openbmb-minicpm.md) (9,464 ⭐) — MiniCPM is a collection of small language models designed for local, on-device deployment in resource-constrained environments. The project focuses on running dense Transformer models on consumer hardware, including GPUs, CPUs, and Apple Silicon, without requiring custom code forks.

The project distinguishes itself through heavy optimization for edge hardware, utilizing quantized weight compression in GGUF and MLX formats to reduce memory overhead. It implements advanced inference techniques such as speculative sampling and radix-tree prefix caching to accelerate generation speed and throughp
- [fishaudio/fish-speech](https://awesome-repositories.com/repository/fishaudio-fish-speech.md) (24,928 ⭐) — This project is a generative speech synthesis engine that converts text into high-fidelity human speech. It utilizes a two-stage autoregressive transformer architecture that separates semantic token prediction from acoustic detail reconstruction to balance linguistic accuracy with audio quality. The system is designed to support multilingual output and conversational AI development, enabling the generation of context-aware speech that maintains flow across multiple dialogue turns.

The platform distinguishes itself through a production-ready inference server that employs continuous batching to
- [cheahjs/free-llm-api-resources](https://awesome-repositories.com/repository/cheahjs-free-llm-api-resources.md) (11,612 ⭐) — This project is a community-driven repository that serves as a directory for artificial intelligence providers offering free usage tiers and trial credits for large language model inference. It functions as a resource for developers to discover and integrate external AI services into applications while minimizing initial infrastructure costs.

The repository provides structured metadata that enables developers to track request constraints, token limits, and rate requirements across multiple providers. By utilizing standardized data structures and declarative configuration, it assists in managi
- [sjtu-ipads/powerinfer](https://awesome-repositories.com/repository/sjtu-ipads-powerinfer.md) (9,568 ⭐) — PowerInfer is an inference engine and serving framework designed to run large language models on local hardware. It combines a hybrid CPU-GPU offloader, a quantization tool, and a sparse model optimizer to enable the execution of high-parameter models on consumer-grade devices.

The system distinguishes itself through neuron-activation-based offloading, using a predictor model to preload frequent neurons into VRAM while keeping rare neurons in system memory. This hybrid execution model balances workloads between the GPU and CPU based on input patterns to optimize memory access and increase tok
- [zccyman/pytorch-inference](https://awesome-repositories.com/repository/zccyman-pytorch-inference.md) (89 ⭐) — PyTorch 1.0 inference in C++ on Windows10 platforms
- [hannibal046/awesome-llm](https://awesome-repositories.com/repository/hannibal046-awesome-llm.md) (26,933 ⭐) — This project serves as a comprehensive, static directory of external resources dedicated to the study and application of large language models. It functions as a centralized discovery point for developers and researchers, aggregating foundational academic papers, technical documentation, and specialized tools within a structured, version-controlled knowledge base.

The repository distinguishes itself through a multi-level classification system that organizes diverse technical domains, ranging from model training frameworks and inference optimization to AI safety and hallucination detection. By
- [sindresorhus/electron-serve](https://awesome-repositories.com/repository/sindresorhus-electron-serve.md) (482 ⭐) — Static file serving for Electron apps
- [microsoft/deepspeedexamples](https://awesome-repositories.com/repository/microsoft-deepspeedexamples.md) (6,822 ⭐) — DeepSpeedExamples is a collection of reference implementations for training and deploying large scale AI models using the DeepSpeed optimization library. It provides Python code examples for training massive models across multiple GPUs through distributed optimization techniques.

The repository includes optimized patterns for deploying and running large language model predictions in production environments. It also serves as a guide for model compression to reduce memory footprints and as a source for performance benchmarks to measure execution speed and resource utilization.

The project cov
- [zeit/serve-handler](https://awesome-repositories.com/repository/zeit-serve-handler.md) (617 ⭐) — This package represents the core of serve. It can be plugged into any HTTP server and is responsible for routing requests and handling responses.
- [googlechrome/workbox](https://awesome-repositories.com/repository/googlechrome-workbox.md) (12,895 ⭐) — Workbox is a modular library and toolkit designed for managing service workers in progressive web applications. It provides a comprehensive framework for handling asset caching, request routing, and background script lifecycle management, enabling developers to build web applications that function reliably offline and load efficiently.

The project distinguishes itself through a declarative routing engine and a plugin-based architecture that allows for the injection of custom logic into the request and response processing pipeline. It supports advanced caching patterns, such as cache-first or
- [wdndev/llm_interview_note](https://awesome-repositories.com/repository/wdndev-llm-interview-note.md) (12,438 ⭐) — This project is a comprehensive technical reference and educational resource focused on the lifecycle of large language models. It provides structured learning materials that cover the foundational mechanics of transformer architectures, the mathematical principles of attention mechanisms, and the engineering practices required for modern generative artificial intelligence.

The repository serves as a guide for both technical skill development and professional preparation, offering a curriculum that spans from model training and inference optimization to advanced alignment techniques. It detai
- [heyputer/puter](https://awesome-repositories.com/repository/heyputer-puter.md) (42,318 ⭐) — Puter is a browser-based desktop environment and cloud-native development platform that provides a virtualized graphical workspace. It enables developers to build and deploy full-stack web applications by integrating cloud storage, authentication, and serverless backend logic directly into the browser, eliminating the need for traditional server infrastructure.

The platform distinguishes itself through a unified cloud storage layer and a distributed network runtime that facilitates peer-to-peer communication and cross-origin resource fetching. It features a sophisticated cross-window orchestr