# Speculative Decoding Frameworks

> Search results for `speculative decoding to make LLM generation faster` on awesome-repositories.com. 116 total matches; showing the first 50.

Explore on the web: https://awesome-repositories.com/q/speculative-decoding-to-make-llm-generation-faster

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [this search on awesome-repositories.com](https://awesome-repositories.com/q/speculative-decoding-to-make-llm-generation-faster).**

## Results

- [huggingface/transformers](https://awesome-repositories.com/repository/huggingface-transformers.md) (161,630 ⭐) — Transformers is a comprehensive library for machine learning that provides a unified interface for training, fine-tuning, and deploying transformer-based models. It supports a wide range of tasks, including text classification, language modeling, question answering, and sequence-to-sequence translation, while offering specialized architectures for both text and vision processing. The framework includes tools for managing the entire model lifecycle, from data preprocessing and tokenization to distributed training and inference.

The library features extensive support for model optimization and performance, including techniques like quantization, speculative decoding, and paged memory management for key-value caches. It provides native integration for distributed training across multi-node clusters, as well as flexible APIs for serving models via compatible inference servers. Developers can also utilize built-in utilities for model patching, custom kernel execution, and automated documentation generation to streamline development workflows.
- [guillaumekln/faster-whisper](https://awesome-repositories.com/repository/guillaumekln-faster-whisper.md) (23,679 ⭐) — faster-whisper is an automatic speech recognition framework and an optimized implementation of the Whisper speech-to-text engine. It functions as a CTranslate2 inference engine designed to convert spoken audio into written text.

The project serves as a model quantization tool that transforms large audio model weights into lower precision formats. This process reduces memory usage and increases execution speed on hardware by utilizing integer quantized weights.

The framework covers a broad range of capabilities including batch audio transcription for parallel processing and voice activity detection to filter out non-speech audio segments. It also provides utilities for converting original or fine-tuned audio models into formats compatible with the CTranslate2 runtime.
- [llm-d/llm-d](https://awesome-repositories.com/repository/llm-d-llm-d.md) (2,514 ⭐) — llm-d is a distributed serving framework designed for large language model inference. It functions as an inference orchestrator and gateway, providing a control plane for deploying model replicas and managing hardware accelerators. The system includes a batch inference scheduler and a cache manager to coordinate request flow and memory utilization.

The project is distinguished by a disaggregated serving architecture that separates prefill and decode execution phases across specialized workers to maximize throughput. It employs a hardware-agnostic control plane and tiered cache offloading, moving memory blocks between GPU memory, host RAM, and shared storage to support long-context workloads.

The framework covers comprehensive traffic management and scaling capabilities, including SLO-aware autoscaling, cache-affinity routing, and predictive latency scoring. It also provides mechanisms for offline batch processing and high-availability scheduler management to balance interactive traffic with asynchronous workloads.

The system exposes these capabilities via an OpenAI-compatible chat completion API.
- [facebookresearch/fairseq](https://awesome-repositories.com/repository/facebookresearch-fairseq.md) (32,228 ⭐) — Fairseq is a PyTorch toolkit for sequence-to-sequence modeling, specializing in neural machine translation, automatic speech recognition, and large-scale language model training. It provides a framework for processing and aligning diverse data sources, including text, audio, and video, to support tasks such as speech-to-text conversion and multimodal sequence learning.

The project is distinguished by its distributed training capabilities, which utilize parameter sharding, mixed-precision training, and CPU offloading to handle models that exceed single-device memory. It also includes specialized tools for data engineering, such as parallel data mining for unsupervised learning and back-translation for expanding training corpora.

Its capability surface extends to comprehensive inference and generation tools, including beam search and lexical constraint enforcement, as well as model compression techniques like layer pruning and product quantization. The toolkit also provides utilities for feature extraction, model evaluation via metrics like perplexity and BLEU scores, and a registry-based system for extending models and tasks.

Training and evaluation workflows are managed through a command-line interface that orchestrates hyperparameter configuration and model execution.
- [intel/ipex-llm](https://awesome-repositories.com/repository/intel-ipex-llm.md) (8,836 ⭐) — Intel XPU LLM Acceleration Library is a toolkit designed to accelerate large language model inference and finetuning on Intel CPUs, GPUs, and NPUs. It provides a distributed inference engine for scaling models across multiple accelerators, a multimodal model runtime for vision and speech tasks, and a low-bit model quantization tool for converting weights into INT4, FP8, and GGUF formats.

The project features a parameter-efficient finetuning framework that enables model adaptation using QLoRA and DPO on Intel hardware. It distinguishes itself by providing specialized optimizations for Intel XPU backends, including the ability to execute large Mixture-of-Experts models on consumer-grade hardware and perform NPU-specific model conversion.

The library covers a broad range of capabilities, including inference optimization via speculative decoding and KV-cache compression, distributed workload distribution through tensor and pipeline parallelism, and the deployment of local retrieval-augmented generation pipelines. It also supports multimodal execution for visual question answering and audio transcription, alongside OpenAI-compatible API serving.
- [vllm-project/speculators](https://awesome-repositories.com/repository/vllm-project-speculators.md) (518 ⭐) — A unified library for building, evaluating, and storing speculative decoding algorithms for LLM inference in vLLM
- [stas00/ml-engineering](https://awesome-repositories.com/repository/stas00-ml-engineering.md) (18,124 ⭐) — This project is a comprehensive engineering framework and technical reference for managing, scaling, and optimizing distributed machine learning infrastructure. It provides a suite of methodologies and diagnostic tools designed to support large-scale model training and inference on high-performance computing clusters.

The project distinguishes itself through a specialized diagnostic toolkit and infrastructure optimization suite that addresses the complexities of multi-node environments. It enables precise control over cluster resources, including hardware maintenance, network topology configuration, and the orchestration of containerized workloads. By integrating performance benchmarking, numerical stability validation, and automated fault detection, it allows engineers to identify and resolve bottlenecks or hardware failures within distributed systems.

Beyond core orchestration, the project covers a broad range of operational capabilities including distributed file system management, automated checkpointing, and storage lifecycle optimization. It provides utilities for training performance tuning, inference scaling, and the enforcement of structured outputs, ensuring that both training and deployment pipelines remain efficient and reliable.

The repository serves as a technical guide for distributed machine learning engineering, offering automation scripts and diagnostic procedures for GPU and TPU clusters.
- [gofiber/fiber](https://awesome-repositories.com/repository/gofiber-fiber.md) (39,849 ⭐) — Fiber is a high-performance web framework designed for building scalable HTTP services with minimal memory overhead. It provides a comprehensive runtime environment for managing the full request lifecycle, utilizing an optimized radix tree for high-speed route matching and an object pooling system to reduce garbage collection pressure during traffic processing.

The framework distinguishes itself through its multi-process architecture, which supports prefork socket reuse to distribute incoming traffic across all available CPU cores. It offers a modular approach to application development, featuring fluent route grouping, middleware chaining, and automated data binding that maps request payloads to structured objects using field tags. Developers can also leverage a built-in HTTP client for outgoing requests, complete with support for connection pooling, request hooks, and streaming responses.

Beyond core routing and request handling, the project includes extensive tools for server-side HTML rendering, centralized error management, and context-aware logging. It maintains broad compatibility with the broader ecosystem by providing adapter layers that allow for the integration of standard library handlers and middleware.

The framework is configured through a central application controller that manages lifecycle hooks, service registration, and dynamic route updates. It is designed to be installed and integrated into Go projects to facilitate the development of structured, high-throughput web interfaces.
- [damirsvrtan/fasterer](https://awesome-repositories.com/repository/damirsvrtan-fasterer.md) (1,821 ⭐) — :zap: Don't make your Rubies go fast. Make them go fasterer ™. :zap:
- [hacker-dom/decode](https://awesome-repositories.com/repository/hacker-dom-decode.md) (6 ⭐) — Decode is a package to make it easier for you to develop on Ethereum. In particular, it parses tx's submitted to a local testrpc node to make them more readable.
- [ymcui/chinese-llama-alpaca-2](https://awesome-repositories.com/repository/ymcui-chinese-llama-alpaca-2.md) (7,136 ⭐) — This project provides a Chinese large language model based on the LLaMA architecture. It is an instruction-tuned model optimized for natural language processing and multi-turn conversations in Chinese.

The system includes a framework for parameter-efficient fine-tuning using low-rank adaptation and quantization to reduce memory requirements. It also implements retrieval augmented generation for local document question answering and supports long-context processing for sequences up to 64K tokens.

The project covers a broad set of capabilities including supervised instruction tuning, reinforcement learning from human feedback for safety alignment, and multi-GPU distributed training. It also provides tools for model weight quantization, speculative decoding for inference acceleration, and a web-based interface for model interaction.
- [openbmb/minicpm](https://awesome-repositories.com/repository/openbmb-minicpm.md) (9,464 ⭐) — MiniCPM is a collection of small language models designed for local, on-device deployment in resource-constrained environments. The project focuses on running dense Transformer models on consumer hardware, including GPUs, CPUs, and Apple Silicon, without requiring custom code forks.

The project distinguishes itself through heavy optimization for edge hardware, utilizing quantized weight compression in GGUF and MLX formats to reduce memory overhead. It implements advanced inference techniques such as speculative sampling and radix-tree prefix caching to accelerate generation speed and throughput.

Capability areas cover the full model lifecycle, including supervised fine-tuning and preference optimization via parameter-efficient LoRA adapters. The system supports structured tool calling for external agent integration and provides various serving options, including OpenAI-compatible APIs, REST endpoints, and a command-line interface.

The implementation includes tools for converting model checkpoints between formats and distributing training workloads across multiple GPUs.
- [hexojs/hexo](https://awesome-repositories.com/repository/hexojs-hexo.md) (41,768 ⭐) — Hexo is a command-line static site generator designed for content-driven blogging and website creation. It functions as a structured framework that transforms plain text files and markdown into production-ready static websites, utilizing a template-based rendering engine to separate site content from visual presentation.

The project is distinguished by its event-driven build pipeline, which manages the entire site lifecycle through a series of hooks for file processing, asset generation, and deployment. Developers can extend the system’s core capabilities through a modular plugin architecture, allowing for custom rendering engines and specialized site-wide functionality. The platform also provides a local development server for real-time previewing and file change monitoring to ensure efficient build performance during the authoring process.

Beyond its core generation capabilities, the system includes comprehensive tools for managing site metadata, URL structures, and content organization through front-matter configuration. It supports complex asset management, including post-specific folders and automated path resolution, alongside a suite of tag plugins for injecting dynamic elements like code blocks and media directly into content. The platform also features built-in deployment automation, enabling direct synchronization of generated files to various remote hosting environments and cloud platforms.

Hexo is installed and managed via command-line utilities, with documentation and configuration centered around a project-based directory structure.
- [paritytech/scale-decode](https://awesome-repositories.com/repository/paritytech-scale-decode.md) (0 ⭐) — This crate makes it easy to decode SCALE encoded bytes into a custom data structure with the help of a TypeResolver (one of which is a scale_info::PortableRegistry). By using this type information to guide decoding (instead of just trying to decode bytes based on the shape of the target type),…
- [nccgroup/decoder-improved](https://awesome-repositories.com/repository/nccgroup-decoder-improved.md) (139 ⭐) — Improved decoder for Burp Suite
- [bytebytegohq/system-design-101](https://awesome-repositories.com/repository/bytebytegohq-system-design-101.md) (83,491 ⭐) — This project is a centralized engineering knowledge repository that provides a structured curriculum for mastering system design, architectural patterns, and fundamental software development workflows. It serves as a professional development resource for engineers, offering foundational knowledge and real-world case studies to support the design of scalable, secure, and efficient distributed systems.

The repository distinguishes itself through a visual-first approach to knowledge synthesis, distilling complex technical concepts into high-density graphical diagrams and succinct illustrations. By employing cross-domain concept mapping and modular topic decomposition, it connects disparate engineering disciplines—such as infrastructure, security, and application layers—into granular, self-contained modules that facilitate rapid mental modeling and targeted learning.

The content covers a broad spectrum of technical domains, including API and web development, database scaling strategies, networking protocols, and DevOps deployment pipelines. These educational assets are organized as a static, version-controlled repository, allowing users to consume technical insights asynchronously at their own pace.
- [openvinotoolkit/openvino](https://awesome-repositories.com/repository/openvinotoolkit-openvino.md) (10,414 ⭐) — OpenVINO is an AI inference engine and model serving platform designed to execute optimized deep learning models across CPUs, GPUs, and NPUs through a unified API. It includes a model optimization toolkit for converting, quantizing, and compressing models from various frameworks, alongside a specialized generative AI runtime for large language models.

The project distinguishes itself through a plugin-based hardware acceleration layer that maps neural network operations to vendor-specific drivers. It features advanced execution mechanisms such as continuous batching, speculative decoding, and a graph-based inference pipeline that orchestrates sequences of models and custom logic nodes.

The platform covers a broad range of capabilities, including comprehensive model preparation via framework conversion and precision quantization, high-performance model serving through REST and gRPC endpoints, and deep observability through performance profiling and hardware affinity visualization. It also provides extensive deployment options ranging from bare metal server binaries to Kubernetes orchestration.
- [predibase/lorax](https://awesome-repositories.com/repository/predibase-lorax.md) (3,724 ⭐) — Lorax is a GPU-accelerated inference server and multi-adapter engine designed for serving large language models. It functions as a high-throughput system capable of deploying models via Kubernetes and managing the dynamic swapping of Low-Rank Adaptation adapters per request.

The server distinguishes itself through multi-adapter dynamic batching, which allows requests using different adapter weights to be processed in a single GPU forward pass. It employs just-in-time adapter loading and weighted adapter merging to maximize throughput and enable multi-tasking without sacrificing performance.

The project provides a standardized interface for chat and completions that is compatible with common API protocols, supporting structured outputs via JSON schema enforcement. Its performance surface includes tensor parallelism, speculative decoding, paged attention, and model weight quantization to reduce latency and memory overhead.

Infrastructure is managed through Helm charts for Kubernetes orchestration, with integrated telemetry exported via Prometheus and Open Telemetry.
- [microsoft/faster](https://awesome-repositories.com/repository/microsoft-faster.md) (6,606 ⭐) — FASTER is a high-throughput key-value store that combines an in-memory data store with a hybrid memory-disk storage engine, enabling datasets larger than available RAM. It uses a latch-free, cache-optimized index for concurrent point lookups and heavy updates, and records all mutations to a persistent append-only log on disk with checksum validation and group-commit checkpointing for crash recovery.

The system supports multi-key transactional workloads through atomic multi-key locking, ensuring transactional consistency without coarse-grained contention. It exposes the key-value store to remote clients over a custom TCP protocol that scales linearly with the number of concurrent connections, and provides atomic value merging for read-modify-write operations.

FASTER includes a non-blocking checkpoint-recovery model that restores consistent state after crashes, and provides high-performance iterators for reading through the persistent log sequentially. The append-only log engine supports frequent low-latency commits and saturates disk bandwidth, while the hybrid storage layer seamlessly spills cold data to fast local or cloud storage.
- [unslothai/unsloth](https://awesome-repositories.com/repository/unslothai-unsloth.md) (66,628 ⭐) — Unsloth is a high-performance training and inference platform designed to optimize the lifecycle of large language and multimodal models. It provides a comprehensive engine for fine-tuning, executing, and managing models locally, with a focus on reducing memory consumption and increasing compute speed on consumer-grade hardware.

The platform distinguishes itself through hand-optimized kernels and automated computational graph techniques that maximize hardware throughput. It supports advanced training methodologies, including reinforcement learning for reasoning and efficient adapter-based fine-tuning, while offering a unified web-based interface for no-code model training, data preparation, and real-time performance monitoring.

Beyond its core training capabilities, the project includes a local inference runtime that supports API-based deployment, tool-calling, and automated output verification. It manages the entire model development process, from dataset generation and hyperparameter configuration to model exporting and performance benchmarking across diverse hardware configurations.

The software provides setup utilities for local development environments and includes diagnostic tools to assist with installation and hardware compatibility.
- [yandex/faster-rnnlm](https://awesome-repositories.com/repository/yandex-faster-rnnlm.md) (564 ⭐) — Faster Recurrent Neural Network Language Modeling Toolkit with Noise Contrastive Estimation and Hierarchical Softmax
- [clickhouse/clickhouse](https://awesome-repositories.com/repository/clickhouse-clickhouse.md) (48,229 ⭐) — ClickHouse is a high-performance, columnar analytical database designed for real-time query execution and large-scale data aggregation. It functions as a distributed data warehouse capable of processing petabytes of information, while also providing an embedded engine that integrates directly into applications for native query capabilities without external dependencies. The system is built to handle high-throughput ingestion and complex analytical workloads, delivering millisecond-level latency for interactive dashboards and operational monitoring.

The platform distinguishes itself through advanced storage and execution techniques, including vectorized query processing and a merge tree storage engine that maintains performance during massive insertions. It features adaptive subcolumn mapping for semi-structured data and supports native vector search for machine learning and generative AI applications. To facilitate efficient data movement, the engine utilizes zero-copy shared memory buffers, minimizing overhead when interacting with external analytical tools or processing diverse file formats like Parquet, JSON, and Arrow.

Beyond its core storage and processing capabilities, the project provides a comprehensive suite of tools for observability, security, and data integration. It includes built-in support for natural language querying, automated workflow orchestration for AI agents, and extensive diagnostic features for query plan inspection. The platform also offers robust cloud infrastructure management, including support for private networking, compliant deployment strategies, and integrated billing consolidation.
- [dusty-nv/jetson-inference](https://awesome-repositories.com/repository/dusty-nv-jetson-inference.md) (8,734 ⭐) — jetson-inference is a set of libraries and tools for executing optimized deep learning models on embedded GPU hardware. Its primary purpose is to enable real-time computer vision and AI inference at the edge with low latency and high throughput.

The project distinguishes itself through high-performance streaming analytics and the ability to execute concurrent AI pipelines on auto-grade silicon. It provides specialized support for multi-sensor stream processing, utilizing zero-copy data transport to load camera frames directly into GPU memory.

The codebase covers a broad surface of capabilities, including real-time video analytics, object detection and tracking, and image segmentation. It also integrates hardware-accelerated decoding and TensorRT-based inference to optimize model execution on embedded platforms.

The project provides a TensorRT inference wrapper and an embedded vision SDK to facilitate the deployment of neural network primitives.
- [consensys/abi-decoder](https://awesome-repositories.com/repository/consensys-abi-decoder.md) (643 ⭐) — Nodejs and Javascript library for decoding data params and events from ethereum transactions
- [chalarangelo/30-seconds-of-code](https://awesome-repositories.com/repository/chalarangelo-30-seconds-of-code.md) (128,121 ⭐) — 30-seconds-of-code is a comprehensive knowledge base and programming snippet library designed to support software engineering education and professional development. It provides a curated collection of reusable code units and technical guides that help developers master core language mechanics, design patterns, and architectural philosophies.

The project distinguishes itself by offering a wide-ranging library of algorithmic solutions and web development patterns that are organized into modular, independently testable units. It emphasizes functional programming paradigms and declarative logic, allowing developers to integrate standardized implementations of data structures and algorithms into their own projects while minimizing side effects.

Beyond core programming tasks, the repository covers a broad capability surface including frontend component engineering, data processing, and version control workflow optimization. It provides practical tools for managing complex object relationships, implementing search and sorting algorithms, and streamlining repository management through custom command aliases and history manipulation.

The project is maintained as a technical reference, offering educational content and code snippets that are accessible for browsing and integration into various JavaScript and web development environments.
- [ericlbuehler/mistral.rs](https://awesome-repositories.com/repository/ericlbuehler-mistral-rs.md) (6,597 ⭐) — mistral.rs is an inference engine for large language models that runs locally and exposes models behind OpenAI and Anthropic-compatible APIs. It serves as a multi-model serving platform, capable of loading several models in a single server process with per-request routing and on-demand loading and unloading. The engine supports multimodal inference, processing text alongside images, video, audio, and speech inputs, and includes a quantized model deployment runtime that reduces memory use and speeds up inference on consumer hardware.

The project distinguishes itself through an agentic tool execution framework that runs server-side tools like code execution, shell commands, and web search in an automated loop during model generation, with session state persistence. It provides an in-process inference engine that can be embedded directly into Rust or Python applications without a separate server process, and includes an in-situ quantization engine that converts model weights to lower precision at load time with per-layer tuning. The system supports structured output constraints, forcing model output to conform to JSON Schema or grammar specifications during decoding, and offers automatic architecture detection that identifies model type, quantization format, and chat template from a Hugging Face model ID.

The platform includes capabilities for managing LoRA adapters, composing models as mixture-of-experts configurations, and running distributed inference across multiple GPUs or nodes using tensor parallelism and ring transport. It provides a built-in web chat interface, supports speculative decoding with a smaller assistant model, and offers benchmarking, logging, and Prometheus metrics for monitoring. The project can be run from a configuration file, with options for customizing build processes, tuning hardware settings automatically, and managing model caches.
- [samypesse/how-to-make-a-computer-operating-system](https://awesome-repositories.com/repository/samypesse-how-to-make-a-computer-operating-system.md) (0 ⭐) — How to Make a Computer Operating System
- [jaykali/maskphish](https://awesome-repositories.com/repository/jaykali-maskphish.md) (3,020 ⭐) — Maskphish is a comprehensive security toolkit that integrates capabilities for digital forensics, network vulnerability scanning, open-source intelligence, penetration testing, and social engineering. It functions as a multi-purpose framework for automating reconnaissance and executing security audits across diverse network environments.

The project features a specialized phishing and social engineering toolkit used for cloning websites, masking URLs, and deploying deceptive pages to capture user credentials. It also includes a remote access Trojan builder for generating platform-specific executables and mobile application packages to establish remote command sessions.

The framework covers a broad surface of capabilities, including web application penetration testing, OSINT reconnaissance, memory and disk forensics, and wireless network auditing. It provides tools for payload generation, credential theft, and the automation of information gathering from public data sources.

This project is implemented primarily as a shell-based application.
- [sgl-project/sglang](https://awesome-repositories.com/repository/sgl-project-sglang.md) (29,079 ⭐) — Sglang is a high-performance inference engine and serving system designed for large language and multimodal models. It provides a programmable interface for orchestrating complex generation workflows, enabling developers to coordinate multi-turn dialogues, tool invocations, and reasoning chains through a domain-specific language. The platform is built to support production-scale deployments, offering an OpenAI-compatible API that allows for integration with existing application ecosystems.

The system distinguishes itself through a disaggregated architecture that separates compute-intensive prompt processing from memory-intensive token generation across distinct hardware nodes. This approach, combined with a continuous batching engine and graph-captured kernel execution, maximizes hardware utilization and throughput. It also features dynamic adapter injection, allowing for the runtime switching of fine-tuning modules without requiring server restarts, and a hierarchical key-value cache management system that distributes state across GPU, host RAM, and external storage to support extended context windows.

Beyond core serving, the project includes comprehensive capabilities for structured output generation, enforcing machine-readable formats like JSON schemas and regular expressions during the inference process. It supports advanced performance techniques such as speculative decoding, multi-token prediction, and sparse attention mechanisms. The engine also provides robust tools for traffic management, reliability enforcement, and distributed observability, ensuring consistent performance across heterogeneous hardware clusters.
- [appwrite/appwrite](https://awesome-repositories.com/repository/appwrite-appwrite.md) (56,318 ⭐) — Appwrite is a backend-as-a-service platform that provides a unified development environment for building full-stack applications. It integrates essential infrastructure components—including authentication, databases, storage, and serverless functions—into a single, centralized interface to simplify application development and resource management.

The platform distinguishes itself through a container-based microservices architecture that ensures consistent execution across diverse infrastructure. It features a versatile connectivity layer that links frontend applications with third-party services, databases, and external APIs through standardized interfaces. Developers can manage and automate the configuration of these backend resources using infrastructure-as-code tools, while granular role-based access control enforces security policies across all platform resources and API endpoints.

Beyond its core services, the platform offers a broad capability surface that includes cross-platform data synchronization, event-driven webhooks, and comprehensive billing and usage monitoring. It supports extensive integrations for AI utilities, payment processing, messaging, and logging, allowing developers to extend application functionality through modular, event-driven workflows.

The platform is designed for both managed and self-hosted deployments, providing tools for production environment optimization, data migration, and custom domain configuration.
- [harishsg993010/llm-reasoner](https://awesome-repositories.com/repository/harishsg993010-llm-reasoner.md) (0 ⭐) — Make any LLM to think deeper like OpenAI o1 and deepseek R1!
- [longcw/faster_rcnn_pytorch](https://awesome-repositories.com/repository/longcw-faster-rcnn-pytorch.md) (1,777 ⭐) — Faster RCNN with PyTorch
- [ggerganov/llama.cpp](https://awesome-repositories.com/repository/ggerganov-llama-cpp.md) (116,912 ⭐) — llama.cpp is a high-performance C++ inference engine and runtime for executing large language models locally across various hardware architectures. It provides the core components for local model execution, including a dedicated model quantizer for compressing weights into the GGUF format and a system for generating text embeddings for semantic search.

The project distinguishes itself through specialized memory and execution optimizations, such as block-wise weight quantization to reduce memory footprints and memory-mapped model loading. It supports structured text generation by using formal grammars to force model outputs to adhere to specific JSON schemas or patterns, and it implements speculative decoding to increase inference speed.

Broad capabilities include hardware acceleration for GPUs, tools for converting models between different data formats, and utilities for measuring model quality via perplexity and divergence metrics. The engine can be wrapped in an HTTP server that provides an OpenAI-compatible API for integration with external tools.
- [clearml/clearml](https://awesome-repositories.com/repository/clearml-clearml.md) (6,740 ⭐) — ClearML is a comprehensive MLOps platform designed to manage the end-to-end machine learning lifecycle, from initial experimentation to production deployment. It provides a suite of integrated tools including a pipeline orchestrator for automating workflows, an experiment tracking tool for logging hyperparameters and metrics, and a metadata-driven data versioning system for managing large-scale datasets and model artifacts.

The platform is distinguished by its advanced compute management and serving capabilities. It features a GPU compute manager that supports fractional resource slicing and priority scheduling across hybrid cloud environments. Additionally, it includes a dedicated serving framework for hosting large language models and agentic workflows through secure APIs with integrated autoscaling.

The system covers a broad range of operational capabilities, including real-time infrastructure cost tracking, multi-tenant resource isolation, and automated execution environment reproduction. It also provides observability tools for monitoring inference endpoints, auditing AI workflows, and analyzing system-level hardware utilization.

The orchestration engine can be deployed via containerized or cloud-image based installations to host the platform's lifecycle infrastructure.
- [allegroai/clearml](https://awesome-repositories.com/repository/allegroai-clearml.md) (6,733 ⭐) — ClearML is a comprehensive MLOps platform designed to manage the entire machine learning lifecycle. It functions as an experiment tracking tool, a data versioning system, and a pipeline orchestrator, while providing infrastructure for GPU cluster management and model serving.

The platform is distinguished by its ability to handle hybrid-cloud compute scheduling and fractional GPU allocation, allowing multiple workloads to share a single hardware accelerator. It employs a metadata-based approach to data versioning, using virtual views to track large datasets and artifacts without duplicating raw files.

The system covers a broad range of capabilities including automated machine learning pipeline orchestration via task-graph dependencies, hyperparameter optimization, and distributed model training. It also provides an integrated AI workbench for remote development and a centralized control plane for tracking models from training through to production deployment.

Governance and observability are integrated through multi-tenant resource isolation, role-based access control, and real-time monitoring of compute resources and model performance.
- [make-open-data/make-open-data](https://awesome-repositories.com/repository/make-open-data-make-open-data.md) (0 ⭐) — Présentation du projet ou contactez-nous pour une démo : https://make-open-data.fr/ - Catalogue des données : https://data.make-open-data.fr/
- [modeltc/lightllm](https://awesome-repositories.com/repository/modeltc-lightllm.md) (3,901 ⭐) — LightLLM is a high-performance serving framework for deploying and executing large language models. It functions as a multi-GPU inference engine and server capable of handling dense architectures, mixture-of-experts designs, and multimodal models that process both text and images.

The system is distinguished by its specialized support for Mixture-of-Experts models using expert parallelism and fused kernels. It implements structured text generation through deterministic state machines and pushdown automata to enforce precise output formats. To optimize throughput, the framework employs speculative decoding, paged key-value cache management, and a separated prefill and decode pipeline.

The platform covers a broad range of operational capabilities, including tensor and data parallelism for scaling across hardware, multi-tier cache offloading for long context windows, and tool use integration for executing external functions. It also provides a standard interface for chat completions and dedicated tools for measuring request throughput and latency under real-world workloads.

The project is implemented in Python and includes base classes for integrating custom model architectures.
- [karpathy/llm.c](https://awesome-repositories.com/repository/karpathy-llm-c.md) (30,230 ⭐) — This project is a low-dependency engine designed for training large language models using native C and CUDA. It provides a bare-metal environment for tensor computation, allowing for the execution of neural network operations directly on hardware accelerators without the overhead of high-level software abstractions.

The framework distinguishes itself by implementing manual gradient backpropagation and custom hardware-specific kernels, providing granular control over memory mapping and computational precision. It supports distributed training across multiple graphics processors and compute nodes, utilizing collective communication primitives to scale workloads while maintaining numerical consistency through integrated validation tools.

The library includes a comprehensive suite of utilities for data preparation, model checkpoint management, and performance optimization. It covers essential operations such as attention acceleration, layer normalization, and memory-efficient checkpointing, while providing command-line tools for orchestrating training runs and conducting hyperparameter sweeps.
- [ai-dynamo/dynamo](https://awesome-repositories.com/repository/ai-dynamo-dynamo.md) (6,112 ⭐) — Dynamo is a distributed inference orchestration platform designed for large language models. It functions as a system to coordinate prefill and decode phases across GPU nodes, utilizing a multi-backend runtime adapter to connect engines like vLLM and TensorRT-LLM through a unified block-oriented memory interface. An OpenAI-compatible API server provides the frontend for integration with existing tools and clients.

The project is distinguished by its disaggregated serving architecture, which separates prompt processing and token generation onto independent GPU pools to optimize throughput and memory. It employs a key-value cache-aware request router that directs queries to workers holding relevant cache entries to reduce recomputation. High-speed data transfer mechanisms move cache blocks and weights directly between GPU VRAMs over RDMA or NVLink to minimize latency.

The platform includes comprehensive capabilities for distributed fault tolerance, allowing in-flight requests to migrate and resume from failure points via token-state continuation. It features SLA-based autoscaling and performance profiling to right-size GPU pools and a Kubernetes-native operator for topology-aware scheduling. Additional support covers multimodal inference for images, video, and audio, alongside dynamic swapping of LoRA adapters.

Installation is available via wheels, container images, charts, and crates, with support for major Linux distributions and NVIDIA GPU architectures from Ampere through Blackwell.
- [systran/faster-whisper](https://awesome-repositories.com/repository/systran-faster-whisper.md) (21,043 ⭐) — Faster-Whisper is a high-performance implementation of the Whisper speech-to-text model designed for efficient audio transcription. It provides an end-to-end processing pipeline that converts spoken audio into written text while maintaining lower memory consumption and faster execution speeds than standard implementations.

The project achieves its performance through a specialized inference engine that utilizes optimized kernels and weight quantization to reduce computational complexity. It supports large-scale operations by grouping audio segments into dynamic batches and filtering out non-speech content to improve overall throughput and accuracy.

Beyond core transcription, the framework includes utilities for converting external transformer models into optimized formats and extracting word-level timestamps. These capabilities facilitate automated subtitle generation and the processing of high-volume audio data on standard hardware.
- [lostruins/koboldcpp](https://awesome-repositories.com/repository/lostruins-koboldcpp.md) (9,511 ⭐) — KoboldCPP is a local large language model inference engine and GGUF model runner designed to execute quantized models on personal hardware. It functions as a multimodal AI server and API gateway, providing OpenAI-compatible endpoints that allow third-party clients to interact with locally hosted models.

The project distinguishes itself as an AI storytelling backend, featuring dedicated tools for long-form narrative management through persistent memory, world lore tracking, and character state management. It further extends its capabilities as a multimodal server capable of processing text, images, and audio using vision projectors and speech synthesis.

The system includes broad support for hardware acceleration via GPU-layer offloading and multi-GPU tensor splitting to handle large models. It incorporates advanced output control through grammar constraints and phrase banning, as well as grounded retrieval capabilities that connect models to local documents and web search.

The core runtime is implemented in C++ for high-performance memory management and hardware-level optimization.
- [laracademy/commands.make-user](https://awesome-repositories.com/repository/laracademy-commands-make-user.md) (0 ⭐) — Laracademy make:user Command - provides you with a simplistic artisan command to generate users from the console.
- [angular/angular](https://awesome-repositories.com/repository/angular-angular.md) (100,360 ⭐) — Angular is a platform for building web applications using a component-based architecture. It provides a comprehensive suite of tools for managing encapsulated UI units, including hierarchical dependency injection, a declarative template system, and fine-grained reactivity through signals. The framework supports complex application requirements such as client-side routing, form management, and internationalization.

The project includes a command-line interface for scaffolding and build automation, alongside a testing ecosystem for unit and integration verification. It offers multiple rendering strategies, including server-side rendering and static site generation, with support for hydration processes to optimize application delivery. Additionally, the framework features a built-in animation suite and security mechanisms to handle common web vulnerabilities.
- [aishwaryanr/awesome-generative-ai-guide](https://awesome-repositories.com/repository/aishwaryanr-awesome-generative-ai-guide.md) (24,755 ⭐) — This project is a community-driven knowledge repository and technical learning resource focused on the field of generative artificial intelligence. It serves as a centralized hub for developers and practitioners to access curated research, tutorials, and foundational concepts necessary for building and deploying modern artificial intelligence applications.

The platform distinguishes itself through a collaborative, distributed contribution model that aggregates diverse learning materials into a structured, searchable knowledge base. It covers a wide range of specialized topics, including retrieval-augmented generation, large language model training, fine-tuning techniques, and agentic workflows. Beyond technical skill development, the repository functions as a professional development hub, offering interview preparation resources and guidance for those pursuing careers in the artificial intelligence industry.

The content is organized through a hierarchical taxonomy, allowing users to navigate complex subjects such as system evaluation, multimodal models, and security tools. The repository provides access to comprehensive code notebooks and structured tutorials, all maintained as static documentation within a version control system to ensure accessibility and ease of discovery.
- [sindresorhus/make-synchronous](https://awesome-repositories.com/repository/sindresorhus-make-synchronous.md) (328 ⭐) — Make an asynchronous function synchronous
- [abetlen/llama-cpp-python](https://awesome-repositories.com/repository/abetlen-llama-cpp-python.md) (9,993 ⭐) — llama-cpp-python provides a Python interface for the llama.cpp library, enabling the execution of large language models with hardware acceleration. It functions as a GGUF model loader and a structured text generator capable of running inference servers and multimodal runtimes for processing both text and image inputs.

The project distinguishes itself through a local inference server that exposes model capabilities via an OpenAI-compatible web API. It supports advanced execution techniques including speculative decoding, weight quantization, and layer-based GPU offloading to manage memory across system RAM and VRAM.

The library covers a broad range of AI capabilities, including text completion, embedding generation, and the enforcement of structured outputs via JSON schemas or formal grammars. It also provides infrastructure for tool use through external function calling and manages model extensions via LoRA adapter injection.

Users can fetch model files directly from Hugging Face and maintain model state persistence for resuming generation.
- [mwielgoszewski/burp-protobuf-decoder](https://awesome-repositories.com/repository/mwielgoszewski-burp-protobuf-decoder.md) (107 ⭐) — A simple Google Protobuf Decoder for Burp
- [huggingface/text-generation-inference](https://awesome-repositories.com/repository/huggingface-text-generation-inference.md) (10,775 ⭐) — Text Generation Inference is a production-ready engine designed for the deployment and serving of large language models. It functions as a containerized runtime environment that manages model execution, scales across distributed hardware, and provides high-performance inference capabilities for demanding production environments.

The project distinguishes itself through advanced optimization techniques, including continuous batching to maximize hardware utilization and tensor parallelism to shard large models across multiple accelerator cards. It supports efficient inference through custom compute kernels, weight quantization, and memory optimization strategies that reduce the computational footprint of complex models.

The platform covers a broad operational surface, including native support for streaming responses via server-sent events, multimodal model serving, and comprehensive telemetry for distributed request tracing. It also integrates security features such as token-based authentication and rate limiting to manage access to inference endpoints. The service is designed for containerized deployment and includes built-in tools for performance monitoring, benchmarking, and automated model weight management.
- [zhaochenyang20/awesome-ml-sys-tutorial](https://awesome-repositories.com/repository/zhaochenyang20-awesome-ml-sys-tutorial.md) (5,371 ⭐) — This project provides a comprehensive technical guide and framework for engineering large-scale machine learning systems. It covers the full lifecycle of model development, focusing on the infrastructure and computational principles required to build, train, and serve generative AI models across distributed GPU clusters.

The repository distinguishes itself by offering deep-dive tutorials and implementation strategies for complex system challenges. It emphasizes high-performance architectural primitives, such as collective communication orchestration, distributed tensor sharding, and static graph kernel capture. These capabilities are complemented by advanced inference optimizations, including speculative decoding, memory-efficient activation offloading, and tree-structured key-value cache prefix sharing, which collectively enable efficient model execution and resource management.

Beyond core training and inference, the project details a broad capability surface for managing agentic workflows and multimodal architectures. This includes automated reinforcement learning pipelines, structured grammar-based decoding for constrained output, and sophisticated traffic management for distributed request scheduling. The framework also provides extensive tooling for system observability, performance profiling, and hardware-aware resource allocation to ensure stability and efficiency in production environments.
- [hviana/faster](https://awesome-repositories.com/repository/hviana-faster.md) (58 ⭐) — A fast and optimized middleware server with an absurdly small amount of code (300 lines) built on top of native HTTP APIs with no dependencies. It also has a collection of useful middlewares: log file, serve static, CORS, session, rate limit, token, body parsers, redirect, proxy and handle upload. For Deno Deploy and other enviroments!