Tiny Llm

tiny-llm is a large language model inference engine and transformer model implementation. It serves as a quantized model runtime and paged key-value cache manager, providing a specialized inference stack optimized for Apple Silicon.

The system distinguishes itself through high-throughput execution techniques, including continuous batching and paged attention. It utilizes a paged memory system to eliminate fragmentation during token generation and employs on-the-fly dequantization of compressed weights to reduce the memory footprint during matrix multiplication.

The project covers a broad range of model architecture and performance capabilities, such as mixture of experts routing, grouped query attention, and flash attention. It includes support for advanced decoding logic, including greedy decoding and sampling via temperature, top-k, and top-p methods.

The implementation is written in Python and includes custom low-level kernels to accelerate tensor processing on hardware.

Features

Model Inference and Serving - Implements a complete high-throughput model serving stack optimized for Apple Silicon hardware.
Model Serving and Inference - Provides a complete runtime for loading model weights and serving LLM inference on specialized hardware.
Apple Silicon Inference - Provides a specialized inference stack and memory management system optimized for Apple M-series hardware.
Paged KV Cache Management - Manages key-value caches using fixed-size blocks to eliminate memory fragmentation during token generation.
Gated Linear Units - Implements gated linear units to control token flow through a gated MLP block.
Gated Activation Computations - Implements the gated MLP activation pattern using a linear projection gated by an activation function.
Dequantizing Runtimes - Performs matrix multiplication by converting compressed low-bit weights back to full precision during the compute pass.
Feature Scale Normalization - Scales feature vectors using root mean square normalization for stable model convergence.
Generative Text Decoding - Produces sequences of tokens from a prompt using a prefill stage and a decoding loop.
Generative Text Inference - Implements a system to process natural language prompts and generate intelligent text completions using transformer models.
Temperature Scaling - Scales the probability distribution of next-token logits to balance predictability and creativity.
Incremental Text Decoding - Generates tokens sequentially by processing an initial prompt followed by iterative single-token steps.
Inference Batching - Executes inference for multiple prompts in a single pass to maximize total throughput.
Inference Optimization Kernels - Implements custom low-level kernels to accelerate the token generation and decoding phases.
KV Cache Management - Manages key-value pairs for multiple concurrent requests to optimize transformer inference efficiency.
Inference Engines - Serves as a high-throughput execution engine for pre-trained large language models.
Continuous Batching Strategies - Employs continuous batching to dynamically insert new requests into active inference batches for higher throughput.
Paged Batching Managers - Allocates and frees pages from a shared pool as requests enter and exit the system to maximize throughput.
Model Inference - Combines embedding layers and transformer blocks to process sequences and output next-token probabilities.
Model Inference Engines - Provides a complete inference engine for loading pretrained models and generating text responses.
Quantized Linear Layers - Executes linear layers using compressed weights to significantly reduce the memory footprint on the processor.
Grouped-Query Attention - Shares key and value heads across multiple query heads to reduce memory overhead during inference.
Paged Cache Attention Kernels - Executes multi-head attention directly on paged key-value caches to minimize memory fragmentation.
Dequantization Matrix Multiplications - Performs on-the-fly dequantization of compressed weights to reduce memory footprint during matrix multiplication.
Combined Top-K and Top-P Filtering - Combines top-k and top-p filtering to efficiently sample tokens from output probability distributions.
Token Embeddings - Converts integer tokens into high-dimensional vectors using a weight matrix and handles reverse mapping back to token space.
MoE Top-K Selection Kernels - Uses top-k selection kernels to route tokens to the most relevant experts in an MoE architecture.
Key-Value Cache Reuse - Caches transformer key-value tensors across layers to avoid redundant computations during decoding.
Paged Key-Value Cache Stores - Stores key-value pairs in non-contiguous memory pages to handle variable-length sequences efficiently.
Transformer Language Models - Implements causal language model architectures including grouped query attention and mixture of experts routing.
Transformer Models - Implements core transformer architectures, including attention and MLP layers, to create causal language models.
Token Embedding Layers - Implements token embedding layers that map discrete token IDs to dense vector representations.
Inference Batching - Groups multiple incoming requests into a single hardware execution pass to maximize throughput.
Model Serving - Runs model predictions via matrix manipulation APIs to generate human-readable text responses.
KV Cache Page Allocators - Assigns physical memory pages from a global pool to individual request caches during token generation.
Quantized Matrix Multiplication - Performs linear algebra by dequantizing compressed low-bit weights on the fly during matrix multiplication.
Root Mean Square Normalizations - Provides RMSNorm to standardize activations and stabilize inference.
Attention Computations - Computes scaled dot product attention to determine relationships between query, key, and value tensors.
Flash-Attention Implementations - Calculates attention using a tiled softmax algorithm to reduce memory overhead and increase throughput.
Chunked Prefill Mechanisms - Implements mechanisms to split long prompt processing into smaller segments to prevent memory spikes.
GPU Kernel Implementations - Provides custom low-level hardware kernels to accelerate tensor processing and math operations.
Inference Performance Optimization - Implements techniques like continuous batching and paged attention to balance inference speed and throughput.
Mixture of Experts - Directs tokens to specialized subnetworks using a top-k router and grouped matrix multiplication.
Causal Masking - Prevents attention mechanisms from accessing future tokens by applying a lower triangular causal mask.
Sparse Routing Architectures - Implements sparse MoE layers that selectively activate a subset of experts per token.
Model Performance Optimization - Increases execution speed by utilizing custom low-level kernels for tensor processing.
Quantized Embedding Lookups - Retrieves token representations from a quantized table to reduce memory footprint.
Multi-Head Attention Mechanisms - Implements multi-head attention layers that coordinate linear projections for parallel processing.
Tiled Online Softmax - Calculates attention scores in small blocks using a tiled online softmax to optimize memory usage.
Rotary Positional Embeddings - Injects sequence order information into query and key vectors using rotation matrices.
Sampling Controls - Provides parameters for adjusting randomness and creativity during inference, including temperature scaling and nucleus sampling.
Weight Quantization - Integrates compressed weight containers to replace standard linear layers, reducing the overall memory footprint.
Transformer Blocks - Constructs architectural blocks consisting of normalization, attention mechanisms, and feed-forward networks with residual connections.
Weight Dequantization - Performs on-the-fly recovery of original precision from compressed weights during model inference.
Subspace Mappings - Projects input embeddings into independent subspaces via weight matrices as part of multi-head attention.
Inference Runtime Metadata Management - Maintains block tables and context lengths to coordinate the request scheduler and attention kernels.
Batch Request Processing - Handles multiple independent prompt sequences simultaneously using batched masking and positional encoding.
Small Language Models - Tiny language model for educational purposes.

sgl-project/sglang

29,079View on GitHub

Sglang is a high-performance inference engine and serving system designed for large language and multimodal models. It provides a programmable interface for orchestrating complex generation workflows, enabling developers to coordinate multi-turn dialogues, tool invocations, and reasoning chains through a domain-specific language. The platform is built to support production-scale deployments, offering an OpenAI-compatible API that allows for integration with existing application ecosystems. The system distinguishes itself through a disaggregated architecture that separates compute-intensive pr

flashinfer-ai/flashinfer

4,996View on GitHub

FlashInfer is a library of high-performance GPU kernels purpose-built for accelerating large language model inference. It provides optimized implementations for attention operations (including flash attention, page attention, multi-head latent attention, and cascade attention) using paged key-value caches, fused kernel composition, and just-in-time compilation. The library also includes specialized kernels for mixture-of-experts layers, block-scaled low-precision quantization (FP8, FP4), and distributed collective communication. What distinguishes FlashInfer is its fused all-reduce communicat

datawhalechina/tiny-universe

4,505View on GitHub

Tiny Universe is an educational monorepo that delivers multiple independent implementations of core AI subsystems as self-contained Jupyter notebooks. It provides from-scratch constructions of foundational architectures including a complete Transformer model built from the original paper specification, a denoising diffusion probabilistic model for image generation, and a ReAct-style autonomous agent framework that equips an LLM with tools for planning and multi-step task execution. The project distinguishes itself by covering the full lifecycle of modern AI systems through hands-on implementa

naklecha/llama3-from-scratch

15,230View on GitHub

This project is a manual reconstruction of the Llama 3 transformer architecture implemented as a PyTorch neural network. It serves as a reference for the internal mathematical structure and tensor flow of a transformer-based language model designed for next token prediction. The implementation focuses on building the model from scratch using basic matrix operations and tensor manipulations. It demonstrates the manual construction of core components, including rotary positional embeddings, multi-head self-attention, and root mean square normalization. The codebase covers the full inference pi

skyzhtiny-llm

Features