Sglang is a high-performance inference engine and serving system designed for large language and multimodal models. It provides a programmable interface for orchestrating complex generation workflows, enabling developers to coordinate multi-turn dialogues, tool invocations, and reasoning chains through a domain-specific language. The platform is built to support production-scale deployments, offering an OpenAI-compatible API that allows for integration with existing application ecosystems. The system distinguishes itself through a disaggregated architecture that separates compute-intensive pr
Axolotl is a distributed training orchestrator and fine-tuning framework for large language models, multimodal systems, and quantized models. It provides a structured environment for specializing pre-trained models through full parameter updates or low-rank adaptation, as well as aligning model outputs with human expectations via preference tuning pipelines and reward modeling. The system distinguishes itself through a configuration-driven pipeline that manages preprocessing and training workflows via a single file for reproducibility. It implements high-throughput optimizations such as multi
MiniCPM is a collection of small language models designed for local, on-device deployment in resource-constrained environments. The project focuses on running dense Transformer models on consumer hardware, including GPUs, CPUs, and Apple Silicon, without requiring custom code forks. The project distinguishes itself through heavy optimization for edge hardware, utilizing quantized weight compression in GGUF and MLX formats to reduce memory overhead. It implements advanced inference techniques such as speculative sampling and radix-tree prefix caching to accelerate generation speed and throughp
Llama is a large language model runtime and inference engine designed to load and execute autoregressive transformer models. It enables the generation of natural language text completions from prompts using pretrained weights. The system features multi-GPU model parallelism, which distributes model weights and workloads across multiple graphics processors to support larger parameter counts. It also incorporates a content safety filter that uses classifiers to intercept and block unsafe inputs or outputs during the inference process. The project covers broad capabilities in distributed model