This project is a collection of reference implementations and recipes for deploying, fine-tuning, and running inference with Llama large language models. It serves as a toolkit and implementation guide for adapting pre-trained models to specific tasks and domain-specific datasets. The repository provides frameworks for developing retrieval augmented generation pipelines to ground model responses in external data. It includes guides for executing quantized inference to reduce memory usage and increase processing speed. The toolkit covers a broad range of capabilities including parameter-effic
Tinker Cookbook is an open-source framework for fine-tuning large language models, supporting supervised learning, reinforcement learning, and parameter-efficient techniques like LoRA adapters. It provides a complete pipeline for aligning models with human preferences through multi-stage RLHF workflows, from supervised fine-tuning through preference optimization to reinforcement learning. The framework distinguishes itself through recipe-based training orchestration, where fine-tuning workflows are defined as composable recipe files that chain data loading, model configuration, and training l
This project provides a foundational framework and reference implementation for executing causal language modeling and multimodal reasoning on local systems. It includes a set of core components for managing model assets, a fine-tuning framework, and structural definitions required to instantiate transformer-based architectures. The system is distinguished by its ability to process combined text and image inputs through multimodal transformer models for visual reasoning and document analysis. It also supports the deployment of quantized models, reducing memory footprints through low-precision
BigDL is a PyTorch acceleration framework and distributed inference engine designed for large language models. It provides a toolkit for running models on Intel hardware, integrating quantization tools and libraries for parameter-efficient fine-tuning. The project distinguishes itself through the use of pipeline parallelism to distribute model workloads across multiple hardware accelerators. It utilizes low-bit integer quantization and speculative decoding to reduce memory footprints and decrease text generation latency. The system covers broad capabilities in model optimization, including w