30 open-source projects similar to timdettmers/bitsandbytes, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best Bitsandbytes alternative.
bitsandbytes is a deep learning quantization tool and library designed to reduce the memory footprint of large language models. It serves as a GPU memory optimizer and quantization framework, compressing model weights and features to 8-bit and 4-bit precision to enable inference and training on hardware with limited memory. The project provides a framework for low-rank adaptation, allowing the fine-tuning of quantized models by combining 4-bit weights with small trainable matrices. It further distinguishes itself through memory paging, which moves optimizer states between CPU and GPU memory t
Torchtune is a PyTorch-native library for fine-tuning, aligning, and quantizing large language models. It provides a configurable training pipeline orchestrated through YAML recipes, with CLI overrides and component swapping, distributed training via FSDP2, memory optimizations, and parameter-efficient fine-tuning methods like LoRA, DoRA, and QLoRA. The library distinguishes itself through its YAML-driven configuration system that defines all training parameters and instantiates components from config files, with full CLI override capability for any field or component at launch time. It suppo
Axolotl is a distributed training orchestrator and fine-tuning framework for large language models, multimodal systems, and quantized models. It provides a structured environment for specializing pre-trained models through full parameter updates or low-rank adaptation, as well as aligning model outputs with human expectations via preference tuning pipelines and reward modeling. The system distinguishes itself through a configuration-driven pipeline that manages preprocessing and training workflows via a single file for reproducibility. It implements high-throughput optimizations such as multi
DeepSpeed is a distributed deep learning optimization library and framework designed for the training and inference of massive AI models. It serves as a model parallelism orchestrator and a toolkit for scaling large language models across multiple GPUs and compute nodes. The project distinguishes itself through 3D parallelism orchestration, which combines data, pipeline, and tensor parallelism. It utilizes ZeRO-based memory partitioning to eliminate redundant storage and employs CPU-offload memory management to move weights and optimizer states to system RAM. Additionally, it provides special
llm-compressor is a quantization toolkit and post-training library designed to reduce the memory footprint and size of large language models. It provides a framework for compressing models using weight and activation quantization to enable more efficient deployment. The project distinguishes itself through a distributed quantization framework that utilizes data-parallel processing and disk-based weight offloading to handle massive model checkpoints that exceed available system memory. It includes specialized compressors for diverse architectures, including Mixture-of-Experts, Vision-Language,
This project is a quantized fine-tuning framework for large language models. It implements a low-rank adaptation library and a four-bit quantizer to reduce the GPU memory requirements needed to train large models. The framework utilizes four-bit quantization and low-rank adapters to enable model training on consumer-grade hardware. It further reduces the memory footprint through double quantization and a paged optimizer that offloads states to system RAM. The system supports distributed training across multiple GPUs to handle larger parameter scales and includes utilities for custom dataset
This project is a vision language model framework and vision-to-text pipeline designed for deploying and optimizing models that process both images and text. It provides an on-device inference engine and a vision language model framework to run quantized models locally on mobile and desktop hardware accelerators. The framework features a model quantization toolkit to reduce weight precision for lower memory footprints and increased execution speed on specialized silicon. It also includes an efficient vision encoder utilizing a hybrid encoding system to compress image tokens, which reduces pro
This project provides a foundational framework and reference implementation for executing causal language modeling and multimodal reasoning on local systems. It includes a set of core components for managing model assets, a fine-tuning framework, and structural definitions required to instantiate transformer-based architectures. The system is distinguished by its ability to process combined text and image inputs through multimodal transformer models for visual reasoning and document analysis. It also supports the deployment of quantized models, reducing memory footprints through low-precision
ipex-llm is an acceleration library and inference engine designed to optimize the execution and finetuning of large language models on Intel GPUs and NPUs. It provides a HuggingFace compatible model backend and a dedicated quantization toolkit for converting model weights into low-bit precision formats. The project facilitates distributed inference by splitting large model workloads across multiple accelerators using pipeline and tensor parallelism. It enables the deployment of models on Intel Arc, Flex, and Max GPUs to increase throughput and reduce latency. The library covers a broad range
h2o-llmstudio is a language model training framework that provides a no-code graphical interface for fine-tuning large language models on custom datasets. It functions as a specialized tool for managing the training lifecycle, from configuring hyperparameters to monitoring performance metrics. The project distinguishes itself through a multi-GPU training orchestrator that distributes workloads via data parallel processing and a low-rank adaptation tool for memory-efficient fine-tuning. It also includes a model evaluation dashboard featuring an interactive chat interface to verify conversation
Neural Compressor is a deep learning model compression toolkit and AI inference acceleration engine. It functions as an automated model quantization tool and hardware-aware model compiler designed to reduce the memory footprint of neural networks and decrease execution latency. The project provides specialized frameworks for optimizing large language models, utilizing weight-only quantization and hardware-specific kernels to improve the operational efficiency of generative AI workloads. It maps neural network operators to specialized CPU and GPU vector instructions to accelerate model executi
Lit-llama is a PyTorch-based implementation framework for the LLaMA language model, providing a system for pre-training, fine-tuning, and high-performance inference. It includes a pre-training pipeline for creating foundational language models from scratch and tools for running pretrained weights to generate natural text and predict sequences. The project provides specialized toolkits for parameter-efficient fine-tuning using low-rank adaptation and lightweight adapters. It also includes a quantization library that reduces model memory footprints through four-bit and eight-bit precision to en
Metaseq is a transformer sequence modeling toolkit designed for training, fine-tuning, and deploying sequence-to-sequence models using open pre-trained weights. It provides a comprehensive framework for large language model training, including dedicated tools for sequence dataset processing and a standalone inference server for generating text via API requests. The project features specialized utilities for model quantization to reduce parameter precision to eight bits, which lowers memory usage and increases inference speed. It also includes a checkpoint conversion pipeline to transform mode
alpaca.cpp is a high-performance local inference engine implemented in C++ for executing instruction-tuned large language models. It serves as a quantized model runtime designed to load and run model tensors on local hardware with minimal dependencies, removing the requirement for a full Python environment. The project focuses on on-device text generation and the deployment of private AI chatbots. It utilizes model weight quantization to reduce memory requirements and increase inference speed on consumer-grade devices. The system covers hardware-optimized model execution through thread-pool
Sana is a framework for high-resolution image and video synthesis based on a linear diffusion transformer. It provides a toolkit for the training, fine-tuning, and execution of text-to-image and text-to-video models, as well as a video generative world model capable of simulating physical environments with precise spatial control. The project is distinguished by its use of linear complexity layers to handle high resolutions and its support for long-form, minute-length video generation in real time. It implements a two-stage inference paradigm that separates structural generation from visual t
Torchtune is a PyTorch-native library for fine-tuning, aligning, and quantizing large language models. It provides a config-driven system for instantiating components, orchestrating distributed training, and managing parameter-efficient fine-tuning with quantization support, all through YAML-based configurations and command-line overrides. The library distinguishes itself through its comprehensive post-training workflow orchestration, combining supervised fine-tuning, preference optimization (DPO, PPO, GRPO), knowledge distillation, and quantization-aware training in a single configurable pip
Chinese-Vicuna is a Chinese large language model and instruction-following AI based on the LLaMA architecture. It is specifically designed for natural language understanding and generation in the Chinese language, utilizing an instruction-tuned model to follow complex user prompts across conversations. The project provides a LoRA fine-tuning framework and quantization systems to enable model adaptation and inference on consumer hardware. It implements quantized inference to reduce memory usage on both CPUs and GPUs, supported by a low-level C++ implementation to minimize system resource requi
llama.cpp is a high-performance C++ inference engine and runtime for executing large language models locally across various hardware architectures. It provides the core components for local model execution, including a dedicated model quantizer for compressing weights into the GGUF format and a system for generating text embeddings for semantic search. The project distinguishes itself through specialized memory and execution optimizations, such as block-wise weight quantization to reduce memory footprints and memory-mapped model loading. It supports structured text generation by using formal
BigDL is a PyTorch acceleration framework and distributed inference engine designed for large language models. It provides a toolkit for running models on Intel hardware, integrating quantization tools and libraries for parameter-efficient fine-tuning. The project distinguishes itself through the use of pipeline parallelism to distribute model workloads across multiple hardware accelerators. It utilizes low-bit integer quantization and speculative decoding to reduce memory footprints and decrease text generation latency. The system covers broad capabilities in model optimization, including w
gpt-fast is a PyTorch transformer inference engine designed for low-latency text generation. It functions as a distributed GPU inference library, a quantized model runner, and a speculative decoding framework. The system utilizes a speculative decoding workflow where a small draft model predicts token sequences for verification by a larger model to accelerate generation. It supports quantized model execution to reduce memory footprint and implements tensor parallelism to split computations across multiple GPUs. The project includes a standardized evaluation harness to measure the accuracy an
This repository is a collection of frameworks and guides for Llama models, functioning as a fine-tuning framework, an inference pipeline, and an AI workflow orchestrator. It provides tools for adapting large language models to specific datasets and domains. The project includes a parameter-efficient fine-tuning toolkit that utilizes techniques like low-rank adaptation to reduce memory and compute requirements. It also serves as an implementation guide for retrieval-augmented generation, combining model inference with external data retrieval to improve response accuracy. The capability surfac
llm-foundry is a training framework for large language models, providing a system for foundation model pre-training and supervised fine-tuning. It includes a distributed trainer for scaling workloads across multiple nodes and GPUs, a dataset streaming pipeline for loading data from cloud storage, and a parameter-efficient fine-tuning implementation. The framework distinguishes itself through its use of parameter sharding and high-throughput data streaming to maintain stability during large-scale training. It incorporates low-rank adaptation to reduce computational costs and uses eight-bit flo
Qwen is a comprehensive framework for large language model development, serving, and deployment. It provides a complete ecosystem for transformer-based sequence modeling, offering base models alongside specialized tools for instruction-tuned alignment, fine-tuning, and long-context inference. The project is designed to support both research and production environments, enabling users to train, optimize, and host generative models locally or across distributed hardware. The framework distinguishes itself through its focus on high-performance serving and extensibility. It features a high-perfor
AutoGPTQ is a model compression framework designed to reduce the memory footprint and increase the inference speed of large language models. It utilizes the GPTQ algorithm to compress model weights, allowing these models to run on hardware with limited VRAM. The toolkit provides an architecture quantization pipeline that supports the integration of custom model classes for various neural network architectures. It includes a mixed-precision inference engine with optimized kernels to accelerate matrix multiplication during deployment. The framework covers the full weight-compression workflow,
This library provides a framework for parameter-efficient fine-tuning, enabling the adaptation of large pretrained models by training only a small subset of parameters. It functions as a distributed model training system and optimization toolkit, designed to reduce the computational and memory requirements typically associated with full model fine-tuning. The project distinguishes itself through a suite of methods for modular adapter composition, including low-rank matrix decomposition and activation-based scaling. It supports the integration of multiple task-specific adapter modules, allowin
LitGPT is a training and deployment framework for large language models, providing a suite of tools for pretraining, finetuning, quantizing, evaluating, and serving models within a production environment. It includes a dedicated training pipeline for adapting pretrained models to specific tasks, a quantization tool for reducing weight precision, and an inference server for hosting models via web interfaces. The framework supports high-performance model development through custom architecture implementation and the use of predefined recipes to standardize pretraining and finetuning. It enables
Paddle-Lite is a deep learning inference engine and edge computing runtime designed to execute trained models on mobile and edge devices. It provides a hardware-accelerated inference framework and a decoupled runtime with a minimal binary footprint to operate in resource-constrained environments without third-party dependencies. The project includes a model quantization tool for reducing precision and size via static and dynamic quantization, as well as a computation graph optimizer. These tools reduce latency and memory usage by fusing operators and pruning the model intermediate representat
This project is a comprehensive technical course study guide and reference for learning the architectures and training methods of Transformers and large language models. It serves as a technical overview for understanding how neural networks process data and how to align model behavior with specific performance goals. The repository provides specialized guides on several key areas of model development. This includes detailed references for transformer architectures, implementation frameworks for retrieval-augmented generation and agentic workflows, and technical guides for model optimization
jetson-inference is a set of libraries and tools for executing optimized deep learning models on embedded GPU hardware. Its primary purpose is to enable real-time computer vision and AI inference at the edge with low latency and high throughput. The project distinguishes itself through high-performance streaming analytics and the ability to execute concurrent AI pipelines on auto-grade silicon. It provides specialized support for multi-sensor stream processing, utilizing zero-copy data transport to load camera frames directly into GPU memory. The codebase covers a broad surface of capabiliti
This repository provides a collection of reference implementations and code examples for training and deploying machine learning models using the MLX framework. It serves as a practical guide for executing distributed training, fine-tuning large language models, converting model weights, and implementing multimodal generative workflows. The project distinguishes itself through specialized examples for local hardware execution, featuring weight quantization to reduce memory usage and low-rank adaptation for parameter-efficient fine-tuning. It also includes scripts for transforming external mod