30 open-source projects similar to intel-analytics/bigdl, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best BigDL alternative.
Intel XPU LLM Acceleration Library is a toolkit designed to accelerate large language model inference and finetuning on Intel CPUs, GPUs, and NPUs. It provides a distributed inference engine for scaling models across multiple accelerators, a multimodal model runtime for vision and speech tasks, and a low-bit model quantization tool for converting weights into INT4, FP8, and GGUF formats. The project features a parameter-efficient finetuning framework that enables model adaptation using QLoRA and DPO on Intel hardware. It distinguishes itself by providing specialized optimizations for Intel XP
This repository is a collection of frameworks and guides for Llama models, functioning as a fine-tuning framework, an inference pipeline, and an AI workflow orchestrator. It provides tools for adapting large language models to specific datasets and domains. The project includes a parameter-efficient fine-tuning toolkit that utilizes techniques like low-rank adaptation to reduce memory and compute requirements. It also serves as an implementation guide for retrieval-augmented generation, combining model inference with external data retrieval to improve response accuracy. The capability surfac
Chinese-Vicuna is a Chinese large language model and instruction-following AI based on the LLaMA architecture. It is specifically designed for natural language understanding and generation in the Chinese language, utilizing an instruction-tuned model to follow complex user prompts across conversations. The project provides a LoRA fine-tuning framework and quantization systems to enable model adaptation and inference on consumer hardware. It implements quantized inference to reduce memory usage on both CPUs and GPUs, supported by a low-level C++ implementation to minimize system resource requi
ipex-llm is an acceleration library and inference engine designed to optimize the execution and finetuning of large language models on Intel GPUs and NPUs. It provides a HuggingFace compatible model backend and a dedicated quantization toolkit for converting model weights into low-bit precision formats. The project facilitates distributed inference by splitting large model workloads across multiple accelerators using pipeline and tensor parallelism. It enables the deployment of models on Intel Arc, Flex, and Max GPUs to increase throughput and reduce latency. The library covers a broad range
MiniCPM is a collection of small language models designed for local, on-device deployment in resource-constrained environments. The project focuses on running dense Transformer models on consumer hardware, including GPUs, CPUs, and Apple Silicon, without requiring custom code forks. The project distinguishes itself through heavy optimization for edge hardware, utilizing quantized weight compression in GGUF and MLX formats to reduce memory overhead. It implements advanced inference techniques such as speculative sampling and radix-tree prefix caching to accelerate generation speed and throughp
Qwen-7B is a pretrained causal language model designed for natural language generation, text processing, and complex reasoning tasks. It is available as an instruction-tuned model optimized for conversational interactions and a tool-use model capable of executing function calls and interacting with external APIs. The project provides a quantized version of the model to reduce GPU memory usage and supports the development of autonomous agents that can execute code and perform functions to complete complex goals. The system covers a wide range of capabilities including model fine-tuning throug
gpt-fast is a PyTorch transformer inference engine designed for low-latency text generation. It functions as a distributed GPU inference library, a quantized model runner, and a speculative decoding framework. The system utilizes a speculative decoding workflow where a small draft model predicts token sequences for verification by a larger model to accelerate generation. It supports quantized model execution to reduce memory footprint and implements tensor parallelism to split computations across multiple GPUs. The project includes a standardized evaluation harness to measure the accuracy an
This library provides a framework for parameter-efficient fine-tuning, enabling the adaptation of large pretrained models by training only a small subset of parameters. It functions as a distributed model training system and optimization toolkit, designed to reduce the computational and memory requirements typically associated with full model fine-tuning. The project distinguishes itself through a suite of methods for modular adapter composition, including low-rank matrix decomposition and activation-based scaling. It supports the integration of multiple task-specific adapter modules, allowin
Swift is a toolkit for the full-parameter and parameter-efficient fine-tuning of large language and multimodal models. It functions as a multimodal model trainer for text, image, video, and audio data, and includes specialized tools for model compression and reinforcement learning from human feedback. The framework provides an alignment toolkit for optimizing model behavior using preference learning algorithms and reinforcement learning. It integrates parameter-efficient fine-tuning methods to adapt models with minimal memory and compute requirements, alongside utilities for reducing hardware
OpenRLHF is a training framework and alignment library designed for reinforcement learning from human feedback across distributed GPU clusters. It provides tools for aligning large language models and multimodal vision-language models using algorithms such as PPO, GRPO, and DPO. The framework distinguishes itself through a distributed inference engine that overlaps sample rollout with training to increase throughput. It supports scaling to models exceeding 70 billion parameters via parameter sharding and handles long-context sequences through ring-attention sequence parallelism. The project
Serving is a high-performance framework designed for deploying and scaling machine learning models as production services. It functions as a distributed inference engine that enables the execution of complex data processing workflows by chaining multiple models into directed acyclic graphs. The platform distinguishes itself through its ability to manage the entire production model lifecycle, allowing for hot-swappable versioning that updates services without downtime. It supports horizontal scaling through distributed model sharding and optimizes high-dimensional data retrieval via specialize
This project is a fine-tuning framework and training pipeline designed to optimize and adapt large language and vision models. It provides a specialized toolkit for parameter-efficient tuning and supervised learning, serving as both a trainer for multimodal models and a deployment tool for serving fine-tuned models via high-performance inference engines. The framework focuses on reducing memory and compute requirements by updating a small subset of model parameters. It supports a wide range of adaptation strategies, including vision-language model training to align text, image, video, and aud
This project is a comprehensive learning resource and set of demonstrations focused on large language model integration, deployment, and fine-tuning. It provides educational content and practical guides for working with artificial intelligence models. The resource includes specific tutorials and courses on adapting pre-trained models to specialized datasets using parameter-efficient fine-tuning techniques. It also provides instructional content for running quantized models on consumer hardware and building retrieval augmented generation pipelines using vector databases and document indexing.
Baichuan2 is a collection of pre-trained large language models, including base and chat variants, designed for natural language generation and multi-turn conversational AI. It provides an inference engine and a fine-tuning framework to adapt these models to custom datasets and specialized domains. The project features a quantization toolkit and an inference engine that enable model execution across diverse hardware, including graphics processors, central processors, and specialized accelerators. These tools support low-bit weight quantization to reduce memory usage and increase inference spee
This project provides a comprehensive collection of educational resources and technical guides for training, fine-tuning, and deploying machine learning models using PyTorch and Hugging Face. It serves as a practical reference for scaling deep learning workflows, offering structured instructions for managing large-scale architectures across distributed hardware accelerators. The repository distinguishes itself by focusing on the end-to-end lifecycle of large language models, specifically emphasizing containerized deployment and performance optimization. It details workflows for parameter-effi
LMFlow is a comprehensive suite for large language model fine-tuning, context extension, multimodal processing, and inference execution. It provides a toolkit for updating model parameters through full tuning or memory-efficient adapter algorithms, alongside an inference engine for executing tuned models via command-line or web-based interfaces. The framework includes a dedicated alignment suite for supervised tuning and reward model training to refine model behavior. It features a context window extender to increase maximum input lengths and a multimodal framework for building chatbots that
OpenVINO is an AI inference engine and model serving platform designed to execute optimized deep learning models across CPUs, GPUs, and NPUs through a unified API. It includes a model optimization toolkit for converting, quantizing, and compressing models from various frameworks, alongside a specialized generative AI runtime for large language models. The project distinguishes itself through a plugin-based hardware acceleration layer that maps neural network operations to vendor-specific drivers. It features advanced execution mechanisms such as continuous batching, speculative decoding, and
This project provides a Chinese large language model based on the LLaMA architecture. It is an instruction-tuned model optimized for natural language processing and multi-turn conversations in Chinese. The system includes a framework for parameter-efficient fine-tuning using low-rank adaptation and quantization to reduce memory requirements. It also implements retrieval augmented generation for local document question answering and supports long-context processing for sequences up to 64K tokens. The project covers a broad set of capabilities including supervised instruction tuning, reinforce
PaddleNLP is a development library and toolkit for training, fine-tuning, and deploying large and small language models using the PaddlePaddle framework. It provides a comprehensive suite for the entire natural language processing lifecycle, from model development to high-performance inference. The project features a standardized model zoo for loading and managing pre-trained models and tokenizers through a unified interface. It distinguishes itself with a specialized model compression framework that reduces memory footprints via weight precision conversion and lossless size optimization, alo
PowerInfer is a high-performance local large language model inference engine and sparse inference framework. It provides a runtime for executing models on consumer-grade hardware, utilizing a GPU acceleration backend to optimize tensor operations for graphics processors. The system distinguishes itself through a sparse inference framework that increases generation speed by skipping computations based on activation sparsity in model weights. It includes a GGUF model converter for transforming weights and metadata into a unified binary format, as well as an OpenAI API compatible server for inte
SmolLM is a project dedicated to the development of small language models. It focuses on training and fine-tuning compact models that maintain high performance while utilizing fewer parameters. The project emphasizes efficient AI inference and on-device text generation, aiming to enable the deployment of lightweight models on edge devices with limited memory and processing power. It utilizes synthetic data generation to produce artificial datasets that improve the reasoning and training of these AI systems. The system supports a variety of optimization and training capabilities, including we
Mmlspark is a distributed framework for executing machine learning models, data transformations, and AI service integrations across Apache Spark clusters. It functions as a distributed machine learning library and pipeline orchestrator, allowing users to integrate pre-trained cognitive services and custom models into large-scale batch and streaming workflows. The project is distinguished by its ability to incorporate external AI services and web APIs directly into big data pipelines for text and vision analysis. It provides a scalable model training framework that coordinates gradient boostin
Metaseq is a transformer sequence modeling toolkit designed for training, fine-tuning, and deploying sequence-to-sequence models using open pre-trained weights. It provides a comprehensive framework for large language model training, including dedicated tools for sequence dataset processing and a standalone inference server for generating text via API requests. The project features specialized utilities for model quantization to reduce parameter precision to eight bits, which lowers memory usage and increases inference speed. It also includes a checkpoint conversion pipeline to transform mode
Grok-1 is an open-weights large language model implementation featuring a sparse mixture-of-experts architecture. It is designed for high-performance text generation and natural language processing by activating only a subset of specialized expert layers per token. The model utilizes 8-bit weight quantization to reduce memory overhead and accelerate loading. To manage its high parameter count, the implementation supports activation sharding, which distributes the memory load across multiple hardware devices during execution. The project covers large-scale model inference, including text comp
VisualGLM-6B is a bilingual multimodal large language model and vision-language model designed for conversational tasks and visual understanding. It functions as a bilingual AI model capable of processing and generating responses in both Chinese and English. The system is a quantized large language model supporting 4-bit and 8-bit precision to reduce memory usage and hardware requirements during local deployment. It is also a parameter-efficient fine-tuning model, allowing for weight adjustments to adapt the system to specific downstream tasks without full retraining. The project covers mult
Petals is a decentralized framework and inference engine for running large language models across a peer-to-peer network. It enables the execution of models that exceed the memory of any single machine by splitting computations and model layers across a collaborative swarm of GPUs. The system functions as a collaborative compute network where participants share local GPU resources and host model weights. It supports distributed prompt-tuning to adapt massive models to specific tasks and allows for the establishment of private compute swarms to process sensitive data within restricted, trusted
bitsandbytes is a quantization library for large language models that reduces memory footprints using k-bit quantization. It provides a framework for 4-bit low-rank adaptation, tools for 8-bit model compression, and memory-efficient optimizer extensions for PyTorch. The project enables the training of large models on limited hardware through 4-bit quantization and low-rank adaptation weights. It also facilitates faster inference by compressing models to 8-bit precision using vector-wise quantization. The library covers a range of memory optimization capabilities, including optimizer memory r
Sglang is a high-performance inference engine and serving system designed for large language and multimodal models. It provides a programmable interface for orchestrating complex generation workflows, enabling developers to coordinate multi-turn dialogues, tool invocations, and reasoning chains through a domain-specific language. The platform is built to support production-scale deployments, offering an OpenAI-compatible API that allows for integration with existing application ecosystems. The system distinguishes itself through a disaggregated architecture that separates compute-intensive pr
KoboldCPP is a local large language model inference engine and GGUF model runner designed to execute quantized models on personal hardware. It functions as a multimodal AI server and API gateway, providing OpenAI-compatible endpoints that allow third-party clients to interact with locally hosted models. The project distinguishes itself as an AI storytelling backend, featuring dedicated tools for long-form narrative management through persistent memory, world lore tracking, and character state management. It further extends its capabilities as a multimodal server capable of processing text, im
llama.cpp is a high-performance C++ inference engine and runtime for executing large language models locally across various hardware architectures. It provides the core components for local model execution, including a dedicated model quantizer for compressing weights into the GGUF format and a system for generating text embeddings for semantic search. The project distinguishes itself through specialized memory and execution optimizations, such as block-wise weight quantization to reduce memory footprints and memory-mapped model loading. It supports structured text generation by using formal