21 Repos
Configuration of Intel GPU acceleration for video decoding.
Distinguishing note: Specific to Intel hardware architecture.
Explore 21 awesome GitHub repositories matching devops & infrastructure · Intel Hardware Acceleration. Refine with filters or upvote what's useful.
Frigate is a self-hosted network video recorder that functions as a private, local AI-powered vision engine. It manages video streams by performing real-time object detection, tracking, and classification directly on local hardware, ensuring that security monitoring and activity recording remain independent of cloud services. The system distinguishes itself through a modular, hardware-accelerated video pipeline that offloads intensive decoding and machine learning inference to dedicated GPUs, NPUs, or specialized accelerators like Coral TPUs and Hailo modules. It utilizes state-based object t
Configures hardware acceleration presets for Intel GPUs to improve video decoding performance.
Facefusion is a modular framework designed for automated image and video manipulation, specializing in tasks such as face swapping, enhancement, and restoration. It functions as a computer vision processing pipeline that chains independent machine learning modules to perform complex transformations, including facial animation, age modification, and lip synchronization. The system is built to handle both real-time interactive feeds and large-scale batch processing tasks. The platform distinguishes itself through a highly extensible architecture that supports custom processing modules and inter
Utilizes compatible Intel graphics hardware to improve processing efficiency for complex tasks.
kohya_ss is a graphical user interface and workbench for fine-tuning diffusion models, specifically designed for Stable Diffusion. It provides a suite of tools for training generative AI models, including specialized interfaces for creating Low-Rank Adaptation weights and training ControlNet spatial control networks. The project distinguishes itself through integrated VRAM usage optimization and hardware acceleration, featuring specific support for Intel GPUs via XPU-accelerated libraries. It implements parameter-efficient training methods and memory-saving techniques like gradient checkpoint
Includes specific environment configurations and support for Intel GPUs via XPU-accelerated libraries.
FlexLLMGen is an inference engine and runtime designed to run large language models on a single GPU by combining weight compression with tensor offloading. It reduces model weight memory usage by approximately 70% through 4-bit quantization, and stores model parameters, attention cache, and hidden states across GPU, CPU, and disk to fit models larger than available GPU memory. The project distinguishes itself through a throughput-oriented batching approach that processes multiple generation requests together in large batches to maximize throughput on a single GPU. It also supports distributed
Reduces model weight memory by approximately 70% using 4-bit quantization with minimal accuracy loss.
BigDL ist ein PyTorch-Beschleunigungsframework und eine Engine für verteilte Inferenz, die für große Sprachmodelle (LLMs) entwickelt wurde. Es bietet ein Toolkit für den Betrieb von Modellen auf Intel-Hardware und integriert Quantisierungswerkzeuge sowie Bibliotheken für parameter-effizientes Fine-Tuning. Das Projekt zeichnet sich durch die Verwendung von Pipeline-Parallelität aus, um Modell-Workloads über mehrere Hardware-Beschleuniger zu verteilen. Es nutzt Low-Bit-Integer-Quantisierung und spekulative Dekodierung, um den Speicherbedarf zu reduzieren und die Latenz bei der Textgenerierung zu verringern. Das System deckt umfassende Funktionen zur Modelloptimierung ab, einschließlich Gewichtskomprimierung und Laden quantisierter Modelle. Es unterstützt zudem hardwarebeschleunigte Trainingsroutinen, um vortrainierte Modelle an spezifische Aufgaben anzupassen.
Compresses LLM weights into low-bit precision formats to reduce memory usage and increase execution speed.
Intel XPU LLM Acceleration Library is a toolkit designed to accelerate large language model inference and finetuning on Intel CPUs, GPUs, and NPUs. It provides a distributed inference engine for scaling models across multiple accelerators, a multimodal model runtime for vision and speech tasks, and a low-bit model quantization tool for converting weights into INT4, FP8, and GGUF formats. The project features a parameter-efficient finetuning framework that enables model adaptation using QLoRA and DPO on Intel hardware. It distinguishes itself by providing specialized optimizations for Intel XP
Converts model weights to low-bit precision formats like INT4 and FP8 to maximize performance on Intel hardware.
Yi is a bilingual language model and foundation model designed for natural language processing, reasoning, and reading comprehension in both English and Chinese. It is built as a transformer-based architecture capable of general purpose text generation and conversational tasks. The model is distinguished by its ability to function as a long context system, processing and analyzing extended input sequences up to 200k tokens. It also supports quantized versions that use low-bit precision to reduce memory footprints, enabling execution on consumer-grade hardware. The project covers a broad rang
Provides low-bit weight quantization to reduce memory footprint for execution on consumer-grade hardware.
itlwm is a macOS network driver and kernel extension that enables Intel wireless network adapters to function on macOS systems. It serves as a hardware-specific driver providing connectivity and stability for Intel Wi-Fi chips on non-native platforms. The project acts as a bridge between Intel hardware and native macOS wireless frameworks, allowing the system to recognize the adapter as a native device and utilize Apple Airport features and system settings. The driver manages hardware integration and network connectivity by interfacing with the operating system to support macOS hardware comp
Provides the necessary connectivity for Intel-based network chips to function on macOS systems.
mistral.rs is an inference engine for large language models that runs locally and exposes models behind OpenAI and Anthropic-compatible APIs. It serves as a multi-model serving platform, capable of loading several models in a single server process with per-request routing and on-demand loading and unloading. The engine supports multimodal inference, processing text alongside images, video, audio, and speech inputs, and includes a quantized model deployment runtime that reduces memory use and speeds up inference on consumer hardware. The project distinguishes itself through an agentic tool exe
Uses per-column weight data from calibration text to allocate more precision to high-impact weights during quantization.
Lit-llama ist ein PyTorch-basiertes Implementierungs-Framework für das LLaMA-Sprachmodell und bietet ein System für Pre-Training, Fine-Tuning und Hochleistungs-Inferenz. Es enthält eine Pre-Training-Pipeline zur Erstellung grundlegender Sprachmodelle von Grund auf sowie Tools zur Ausführung vortrainierter Gewichte, um natürlichen Text zu generieren und Sequenzen vorherzusagen. Das Projekt bietet spezialisierte Toolkits für parameter-effizientes Fine-Tuning unter Verwendung von Low-Rank Adaptation (LoRA) und leichtgewichtigen Adaptern. Es enthält zudem eine Quantisierungsbibliothek, die den Speicherbedarf von Modellen durch 4-Bit- und 8-Bit-Präzision reduziert, um die Ausführung auf Hardware mit begrenzten Ressourcen zu ermöglichen. Das Framework integriert ein vereinfachtes Transformer-Design und verwendet Flash-Attention, um Speicher und Geschwindigkeit zu optimieren. Es verwaltet zudem große Datensätze durch Streaming-Datenformate, um das Laden ganzer Korpora in den Arbeitsspeicher zu vermeiden.
Ships a quantization library that reduces memory footprints via GPTQ-based 4-bit and 8-bit precision.
ACE Step 1.5 is a local text-to-music generation and audio editing system that runs on consumer hardware. It transforms plain-language descriptions into full-length songs with lyrics, and can edit existing audio through cover generation, vocal removal, track separation, and selective repainting. The system supports multilingual prompts and lyrics in over 50 languages, and provides precise control over musical structure including duration, BPM, key, and time signature. The project distinguishes itself through a dual-stream diffusion architecture that processes separate latent streams for vocal
Suno generates complete songs in under ten seconds on a standard consumer GPU while using less than four gigabytes of video memory.
Chainer is an open-source deep learning framework built around define-by-run automatic differentiation, where computation graphs are constructed dynamically during forward execution. This imperative approach allows networks to be built using standard Python control flow, with gradients computed automatically through reverse-mode differentiation on the dynamically recorded graph. The framework supports GPU acceleration through a NumPy-compatible array backend with CUDA and cuDNN support, and provides a pluggable device abstraction that lets users switch between CPU and GPU computation without c
Improves performance on Intel CPUs for supported operations using Intel Deep Learning optimizations.
PakePlus-Android is a tool that converts any public webpage or static frontend project into a native desktop or mobile application. It wraps web content inside a configurable WebView shell, enabling the creation of cross-platform apps for Windows, Mac, Linux, Android, and iOS from a single source. The project distinguishes itself by automating the entire packaging and compilation pipeline through GitHub Actions, requiring no local development environment or dependencies. Users configure the app name, icon, window behavior, and platform-specific settings through a guided interface or configura
Provides a prebuilt binary for Intel-based Macs enabling immediate use on older hardware.
Anomalib is a PyTorch-based library for visual anomaly detection, offering a modular framework, a comprehensive model zoo, and a benchmarking suite designed for industrial defect detection. It provides a wide range of algorithms—including generative, discriminative, teacher-student, and vision-language approaches—that support unsupervised, few-shot, and zero-shot settings. The library enables deployment through model export to ONNX and OpenVINO for edge devices, and includes a no-code web application for training and inference. It also features a command-line interface for orchestrating multi
Uses XPU accelerators and PyTorch Lightning strategies to run model training on Intel GPUs.
Baichuan-7B is an open-source 7 billion parameter bilingual Transformer model designed for text generation and few-shot learning across Chinese and English. It is built on a large Transformer architecture trained on a bilingual corpus, enabling it to produce coherent text in both languages from a single model. The model incorporates several optimization techniques that distinguish it from standard large language models. It uses rotary position embeddings that can extrapolate to longer sequences than seen during training, allowing context extension beyond the original 4096-token training lengt
Reduces model memory by approximately 70% using 4-bit weight quantization with minimal accuracy loss.
AutoGPTQ ist ein Framework zur Modellkomprimierung, das entwickelt wurde, um den Speicherbedarf zu reduzieren und die Inferenzgeschwindigkeit großer Sprachmodelle zu erhöhen. Es nutzt den GPTQ-Algorithmus zur Komprimierung von Modellgewichten, wodurch diese Modelle auf Hardware mit begrenztem VRAM ausgeführt werden können. Das Toolkit bietet eine Architektur-Quantisierungspipeline, die die Integration benutzerdefinierter Modellklassen für verschiedene neuronale Netzwerkarchitekturen unterstützt. Es enthält eine Mixed-Precision-Inferenz-Engine mit optimierten Kernels, um die Matrixmultiplikation während des Deployments zu beschleunigen. Das Framework deckt den gesamten Workflow der Gewichtskomprimierung ab, von der Kalibrierung und Quantisierung bis hin zur Genauigkeitsbewertung nachgelagerter Aufgaben. Diese Tools messen den Performance-Verlust durch den Vergleich der Ausgaben quantisierter Modelle mit den Originalgewichten bei Benchmark-Aufgaben.
Implements the GPTQ algorithm for post-training weight quantization to reduce model size.
AutoGPTQ ist ein Toolkit zur Modellkomprimierung und ein Framework zur Post-Training-Quantisierung, das entwickelt wurde, um den Speicherbedarf großer Sprachmodelle zu reduzieren. Es nutzt den GPTQ-Algorithmus zur Komprimierung neuronaler Netzwerkgewichte, wodurch Hardwareanforderungen gesenkt und die VRAM-Nutzung reduziert werden. Das Projekt dient als Inferenz-Beschleuniger durch die Bereitstellung optimierter Kernels, die die Token-Generierungsgeschwindigkeit erhöhen. Es bietet Erweiterbarkeit der Modellarchitektur, wodurch Quantisierungsfunktionen durch konfigurierbare Muster zu neuen Modellstrukturen hinzugefügt werden können. Das Framework deckt eine umfassende Quantisierungspipeline ab, einschließlich schichtweiser Gewichtskomprimierung, kalibrierungsbasierter Skalenschätzung und präzisionsspezifischem Memory-Mapping. Es enthält zudem Systeme zur Bewertung der Modellperformance, um die Auswirkungen der Quantisierung auf die Genauigkeit bei Sprach- und Zusammenfassungsaufgaben zu messen.
Implements the GPTQ algorithm for high-efficiency post-training weight quantization of large language models.
SakuraLLM is a multi-format document translation system that hosts large language models for translating Japanese text into other languages. It functions as an inference server that exposes translation models through an OpenAI-compatible API, allowing any tool supporting the OpenAI client format to send translation requests. The system is designed as a glossary-aware translation engine that applies user-defined term dictionaries to ensure consistent translation of proper nouns and names across outputs. The project distinguishes itself by supporting multiple high-performance inference backends
Runs the translation model on NVIDIA and AMD GPUs with CPU-GPU hybrid inference for lower-memory setups.
CTranslate2 is a C++ inference engine and runtime for Transformer models, designed to execute models on both CPU and GPU with optimizations for speed and memory efficiency. It functions as a model format converter, quantization tool, and REST API server, enabling deployment of neural machine translation, automatic speech recognition, and text generation models. The engine distinguishes itself through a suite of runtime optimizations including layer fusion, weight-matrix quantization, batch-by-length grouping, and a caching allocator that reuses GPU memory. It supports tensor-parallel model di
Overrides automatic runtime detection to force the use or non-use of Intel MKL for CPU execution.
Baichuan2 is a collection of pre-trained large language models, including base and chat variants, designed for natural language generation and multi-turn conversational AI. It provides an inference engine and a fine-tuning framework to adapt these models to custom datasets and specialized domains. The project features a quantization toolkit and an inference engine that enable model execution across diverse hardware, including graphics processors, central processors, and specialized accelerators. These tools support low-bit weight quantization to reduce memory usage and increase inference spee
Implements weight quantization to four or eight bits to reduce memory overhead and increase inference speed.