30 open-source projects similar to answerdotai/gpu.cpp, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best Gpu.cpp alternative.
This project provides a comprehensive toolset for WebGPU, serving as a graphics API wrapper, compute shader framework, resource manager, and shader toolchain. It enables browser-based GPU acceleration by offloading memory-intensive tasks and data processing from the CPU to the GPU. The framework manages the full lifecycle of GPU operations, from requesting physical hardware adapters and initializing logical devices to configuring programmable render and compute pipelines. It specifically supports the coordination of parallel workgroups and collective subgroup operations for general-purpose co
This repository is a collection of reference implementations and programming examples for the CUDA Toolkit. It serves as a GPGPU implementation guide and a parallel computing reference, providing code for using graphics hardware to perform general-purpose calculations and high-performance parallel processing. The project provides specific samples for GPU kernel development and resource management. These include demonstrations of multi-GPU communication, peer-to-peer memory access, and system hardware inspection to coordinate distributed GPU resources. The codebase covers a wide range of capa
gfx is a hardware-agnostic graphics API abstraction that translates a unified set of graphics and compute commands into native instructions for multiple GPU drivers. It provides a common interface for cross-platform rendering and general-purpose GPU compute programming. The project features an intermediate-representation shader translation system that converts source code and SPIR-V into target-specific languages. It employs a data-driven reference test framework to verify that graphics output remains consistent across different hardware platforms. Capabilities include parallel command buffe
This project is a cross-platform graphics and compute framework that provides a unified, hardware-agnostic abstraction layer for rendering and parallel processing. It enables developers to build high-performance applications that execute consistently across diverse operating systems and hardware backends, including Vulkan, Metal, and DirectX. By mapping high-level graphics commands to native APIs, it serves as a portable foundation for both real-time 3D rendering and general-purpose GPU computing. The framework distinguishes itself through a robust architecture that supports both native deskt
rust-cuda is a GPU programming framework and device compiler that allows for the development and execution of high-performance kernels on NVIDIA hardware using Rust. It provides a driver wrapper to manage device memory allocation and kernel launching, effectively serving as a system for writing GPU compute logic without relying on C++. The project includes a compute library with hardware-optimized primitives for neural network acceleration and hardware-accelerated raytracing. It utilizes a compilation toolchain that translates source code into a low-level intermediate representation for execu
Orillusion is a WebGPU 3D rendering engine designed for high-fidelity scenes and visual effects in the browser. It functions as a GPU compute framework for parallel mathematical operations and a physically-based rendering graphics pipeline for realistic materials and surfaces. The system also includes a web-based 3D animation toolkit for driving skeletal animations and interpolating vertex positions. The engine is distinguished by its use of an entity component system for scene logic and a macro-based shader generation system that creates multiple shader variants. It optimizes performance thr
cuda-python provides low-level Python bindings for the CUDA Driver and Runtime APIs. It serves as a programmatic wrapper for controlling device memory, managing hardware toolchains, and orchestrating execution graphs on NVIDIA GPUs, allowing for the compilation and launching of parallel kernels directly from Python. The project enables the development of SIMT kernels and the execution of mathematical algorithms on device memory. It integrates pre-compiled bytecode as custom operators and interfaces with accelerated device libraries to access low-level hardware functions without leaving the la
The Forge is a low-level toolkit for building high-performance graphics engines and applications across desktop, mobile, and console platforms. It provides a cross-platform engine framework and a dedicated shader compiler that translates a single source into target-specific languages for various graphics APIs and hardware. The project includes a GPU memory and resource manager that utilizes unified root signatures for resource binding, alongside a ray tracing rendering pipeline that implements hardware-accelerated ray and path tracing queries. State management is handled through a high-perfor
TypeGPU is a tool for type-safe WebGPU development that enables writing shaders in TypeScript. It translates high-level TypeScript function definitions and structures into WebGPU Shading Language source code to automate shader generation and validate logic using a type system. The project provides a mechanism for cross-library GPU interoperability by sharing typed buffers without copying data to system memory. It also integrates the Model Context Protocol to allow AI agents to inspect generated shader code and diagnose runtime errors. The system manages WebGPU resource mapping through typed
This project is a collection of reference implementations and benchmarks demonstrating the use of the Vulkan graphics and compute API. It provides a set of cross-platform examples and GPU programming patterns designed for high-performance rendering and hardware-accelerated tasks. The repository includes a suite of performance benchmarks used to measure API behavior across different hardware environments. It features a modular architecture that organizes rendering examples into isolated units, along with command-line utilities for the batch execution of sample sequences. The project covers se
ArrayFire is a hardware-agnostic compute framework and JIT-compiled tensor engine designed for high-performance numerical computing. It serves as a GPU numerical computing library and parallel signal processing toolkit that abstracts hardware backends, allowing the same codebase to execute across various GPU architectures and CPUs. The project distinguishes itself through a JIT engine that uses expression compilation to fuse operations and minimize memory overhead. It employs a deferred execution graph to optimize computation chains and provides interoperability primitives to share data and e
PowerInfer is a high-performance local large language model inference engine and sparse inference framework. It provides a runtime for executing models on consumer-grade hardware, utilizing a GPU acceleration backend to optimize tensor operations for graphics processors. The system distinguishes itself through a sparse inference framework that increases generation speed by skipping computations based on activation sparsity in model weights. It includes a GGUF model converter for transforming weights and metadata into a unified binary format, as well as an OpenAI API compatible server for inte
MNN is a high-performance inference engine and framework designed for on-device machine learning. It provides a comprehensive environment for executing, optimizing, and deploying neural network models directly on mobile and resource-constrained edge devices. The framework distinguishes itself through a robust model optimization toolkit that supports quantization, compression, and structural graph manipulation to minimize memory footprint and maximize execution speed. It features a modular architecture that abstracts hardware-specific backends, allowing models to run efficiently across diverse
This repository provides a collection of practical demonstrations and implementation guides for machine learning tasks using TensorFlow.js. It serves as a resource for developers to explore model architectures, training workflows, and data manipulation techniques across domains such as computer vision, natural language processing, and reinforcement learning. The project covers the full lifecycle of machine learning development, including tensor-based mathematical operations, model construction via high-level layer APIs or low-level tensor logic, and model serialization for various storage med
This project is a comprehensive collection of educational examples and reference implementations for building vision and language models using PyTorch. It serves as a deep learning tutorial covering the end-to-end process of developing neural networks, from initial architecture definition to final production deployment. The repository provides detailed guides on implementing a wide range of domain-specific models, including convolutional neural networks for object detection and segmentation, as well as transformer and recurrent architectures for natural language processing. It emphasizes gene
This project provides a comprehensive technical guide and framework for engineering large-scale machine learning systems. It covers the full lifecycle of model development, focusing on the infrastructure and computational principles required to build, train, and serve generative AI models across distributed GPU clusters. The repository distinguishes itself by offering deep-dive tutorials and implementation strategies for complex system challenges. It emphasizes high-performance architectural primitives, such as collective communication orchestration, distributed tensor sharding, and static gr
Warp is a Python framework that JIT-compiles Python functions into CUDA kernels for GPU-accelerated parallel computation, with built-in automatic differentiation and multi-framework array interoperability. At its core, it provides a GPU kernel compilation system that enables writing and executing custom GPU kernels directly from Python, while supporting automatic gradient computation through those kernels for integration with machine learning pipelines. The framework also includes tile-based cooperative computing, where thread blocks partition into tiles for shared-memory and tensor-core opera
ExecuTorch is a lightweight C++ runtime for deploying PyTorch models on mobile, embedded, and edge hardware. It provides an ahead-of-time compilation pipeline that exports, quantizes, and lowers model graphs into compact serialized programs, then executes them through a minimal runtime with hardware acceleration and on-device large language model inference capabilities. The project distinguishes itself through a hardware accelerator delegate system that partitions model subgraphs and offloads computation to specialized backends including NPUs, GPUs, and DSPs from Apple, Arm, Intel, MediaTek,
Vulkan-Hpp is a header-only C++ binding library for the Vulkan graphics and compute API. It provides a type-safe wrapper around the Vulkan C API, allowing developers to interface with GPU hardware through a C++ interface that introduces no runtime CPU overhead. The library utilizes Resource Acquisition Is Initialization patterns to manage the lifecycle of Vulkan handles and objects, automating the release of GPU resources. It replaces C-style enumerations and bit-fields with strong typing and static type checking to catch invalid API parameter assignments during compilation. The project cove
Zen-C is a multi-target systems language and source-to-source compiler that translates high-level logic into human-readable GNU C or C11 code. It functions as a JIT-enabled programming language with an in-process compiler for real-time interactive code evaluation and testing. The project serves as a CUDA GPU kernel generator, mapping specialized syntax to CUDA C++ using device attributes to target graphics hardware. It acts as an interoperability layer capable of emitting compatible code for C++, Objective-C, and Lisp to bridge native system frameworks and libraries. The language includes an
ogl is a WebGL graphics library and 3D scene graph engine designed for rendering three-dimensional scenes. It provides a lightweight framework for managing geometries and coordinating spatial transformations within a hierarchical system. The project includes a PBR shader system for creating realistic materials and a GPGPU computation framework for performing large-scale general-purpose calculations and particle simulations on the graphics processor. It also features a post-processing suite for applying visual filters to rendered scenes via frame buffers. The library covers broader capabiliti
Tixl is a node-based motion graphics engine and procedural generation tool used to create 3D geometry and shaders. It utilizes a directed acyclic graph of operators and GPU-accelerated compute kernels to generate complex 3D shapes, particularly through the use of signed distance functions and particle simulations. The engine is highly extensible via a C# development framework that supports hot code reloading, allowing custom operator logic to be injected into the active runtime without restarting. It further distinguishes itself as a lighting controller, capable of translating 3D spatial attr
Mooncake is a disaggregated large language model serving platform and distributed key-value store designed for high-performance inference infrastructure. It functions as a GPU memory orchestrator and KV cache management system that pools and transfers key-value caches across clusters to accelerate inference. The system distinguishes itself by separating the prefill and decode phases of inference into distinct hardware clusters to optimize resource utilization. It utilizes a high-performance RDMA distributed cache with zero-copy transfers to move data between compute nodes, bypassing the CPU t
Rust-GPU is a compiler and toolchain that translates Rust source code into SPIR-V bytecode for execution on graphics and compute hardware. It provides a development environment for writing parallel compute kernels and graphics shaders using a custom LLVM backend that maps high-level language constructs to GPU-compatible memory layouts and instructions. The project enables cross-platform shader development, allowing the same Rust code to run across different GPU hardware and graphics APIs via the SPIR-V intermediate representation. It specifically supports the creation of general-purpose compu
NCCL is a high-performance communication library and distributed GPU computing framework designed for executing collective and point-to-point data exchanges across multiple GPUs in single or multi-node systems. It serves as an RDMA GPU transport layer and memory orchestrator, facilitating high-bandwidth synchronization of data and model gradients for distributed GPU training and inference. The library is distinguished by its ability to execute communication primitives directly from GPU kernels, removing the host CPU from the critical path. It utilizes topology-aware path selection to optimize
jetson-inference is a set of libraries and tools for executing optimized deep learning models on embedded GPU hardware. Its primary purpose is to enable real-time computer vision and AI inference at the edge with low latency and high throughput. The project distinguishes itself through high-performance streaming analytics and the ability to execute concurrent AI pipelines on auto-grade silicon. It provides specialized support for multi-sensor stream processing, utilizing zero-copy data transport to load camera frames directly into GPU memory. The codebase covers a broad surface of capabiliti
AISystem is a comprehensive AI full-stack infrastructure project covering the entire pipeline from AI chip architecture to high-level training frameworks. It encompasses the development of AI compiler frameworks, inference engines, and distributed training orchestrators designed to coordinate workloads across a heterogeneous compute stack of CPUs, GPUs, and NPUs. The project focuses on the deep integration of software and hardware, employing software-hardware co-design to align tensor layouts with physical memory structures. It provides specialized capabilities for accelerating Transformer mo
LWJGL is a cross-platform library that provides Java bindings to native APIs for graphics, audio, compute, windowing, and input. It enables Java applications to access low-level hardware-accelerated capabilities such as OpenGL and Vulkan rendering, OpenAL 3D audio, OpenCL GPU compute, and GLFW windowing and input handling. Under the hood, LWJGL dynamically resolves native function pointers at runtime, loads platform-specific shared libraries, and uses generated JNI bindings to bridge Java and native code. It offers explicit memory management through direct buffer access and stack-allocated me
regl is a declarative WebGL library that manages graphics state and GPU resources through functional commands instead of manual binding and state tracking. It provides a command-based drawing abstraction where shaders, attributes, and render state are encapsulated into reusable, compiled functions that can be executed efficiently. What sets regl apart is its scoped state inheritance system, which allows nested drawing commands to inherit and override render state from parent scopes for organized rendering. The library automatically recovers from GPU context loss by restoring buffer and textur
bgfx is a cross-platform, graphics rendering abstraction layer designed for high-performance applications. It provides a unified interface that maps high-level rendering commands to native graphics APIs, allowing developers to maintain a single codebase that executes consistently across diverse operating systems and hardware architectures. The library distinguishes itself through a multi-threaded command submission model that decouples rendering logic from the main application thread, effectively minimizing CPU bottlenecks. It utilizes a backend-agnostic command buffer and a deferred resource