30 open-source projects similar to rust-gpu/rust-cuda, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best Rust Cuda alternative.
cuda-python provides low-level Python bindings for the CUDA Driver and Runtime APIs. It serves as a programmatic wrapper for controlling device memory, managing hardware toolchains, and orchestrating execution graphs on NVIDIA GPUs, allowing for the compilation and launching of parallel kernels directly from Python. The project enables the development of SIMT kernels and the execution of mathematical algorithms on device memory. It integrates pre-compiled bytecode as custom operators and interfaces with accelerated device libraries to access low-level hardware functions without leaving the la
Zen-C is a multi-target systems language and source-to-source compiler that translates high-level logic into human-readable GNU C or C11 code. It functions as a JIT-enabled programming language with an in-process compiler for real-time interactive code evaluation and testing. The project serves as a CUDA GPU kernel generator, mapping specialized syntax to CUDA C++ using device attributes to target graphics hardware. It acts as an interoperability layer capable of emitting compatible code for C++, Objective-C, and Lisp to bridge native system frameworks and libraries. The language includes an
gpu.cpp is a lightweight C++ library for executing low-level general-purpose GPU computation across different hardware vendors and operating systems. It functions as a portable GPU wrapper, kernel orchestrator, and tensor management system using the WebGPU specification to abstract device initialization, buffer transfers, and compute shader dispatching. The library provides a framework for defining compute kernels from shader code and managing their asynchronous dispatch and synchronization. It enables the execution of cross-platform compute shaders and the orchestration of GPU tasks through
This repository is a collection of reference implementations and programming examples for the CUDA Toolkit. It serves as a GPGPU implementation guide and a parallel computing reference, providing code for using graphics hardware to perform general-purpose calculations and high-performance parallel processing. The project provides specific samples for GPU kernel development and resource management. These include demonstrations of multi-GPU communication, peer-to-peer memory access, and system hardware inspection to coordinate distributed GPU resources. The codebase covers a wide range of capa
This project is a collection of reference implementations and technical guides for building high-performance 3D applications and graphics experiments on Windows. It provides a library of samples covering the implementation of GPU compute frameworks, raytracing reference models, and shader optimization techniques. The repository includes specific demonstrations for modeling physical light behavior to create reflections and lighting effects, as well as tools for analyzing memory dumps and tracking real-time execution metrics on graphics hardware. It further provides guidance on managing shader
SD.Next is an all-in-one web interface and multi-backend inference engine for generating, editing, and processing images and videos using diffusion models. It functions as a comprehensive tool for diffusion model management and an automated image processing pipeline for bulk operations. The project is distinguished by its hardware-backend abstraction layer, which provides automatic detection and acceleration for NVIDIA CUDA, AMD ROCm, Intel OpenVINO, and DirectML. It features a headless generative API and a programmatic command interface, allowing users to trigger tasks via REST API or CLI wi
This project is a collection of reference implementations and benchmarks demonstrating the use of the Vulkan graphics and compute API. It provides a set of cross-platform examples and GPU programming patterns designed for high-performance rendering and hardware-accelerated tasks. The repository includes a suite of performance benchmarks used to measure API behavior across different hardware environments. It features a modular architecture that organizes rendering examples into isolated units, along with command-line utilities for the batch execution of sample sequences. The project covers se
AISystem is a comprehensive AI full-stack infrastructure project covering the entire pipeline from AI chip architecture to high-level training frameworks. It encompasses the development of AI compiler frameworks, inference engines, and distributed training orchestrators designed to coordinate workloads across a heterogeneous compute stack of CPUs, GPUs, and NPUs. The project focuses on the deep integration of software and hardware, employing software-hardware co-design to align tensor layouts with physical memory structures. It provides specialized capabilities for accelerating Transformer mo
regl is a declarative WebGL library that manages graphics state and GPU resources through functional commands instead of manual binding and state tracking. It provides a command-based drawing abstraction where shaders, attributes, and render state are encapsulated into reusable, compiled functions that can be executed efficiently. What sets regl apart is its scoped state inheritance system, which allows nested drawing commands to inherit and override render state from parent scopes for organized rendering. The library automatically recovers from GPU context loss by restoring buffer and textur
The Forge is a low-level toolkit for building high-performance graphics engines and applications across desktop, mobile, and console platforms. It provides a cross-platform engine framework and a dedicated shader compiler that translates a single source into target-specific languages for various graphics APIs and hardware. The project includes a GPU memory and resource manager that utilizes unified root signatures for resource binding, alongside a ray tracing rendering pipeline that implements hardware-accelerated ray and path tracing queries. State management is handled through a high-perfor
Rust-GPU is a compiler and toolchain that translates Rust source code into SPIR-V bytecode for execution on graphics and compute hardware. It provides a development environment for writing parallel compute kernels and graphics shaders using a custom LLVM backend that maps high-level language constructs to GPU-compatible memory layouts and instructions. The project enables cross-platform shader development, allowing the same Rust code to run across different GPU hardware and graphics APIs via the SPIR-V intermediate representation. It specifically supports the creation of general-purpose compu
AITemplate is an ahead-of-time deep learning compiler that translates PyTorch neural networks into standalone C++ source code. It functions as a PyTorch to C++ compiler and a GPU kernel fusion engine, producing self-contained executable binaries that run inference without requiring a Python interpreter or deep learning framework runtime. The project generates optimized CUDA and HIP C++ code specifically for NVIDIA TensorCores and AMD MatrixCores. It focuses on maximizing throughput for half-precision floating-point operations through a system that combines multiple neural network operators in
FlashInfer is a library of high-performance GPU kernels purpose-built for accelerating large language model inference. It provides optimized implementations for attention operations (including flash attention, page attention, multi-head latent attention, and cascade attention) using paged key-value caches, fused kernel composition, and just-in-time compilation. The library also includes specialized kernels for mixture-of-experts layers, block-scaled low-precision quantization (FP8, FP4), and distributed collective communication. What distinguishes FlashInfer is its fused all-reduce communicat
gfx is a hardware-agnostic graphics API abstraction that translates a unified set of graphics and compute commands into native instructions for multiple GPU drivers. It provides a common interface for cross-platform rendering and general-purpose GPU compute programming. The project features an intermediate-representation shader translation system that converts source code and SPIR-V into target-specific languages. It employs a data-driven reference test framework to verify that graphics output remains consistent across different hardware platforms. Capabilities include parallel command buffe
This project provides a comprehensive toolset for WebGPU, serving as a graphics API wrapper, compute shader framework, resource manager, and shader toolchain. It enables browser-based GPU acceleration by offloading memory-intensive tasks and data processing from the CPU to the GPU. The framework manages the full lifecycle of GPU operations, from requesting physical hardware adapters and initializing logical devices to configuring programmable render and compute pipelines. It specifically supports the coordination of parallel workgroups and collective subgroup operations for general-purpose co
This project is a CUDA programming course and technical guide focused on writing and optimizing GPU kernels for hardware acceleration. It provides structured learning resources for using the CUDA platform to execute operations on silicon architectures. The material covers the optimization of linear algebra kernels and the analysis of machine learning deployment. It includes guidance on identifying acceleration tools, mapping the deep learning ecosystem, and evaluating the frameworks used to move models from research to production environments. The scope extends to GPU performance optimizatio
Vulkan-Hpp is a header-only C++ binding library for the Vulkan graphics and compute API. It provides a type-safe wrapper around the Vulkan C API, allowing developers to interface with GPU hardware through a C++ interface that introduces no runtime CPU overhead. The library utilizes Resource Acquisition Is Initialization patterns to manage the lifecycle of Vulkan handles and objects, automating the release of GPU resources. It replaces C-style enumerations and bit-fields with strong typing and static type checking to catch invalid API parameter assignments during compilation. The project cove
LWJGL is a cross-platform library that provides Java bindings to native APIs for graphics, audio, compute, windowing, and input. It enables Java applications to access low-level hardware-accelerated capabilities such as OpenGL and Vulkan rendering, OpenAL 3D audio, OpenCL GPU compute, and GLFW windowing and input handling. Under the hood, LWJGL dynamically resolves native function pointers at runtime, loads platform-specific shared libraries, and uses generated JNI bindings to bridge Java and native code. It offers explicit memory management through direct buffer access and stack-allocated me
MMdnn is a deep learning model converter and migrator designed to translate neural network architectures and weights between different frameworks such as TensorFlow, PyTorch, and Keras. It utilizes a standardized intermediate representation to decouple network structures and weights from specific framework implementations, enabling the transformation of pre-trained models across different environments. The project distinguishes itself by generating native Python reconstruction code from its intermediate representations, allowing models to be rebuilt and fine-tuned in target environments. It a
HIP is a C++ GPU kernel language and cross-platform runtime designed for writing portable high-performance compute applications. It provides a programming interface that allows a single source codebase to execute on both AMD and NVIDIA GPU architectures. The project functions as a compatibility layer that enables the conversion and migration of existing CUDA source code to run on AMD hardware. This is achieved through a syntax mapping that mirrors CUDA and a source-to-source translation process during compilation. The toolkit covers the broader surface of cross-platform GPGPU development, in
TypeGPU is a tool for type-safe WebGPU development that enables writing shaders in TypeScript. It translates high-level TypeScript function definitions and structures into WebGPU Shading Language source code to automate shader generation and validate logic using a type system. The project provides a mechanism for cross-library GPU interoperability by sharing typed buffers without copying data to system memory. It also integrates the Model Context Protocol to allow AI agents to inspect generated shader code and diagnose runtime errors. The system manages WebGPU resource mapping through typed
TileLang is a Python-embedded domain-specific language compiler that JIT-compiles and autotunes GPU kernels. It uses a tile-based DSL, automatic software pipelining, and parallel autotuning to generate optimized GPU kernels at runtime. It supports tensor core operations with Pythonic syntax, automatic memory management, and thread mapping. The compiler searches over tile sizes, thread counts, and scheduling policies, compiling and benchmarking candidates in parallel to find the fastest kernel. It also caches compiled binaries and tuning results to disk for reuse across sessions. TileLang inc
This project is a high-performance C++ and CUDA neural network library designed for fast training and inference of small networks on NVIDIA GPUs. It serves as a specialized backend for neural radiance fields and coordinate-based networks, providing a fused GPU kernel library and a hash grid encoder for transforming raw input dimensions into high-dimensional representations. The library distinguishes itself through the use of C++ template metaprogramming and fused-kernel execution, which merge neural network layers into single GPU device functions to eliminate memory bottlenecks. It leverages
Neural Enhance is a deep learning image upscaler and restoration tool designed to increase image resolution and remove blur. It functions as a neural image restoration utility for eliminating noise and JPEG artifacts, and includes a framework for training and tuning custom neural network models against image datasets. The system utilizes a containerized environment to offload tensor calculations to GPU cores, speeding up neural network inference. It features a batch processing pipeline that queues multiple image files in sequence to maximize hardware throughput. Capabilities include domain-s
Bosque is an experimental programming language and development platform designed for machine-assisted software construction. It combines functional programming semantics with imperative syntax to enforce logic correctness and runtime safety, providing a type-safe environment that utilizes structured data models to maintain information integrity throughout the application lifecycle. The platform distinguishes itself through deep integration with formal verification tools, including automated theorem provers and symbolic execution engines. By transforming source code into a regularized intermed
Kajiya is a physically based rendering engine and real-time global illumination renderer. It utilizes a GPU-accelerated path tracer to simulate real-world material properties, such as roughness and metalness, to achieve photorealistic visual results. The engine incorporates a temporal super-resolution upscaler to increase final render resolution by reconstructing images from lower-resolution internal frames. It also generates high-fidelity reference images through path-tracing to verify the visual accuracy of real-time lighting outputs. The system covers 3D scene visualization and asset mana
DirectXTK is a C++ library designed to simplify 2D and 3D graphics, audio, and input programming for DirectX applications. It serves as a comprehensive toolkit providing high-level wrappers for DirectX graphics, audio management, and input handling. The toolkit includes a graphics wrapper for loading textures and rendering 3D models and 2D sprites, alongside a dedicated audio manager for sound effects and 3D spatial audio. It also provides an input handler to track and process state updates from keyboards, mice, and gamepads. The library covers a broad capability surface including 3D math an
oneDNN is a library for deep learning acceleration that provides optimized building blocks for neural network training and inference. It manages tensor computation across CPU and GPU hardware, enabling the execution of high-performance primitives for model training and neural network inference optimization. The project distinguishes itself through hardware-specific kernel optimization and the use of just-in-time compilation to target specific processor instruction sets. It supports quantized neural network execution using both static and dynamic quantization to reduce memory usage and increas
PhysX is a physics engine SDK designed for calculating real-time rigid body dynamics, fluid simulations, and environmental interactions in virtual applications. It includes a GPU-accelerated physics solver for computing complex particle fluids and combustion models, a voxel fluid simulator for real-time gas, fire, and smoke, and a destruction simulation framework for modeling the fracture of meshes. The SDK features a specialized machine learning physics tensor interface that enables the exchange of simulation data with machine learning frameworks using a common tensor format. It also impleme