30 open-source projects similar to nvidia/cuda-python, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best Cuda Python alternative.
This repository is a collection of reference implementations and programming examples for the CUDA Toolkit. It serves as a GPGPU implementation guide and a parallel computing reference, providing code for using graphics hardware to perform general-purpose calculations and high-performance parallel processing. The project provides specific samples for GPU kernel development and resource management. These include demonstrations of multi-GPU communication, peer-to-peer memory access, and system hardware inspection to coordinate distributed GPU resources. The codebase covers a wide range of capa
rust-cuda is a GPU programming framework and device compiler that allows for the development and execution of high-performance kernels on NVIDIA hardware using Rust. It provides a driver wrapper to manage device memory allocation and kernel launching, effectively serving as a system for writing GPU compute logic without relying on C++. The project includes a compute library with hardware-optimized primitives for neural network acceleration and hardware-accelerated raytracing. It utilizes a compilation toolchain that translates source code into a low-level intermediate representation for execu
TileLang is a Python-embedded domain-specific language compiler that JIT-compiles and autotunes GPU kernels. It uses a tile-based DSL, automatic software pipelining, and parallel autotuning to generate optimized GPU kernels at runtime. It supports tensor core operations with Pythonic syntax, automatic memory management, and thread mapping. The compiler searches over tile sizes, thread counts, and scheduling policies, compiling and benchmarking candidates in parallel to find the fastest kernel. It also caches compiled binaries and tuning results to disk for reuse across sessions. TileLang inc
This project provides a framework for binding Rust and Python, enabling the creation of native extension modules and the embedding of the Python interpreter within host applications. It functions as a cross-language interoperability library that facilitates the execution of scripts, the definition of classes, and the sharing of data structures across the boundary of the two runtimes. The framework distinguishes itself through the use of procedural macros to automate the generation of boilerplate code, simplifying the process of exposing native functions and data types. It employs type-level m
gpu.cpp is a lightweight C++ library for executing low-level general-purpose GPU computation across different hardware vendors and operating systems. It functions as a portable GPU wrapper, kernel orchestrator, and tensor management system using the WebGPU specification to abstract device initialization, buffer transfers, and compute shader dispatching. The library provides a framework for defining compute kernels from shader code and managing their asynchronous dispatch and synchronization. It enables the execution of cross-platform compute shaders and the orchestration of GPU tasks through
Cpp-taskflow is a C++ task-parallelism framework and task graph scheduler designed to manage and execute complex dependency graphs of parallel tasks across CPU and GPU hardware. It provides a parallel algorithm library for high-performance implementations of reductions, sorts, pipelines, and iterations. The framework distinguishes itself through its ability to offload heavy computational workloads from a task graph to graphics processors for acceleration. It also includes a task profiling tool and a performance analysis interface for visualizing task execution flow and dependency structures t
RustPython is a Python 3 compatible interpreter implemented in Rust. It functions as a scripting engine that can be embedded directly into host applications, allowing for the execution of dynamic scripts and the customization of software behavior within a memory-safe environment. The project distinguishes itself through its ability to bridge Python and JavaScript runtimes, enabling data exchange and function invocation across language boundaries. It also provides a portable execution environment by compiling Python code into WebAssembly, which allows for the execution of logic directly within
TypeGPU is a tool for type-safe WebGPU development that enables writing shaders in TypeScript. It translates high-level TypeScript function definitions and structures into WebGPU Shading Language source code to automate shader generation and validate logic using a type system. The project provides a mechanism for cross-library GPU interoperability by sharing typed buffers without copying data to system memory. It also integrates the Model Context Protocol to allow AI agents to inspect generated shader code and diagnose runtime errors. The system manages WebGPU resource mapping through typed
CuPy is a CUDA array computing library that implements a NumPy-compatible interface for executing array operations and numerical computing on NVIDIA GPUs. It serves as a GPU-accelerated numerical library and a CUDA-based SciPy implementation, offloading heavy calculations to graphics hardware to increase processing speed for scientific and engineering workloads. The library enables multi-framework tensor exchange, allowing data buffers to be shared between different deep learning frameworks using standardized memory layouts to avoid memory copies. It also supports custom GPU kernel integratio
Ray is a distributed computing framework designed to scale Python and Java applications across clusters by abstracting task scheduling and resource management. It functions as a resource-aware execution engine that manages task dependencies, placement, and fault tolerance across networked compute nodes. At its core, the system provides a stateful actor model, allowing developers to define classes that run in dedicated processes to maintain and mutate internal state across remote method calls. The framework distinguishes itself through a robust cross-language interoperability layer, enabling f
FlashInfer is a library of high-performance GPU kernels purpose-built for accelerating large language model inference. It provides optimized implementations for attention operations (including flash attention, page attention, multi-head latent attention, and cascade attention) using paged key-value caches, fused kernel composition, and just-in-time compilation. The library also includes specialized kernels for mixture-of-experts layers, block-scaled low-precision quantization (FP8, FP4), and distributed collective communication. What distinguishes FlashInfer is its fused all-reduce communicat
This project is an OpenWrt firmware builder and specialized Linux router distribution designed to repurpose Amlogic S9xxx series hardware into functional routers. It provides a hardware adaptation layer consisting of kernel modifications and drivers that enable the operating system to run on Amlogic ARM SoC devices. The project features an automated firmware pipeline for scheduling, building, and distributing custom images. It includes a device recovery toolkit for bootstrapping, flashing, and restoring factory images, and supports the conversion of devices previously running different mobile
Neon is a framework for writing high-performance native Node.js modules using the Rust programming language. It serves as a foreign function interface bridge and a toolchain for bootstrapping, compiling, and managing Rust-based extensions. The project provides a cross-language memory manager that handles buffers and object borrowing to ensure safe memory access between Rust and JavaScript. It enables the mapping of data types and function calls across the language boundary, allowing Rust functions to be exported to the script environment and JavaScript functions to be called from Rust. The f
node-ffi is a foreign function interface library for Node.js that enables calling functions from native C dynamic libraries without writing manual C++ bindings. It serves as a system for loading shared objects and DLLs into process memory, translating JavaScript values into binary representations, and executing external binaries at runtime. The project utilizes a wrapper around the libffi library to construct call frames and execute native functions with dynamic arguments. It distinguishes itself by providing a native memory manager for allocating raw pointers and a mapping system that connec
miniaudio is a single-file C audio library used for audio playback, capture, and hardware interfacing across multiple operating systems. It functions as an audio hardware abstraction layer, an audio processing engine, an audio synthesis engine, and a codec and resampler. The project implements a node-graph based system for routing digital audio signals, mixing, and 3D spatialization. It also includes a programmatic generator for noise patterns and basic waveforms used for sound creation and signal testing. The library covers digital signal processing, including audio format conversion and sa
libffi is a foreign function interface library that enables calling functions written in other languages at runtime. It serves as a multi-architecture ABI wrapper and dynamic call frame generator, allowing the execution of external functions based on runtime descriptions of argument types and return values. The project provides a portable interface to handle diverse calling conventions across different hardware architectures and operating systems. It includes capabilities for executable closure allocation, which allows foreign code to trigger callbacks within a host language via jump tables s
JAX is a hardware-accelerated array library and automatic differentiation system for numerical computing. It provides a framework compatible with NumPy that extends array operations with a just-in-time compiler to transform Python functions into optimized kernels for execution on GPU and TPU accelerators. The system differentiates itself through the use of an XLA-based compiler and a single program multiple data sharding model. These capabilities allow the library to distribute large-scale computations across multiple hardware accelerators using both automatic parallelization and manual shard
This project is a comprehensive research platform designed for the end-to-end lifecycle of robotic learning. It provides a modular framework for training neural network policies—specifically through imitation and reinforcement learning—and deploying them onto physical robotic hardware. By offering a unified interface for hardware abstraction, the platform decouples high-level control logic from the specific sensors and actuators of diverse robotic systems. The framework distinguishes itself through a standardized approach to data and policy management. It utilizes a consistent schema for reco
CPython is the primary, community-maintained reference implementation of the Python programming language. It functions as a high-level, interpreted execution environment that compiles source code into platform-independent bytecode for processing by a stack-based virtual machine. The runtime manages memory through a combination of reference counting and generational cyclic garbage collection, while dynamic type dispatching determines object behavior at runtime based on metadata stored within object headers. The project is distinguished by its C-based architecture, which provides a stable forei
This project provides a full Python interpreter compiled to WebAssembly, enabling the execution of Python code and scientific libraries directly within web browsers and server-side environments. By bridging the gap between language runtimes, it allows developers to run computational tasks, manage packages, and perform data analysis in client-side environments without requiring a backend server. The platform distinguishes itself through a comprehensive foreign function interface that enables bidirectional data exchange, object proxying, and function calling between Python and JavaScript. It in
This project is a professional live video production suite designed for capturing, encoding, and broadcasting high-quality media. At its core, it features a real-time media processing engine that utilizes hardware acceleration to composite multiple audio and video sources with minimal latency. The application provides a centralized studio interface for managing complex scene transitions, layering visual sources through a hierarchical scene-graph engine, and streaming content to multiple platforms simultaneously. The software is built on a cross-platform abstraction layer that ensures consiste
SDL is a cross-platform development library that provides low-level access to audio, keyboard, mouse, joystick, and graphics hardware. It functions as a hardware abstraction layer, mapping diverse operating system interfaces into a unified set of functions to ensure consistent performance across different computing environments. The library serves as a foundation for multimedia and interactive application development by providing an integrated audio processing engine and a graphics rendering framework. It manages the complexities of hardware communication by normalizing raw input events and p
This project is a high-performance, lightweight C graphics library designed for creating interactive user interfaces on resource-constrained embedded hardware. It functions as a comprehensive framework that provides a widget toolkit, a rendering engine, and hardware-agnostic drivers to support the development of graphical displays on microcontrollers and embedded systems. The framework distinguishes itself through a flexible, object-oriented widget hierarchy and a declarative layout engine that supports responsive design patterns like flexbox and grid systems. It enables developers to synchro
gdext provides a set of language bindings for writing high-performance native game logic in Rust for the Godot 4 engine. It serves as a framework for creating native engine extensions and custom classes via the GDExtension library, allowing developers to extend core engine functionality without recompiling the engine source code. The project includes a dedicated Rust WebAssembly toolchain to compile native logic into modules for execution in web browsers. This system supports WebAssembly-compatible compilation with specific configurations for web threading and module debugging. The toolkit c
NuttX is a POSIX-compliant real-time operating system designed for microcontrollers ranging from 8-bit to 64-bit architectures. It provides a deterministic execution environment with a real-time task scheduler and a POSIX embedded kernel to ensure portable code execution across diverse hardware targets. The project distinguishes itself through a comprehensive hardware abstraction layer that provides standardized drivers for I2C, SPI, CAN, and USB across various semiconductor chipsets. It also features an embedded networking stack supporting TCP, UDP, IPv4, and IPv6, alongside industrial proto
flutter-webrtc is a real-time communication SDK and plugin for the Flutter framework. It provides a set of tools for establishing peer-to-peer media connections and low-latency data exchange across mobile, desktop, and web environments. The project enables the creation of applications with live audio and video calling, real-time media streaming, and peer-to-peer data channels for sending encrypted arbitrary data packets without a central server. It supports secure media communication through end-to-end encryption for audio, video, and data streams. The SDK covers broad capabilities including
DeepEP is a distributed model accelerator and expert-parallel communication library designed to optimize the training and inference of large-scale neural networks. It provides specialized GPU communication kernels and a remote GPU memory interface to facilitate high-throughput data exchange between hardware nodes. The system utilizes dynamic kernel generation to compile optimized GPU kernels during execution, removing the need for separate installation compilation steps. It implements virtual-lane traffic isolation to prevent interference between different data streams and employs routing met
flutter-go is a cross-platform UI framework and kit designed for building mobile and desktop applications using a Go backend and a Flutter frontend. It provides a communication bridge that enables Go functions to be executed from Dart code via a C-ABI foreign function interface. The project includes a Flutter UI component library and a frontend design gallery. These resources provide pre-designed interface patterns, reusable widgets, and interaction demos to assist with rapid application prototyping and consistent interface design across different operating systems. The framework covers nati
DeepGEMM is a suite of specialized GPU kernels and a just-in-time compiler designed for low-precision matrix operations, Mixture-of-Experts models, and attention processing. It provides a library of high-performance matrix multiplication kernels using FP8 precision to increase compute throughput and reduce memory usage. The project features a JIT CUDA kernel compiler that generates and loads optimized compute kernels at runtime to eliminate the need for manual compilation during installation. It includes specialized implementations for grouped matrix multiplication that process multiple group