# nvidia/cuda-python

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/nvidia-cuda-python).**

3,170 stars · 248 forks · Cython · other

## Links

- GitHub: https://github.com/NVIDIA/cuda-python
- Homepage: https://nvidia.github.io/cuda-python/
- awesome-repositories: https://awesome-repositories.com/repository/nvidia-cuda-python.md

## Description

cuda-python provides low-level Python bindings for the CUDA Driver and Runtime APIs. It serves as a programmatic wrapper for controlling device memory, managing hardware toolchains, and orchestrating execution graphs on NVIDIA GPUs, allowing for the compilation and launching of parallel kernels directly from Python.

The project enables the development of SIMT kernels and the execution of mathematical algorithms on device memory. It integrates pre-compiled bytecode as custom operators and interfaces with accelerated device libraries to access low-level hardware functions without leaving the language environment.

The toolkit covers hardware management through host and runtime API integration, as well as device memory management for structured data layouts and temporary buffers. It includes capabilities for system hardware inspection, GPU execution debugging, and performance optimizations such as disk-backed kernel caching and cooperative reduction execution.

## Tags

### Part of an Awesome List

- [Python GPU Development](https://awesome-repositories.com/f/awesome-lists/devtools/gpu-acceleration/python-gpu-development.md) — Enables developers to write and execute high-performance GPU-accelerated applications directly within Python.
- [GPU Kernel Performance Tuning](https://awesome-repositories.com/f/awesome-lists/devtools/performance-and-optimization/gpu-kernel-performance-tuning.md) — Optimizes GPU execution speed using techniques like cooperative reductions and bytecode caching.

### Operating Systems & Systems Programming

- [CUDA Driver Wrappers](https://awesome-repositories.com/f/operating-systems-systems-programming/cuda-driver-wrappers.md) — Provides a comprehensive programmatic wrapper for controlling device memory and orchestrating execution graphs on NVIDIA GPUs.
- [CUDA Driver API Integrations](https://awesome-repositories.com/f/operating-systems-systems-programming/cuda-driver-api-integrations.md) — Provides a programmatic interface to CUDA driver and compiler tools for coordinating hardware processing and memory. ([source](https://cdn.jsdelivr.net/gh/nvidia/cuda-python@main/README.md))
- [Driver Runtime Integrations](https://awesome-repositories.com/f/operating-systems-systems-programming/driver-runtime-integrations.md) — Provides direct access to CUDA driver and runtime functions for managing hardware resources.
- [Hardware Abstraction Layers](https://awesome-repositories.com/f/operating-systems-systems-programming/hardware-interfacing-drivers/hardware-abstraction-layers.md) — Provides an object-oriented hardware abstraction layer to manage GPU device properties and memory allocation.
- [Device-Local Memory Layouts](https://awesome-repositories.com/f/operating-systems-systems-programming/kernel-core-internals/process-and-memory-management/memory-management/allocation-strategies/dynamic-memory-allocation/custom-memory-allocators/explicit-memory-allocators/device-local-memory-layouts.md) — Handles the allocation and control of hardware memory buffers and structured data types for efficient GPU processing. ([source](https://nvidia.github.io/cuda-python/cuda-core/latest))
- [Device Buffer Managers](https://awesome-repositories.com/f/operating-systems-systems-programming/kernel-core-internals/process-and-memory-management/memory-management/allocation-strategies/dynamic-memory-allocation/gpu-memory-allocators/unified-memory-managers/device-buffer-managers.md) — Provides a tool for allocating and controlling hardware memory buffers using standard array interfaces.
- [GPU Memory Layout Definitions](https://awesome-repositories.com/f/operating-systems-systems-programming/gpu-memory-layout-definitions.md) — Creates structured data types that handle complex memory layouts for arrays processed on graphics hardware. ([source](https://nvidia.github.io/cccl/unstable/python/compute/index.html))
- [GPU Memory Pools](https://awesome-repositories.com/f/operating-systems-systems-programming/kernel-core-internals/process-and-memory-management/memory-management/allocation-strategies/dynamic-memory-allocation/custom-memory-allocators/gpu-memory-pools.md) — Implements device memory pools to reuse buffers and reduce allocation overhead and fragmentation. ([source](https://nvidia.github.io/cccl/unstable/python/compute/index.html))

### Hardware & IoT

- [Hardware Driver API Mappings](https://awesome-repositories.com/f/hardware-iot/driver-to-device-mapping/hardware-driver-api-mappings.md) — Maps high-level Python function calls directly to the underlying CUDA driver and runtime APIs.

### Programming Languages & Runtimes

- [JIT Kernel Compilers](https://awesome-repositories.com/f/programming-languages-runtimes/compiler-interpreter-internals/compiler-infrastructure/jit-kernel-compilers.md) — Transforms high-level functions into machine code at runtime for parallel execution on NVIDIA GPUs.
- [GPU Kernel Orchestration](https://awesome-repositories.com/f/programming-languages-runtimes/compiler-interpreter-internals/compiler-infrastructure/jit-kernel-compilers/gpu-kernel-orchestration.md) — Manages the compilation and launching of parallel kernels and execution graphs to optimize processing speed.
- [Dynamic CUDA Runtime Invocations](https://awesome-repositories.com/f/programming-languages-runtimes/dynamic-cuda-runtime-invocations.md) — Implements dynamic calls to CUDA runtime functions for device management and memory allocation. ([source](https://nvidia.github.io/cuda-python/cuda-bindings/latest))
- [GPU Kernel Programming](https://awesome-repositories.com/f/programming-languages-runtimes/gpu-kernel-programming.md) — Supports the development and compilation of SIMT kernels for high-performance workloads on NVIDIA hardware. ([source](https://nvidia.github.io/numba-cuda-mlir/))
- [Foreign Function Interfaces](https://awesome-repositories.com/f/programming-languages-runtimes/language-interoperability/foreign-function-interfaces.md) — Implements a foreign function interface to bind Python calls to low-level CUDA C headers and binary libraries.
- [Python Bindings](https://awesome-repositories.com/f/programming-languages-runtimes/language-interoperability/interoperability/python-bindings.md) — Offers a set of low-level Python bindings for the CUDA Driver and Runtime APIs.
- [GPU-Accelerated Reductions](https://awesome-repositories.com/f/programming-languages-runtimes/array-reductions/gpu-accelerated-reductions.md) — Uses custom operators to perform efficient data aggregation and reduction algorithms across GPU blocks and warps. ([source](https://nvidia.github.io/cccl/unstable/python/coop.html))
- [Pre-compiled Bytecode Integration](https://awesome-repositories.com/f/programming-languages-runtimes/compiler-interpreter-internals/compiler-infrastructure/compiler-optimizations/just-in-time-compilation/pre-compiled-bytecode-integration.md) — Allows importing pre-compiled bytecode as custom operators to skip the standard JIT compilation process. ([source](https://nvidia.github.io/cccl/unstable/python/compute/index.html))
- [Kernel Artifact Caches](https://awesome-repositories.com/f/programming-languages-runtimes/compiler-interpreter-internals/compiler-infrastructure/jit-kernel-compilers/kernel-artifact-caches.md) — Caches compiled GPU kernel bytecode on the local filesystem to accelerate subsequent program startup.
- [Compiled Kernel Executions](https://awesome-repositories.com/f/programming-languages-runtimes/compiler-interpreter-internals/compiler-infrastructure/jit-kernel-compilers/runtime-gpu-kernel-compilation-libraries/compiled-kernel-executions.md) — Executes compiled GPU kernels on active streams and devices using custom launch configurations. ([source](https://nvidia.github.io/cuda-python/cuda-core/latest))
- [Custom Kernel Compilers](https://awesome-repositories.com/f/programming-languages-runtimes/custom-kernel-compilers.md) — Transforms custom Python functions into optimized machine code to increase execution speed on the GPU. ([source](https://nvidia.github.io/numba-cuda-mlir/))

### DevOps & Infrastructure

- [GPU Acceleration Libraries](https://awesome-repositories.com/f/devops-infrastructure/gpu-acceleration-libraries.md) — Interfaces with accelerated device libraries to access low-level hardware functions from Python. ([source](https://nvidia.github.io/numba-cuda-mlir/))

### Networking & Communication

- [GPU Buffer Sharing Allocations](https://awesome-repositories.com/f/networking-communication/communication-protocols-architectures/inter-process-communication/shared-memory-ipc/gpu-buffer-sharing-allocations.md) — Provides mechanisms to share GPU buffers across processes and external libraries using a standardized memory interface.

### Scientific & Mathematical Computing

- [Device-Side Algorithm Execution](https://awesome-repositories.com/f/scientific-mathematical-computing/device-side-algorithm-execution.md) — Executes fast mathematical operations such as sorting and transforming directly on device memory. ([source](https://nvidia.github.io/cccl/unstable/python/compute/index.html))

### Software Engineering & Architecture

- [Execution Graphs](https://awesome-repositories.com/f/software-engineering-architecture/execution-graphs.md) — Orchestrates sequences of GPU operations as execution graphs to minimize CPU-to-GPU launch overhead.
- [Graph Execution Compilers](https://awesome-repositories.com/f/software-engineering-architecture/execution-graphs/graph-execution-compilers.md) — Compiles and executes captured operation graphs into executable objects for optimized GPU processing. ([source](https://nvidia.github.io/cuda-python/cuda-core/latest))