Warp is a Python framework that JIT-compiles Python functions into CUDA kernels for GPU-accelerated parallel computation, with built-in automatic differentiation and multi-framework array interoperability. At its core, it provides a GPU kernel compilation system that enables writing and executing custom GPU kernels directly from Python, while supporting automatic gradient computation through those kernels for integration with machine learning pipelines. The framework also includes tile-based cooperative computing, where thread blocks partition into tiles for shared-memory and tensor-core operations, and a block-sparse matrix engine for GPU-accelerated linear algebra.
What distinguishes Warp is its deep interoperability with major deep learning frameworks. It exchanges GPU arrays with PyTorch, JAX, Paddle, and NumPy via DLPack and CUDA array interfaces without copying data, and can wrap Warp kernels as callable primitives inside JAX JIT-compiled functions or PyTorch autograd graphs. The framework supports CUDA graph capture and replay for low-overhead repeated execution, multi-device orchestration across multiple GPUs, and spatial acceleration structures like BVH and hash grids for collision detection and ray casting. It also provides debugging and diagnostics tools including kernel code stepping, assertion checking, and integration with CUDA Compute Sanitizer.
The framework covers GPU-accelerated physics simulation, finite element method on GPU, geometry querying on meshes and bounding volume hierarchies, and hardware-accelerated texture sampling. It includes mathematical operations on scalars, vectors, matrices, quaternions, and spatial transforms, as well as random number generation and volume data handling. Warp can be installed via conda, built from source, or integrated as an Omniverse extension, and supports Docker container execution with the NVIDIA Container Toolkit.