AITemplate is an ahead-of-time deep learning compiler that translates PyTorch neural networks into standalone C++ source code. It functions as a PyTorch to C++ compiler and a GPU kernel fusion engine, producing self-contained executable binaries that run inference without requiring a Python interpreter or deep learning framework runtime. The project generates optimized CUDA and HIP C++ code specifically for NVIDIA TensorCores and AMD MatrixCores. It focuses on maximizing throughput for half-precision floating-point operations through a system that combines multiple neural network operators in
ZLUDA is a middleware and translation engine designed to enable the execution of unmodified proprietary compute binaries on non-native graphics hardware. It functions as a compatibility layer that bridges vendor-specific compute interfaces with open standards, allowing software originally restricted to a single hardware ecosystem to operate on alternative graphics processing units. The project achieves this through a combination of dynamic library interception and runtime instruction translation. By replacing standard system libraries and mapping proprietary compute calls to open standards, t
Taskflow is a C++ task-parallel framework designed to build high-performance parallel workflows and complex dependency graphs. It provides a programming model that organizes computational work into directed acyclic graphs, enabling developers to manage concurrency, resource scheduling, and task dependencies across multi-core CPUs and GPU accelerators. The framework distinguishes itself through its ability to orchestrate heterogeneous systems, allowing for the integration of hardware-accelerated kernels and memory operations into unified execution pipelines. It supports dynamic runtime subflow