30 open-source projects similar to packtpublishing/learn-cuda-programming, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best Learn CUDA Programming alternative.
This repository is a collection of reference implementations and programming examples for the CUDA Toolkit. It serves as a GPGPU implementation guide and a parallel computing reference, providing code for using graphics hardware to perform general-purpose calculations and high-performance parallel processing. The project provides specific samples for GPU kernel development and resource management. These include demonstrations of multi-GPU communication, peer-to-peer memory access, and system hardware inspection to coordinate distributed GPU resources. The codebase covers a wide range of capa
GPU-Puzzles is an interactive learning environment and tutorial designed for mastering CUDA GPU kernel development. It serves as an educational tool and lab where users solve coding puzzles to understand how to map high-level logic to low-level GPU hardware instructions. The platform focuses on teaching parallel computing concepts and GPU architecture. Users practice developing parallel algorithms and managing GPU memory through a series of hands-on challenges. The environment utilizes a bridge between Python and CUDA to execute kernels and provide real-time feedback by validating outputs ag
This project provides Rust bindings for the TensorFlow C API, serving as a tensor computation interface and machine learning library. It enables the construction and execution of machine learning models and neural networks by bridging a systems language to high-performance backends. The framework supports GPU-accelerated computing to increase the speed of model training and inference by offloading mathematical operations to graphics processing units. It offers both graph-based computation for defining static network architectures and an eager execution mode for immediate operation calls durin
AISystem is a comprehensive AI full-stack infrastructure project covering the entire pipeline from AI chip architecture to high-level training frameworks. It encompasses the development of AI compiler frameworks, inference engines, and distributed training orchestrators designed to coordinate workloads across a heterogeneous compute stack of CPUs, GPUs, and NPUs. The project focuses on the deep integration of software and hardware, employing software-hardware co-design to align tensor layouts with physical memory structures. It provides specialized capabilities for accelerating Transformer mo
xtensor is a C++ multidimensional array library for numerical computing that provides N-dimensional containers with an interface mirroring the NumPy API. It utilizes a lazy evaluation expression engine to defer numerical computations until assignment, which minimizes memory allocations and intermediate copies. The library features a foreign memory array adaptor that allows it to wrap external buffers, such as NumPy arrays, to perform numerical operations in-place without duplicating data. It further optimizes performance through lazy broadcasting and a system that manages the lifetime of temp
Triton is a parallel computing framework and high-level programming language designed for writing custom compute kernels. It functions as a deep learning compiler, translating complex mathematical operations into high-throughput instructions that maximize hardware utilization and memory efficiency on graphics processing units. The framework distinguishes itself through a hardware-agnostic compute abstraction that allows developers to define kernels without manual low-level tuning. It employs just-in-time compilation to generate optimized binary instructions at runtime, utilizing static data f
Dask is a parallel computing framework and distributed task scheduler designed to scale Python data science workflows from single machines to large clusters. It functions as a cluster resource manager that orchestrates computational logic by representing tasks and their dependencies as directed acyclic graphs. This architecture allows the system to automate the distribution of workloads across available hardware while managing complex execution requirements. The project distinguishes itself through a lazy evaluation engine that defers data operations until they are explicitly requested, enabl
This project is a high-performance numerical computing library designed for large-scale scientific and machine learning workloads. It functions as an automatic differentiation framework and a just-in-time compilation engine, transforming high-level Python code into optimized machine instructions. By enforcing pure functional programming patterns and immutable array semantics, the library ensures that mathematical functions remain compatible with automated graph transformations and symbolic differentiation. The platform distinguishes itself through its distributed array computing capabilities,
Codon is an LLVM-based Python compiler and statically typed implementation that translates source code into optimized machine instructions. It functions as a high-performance numerical backend and a GPU computing framework designed to remove runtime overhead. The project implements a compiled alternative to NumPy, translating array logic directly into machine code. It differentiates itself by generating specialized hardware kernels for graphics processors and utilizing static type inference to enable aggressive machine-code optimization. The system provides capabilities for parallel workload
Cpp-taskflow is a C++ task-parallelism framework and task graph scheduler designed to manage and execute complex dependency graphs of parallel tasks across CPU and GPU hardware. It provides a parallel algorithm library for high-performance implementations of reductions, sorts, pipelines, and iterations. The framework distinguishes itself through its ability to offload heavy computational workloads from a task graph to graphics processors for acceleration. It also includes a task profiling tool and a performance analysis interface for visualizing task execution flow and dependency structures t
HowToBeAProgrammer is a comprehensive software engineering career guide and professional development framework. It serves as a curated-knowledge repository and handbook designed to help programmers acquire technical habits and social competencies necessary for professional advancement. The project distinguishes itself by integrating technical craftsmanship with a detailed manual for technical leadership and organizational navigation. It provides specific strategies for career progression, such as compensation negotiation, promotion readiness, and the management of professional boundaries to p
TileLang is a Python-embedded domain-specific language compiler that JIT-compiles and autotunes GPU kernels. It uses a tile-based DSL, automatic software pipelining, and parallel autotuning to generate optimized GPU kernels at runtime. It supports tensor core operations with Pythonic syntax, automatic memory management, and thread mapping. The compiler searches over tile sizes, thread counts, and scheduling policies, compiling and benchmarking candidates in parallel to find the fastest kernel. It also caches compiled binaries and tuning results to disk for reuse across sessions. TileLang inc
ISPC is a vectorizing compiler and SIMD parallel programming language that implements a single program multiple data model. It serves as a toolchain for translating C-based code with parallel extensions into optimized machine code for various CPU and GPU architectures using an LLVM backend. The compiler is designed for cross-platform SIMD toolchain support, generating specialized instruction sets for x86 SSE/AVX, ARM NEON, and Intel GPU from a single source. It features a runtime dispatch mechanism that selects the most efficient hardware-specific implementation for the current system during
HVM2 is a high-performance execution environment for pure functional programs, implemented as a systems-level runtime in Rust. It functions as a massively parallel functional runtime that uses interaction combinators to achieve automatic parallelism across multi-core CPUs and GPUs. The project distinguishes itself by using a graph-rewriting computational model to execute programs via local reduction rules, which eliminates the need for manual locks or atomic operations. It employs beta-optimal reduction and lazy evaluation to optimize higher-order functions and eliminate redundant computation
h2o-3 is a distributed machine learning platform and automated machine learning framework designed for training and deploying predictive models using distributed in-memory computing. It functions as a deep learning framework and a distributed model scoring engine, capable of operating as a Kubernetes ML cluster to process large datasets in parallel. The platform distinguishes itself through automated machine learning capabilities that automatically select the best algorithms and hyperparameters to optimize model performance. It provides specialized deep learning toolkits for tasks including i
ArrayFire is a hardware-agnostic compute framework and JIT-compiled tensor engine designed for high-performance numerical computing. It serves as a GPU numerical computing library and parallel signal processing toolkit that abstracts hardware backends, allowing the same codebase to execute across various GPU architectures and CPUs. The project distinguishes itself through a JIT engine that uses expression compilation to fuse operations and minimize memory overhead. It employs a deferred execution graph to optimize computation chains and provides interoperability primitives to share data and e
Surge is a Swift library for high-performance numerical analysis, linear algebra, digital signal processing, and accelerated image manipulation. It utilizes the Accelerate framework to provide hardware-accelerated tools for matrix mathematics and signal processing. The library provides specialized capabilities for digital signal processing, including convolution, signal similarity analysis through cross-correlation, and domain transformations using fast Fourier transforms. It also includes a suite of tools for the rapid transformation and analysis of pixel buffers and image data. Beyond sign
Boost is a collection of portable, high-performance source libraries that extend the C++ standard library. It provides a wide range of reusable components, data structures, and algorithms designed to add capabilities to the base language across different platforms. The project is distinguished by its extensive focus on compile-time template metaprogramming and generic programming. It implements advanced architectural patterns such as policy-based design, concept-based type validation, and the use of SFINAE for conditional template resolution to minimize runtime overhead. The library covers a
mctx is a framework for executing high-performance tree search and state simulations to generate policy targets for neural networks. It functions as a compiled search engine and neural dynamics simulator that predicts state transitions and rewards using learned representations. The project implements a vectorised tree search capable of running parallel search operations across input batches. It utilizes a policy target generator to convert search results into action weights used for training and refining neural network policies. The system covers reinforcement learning workflows by integrati
Napajs is an embeddable JavaScript engine and multi-threaded runtime designed to be integrated directly into other software applications as a component. It serves as a parallel computation framework that allows JavaScript code to execute across multiple threads, bypassing the standard single-threaded event loop limitation to handle CPU-intensive tasks. The runtime is distinguished by its ability to load and execute modules from the NPM ecosystem and its pluggable execution environment. This architecture allows for custom implementations of memory allocation, system logging, and performance me
Taskflow is a C++ task-parallel framework designed to build high-performance parallel workflows and complex dependency graphs. It provides a programming model that organizes computational work into directed acyclic graphs, enabling developers to manage concurrency, resource scheduling, and task dependencies across multi-core CPUs and GPU accelerators. The framework distinguishes itself through its ability to orchestrate heterogeneous systems, allowing for the integration of hardware-accelerated kernels and memory operations into unified execution pipelines. It supports dynamic runtime subflow
Numba is a just-in-time compiler that translates high-level Python functions into optimized machine code at runtime. By leveraging the LLVM compiler infrastructure, it provides a framework for accelerating numerical data processing and mathematical computations, enabling performance levels comparable to statically compiled languages. The project distinguishes itself through its ability to perform type-inference-based specialization, which generates machine instructions tailored to the specific data types used during execution. It employs a lazy compilation pipeline that defers translation unt
This project serves as an educational resource for learning and implementing low-level assembly language optimizations. It provides a structured guide for developers to master hardware-specific instructions and manual performance tuning, focusing on the translation of high-level code into efficient machine-level operations for resource-constrained environments. The materials emphasize techniques for maximizing computational throughput in multimedia processing. By covering instruction-level parallelism, register management, and data parallelism, the project enables the development of software
This library is a JavaScript framework for general-purpose computing on graphics processing units. It enables the execution of parallel mathematical operations directly within the browser by offloading data-heavy calculations to graphics hardware. The project functions as a web-based math accelerator that converts standard JavaScript functions into shader code for execution on the graphics processor. It provides a unified interface that detects available graphics APIs and manages data transfer between system and graphics memory. To ensure compatibility across diverse environments, the library
Bend is a high-level parallel programming language and compiler designed to execute code across multi-core CPUs and GPUs automatically. By translating functional source code into a graph-based intermediate representation, it enables massive parallel execution without requiring manual management of threads, locks, or atomic operations. The runtime operates as an interaction net engine, where computations are represented as networks of nodes that reduce through local rewriting rules. This model utilizes a work-stealing scheduler to distribute tasks across thousands of hardware threads, ensuring
Mistral Inference is a library for running Mistral large language models on a GPU, generating text from prompts with token streaming. It loads pretrained model weights from local disk or a remote registry into GPU memory, then produces output tokens one by one for real-time display in interactive applications. The library supports multimodal prompts that accept image URLs alongside text, enabling visual description and reasoning. It includes content safety guardrails that scan generated text against predefined policies to block or flag policy violations. For structured interactions, it provid
This project is a Ruby performance optimization guide and refactoring resource. It provides a collection of benchmarked coding patterns and idiomatic comparisons designed to increase execution speed and reduce memory allocations in Ruby applications. The resource focuses on mapping common language constructs to their most computationally efficient equivalents. It uses comparative timing analysis and allocation-count profiling to identify high-performance idioms that replace object-heavy expressions. The project covers application runtime tuning and memory management by identifying patterns t
Gorgonia is a Go library that provides an automatic differentiation engine and a computation graph framework for building and training neural networks. It functions as a CUDA-accelerated tensor library and a SIMD-optimized math library, enabling machine learning workflows entirely within the Go ecosystem. The library distinguishes itself through a dual-backend architecture that dispatches neural network operations to either a GPU or CPU depending on CUDA availability at runtime. It constructs differentiable directed acyclic graphs of tensor operations, supports reverse-mode automatic gradient
Chainer is an open-source deep learning framework built around define-by-run automatic differentiation, where computation graphs are constructed dynamically during forward execution. This imperative approach allows networks to be built using standard Python control flow, with gradients computed automatically through reverse-mode differentiation on the dynamically recorded graph. The framework supports GPU acceleration through a NumPy-compatible array backend with CUDA and cuDNN support, and provides a pluggable device abstraction that lets users switch between CPU and GPU computation without c