30 open-source projects similar to nvidia/thrust, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best Thrust alternative.
Thrust is a heterogeneous computing library and C++ template library that provides a collection of high-level templates for executing data-parallel operations. It functions as a parallel algorithms library designed to work across different hardware backends, including multicore CPUs and NVIDIA GPU hardware. The framework utilizes a header-only implementation and a generic-programming policy interface to abstract the differences between CPU and GPU memory and execution models. It employs an iterator-based data abstraction to provide a uniform interface for accessing elements across host RAM an
ArrayFire is a hardware-agnostic compute framework and JIT-compiled tensor engine designed for high-performance numerical computing. It serves as a GPU numerical computing library and parallel signal processing toolkit that abstracts hardware backends, allowing the same codebase to execute across various GPU architectures and CPUs. The project distinguishes itself through a JIT engine that uses expression compilation to fuse operations and minimize memory overhead. It employs a deferred execution graph to optimize computation chains and provides interoperability primitives to share data and e
Boost is a collection of portable, high-performance source libraries that extend the C++ standard library. It provides a wide range of reusable components, data structures, and algorithms designed to add capabilities to the base language across different platforms. The project is distinguished by its extensive focus on compile-time template metaprogramming and generic programming. It implements advanced architectural patterns such as policy-based design, concept-based type validation, and the use of SFINAE for conditional template resolution to minimize runtime overhead. The library covers a
Taskflow is a C++ task-parallel framework designed to build high-performance parallel workflows and complex dependency graphs. It provides a programming model that organizes computational work into directed acyclic graphs, enabling developers to manage concurrency, resource scheduling, and task dependencies across multi-core CPUs and GPU accelerators. The framework distinguishes itself through its ability to orchestrate heterogeneous systems, allowing for the integration of hardware-accelerated kernels and memory operations into unified execution pipelines. It supports dynamic runtime subflow
oneTBB is a C++ parallelism library and framework designed to add multi-core parallelism to applications. It provides a task-based parallelism model that maps logical computational tasks to available hardware cores to eliminate the need for manual thread management. The library functions as a multi-core scaling tool, utilizing generic templates to scale data-parallel operations across processors for portable performance. It employs a task-based framework to ensure computational workloads are distributed across hardware resources. The project covers shared memory parallelism, multi-core task
RIOT is a real-time operating system designed for resource-constrained microcontrollers. It provides a kernel for managing hardware peripherals, memory, and multitasking on embedded devices, featuring a microcontroller hardware abstraction layer to unify hardware access across different chipsets. The system employs a preemptive tickless task scheduler with priority-based execution to maximize energy efficiency in battery-powered hardware. It also includes an embedded security framework consisting of cryptographic APIs and secure transport protocols to facilitate authenticated over-the-air fir
Deep Java Library is a Java deep learning framework and JVM model inference engine. It provides a high-level API for building and deploying deep learning models within the Java ecosystem, acting as a cross-platform runtime for executing models across CPUs, GPUs, and mobile devices. The library is engine-agnostic, allowing users to switch between different deep learning engines such as PyTorch, TensorFlow, and MXNet while maintaining a single unified API. This enables the deployment of the same model across different backends without changing the application code. The framework supports the f
This project is a technical curriculum and set of educational resources focused on parallel programming, high-performance computing, and systems programming. It provides a structured course covering the implementation of parallel algorithms and multithreading techniques for processing large datasets. The project includes a systems programming guide for modern language features, a framework for lock-free concurrency patterns, and a manual for optimizing CPU and GPU performance through assembly analysis and cache management. The material covers hardware performance tuning, the implementation o
oneAPI Threading Building Blocks (oneTBB)
Cpp-taskflow is a C++ task-parallelism framework and task graph scheduler designed to manage and execute complex dependency graphs of parallel tasks across CPU and GPU hardware. It provides a parallel algorithm library for high-performance implementations of reductions, sorts, pipelines, and iterations. The framework distinguishes itself through its ability to offload heavy computational workloads from a task graph to graphics processors for acceleration. It also includes a task profiling tool and a performance analysis interface for visualizing task execution flow and dependency structures t
This project is a C++ Standard Library implementation that provides the foundational classes and functions required by the ISO C++ standard. It serves as a template-based generic programming library, providing the Standard Template Library's set of containers, algorithms, and iterators for data manipulation. The library is a core component of the MSVC toolchain, designed specifically for integration with the Microsoft Visual C++ compiler and build tools. The implementation covers memory management through optimized allocators and buffer strategies, as well as tools for performance benchmarki
This project is a comprehensive collection of reference materials, including a language cheatsheet, a standard library reference, and a concurrency reference. It serves as a guide to modern C++ development, focusing on language syntax, standard library utilities, and template metaprogramming patterns. The repository provides specific guidance on template metaprogramming through a dedicated guide covering compile-time evaluation, type deduction, and variadic template execution. The materials cover a broad range of capabilities, including asynchronous programming, memory management, and system
cuml is a GPU-accelerated machine learning library and framework that uses CUDA to accelerate tabular data preprocessing and model execution. It provides a suite of tools for training and deploying classification, regression, and clustering models on NVIDIA GPUs and GPU clusters. The library is designed for scalability, offering a distributed GPU machine learning environment that can spread computation and data across multiple hardware accelerators and nodes to handle datasets exceeding single-device memory. It mirrors standard estimator interfaces to allow the replacement of CPU-based models
Nim is a statically typed, compiled systems programming language designed for high performance and cross-platform development. It translates high-level source code into C, C++, or JavaScript, allowing developers to produce efficient native binaries or web-compatible scripts from a single codebase. The language emphasizes a clean, indentation-based syntax that simplifies code hierarchy while maintaining the power of a full-featured systems language. What distinguishes Nim is its robust metaprogramming framework, which allows developers to inspect, modify, and generate code structures during th
HIP is a C++ GPU kernel language and cross-platform runtime designed for writing portable high-performance compute applications. It provides a programming interface that allows a single source codebase to execute on both AMD and NVIDIA GPU architectures. The project functions as a compatibility layer that enables the conversion and migration of existing CUDA source code to run on AMD hardware. This is achieved through a syntax mapping that mirrors CUDA and a source-to-source translation process during compilation. The toolkit covers the broader surface of cross-platform GPGPU development, in
CppGuide is a curated collection of educational resources and practical guides focused on C++ server development, Linux kernel internals, concurrent programming, network protocols, and security exploitation. It provides structured learning paths for backend developers, covering everything from interview preparation to building high-performance network servers and understanding operating system fundamentals. The guide distinguishes itself by offering in-depth, hands-on tutorials that walk through real-world implementations, including building a Redis-like server from scratch, designing custom
c3c is the compiler for the C3 programming language, transforming source code into executable binaries, static libraries, or dynamic libraries using an LLVM backend. It implements a system based on result-based error handling, scoped memory pooling, and a semantic macro system. The compiler provides first-class support for hardware-backed SIMD vectors that map directly to processor instructions and enables runtime polymorphism through interface-based dynamic dispatch. The project covers a broad set of low-level capabilities, including manual and pooled memory management, inline assembly inte
Carp is a statically typed Lisp compiler that compiles Lisp-like syntax directly to C source code, enabling seamless integration with existing C libraries and low-level system programming. It manages memory deterministically at compile time using ownership tracking and linear types, eliminating garbage collection pauses and runtime overhead while ensuring type safety through an inferred static type system. The language distinguishes itself through compile-time macro expansion and metaprogramming capabilities, allowing code generation and transformation before final binary output. It enforces
Chainer is an open-source deep learning framework built around define-by-run automatic differentiation, where computation graphs are constructed dynamically during forward execution. This imperative approach allows networks to be built using standard Python control flow, with gradients computed automatically through reverse-mode differentiation on the dynamically recorded graph. The framework supports GPU acceleration through a NumPy-compatible array backend with CUDA and cuDNN support, and provides a pluggable device abstraction that lets users switch between CPU and GPU computation without c
Crossbeam is a concurrency toolkit for Rust providing low-level primitives for writing multi-threaded programs. It focuses on lock-free data structures and memory management primitives designed for shared-memory concurrent environments. The project includes a work-stealing scheduler that uses double-ended queues to balance workloads across multiple processor cores. This system enables the implementation of work-stealing deques to distribute tasks and prevent bottlenecks. The toolkit covers broader capabilities for parallel algorithm development, multi-threaded task scheduling, and general co
CuPy is a CUDA array computing library that implements a NumPy-compatible interface for executing array operations and numerical computing on NVIDIA GPUs. It serves as a GPU-accelerated numerical library and a CUDA-based SciPy implementation, offloading heavy calculations to graphics hardware to increase processing speed for scientific and engineering workloads. The library enables multi-framework tensor exchange, allowing data buffers to be shared between different deep learning frameworks using standardized memory layouts to avoid memory copies. It also supports custom GPU kernel integratio
EASTL is a C++ Standard Template Library implementation consisting of containers, iterators, and algorithms. It provides cross-platform data structures and a template-based algorithm library designed for use in resource-constrained game engine environments. The library focuses on game engine memory management, providing specialized utilities that ensure predictable memory allocation and high-performance access for real-time applications. These containers maintain consistent behavior across different operating systems and hardware platforms. The project covers high-performance C++ development
Flash Linear Attention is a training framework and inference engine for sequence models that use linear attention and state space mechanisms, designed to process long contexts with reduced memory and compute overhead. It provides hardware-optimized token mixing layers and fused CUDA kernels that minimize memory bandwidth and launch overhead across different GPU architectures, and includes a causal inference engine that generates text token-by-token using cached hidden states for efficient autoregressive decoding. The project supports building hybrid sequence models that interleave standard at
Gorgonia is a Go library that provides an automatic differentiation engine and a computation graph framework for building and training neural networks. It functions as a CUDA-accelerated tensor library and a SIMD-optimized math library, enabling machine learning workflows entirely within the Go ecosystem. The library distinguishes itself through a dual-backend architecture that dispatches neural network operations to either a GPU or CPU depending on CUDA availability at runtime. It constructs differentiable directed acyclic graphs of tensor operations, supports reverse-mode automatic gradient
cppfront is a C++ language extension frontend and source-to-source translator. It functions as a syntax transformer that converts experimental language extensions into standard compliant C++ code, allowing for the prototyping of new language features within existing build systems. The project provides a translation layer that adds support for pattern matching, contracts, and string interpolation. It includes a metaprogramming tool for compile-time reflection and automated code generation using specialized metafunctions. The system automates several development tasks, including the resolution
Janet is a Lisp-based dynamic programming language featuring a register-based bytecode virtual machine and an embeddable scripting engine. It functions as a fiber-based concurrency runtime and includes a parsing engine based on Parsing Expression Grammars. The project is distinguished by its ability to be integrated into C or C++ applications via a minimal header interface. It utilizes a Lisp-style macro system for compile-time code transformation and employs prototype-based table inheritance for object-oriented behavior. The runtime covers a broad set of capabilities, including asynchronous
This repository is a collection of reference implementations and programming examples for the CUDA Toolkit. It serves as a GPGPU implementation guide and a parallel computing reference, providing code for using graphics hardware to perform general-purpose calculations and high-performance parallel processing. The project provides specific samples for GPU kernel development and resource management. These include demonstrations of multi-GPU communication, peer-to-peer memory access, and system hardware inspection to coordinate distributed GPU resources. The codebase covers a wide range of capa
Cutlass is a collection of C++ templates and Python interfaces for implementing high-performance linear algebra operations on NVIDIA GPUs. It provides a kernel composition framework for designing custom GPU kernels and a mixed-precision tensor library capable of executing operations across diverse data formats, ranging from 64-bit floating point to 4-bit integers. The project features a toolkit for operator fusion that integrates activation functions and bias calculations directly into matrix multiplication kernels to reduce memory passes. It also includes a Python-based domain-specific langu
The Book of Shaders is an interactive educational guide and curriculum for learning GLSL fragment shader programming to create procedural graphics and visual effects. It provides a structured learning path and a categorized reference guide for data types, built-in functions, and mathematical operations used in shader development. The project features a web-based shader sandbox and interactive editor that allows for real-time iteration and visualization of GLSL code. Users can experiment with procedural art and share their results via unique URLs. The curriculum covers a wide range of graphic