18 repository-uri
Compiles captured operation graphs into executable objects for optimized GPU execution.
Distinct from Execution Graphs: Distinct from general Execution Graphs: focuses on the compilation and instantiation of static graphs for hardware acceleration.
Explore 18 awesome GitHub repositories matching software engineering & architecture · Graph Execution Compilers. Refine with filters or upvote what's useful.
Apache Flink is a distributed processing engine designed for both high-throughput, low-latency data streams and finite batch workloads. It functions as a stateful stream processor and a SQL stream processing engine, providing a unified runtime to execute relational queries and event-based transformations. The system is distinguished by its ability to manage persistent operator state to ensure exactly-once processing guarantees and consistency during failures. It features specialized capabilities for complex event processing to detect temporal patterns and handles out-of-order events using eve
Translates relational queries into optimized physical execution graphs of streaming or batch operators.
Graal is a compiler and runtime architecture designed for high-performance execution and polyglot interoperability. It utilizes a graph-based representation of program logic to perform global optimizations and JIT compilation. The project features a meta-circular interpretation framework and a specialized partial evaluation mechanism, which allow for the creation of new programming languages and the automatic optimization of their semantics into machine code. It enables multiple diverse programming languages to share memory and communicate through a standardized cross-language protocol within
Implements a graph-based representation of program logic to perform global optimizations before emitting final machine code.
AISystem is a comprehensive AI full-stack infrastructure project covering the entire pipeline from AI chip architecture to high-level training frameworks. It encompasses the development of AI compiler frameworks, inference engines, and distributed training orchestrators designed to coordinate workloads across a heterogeneous compute stack of CPUs, GPUs, and NPUs. The project focuses on the deep integration of software and hardware, employing software-hardware co-design to align tensor layouts with physical memory structures. It provides specialized capabilities for accelerating Transformer mo
Transforms compute graphs through operator fusion and layout conversion to maximize hardware utilization.
ai-edu is a comprehensive AI education curriculum and machine learning courseware collection. It provides theoretical tutorials, deep learning lab exercises, and project blueprints designed to teach artificial intelligence fundamentals through a combination of study and practical implementation. The project focuses on a learning-by-doing approach, guiding users from Python programming and neural network basics to advanced topics. It includes specialized instructional content on distributed AI training, MLOps educational guides for model quantization and pruning, and detailed frameworks for im
Provides instructional content on compiling neural network graphs to optimize GPU execution and reduce latency.
Theano is a Python mathematical expression compiler and symbolic math library used as a deep learning backend. It functions as a tensors computation framework that translates mathematical formulas into optimized C or CUDA code for high-performance computing. The system manages the definition and evaluation of complex math formulas using multi-dimensional arrays. It employs a symbolic expression graph and a lazy evaluation engine to optimize mathematical expressions before they are compiled into executable code. The framework provides automatic differentiation for calculating gradients of mat
Uses internal graph-based models of program logic to perform algebraic rewrites and global optimization.
OneFlow is a deep learning framework and distributed execution engine designed for building, training, and deploying neural network architectures. It functions as a scalable neural network library that allows for the development of deep learning models and their execution across distributed hardware. The project includes a machine learning graph compiler used to optimize neural network execution graphs. This allows for the acceleration of model performance and the reduction of latency during both training and inference. The framework covers broad capability areas including large-scale model
Implements a graph compiler that optimizes neural network execution graphs for improved performance.
This repository is a collection of reference implementations and programming examples for the CUDA Toolkit. It serves as a GPGPU implementation guide and a parallel computing reference, providing code for using graphics hardware to perform general-purpose calculations and high-performance parallel processing. The project provides specific samples for GPU kernel development and resource management. These include demonstrations of multi-GPU communication, peer-to-peer memory access, and system hardware inspection to coordinate distributed GPU resources. The codebase covers a wide range of capa
Compiles and executes captured operation graphs as executable objects to optimize GPU task scheduling.
This project is an educational blog and learning resource dedicated to the Rust programming language. It provides a collection of curated guides, technical articles, and structured learning paths designed to teach language fundamentals, concurrency, and systems programming. The repository distinguishes itself by offering practical implementation tutorials for complex systems. This includes detailed guides on compiler development—specifically translating source code into targets such as ARM64, x86_64, LLVM IR, and WebAssembly—as well as networking examples for building multithreaded chat serve
Demonstrates translating source code into LLVM Intermediate Representation (IR) using SSA form.
This project is a comprehensive collection of educational examples and reference implementations for building vision and language models using PyTorch. It serves as a deep learning tutorial covering the end-to-end process of developing neural networks, from initial architecture definition to final production deployment. The repository provides detailed guides on implementing a wide range of domain-specific models, including convolutional neural networks for object detection and segmentation, as well as transformer and recurrent architectures for natural language processing. It emphasizes gene
Compiles computation graphs using specialized backends to reduce latency and increase throughput.
Swift for TensorFlow is a custom toolchain that extends the Swift language with first-class automatic differentiation and differentiable types, enabling gradient-based computation directly within the compiler. It integrates the Swift compiler with TensorFlow runtime and XLA backends, allowing tensor operations to be compiled and executed on hardware-accelerated hardware for high-performance machine learning. The project distinguishes itself through compiler-integrated automatic differentiation that computes gradients of user-defined functions and types during compilation, eliminating the need
Converts Swift functions into static computational graphs at compile time for optimized execution.
NVIDIA DALI is a GPU-accelerated data loading and preprocessing library designed for deep learning workflows. It constructs high-performance data pipelines that offload decoding, augmentation, and normalization to the GPU, eliminating CPU bottlenecks in training and inference. The library reads data from multiple storage formats and streams it directly into GPU memory, with support for multi-GPU execution to scale throughput across large-scale workloads. DALI distinguishes itself by enabling data pipelines to be built once and executed across multiple deep learning frameworks without code cha
Compiles runtime-conditional pipeline branches into optimized execution plans that adapt to data characteristics.
go-ast-book este o colecție de resurse educaționale și tehnice axate pe analiza arborelui sintactic abstract (AST), dezvoltarea de compilatoare și verificarea statică a codului. Oferă ghiduri și manuale pentru parsarea, parcurgerea și analizarea codului sursă Go pentru a extrage semnificația semantică. Proiectul servește ca referință pentru construirea frontend-urilor de compilatoare, acoperind traducerea codului de nivel înalt în reprezentări intermediare și forme de atribuire statică unică (SSA). De asemenea, oferă instrucțiuni pentru utilizarea acestor tehnici în dezvoltarea de tooling pentru limbaje și efectuarea analizei statice de cod. Resursele acoperă o gamă largă de capabilități de analiză statică, inclusiv tokenizarea lexicală, parsarea structurală a expresiilor și declarațiilor, și urmărirea coordonatelor pentru fișierele sursă. Detaliază, de asemenea, procesele de analiză semantică precum rezoluția identificatorilor, verificarea corectitudinii tipurilor și analiza fluxului de control pentru concurență și execuție amânată.
Translates abstract syntax trees into standardized intermediate representations to enable the generation of executable programs.
This project provides a comprehensive technical guide and framework for engineering large-scale machine learning systems. It covers the full lifecycle of model development, focusing on the infrastructure and computational principles required to build, train, and serve generative AI models across distributed GPU clusters. The repository distinguishes itself by offering deep-dive tutorials and implementation strategies for complex system challenges. It emphasizes high-performance architectural primitives, such as collective communication orchestration, distributed tensor sharding, and static gr
Compiles captured graphs into executable objects for optimized GPU execution.
Acest proiect este o resursă educațională cuprinzătoare și un curriculum axat pe designul și implementarea întregului stack software și hardware de machine learning. Servește ca referință tehnică pentru arhitecturarea sistemelor de machine learning, pornind de la interfețe de programare de nivel scăzut până la infrastructura de deployment la scară largă. Proiectul oferă îndrumări instrucționale pe mai multe domenii specializate, inclusiv dezvoltarea compilatoarelor AI prin reprezentări intermediare și optimizări de grafuri. Acoperă tiparele arhitecturale necesare pentru antrenarea distribuită pe clustere GPU și programarea acceleratoarelor hardware pentru a optimiza sarcinile de lucru pe cipuri specializate. Resursa detaliază, de asemenea, implementarea framework-urilor de servire a modelelor pentru medii de producție și designul pipeline-urilor de reinforcement learning. Domeniul său de aplicare se extinde la componentele de bază ale sistemelor ML, cum ar fi diferențierea automată, abstracțiile de tensori și orchestrarea resurselor GPU.
Utilizes internal graph-based models of program logic to enable structural analysis and compiler-driven optimizations.
Triton is a dynamic binary analysis framework designed to automate reverse engineering. It functions as a multi-architecture CPU emulator, an SMT-based symbolic execution engine, and a dynamic taint analysis tool. The framework translates raw machine instructions into abstract syntax trees, allowing it to represent binary program logic as a structured intermediate representation. This allows the system to map multiple hardware instruction sets to a single analysis framework and translate machine instructions into mathematical formulas for solving constraints. Its capabilities cover the simul
Transforms raw machine instructions into a structured intermediate representation to organize code into analyzable blocks.
Acest proiect este o resursă educațională cuprinzătoare și un manual de tutoriale pentru construirea, antrenarea și implementarea modelelor de machine learning folosind TensorFlow 2. Acesta servește drept ghid de învățare structurat, acoperind concepte fundamentale de deep learning, inclusiv arhitecturi de rețele neuronale, diferențiere automată și operații cu tensori. Manualul oferă îndrumări tehnice pentru optimizarea eficienței execuției prin gestionarea memoriei GPU, antrenarea distribuită și cuantizarea modelelor. Include, de asemenea, manuale detaliate pentru construirea de pipeline-uri de date de înaltă performanță și exportul modelelor pentru servere de producție, dispozitive mobile și browsere web. Materialul acoperă o gamă largă de capabilități, inclusiv dezvoltarea de modele cu rețele convoluționale și recurente, implementarea de funcții de loss și straturi personalizate, precum și utilizarea modelelor pre-antrenate pentru transfer learning. De asemenea, abordează strategii de implementare pentru dispozitive edge și utilizarea runtime-urilor bazate pe cloud pentru accelerare hardware. Resursa este implementată sub forma unei colecții de Jupyter Notebooks.
Covers the compilation of operation graphs into executable objects for optimized hardware acceleration.
cuda-python provides low-level Python bindings for the CUDA Driver and Runtime APIs. It serves as a programmatic wrapper for controlling device memory, managing hardware toolchains, and orchestrating execution graphs on NVIDIA GPUs, allowing for the compilation and launching of parallel kernels directly from Python. The project enables the development of SIMT kernels and the execution of mathematical algorithms on device memory. It integrates pre-compiled bytecode as custom operators and interfaces with accelerated device libraries to access low-level hardware functions without leaving the la
Compiles and executes captured operation graphs into executable objects for optimized GPU processing.
RLinf is a distributed reinforcement learning orchestrator and embodied AI training framework. It provides the infrastructure to train vision-language-action models and robotic policies using a combination of reinforcement learning and supervised fine-tuning. The system is designed for scaling workloads across GPU clusters, managing the placement of actors, rollout workers, and environment components. It features a specialized robotics data collection pipeline for gathering teleoperated demonstrations and simulation trajectories into standardized replay buffers, alongside a hardware interface
Accelerates training execution using graph capture and compiled code to optimize GPU processing speed.