18 repositorios
Compiles captured operation graphs into executable objects for optimized GPU execution.
Distinct from Execution Graphs: Distinct from general Execution Graphs: focuses on the compilation and instantiation of static graphs for hardware acceleration.
Explore 18 awesome GitHub repositories matching software engineering & architecture · Graph Execution Compilers. Refine with filters or upvote what's useful.
Apache Flink is a distributed processing engine designed for both high-throughput, low-latency data streams and finite batch workloads. It functions as a stateful stream processor and a SQL stream processing engine, providing a unified runtime to execute relational queries and event-based transformations. The system is distinguished by its ability to manage persistent operator state to ensure exactly-once processing guarantees and consistency during failures. It features specialized capabilities for complex event processing to detect temporal patterns and handles out-of-order events using eve
Translates relational queries into optimized physical execution graphs of streaming or batch operators.
Graal is a compiler and runtime architecture designed for high-performance execution and polyglot interoperability. It utilizes a graph-based representation of program logic to perform global optimizations and JIT compilation. The project features a meta-circular interpretation framework and a specialized partial evaluation mechanism, which allow for the creation of new programming languages and the automatic optimization of their semantics into machine code. It enables multiple diverse programming languages to share memory and communicate through a standardized cross-language protocol within
Implements a graph-based representation of program logic to perform global optimizations before emitting final machine code.
AISystem is a comprehensive AI full-stack infrastructure project covering the entire pipeline from AI chip architecture to high-level training frameworks. It encompasses the development of AI compiler frameworks, inference engines, and distributed training orchestrators designed to coordinate workloads across a heterogeneous compute stack of CPUs, GPUs, and NPUs. The project focuses on the deep integration of software and hardware, employing software-hardware co-design to align tensor layouts with physical memory structures. It provides specialized capabilities for accelerating Transformer mo
Transforms compute graphs through operator fusion and layout conversion to maximize hardware utilization.
ai-edu is a comprehensive AI education curriculum and machine learning courseware collection. It provides theoretical tutorials, deep learning lab exercises, and project blueprints designed to teach artificial intelligence fundamentals through a combination of study and practical implementation. The project focuses on a learning-by-doing approach, guiding users from Python programming and neural network basics to advanced topics. It includes specialized instructional content on distributed AI training, MLOps educational guides for model quantization and pruning, and detailed frameworks for im
Provides instructional content on compiling neural network graphs to optimize GPU execution and reduce latency.
Theano is a Python mathematical expression compiler and symbolic math library used as a deep learning backend. It functions as a tensors computation framework that translates mathematical formulas into optimized C or CUDA code for high-performance computing. The system manages the definition and evaluation of complex math formulas using multi-dimensional arrays. It employs a symbolic expression graph and a lazy evaluation engine to optimize mathematical expressions before they are compiled into executable code. The framework provides automatic differentiation for calculating gradients of mat
Uses internal graph-based models of program logic to perform algebraic rewrites and global optimization.
OneFlow is a deep learning framework and distributed execution engine designed for building, training, and deploying neural network architectures. It functions as a scalable neural network library that allows for the development of deep learning models and their execution across distributed hardware. The project includes a machine learning graph compiler used to optimize neural network execution graphs. This allows for the acceleration of model performance and the reduction of latency during both training and inference. The framework covers broad capability areas including large-scale model
Implements a graph compiler that optimizes neural network execution graphs for improved performance.
This repository is a collection of reference implementations and programming examples for the CUDA Toolkit. It serves as a GPGPU implementation guide and a parallel computing reference, providing code for using graphics hardware to perform general-purpose calculations and high-performance parallel processing. The project provides specific samples for GPU kernel development and resource management. These include demonstrations of multi-GPU communication, peer-to-peer memory access, and system hardware inspection to coordinate distributed GPU resources. The codebase covers a wide range of capa
Compiles and executes captured operation graphs as executable objects to optimize GPU task scheduling.
This project is an educational blog and learning resource dedicated to the Rust programming language. It provides a collection of curated guides, technical articles, and structured learning paths designed to teach language fundamentals, concurrency, and systems programming. The repository distinguishes itself by offering practical implementation tutorials for complex systems. This includes detailed guides on compiler development—specifically translating source code into targets such as ARM64, x86_64, LLVM IR, and WebAssembly—as well as networking examples for building multithreaded chat serve
Demonstrates translating source code into LLVM Intermediate Representation (IR) using SSA form.
This project is a comprehensive collection of educational examples and reference implementations for building vision and language models using PyTorch. It serves as a deep learning tutorial covering the end-to-end process of developing neural networks, from initial architecture definition to final production deployment. The repository provides detailed guides on implementing a wide range of domain-specific models, including convolutional neural networks for object detection and segmentation, as well as transformer and recurrent architectures for natural language processing. It emphasizes gene
Compiles computation graphs using specialized backends to reduce latency and increase throughput.
Swift for TensorFlow is a custom toolchain that extends the Swift language with first-class automatic differentiation and differentiable types, enabling gradient-based computation directly within the compiler. It integrates the Swift compiler with TensorFlow runtime and XLA backends, allowing tensor operations to be compiled and executed on hardware-accelerated hardware for high-performance machine learning. The project distinguishes itself through compiler-integrated automatic differentiation that computes gradients of user-defined functions and types during compilation, eliminating the need
Converts Swift functions into static computational graphs at compile time for optimized execution.
NVIDIA DALI is a GPU-accelerated data loading and preprocessing library designed for deep learning workflows. It constructs high-performance data pipelines that offload decoding, augmentation, and normalization to the GPU, eliminating CPU bottlenecks in training and inference. The library reads data from multiple storage formats and streams it directly into GPU memory, with support for multi-GPU execution to scale throughput across large-scale workloads. DALI distinguishes itself by enabling data pipelines to be built once and executed across multiple deep learning frameworks without code cha
Compiles runtime-conditional pipeline branches into optimized execution plans that adapt to data characteristics.
go-ast-book es una colección de recursos técnicos y educativos centrados en el análisis de árboles de sintaxis abstracta (AST), desarrollo de compiladores y verificación estática de código. Proporciona guías y manuales para analizar, recorrer y examinar código fuente en Go con el fin de extraer su significado semántico. El proyecto sirve como referencia para construir frontends de compiladores, cubriendo la traducción de código de alto nivel a representaciones intermedias y formas de asignación estática única (SSA). También proporciona instrucciones para utilizar estas técnicas en el desarrollo de herramientas de lenguaje y análisis estático de código. Los recursos cubren una amplia gama de capacidades de análisis estático, incluyendo tokenización léxica, análisis estructural de expresiones y declaraciones, y seguimiento de coordenadas para archivos fuente. También detalla procesos de análisis semántico como la resolución de identificadores, verificación de corrección de tipos y análisis de flujo de control para concurrencia y ejecución diferida.
Translates abstract syntax trees into standardized intermediate representations to enable the generation of executable programs.
This project provides a comprehensive technical guide and framework for engineering large-scale machine learning systems. It covers the full lifecycle of model development, focusing on the infrastructure and computational principles required to build, train, and serve generative AI models across distributed GPU clusters. The repository distinguishes itself by offering deep-dive tutorials and implementation strategies for complex system challenges. It emphasizes high-performance architectural primitives, such as collective communication orchestration, distributed tensor sharding, and static gr
Compiles captured graphs into executable objects for optimized GPU execution.
Este proyecto es un recurso educativo integral y un plan de estudios centrado en el diseño e implementación de todo el stack de software y hardware de aprendizaje automático. Sirve como referencia técnica para la arquitectura de sistemas de aprendizaje automático, abarcando desde interfaces de programación de bajo nivel hasta infraestructura de despliegue a gran escala. El proyecto proporciona orientación instructiva sobre varios dominios especializados, incluyendo el desarrollo de compiladores de IA a través de representaciones intermedias y optimizaciones de grafos. Cubre los patrones arquitectónicos necesarios para el entrenamiento distribuido a través de clústeres de GPU y la programación de aceleradores de hardware para optimizar cargas de trabajo en chips especializados. El recurso también detalla la implementación de frameworks de servicio de modelos para entornos de producción y el diseño de pipelines de aprendizaje por refuerzo. Su alcance se extiende a los componentes centrales de los sistemas de ML, como la diferenciación automática, abstracciones de tensores y la orquestación de recursos de GPU.
Utilizes internal graph-based models of program logic to enable structural analysis and compiler-driven optimizations.
Triton is a dynamic binary analysis framework designed to automate reverse engineering. It functions as a multi-architecture CPU emulator, an SMT-based symbolic execution engine, and a dynamic taint analysis tool. The framework translates raw machine instructions into abstract syntax trees, allowing it to represent binary program logic as a structured intermediate representation. This allows the system to map multiple hardware instruction sets to a single analysis framework and translate machine instructions into mathematical formulas for solving constraints. Its capabilities cover the simul
Transforms raw machine instructions into a structured intermediate representation to organize code into analyzable blocks.
Este proyecto es un recurso educativo integral y un manual de tutoriales para construir, entrenar y desplegar modelos de machine learning usando TensorFlow 2. Sirve como una guía de aprendizaje estructurada que cubre conceptos fundamentales de deep learning, incluyendo arquitecturas de redes neuronales, diferenciación automática y operaciones con tensores. El manual proporciona orientación técnica sobre cómo optimizar la eficiencia de ejecución mediante la gestión de memoria de GPU, entrenamiento distribuido y cuantización de modelos. También incluye guías detalladas para construir pipelines de datos de alto rendimiento y exportar modelos para servidores de producción, dispositivos móviles y navegadores web. El material abarca una amplia gama de capacidades, incluyendo el desarrollo de modelos con redes convolucionales y recurrentes, la implementación de funciones de pérdida y capas personalizadas, y el uso de modelos preentrenados para transfer learning. También aborda estrategias de despliegue para dispositivos edge y el uso de entornos de ejecución en la nube para aceleración por hardware. El recurso está implementado como una colección de Jupyter Notebooks.
Covers the compilation of operation graphs into executable objects for optimized hardware acceleration.
cuda-python provides low-level Python bindings for the CUDA Driver and Runtime APIs. It serves as a programmatic wrapper for controlling device memory, managing hardware toolchains, and orchestrating execution graphs on NVIDIA GPUs, allowing for the compilation and launching of parallel kernels directly from Python. The project enables the development of SIMT kernels and the execution of mathematical algorithms on device memory. It integrates pre-compiled bytecode as custom operators and interfaces with accelerated device libraries to access low-level hardware functions without leaving the la
Compiles and executes captured operation graphs into executable objects for optimized GPU processing.
RLinf is a distributed reinforcement learning orchestrator and embodied AI training framework. It provides the infrastructure to train vision-language-action models and robotic policies using a combination of reinforcement learning and supervised fine-tuning. The system is designed for scaling workloads across GPU clusters, managing the placement of actors, rollout workers, and environment components. It features a specialized robotics data collection pipeline for gathering teleoperated demonstrations and simulation trajectories into standardized replay buffers, alongside a hardware interface
Accelerates training execution using graph capture and compiled code to optimize GPU processing speed.