AISystem is a comprehensive AI full-stack infrastructure project covering the entire pipeline from AI chip architecture to high-level training frameworks. It encompasses the development of AI compiler frameworks, inference engines, and distributed training orchestrators designed to coordinate workloads across a heterogeneous compute stack of CPUs, GPUs, and NPUs. The project focuses on the deep integration of software and hardware, employing software-hardware co-design to align tensor layouts with physical memory structures. It provides specialized capabilities for accelerating Transformer mo
Flashlight is a C++ machine learning library and deep learning framework designed for building and training neural networks. It functions as a tensor manipulation library and an automatic differentiation engine that tracks operations to calculate gradients via backpropagation for model optimization. The project is distinguished by its role as a distributed training framework, utilizing all-reduce gradient synchronization and distributed environments to scale machine learning workloads across multiple nodes and devices. It features a backend-agnostic memory interface and RAII-based management
This repository is a comprehensive educational program and deep learning framework designed to teach practical deep learning using PyTorch through notebooks and code examples. It serves as a high-level library for building, training, and deploying neural networks, acting as a model training orchestrator that coordinates PyTorch models, optimizers, and loss functions. The project provides specialized toolkits for computer vision, natural language processing, and tabular data preprocessing. It distinguishes itself through advanced training controls such as discriminative learning rates, a two-w
Torchtitan is a reference implementation for distributed deep learning built within the PyTorch ecosystem. It provides a framework for training large neural network models across multiple GPUs and nodes by combining several parallelism techniques, including fully sharded data parallelism (FSDP), tensor parallelism, and pipeline parallelism, making it possible to train models that exceed the memory capacity of a single device. The system distinguishes itself through asynchronous checkpointing, which saves model and optimizer state to persistent storage without pausing the training loop, enabli