Flashlight

Flashlight is a standalone C++ machine learning library and tensor library used for building and training neural networks. It functions as a comprehensive neural network framework and automatic differentiation engine, providing the tools to construct computation graphs and calculate gradients via backpropagation.

The project serves as a distributed training framework, utilizing all-reduce operations to synchronize gradients and parameters across multiple compute nodes and devices. It distinguishes itself through deep integration of high-performance tensor manipulation, native device memory interoperability, and a system for synchronizing weights across distributed workers to accelerate large-scale model training.

The framework covers a wide range of deep learning capabilities, including modular layer composition for designing complex architectures like residual blocks and recurrent cells. It provides extensive data management utilities for ingestion and prefetching, alongside serialization systems for persisting model states. Additionally, it includes a suite of monitoring and observability tools for tracking training metrics and measuring sequence errors.

The library is implemented in C++.

Features

Automatic Differentiation - Provides a comprehensive automatic differentiation engine that calculates gradients via backpropagation through a computation graph.
C++ Machine Learning Libraries - Serves as a standalone C++ machine learning library for implementing deep learning operations and training neural networks.
Distributed Training - Synchronizes gradients and parameters across multiple compute nodes and devices to accelerate large-scale model training.
Automatic Differentiation Engines - Features a built-in automatic differentiation engine that constructs computation graphs to calculate gradients via backpropagation.
C++ Machine Learning Development - Provides a comprehensive library for building and training high-performance neural networks using native C++.
Deep Learning Architectures - Implements a modular framework for constructing complex deep learning architectures, including residual blocks and recurrent cells.
Distributed Tensor Synchronization - Performs all-reduce operations on tensors to aggregate values from all nodes into a synchronized result.
Distributed Training Frameworks - Provides a distributed training framework that synchronizes gradients and parameters across compute nodes using all-reduce operations.
Dynamic Tensor Shapes - Changes dimensions and permutes axes of tensors during model execution.
Gradient Computation - Calculates gradients from output to input by traversing the computation graph in topologically sorted order.
Distributed Gradient Synchronization - Implements all-reduce operations to synchronize gradients across distributed compute nodes during large-scale model training.
Weight Optimizers - Implements gradient descent algorithms to update network parameters and minimize loss functions.
Tensor Libraries - Implements a multi-dimensional array library supporting various data types and device memory management.
Modular Layer Compositions - Supports constructing neural network architectures by stacking modular computation units into sequential containers.
Model Performance Evaluators - Measures model accuracy and reliability on test data by disabling gradient tracking and training components.
Neural Network Frameworks - Offers a modular collection of layers, activation functions, and optimizers for constructing complex deep learning models.
Neural Network Modules - Provides modular neural network modules that encapsulate mutable parameters and define forward pass calculations.
Loss Function Selections - Calculates errors between predictions and targets using standard loss functions like Mean Squared Error and Cross Entropy.
Tensor Indexing - Retrieves subtensors using literal values, ranges, and advanced indexing.
Tensor Initialization - Initializes multi-dimensional arrays with specific shapes, data types, and sparse representations.
Tensor Reshaping - Modifies tensor dimensions without changing the order of underlying elements.
Tensor Type Conversion - Converts tensor elements between different numerical data types for compatibility.
Sequence Tensor Generation - Generates tensors containing identity matrices, sequential ranges, and evenly-spaced values.
High-Performance Tensor Libraries - Provides high-performance multi-dimensional array operations and custom memory management for hardware accelerators.
Parameter Synchronization - Broadcasts or reduces parameter values across the network to ensure all processes start with identical weights.
Communicator-Based Process Groupings - Allows the configuration of process groups, cluster ranks, and sizes to organize distributed workers.
Graph Construction Engines - Records inputs and gradient functions during operations to build a symbolic graph for automatic differentiation.
Mean Squared Error Scorers - Computes the average squared difference between prediction and target tensors to evaluate regression performance.
Tensor Arithmetic - Provides fundamental mathematical operations including addition, subtraction, multiplication, and division on multi-dimensional arrays.
Graph-Based Backpropagation - Traverses the computation graph in reverse topological order to calculate gradients from the loss back to the inputs.
Transcendental Function Implementations - Computes element-wise transcendental functions such as exponentials, natural logarithms, and reciprocals.
Distributed Cluster Coordination - Coordinates multiple processes and devices across a cluster using shared filesystems for parallel computation.
Tensor Comparison Operators - Performs element-wise logical comparisons and boolean operations between tensors or scalars.
Custom Kernel Accelerators - Integrates hand-optimized GPU kernels by providing direct access to raw tensor memory pointers.
Tensor Rearrangements - Rearranges tensor axes to change shape while maintaining data contiguity.
Training Metric Monitors - Tracks machine learning performance indicators, such as running averages of loss, during the training process.
Activation Functions - Provides various non-linear activation functions including ReLU, Sigmoid, Tanh, and Gated Linear Units.
Batch Normalization - Implements batch normalization to rescale input tensors using mean and variance to accelerate training.
Custom Neural Network Layers - Allows extending base module classes and defining custom forward pass logic to create specialized neural network layers.
Dataset Batch Loading - Packs individual training samples into fixed or dynamic batch sizes using custom batching functions.
Dropout Regularization - Provides dropout regularization to prevent feature co-adaptation by randomly zeroing out input values.
Linear Transformation Layers - Implements linear transformation layers that use matrix multiplication and optional bias to transform input tensor sizes.
Convolution Layers - Implements 2D convolutional layers with configurable stride, padding, and dilation for 4D input tensors.
Residual Block Composers - Provides utilities to construct residual blocks with skip connections and scaling factors.
Normalization Layers - Provides normalization layers that rescale inputs along a feature axis using learnable affine transformation parameters.
Recurrent Layers - Implements standard recurrent layers including RNNs, LSTMs, and GRUs for sequential data processing.
Sequential Containers - Implements sequential containers that wrap layers and activation functions for streamlined model definition.
Domain-Specific Processing Pipelines - Handles specialized data pipelines tailored for speech, vision, and text application modalities.
Mixed-Precision Computing - Adjusts computation precision across operators and normalization layers to balance performance and stability.
Sequential Model Builders - Supports stacking convolution, pooling, and linear layers in a linear sequence to build model architectures.
Tensor Debuggers - Provides tools to output tensor values and gradients to a stream for manual numerical verification.
Training Data Ingestion - Provides built-in utilities to ingest and preprocess data for efficient delivery to neural network models.
Training Data Prefetchers - Uses background worker threads to prefetch and transform training samples, preventing data starvation during training.
Training Dataset Management - Wraps input and target tensors into datasets and iterators to simplify training loop iterations.
Embedding Lookup Layers - Implements embedding lookups to retrieve vectors from learnable dictionaries using index lists.
Tensor-Based - Maps dataset indices to samples of tensor vectors, supporting splitting and resampling of training data.
Tensor Serialization Utilities - Provides utilities for saving and loading tensors, shapes, and model modules to binary files or streams.
Dataset Pipeline Management - Includes extensive utilities for data ingestion, prefetching, and batching of speech, vision, and text datasets.
Dataset Partitioning Strategies - Distributes sample IDs across multiple worker partitions using round-robin or token-based strategies.
Device Memory Interoperability - Interfaces directly with backend device memory and pressure functions for native hardware interoperability.
Direct-Pointer Memory Access - Enables custom GPU kernels to operate on raw tensor memory addresses for high-performance mathematical operations.
Custom Memory Allocators - Allows for the definition of custom memory allocation and management logic to override default device behaviors.
Model State Serialization - Saves and loads neural network weights, modules, and optimizer states to disk for checkpointing.
Memory-Efficient Graph Lifecycles - Minimizes peak memory usage by controlling the lifecycle of intermediate variables during the backward pass.
Device Memory RAII Wrappers - Wraps raw hardware pointers in RAII objects to automate memory release and prevent leaks on accelerator devices.
Tensor Memory RAII Wrappers - Uses RAII wrappers to automate the acquisition and release of device pointers for tensor arrays to prevent memory leaks.
Kernel Call Fusion - Reduces memory allocations and improves performance by fusing multiple function calls into a single kernel call.
GPU Memory Monitors - Reports memory manager statistics and device information to identify leaks and troubleshoot GPU memory pressure.
Sequential Computation Flow - Arranges multiple computation units into an ordered sequence where output flows directly into the next input.
AI & Machine Learning - Standalone machine learning library
Artificial Intelligence - Fast, flexible machine learning library built for C++.
Machine Learning and AI - Fast and flexible machine learning library.
Computation and Optimization - Fast, flexible machine learning library written in C++.

facebookresearch/flashlight

5,443View on GitHub

Flashlight is a C++ machine learning library and deep learning framework designed for building and training neural networks. It functions as a tensor manipulation library and an automatic differentiation engine that tracks operations to calculate gradients via backpropagation for model optimization. The project is distinguished by its role as a distributed training framework, utilizing all-reduce gradient synchronization and distributed environments to scale machine learning workloads across multiple nodes and devices. It features a backend-agnostic memory interface and RAII-based management

TingsongYu/PyTorch_Tutorial

8,018View on GitHub

This project is a comprehensive collection of educational examples and reference implementations for building vision and language models using PyTorch. It serves as a deep learning tutorial covering the end-to-end process of developing neural networks, from initial architecture definition to final production deployment. The repository provides detailed guides on implementing a wide range of domain-specific models, including convolutional neural networks for object detection and segmentation, as well as transformer and recurrent architectures for natural language processing. It emphasizes gene

lyhue1991/eat_tensorflow2_in_30_days

9,933View on GitHub

This project is a structured learning curriculum and technical reference for mastering deep learning with TensorFlow. It provides a comprehensive guide for building, training, and deploying neural networks, combining theoretical fundamentals with practical implementation examples. The repository distinguishes itself by covering the end-to-end machine learning workflow, from low-level tensor mathematics and linear algebra to the creation of complex model architectures. It includes specific guidance on developing data pipelines for diverse data types, such as images, text, and time-series seque

tinygrad/tinygrad

33,147View on GitHub

Tinygrad is a deep learning framework and tensor computation engine designed for building and training neural networks. It functions as a hardware abstraction layer that manages device memory, command queues, and kernel dispatching across heterogeneous computing architectures. By utilizing a lazy-evaluation approach, the framework constructs computational graphs that defer execution until data is explicitly required, allowing it to process only the necessary operations for a given result. The project distinguishes itself through a just-in-time compilation layer that transforms abstract comput

flashlightflashlight

Features