Dino

This project is a PyTorch vision transformer framework designed for self-supervised learning. It implements a model that trains visual representations using a momentum teacher and self-distillation without the need for labeled data.

The library functions as an image feature extractor and visual attention visualizer, allowing for the generation of high-dimensional vectors and the rendering of self-attention maps as heatmaps or videos to analyze model focus.

It provides comprehensive tools for downstream vision evaluation, including linear probe classification, k-nearest neighbor categorization, and visual similarity search. The system also supports semi-supervised video object segmentation and image copy detection.

The framework includes infrastructure for multi-node distributed training and utilities for importing pretrained model weights to accelerate convergence and deployment.

Features

Self-Supervised Vision Representation Trainers - Implements a self-supervised learning method using a momentum teacher and temperature warmup to train vision architectures.

Vision Transformers - Implements a vision transformer that processes images as sequences of fixed-size patches.

Attention Visualizations - Generates heatmaps and videos to visualize which image regions the transformer focuses on.

Self-Distillation Pipelines - Trains a student network to predict the output of a momentum-updated teacher without using labeled data.

Vision Transformer Training - Processes images by dividing them into patches and embedding them into a latent space using a transformer architecture.

Exponential Moving Average Weight Updates - Stabilizes training using an exponential moving average to update teacher weights based on student weights.

PyTorch Vision Transformer Frameworks - Provides a comprehensive PyTorch implementation for training Vision Transformers via self-supervised learning.

Neural Feature Extractors - Generates high-dimensional vectors used for k-NN classification and image retrieval.

Image Feature Extraction - Converts images into high-dimensional latent vectors for similarity search and image retrieval.

Augmentation Pipelines - Ships sequential processing pipelines for stochastic image augmentations including Gaussian blur and solarization.

Distributed Training - Supports scaling model training across multiple GPUs and compute nodes for large-scale workloads.

Downstream Vision Evaluation - Evaluates pretrained weight quality using linear probes and k-nearest neighbor classification on standard datasets.

Multi-Node Inference Scaling - Distributes heavy machine learning workloads across multiple GPUs and compute nodes.

Image-to-Image Retrieval - Matches query images to target galleries by calculating similarity between learned feature vectors.

K-Nearest Neighbor Classifiers - Provides k-nearest neighbor classification to categorize images based on latent feature similarity.

Linear Classifiers - Uses linear classifiers as probes to evaluate the quality of learned representations on frozen weights.

Output Centering & Sharpening - Implements output centering and sharpening to prevent collapse during self-supervised distillation.

Image Augmentations - Utilizes random image transformations to create multiple views of the same image for invariant feature learning.

Vector Similarity Search - Performs visual similarity searches across datasets using high-dimensional vector embeddings.

Self-Attention Implementations - Extracts and renders the self-attention of the class token across different heads to determine model focus.

Attention Map Visualizations - DINOv2 produces video files by extracting frames from source media and rendering the model's attention maps for each frame.

facebookresearchdinoArchived

Features

Star history