This project is a PyTorch vision transformer framework designed for self-supervised learning. It implements a model that trains visual representations using a momentum teacher and self-distillation without the need for labeled data.
The library functions as an image feature extractor and visual attention visualizer, allowing for the generation of high-dimensional vectors and the rendering of self-attention maps as heatmaps or videos to analyze model focus.
It provides comprehensive tools for downstream vision evaluation, including linear probe classification, k-nearest neighbor categorization, and visual similarity search. The system also supports semi-supervised video object segmentation and image copy detection.
The framework includes infrastructure for multi-node distributed training and utilities for importing pretrained model weights to accelerate convergence and deployment.