# facebookresearch/dinov3

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/facebookresearch-dinov3).**

9,613 stars · 742 forks · Jupyter Notebook · other

## Links

- GitHub: https://github.com/facebookresearch/dinov3
- awesome-repositories: https://awesome-repositories.com/repository/facebookresearch-dinov3.md

## Description

This project is a self-supervised vision foundation model based on a vision transformer architecture. It is designed to learn dense visual representations from unlabeled images, serving as a general-purpose backbone for a wide variety of downstream vision tasks.

The system is distinguished by its use of self-distillation and masked image modeling to extract semantic and geometric features. It also incorporates an image-text alignment model that maps visual embeddings to textual descriptions, enabling zero-shot image recognition, zero-shot segmentation, and cross-modal retrieval.

The project covers a broad range of computer vision capabilities, including dense feature extraction, monocular depth estimation, and semantic image segmentation. It supports object detection and classification via linear-head task adaptation, as well as image similarity retrieval and object tracking across video frames.

The repository includes tools for distributed vision pretraining on GPU clusters and methods for high-resolution or metadata-guided model adaptation.

## Tags

### Artificial Intelligence & ML

- [Self-Supervised Vision Representation Trainers](https://awesome-repositories.com/f/artificial-intelligence-ml/natural-language-processing/word-embeddings/self-supervised-embedding-trainers/self-supervised-vision-representation-trainers.md) — Implements large-scale self-supervised vision representation training using self-distillation and masked image modeling.
- [Vision Transformers](https://awesome-repositories.com/f/artificial-intelligence-ml/vision-transformers.md) — Employs a vision transformer architecture that processes image patches as tokens using attention layers.
- [Image Segmentation](https://awesome-repositories.com/f/artificial-intelligence-ml/computer-vision-systems/image-segmentation.md) — Provides high-quality semantic image segmentation and foreground isolation using pretrained vision transformer heads.
- [Visual-Textual Alignments](https://awesome-repositories.com/f/artificial-intelligence-ml/cross-modal-representations/visual-textual-alignments.md) — Maps visual embeddings to textual descriptions to support cross-modal retrieval and zero-shot vision tasks.
- [Feature Extraction](https://awesome-repositories.com/f/artificial-intelligence-ml/feature-extraction.md) — Produces high-quality dense image representations and similarity maps for various vision tasks. ([source](https://github.com/facebookresearch/dinov3/tree/FINO))
- [Feature Extractors](https://awesome-repositories.com/f/artificial-intelligence-ml/feature-extractors.md) — Extracts high-resolution dense visual features and similarity maps from images without task-specific fine-tuning.
- [Vision-Text Alignments](https://awesome-repositories.com/f/artificial-intelligence-ml/generative-ai-resources/speech-synthesis/cross-modal-alignment-models/vision-text-alignments.md) — Implements an image-text alignment model that maps visual embeddings to textual descriptions for zero-shot recognition.
- [Image Encoder Embedding Extractions](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/domain-specific-processing-pipelines/image-encoder-embedding-extractions.md) — Generates vector representations of images using pretrained backbones via standard model loaders. ([source](https://github.com/facebookresearch/dinov3/tree/FINO))
- [Masked Image Modeling](https://awesome-repositories.com/f/artificial-intelligence-ml/masked-language-modeling/masked-image-modeling.md) — Learns visual features by predicting missing image patches through masked image modeling.
- [Self-Distillation Pipelines](https://awesome-repositories.com/f/artificial-intelligence-ml/masked-language-modeling/self-distillation-pipelines.md) — Optimizes vision models using self-distillation to refine semantic and geometric feature representations. ([source](https://github.com/facebookresearch/dinov3/blob/main/MODEL_CARD.md))
- [Teacher-Student Distillation](https://awesome-repositories.com/f/artificial-intelligence-ml/model-distillation-methods/teacher-student-distillation.md) — Utilizes self-distillation by training a student model to predict the output of a teacher model.
- [Dense Embeddings](https://awesome-repositories.com/f/artificial-intelligence-ml/vector-embeddings/dense-embeddings.md) — Generates high-resolution dense image embeddings and similarity maps to find correspondences.
- [Zero-Shot Inference](https://awesome-repositories.com/f/artificial-intelligence-ml/zero-shot-inference.md) — Performs vision tasks without target class training by utilizing text-aligned model weights. ([source](https://github.com/facebookresearch/dinov3/tree/FINO))
- [Zero-Shot Segmentations](https://awesome-repositories.com/f/artificial-intelligence-ml/zero-shot-inference/zero-shot-segmentations.md) — Isolates specific objects within an image without requiring training on those particular categories. ([source](https://github.com/facebookresearch/dinov3/blob/main/notebooks/dinotxt_segmentation_inference.ipynb))
- [Adapter Layers](https://awesome-repositories.com/f/artificial-intelligence-ml/backbone-integrations/adapter-layers.md) — Uses linear-head adapter layers to map frozen high-dimensional embeddings to specific labels for downstream tasks.
- [Object Detection](https://awesome-repositories.com/f/artificial-intelligence-ml/computer-vision-systems/computer-vision/object-detection-tracking/object-detection.md) — Identifies and locates specific objects within images using pretrained detector heads. ([source](https://github.com/facebookresearch/dinov3/blob/main/README.md))
- [Monocular Depth Estimators](https://awesome-repositories.com/f/artificial-intelligence-ml/computer-vision-systems/computer-vision/object-pose-estimations/monocular-depth-estimators.md) — Predicts depth maps from single images by mapping pixels to distance values.
- [Binary Segmentations](https://awesome-repositories.com/f/artificial-intelligence-ml/computer-vision-systems/image-segmentation/binary-segmentations.md) — Isolates the primary subject of an image from its background to create binary masks. ([source](https://github.com/facebookresearch/dinov3/tree/main/notebooks))
- [Distributed Vision Pre-training](https://awesome-repositories.com/f/artificial-intelligence-ml/distributed-vision-pre-training.md) — Provides tools for distributed vision pre-training on GPU clusters to process massive unlabeled image sets.
- [Image Classification](https://awesome-repositories.com/f/artificial-intelligence-ml/image-classification.md) — Categorizes images into predefined classes using pretrained classifier heads or linear evaluation methods. ([source](https://github.com/facebookresearch/dinov3#readme))
- [Vision Transformer Pre-training](https://awesome-repositories.com/f/artificial-intelligence-ml/large-scale-model-training/vision-transformer-pre-training.md) — Executes large-scale pre-training of vision transformers using self-supervised masked image modeling. ([source](https://github.com/facebookresearch/dinov3#readme))
- [Linear Classifiers](https://awesome-repositories.com/f/artificial-intelligence-ml/linear-regression/linear-classifiers.md) — Categorizes images using extracted tokens and linear layers without extensive fine-tuning. ([source](https://github.com/facebookresearch/dinov3/blob/main/MODEL_CARD.md))
- [Vision Model Training](https://awesome-repositories.com/f/artificial-intelligence-ml/model-training-frameworks/vision-model-training.md) — Provides tools for distributed vision pretraining of self-supervised representations on GPU clusters. ([source](https://github.com/facebookresearch/dinov3/blob/main/README.md))
- [Semantic Segmentation](https://awesome-repositories.com/f/artificial-intelligence-ml/semantic-segmentation.md) — Maps image pixels to depth values or semantic labels using linear layers. ([source](https://github.com/facebookresearch/dinov3/blob/main/MODEL_CARD.md))
- [Zero-Shot Segmentors](https://awesome-repositories.com/f/artificial-intelligence-ml/zero-shot-inference/zero-shot-segmentors.md) — Isolates specific objects within images without requiring category-specific training.

### Part of an Awesome List

- [Cross-Modal Retrieval Training](https://awesome-repositories.com/f/awesome-lists/ai/cross-modal-models/cross-modal-retrieval-training.md) — Trains the model to align image representations with text embeddings for cross-modal retrieval. ([source](https://github.com/facebookresearch/dinov3/tree/FINO))

### Data & Databases

- [Vector Similarity Search](https://awesome-repositories.com/f/data-databases/vector-similarity-search.md) — Identifies visually similar images by calculating nearest neighbors between representation tokens. ([source](https://github.com/facebookresearch/dinov3/blob/main/MODEL_CARD.md))
