Depth Anything

Depth-Anything is a monocular depth estimation foundation model that produces dense per-pixel depth maps from a single RGB image. It is built on a DINOv2 Vision Transformer encoder backbone and trained on 62 million unlabeled images using a teacher-student pseudo-labeling framework, enabling robust generalization across diverse scenes without task-specific training. The model outputs both relative depth maps, which capture the ordering of scene points, and metric depth maps with real-world units after fine-tuning on datasets like NYUv2 or KITTI.

The project distinguishes itself through its ability to process video frame-by-frame for consistent depth estimation across clips, and through its integration with ControlNet pipelines for depth-conditioned image generation, where it replaces the default depth estimator to provide more precise conditioning signals. It also offers a fine-tuning framework for adapting the pretrained model to custom datasets or downstream tasks such as semantic segmentation, with demonstrated performance on benchmarks like Cityscapes and ADE20K.

Depth-Anything provides a command-line interface for batch processing images and videos, with options for grayscale output or side-by-side visualization. The model can be loaded via Hugging Face Transformers pipelines for minimal-code inference, or loaded from disk for direct tensor-based inference.

Features

Depth - Provides a large-scale pretrained depth estimation model that generalizes across diverse scenes without task-specific training.

Monocular Depth Estimators - Estimates dense per-pixel depth maps from single RGB images using a DINOv2 encoder backbone.

Metric Depth Estimators - Fine-tunes the model on metric datasets to output depth values in real-world units from a single image.

Depth Estimation - Processes a single RGB image through a fully convolutional decoder to produce a per-pixel depth map.

Pretrained Depth Models - Provides a pretrained monocular depth estimation model that outputs relative and metric depth maps out of the box.

Relative Depth Estimators - Produces depth maps that capture the relative ordering of scene points from a single image without domain-specific training.

Teacher-Student Distillation - Generates pseudo depth labels from a teacher model on unlabeled data and trains a student model to predict them.

Teacher-Student Pseudo-Label Training - Trains the depth model on 62 million unlabeled images using a teacher-student pseudo-labeling framework.

Self-Supervised - Uses a DINOv2 Vision Transformer encoder pre-trained with self-supervised learning as the backbone for depth estimation.

Single-Image Metric Depth Mappers - Produces depth maps with real-world units from one image, enabling direct measurement of scene geometry.

Depth Estimation CLI Tools - Provides a command-line interface for batch processing images to generate depth maps with grayscale or side-by-side output.

Relative Depth Map Generators - Outputs depth values that indicate which parts of a scene are closer or farther without providing absolute scale.

Video Depth Frameworks - Processes video frames sequentially to generate consistent depth maps for each frame in a clip.

Depth Estimation Fine-Tunings - Provides a framework for fine-tuning the pretrained depth model on custom datasets for improved accuracy.

Depth Map Conditioning - Integrates with ControlNet pipelines to provide precise depth maps as conditioning signals for image synthesis.

Depth Estimation Fine-Tunings - Ships a fine-tuning framework for adapting the pretrained depth model to custom datasets and downstream tasks.

Relative-to-Metric Depth Scaling - Fine-tunes the relative depth model on metric datasets like NYUv2 or KITTI to output depth in real-world units.

Depth Map Batch Processors - Provides a command-line interface for batch processing images and videos to generate depth maps.

Depth Estimation Pipelines - Ships a Hugging Face pipeline wrapper for running depth estimation on images with minimal code.

Metric Depth Mapping - Outputs depth values in real-world units when a metric model is used, enabling direct measurement of scene geometry.

Depth Frame Processors - Processes video frames sequentially to generate consistent depth maps for each frame in a clip.

LiheYoungDepth-Anything

Features

Star history