Depth-Anything is a monocular depth estimation foundation model that produces dense per-pixel depth maps from a single RGB image. It is built on a DINOv2 Vision Transformer encoder backbone and trained on 62 million unlabeled images using a teacher-student pseudo-labeling framework, enabling robust generalization across diverse scenes without task-specific training. The model outputs both relative depth maps, which capture the ordering of scene points, and metric depth maps with real-world units after fine-tuning on datasets like NYUv2 or KITTI.
The project distinguishes itself through its ability to process video frame-by-frame for consistent depth estimation across clips, and through its integration with ControlNet pipelines for depth-conditioned image generation, where it replaces the default depth estimator to provide more precise conditioning signals. It also offers a fine-tuning framework for adapting the pretrained model to custom datasets or downstream tasks such as semantic segmentation, with demonstrated performance on benchmarks like Cityscapes and ADE20K.
Depth-Anything provides a command-line interface for batch processing images and videos, with options for grayscale output or side-by-side visualization. The model can be loaded via Hugging Face Transformers pipelines for minimal-code inference, or loaded from disk for direct tensor-based inference.