DUSt3R is a geometric vision transformer model that predicts dense 3D pointmaps directly from one or more uncalibrated images, without requiring prior camera intrinsics, extrinsics, or known camera positions. Its core identity is an end-to-end approach to 3D reconstruction that bypasses traditional depth estimation and camera calibration pipelines, instead outputting metric-scale 3D coordinates from RGB inputs.
The model processes image pairs through a shared dual-image encoder architecture, using cross-attention feature fusion in the decoder to merge features from two images into a unified pointmap in a common coordinate frame. This transformer-based stereo matching approach directly regresses dense 3D pointmaps without explicit correspondence search, and can recover camera parameters analytically from the predicted pointmap structure. For multi-view scenarios, pairwise pointmaps are aligned into a consistent global coordinate frame via a closed-form least-squares optimization over all pairs.
The system supports uncalibrated multi-view fusion, enabling 3D reconstruction from arbitrary unordered image collections without requiring known camera poses or calibration data. It also provides camera parameter recovery, deriving pixel correspondences, relative camera poses, and absolute camera parameters directly from the predicted 3D pointmaps.