Dust3r

DUSt3R is a geometric vision transformer model that predicts dense 3D pointmaps directly from one or more uncalibrated images, without requiring prior camera intrinsics, extrinsics, or known camera positions. Its core identity is an end-to-end approach to 3D reconstruction that bypasses traditional depth estimation and camera calibration pipelines, instead outputting metric-scale 3D coordinates from RGB inputs.

The model processes image pairs through a shared dual-image encoder architecture, using cross-attention feature fusion in the decoder to merge features from two images into a unified pointmap in a common coordinate frame. This transformer-based stereo matching approach directly regresses dense 3D pointmaps without explicit correspondence search, and can recover camera parameters analytically from the predicted pointmap structure. For multi-view scenarios, pairwise pointmaps are aligned into a consistent global coordinate frame via a closed-form least-squares optimization over all pairs.

The system supports uncalibrated multi-view fusion, enabling 3D reconstruction from arbitrary unordered image collections without requiring known camera poses or calibration data. It also provides camera parameter recovery, deriving pixel correspondences, relative camera poses, and absolute camera parameters directly from the predicted 3D pointmaps.

Features

Uncalibrated Reconstructions - Predicts dense 3D pointmaps from uncalibrated images without requiring camera intrinsics or extrinsics.
Uncalibrated Multi-View Fusions - Reconstructs 3D scenes from arbitrary unordered image collections without requiring known camera poses.
End-to-End Metric Depth Predictions - Directly outputs metric-scale 3D coordinates from RGB images, bypassing traditional depth estimation pipelines.
Transformer-Based Pointmap Predictors - A transformer-based architecture that directly outputs 3D pointmaps from image pairs for geometric understanding.
Transformer-Based Stereo Matchers - Uses a shared transformer encoder to process image pairs and directly regress dense 3D pointmaps.
Pointmap-Based Camera Recoveries - Derives camera parameters analytically from predicted 3D pointmaps, enabling uncalibrated reconstruction.
Pointmap Registrations - Aligns pairwise pointmaps into a consistent world coordinate system via closed-form least-squares optimization.
Pairwise Pointmap Alignments - Aligns pairwise pointmaps from multiple images into a consistent global coordinate frame via registration.
Uncalibrated Pointmap Predictions - Predicts dense 3D pointmaps from uncalibrated images without requiring camera intrinsics or extrinsics.
Uncalibrated Reconstructions - Predicts dense 3D pointmaps from uncalibrated images without requiring camera intrinsics or extrinsics.
Geometric Vision Pipelines - Builds end-to-end 3D vision workflows combining reconstruction, alignment, and parameter extraction from raw images.
Cross-Attention Fusion Layers - Uses cross-attention layers in the decoder to merge features from two images into a unified pointmap.
Shared Encoder Image Pair Processors - Processes two images through a shared encoder before fusing them in a decoder for joint pointmap prediction.

NVIDIA/cosmos

10,494View on GitHub

Cosmos is an open platform of world models, datasets, and tools for building physical AI systems such as robots and autonomous vehicles. It provides video generation and video understanding models that can generate synthetic videos and world simulations from text, image, video, or action inputs, and analyze videos to produce captions, event timestamps, spatial bounding boxes, and next-action predictions. The platform includes a world simulation generator that produces images, videos, synchronized audio, and action-conditioned rollouts for synthetic data, alongside a visual content analyzer th

NVIDIA/cosmos

10,494View on GitHub

naverdust3r

Features

Open-source alternatives to Dust3r

NVIDIA/cosmos

Star history

Open-source alternatives to Dust3r

NVIDIA/cosmos