Map-free Visual Relocalization

The Challenge: Pose Without a Map

Classical visual localization assumes access to a pre-built 3D map of the environment—a point cloud or mesh reconstructed from many images. Given a new query image, the system matches it against this map to determine the camera’s position and orientation. This works well when you can afford to map the environment in advance, but it fails in scenarios where no prior map exists: disaster response in damaged areas, navigation in newly constructed environments, or localization from a single reference image sent by another user.

The Map-free Visual Relocalization Challenge 2024 pushes the field toward this harder setting. The task: estimate the relative camera pose between images without any pre-built 3D map. Two tracks test different aspects of this capability:

Track 1: Given a single query image and a single reference image, estimate the relative 6-DoF pose (rotation and translation) between them.
Track 2: Given a sequence of query images and a single reference image, estimate the relative pose. The sequence provides temporal context—consecutive frames from a moving camera—that can help resolve ambiguities present in single-image matching.

Why This Matters

Map-free relocalization has immediate practical applications:

Augmented reality: When a user shares a photo of a location, another user should be able to hold up their phone and see AR content anchored to the same scene, without either user having pre-mapped the area.
Robot navigation: A robot given a single image of a goal location should be able to navigate there by estimating relative poses along the way.
Emergency response: After a natural disaster, pre-existing maps may be invalid. Responders need to localize using whatever reference imagery is available.

The core technical challenge is that relative pose estimation from images alone requires solving correspondence (which parts of two images show the same 3D point) and geometry (what 3D arrangement of points is consistent with the observed correspondences) simultaneously, without the luxury of a dense reconstruction to anchor the solution.

Approach 1: MASt3R

Our primary approach builds on MASt3R (Matching and Stereo 3D Reconstruction), a recent architecture that directly predicts 3D point maps from pairs of images. MASt3R operates on the insight that matching 2D images is fundamentally a 3D problem—two images of the same scene depict the same 3D structure from different viewpoints, and recovering that structure provides the correspondence and geometry needed for pose estimation.

How MASt3R Works

MASt3R takes two images as input and outputs, for each image, a dense 3D point map in a shared coordinate frame. Each pixel in each image is assigned a 3D position, effectively performing stereo reconstruction without requiring calibrated cameras or known baselines.

The architecture uses a Vision Transformer backbone (346M parameters) trained on large-scale stereo and multi-view datasets. The key innovation is that it directly regresses 3D coordinates rather than intermediate representations like depth maps or feature descriptors. This end-to-end approach avoids the error accumulation that plagues traditional pipelines (feature extraction, matching, triangulation, bundle adjustment).

From Point Maps to Pose

Given MASt3R’s 3D point maps for both images, relative pose estimation becomes a rigid alignment problem: find the rotation and translation that best aligns the two point clouds. This can be solved efficiently with Procrustes analysis or RANSAC-based methods, leveraging the dense correspondences that MASt3R provides.

The advantage over traditional feature-matching pipelines (e.g., SuperPoint + SuperGlue) is robustness to wide baselines and significant viewpoint changes. Traditional descriptors struggle when the viewpoint difference exceeds 30–40 degrees because local image patches look too different. MASt3R’s 3D reasoning handles this by implicitly modeling the 3D structure that connects disparate viewpoints.

Approach 2: 4M (Massively Multimodal Masked Modeling)

We also explored the 4M architecture, a foundation model trained to process and generate across many modalities simultaneously—RGB images, depth maps, surface normals, semantic segmentations, and more.

Architecture and Scale

4M is a large-scale model trained on 64–128 A100 GPUs for 1.5 to 8 days depending on model size. It uses a masked modeling objective across all modalities: given partial observations (e.g., an RGB image with some modalities masked), the model predicts the missing modalities. This cross-modal training produces representations that encode rich 3D and semantic information.

The scale of 4M’s training is both its strength and its limitation. The representations are exceptionally rich—encoding geometry, semantics, and appearance in a unified feature space—but the model is expensive to fine-tune and slow to run at inference time compared to purpose-built architectures like MASt3R.

Adaptation for Relocalization

Using 4M for relocalization requires extracting features from both query and reference images, then estimating the relative pose from the feature correspondence. We explored using 4M’s pretrained weights as a feature backbone, replacing its output heads with pose regression heads inspired by the Mickey architecture (discussed below).

The hypothesis was that 4M’s multi-modal pretraining would provide features that are more geometrically informative than those from a standard ImageNet-pretrained backbone. Preliminary results supported this: 4M features produced better correspondences in textureless regions and under lighting changes, likely because the depth and normal prediction objectives during pretraining forced the model to learn geometry-aware representations.

Depth-Augmented Matching with Mickey Variants

Mickey is a learning-based relative pose estimation method that matches keypoints between images and regresses the essential matrix. We explored two ways to augment Mickey with depth information, motivated by the availability of strong monocular depth estimators like Depth Anything.

Variant 1: Depth as Additional Input

The simplest approach: run Depth Anything on each input image to produce a dense monocular depth map, then concatenate the depth map as a fourth channel alongside RGB. The matching network receives 4-channel input (RGBD) and can use depth cues to disambiguate correspondences that are ambiguous in RGB alone.

This approach is straightforward to implement—just change the input convolution from 3 to 4 channels and fine-tune—but relies on the accuracy of monocular depth estimation. Depth Anything produces metrically inconsistent but relatively accurate ordinal depth maps (it gets the depth ordering right even if absolute scale is off), which may or may not help depending on whether the matching network can learn to exploit ordinal depth relationships.

Variant 2: Depth as Supervision

Rather than feeding depth as input, use it as an auxiliary training signal. The matching network still receives RGB input only, but during training, an additional loss term penalizes inconsistency between the predicted correspondences and the depth maps. Specifically, if two pixels are matched as corresponding points, their depth values (from Depth Anything) should be consistent with the predicted relative pose.

This approach has the advantage of not requiring depth at inference time—the model learns to internalize depth reasoning during training and applies it implicitly at test time. The disadvantage is that errors in the monocular depth estimates introduce noise into the supervision signal, potentially degrading training if the depth errors are systematic.

Research Directions Beyond the Challenge

The challenge motivated several longer-term research directions:

Text-to-3D for Data Augmentation

One bottleneck for relocalization is the limited diversity of training pairs—collecting images of the same scene from many viewpoints is expensive. Text-to-3D generation could provide unlimited synthetic training data: describe a scene in text, generate a 3D model, and render image pairs with known relative poses. The challenge is ensuring that the synthetic data is realistic enough to transfer to real-world relocalization.

Synthetic Data for Novel View Synthesis

A related direction: given a single reference image, synthesize realistic views from nearby viewpoints to create synthetic query-reference pairs. Recent advances in diffusion-based novel view synthesis (Zero-1-to-3, Zero123++) make this increasingly feasible. If the synthesized views are photorealistic and geometrically consistent, they could dramatically expand the training data for map-free relocalization.

Depth-Aware Feature Matching

More broadly, the integration of depth estimation with feature matching is an active research frontier. Current approaches treat depth and matching as separate problems, but they are fundamentally coupled: better depth helps matching (by providing geometric constraints), and better matching helps depth (by providing multi-view consistency). A unified architecture that jointly estimates depth and correspondences could outperform both in isolation.

Reflections

3D reasoning is the key differentiator. Methods that explicitly reason about 3D structure (MASt3R) outperform those that treat relocalization as a 2D matching problem, especially under wide baselines and significant viewpoint changes.
Foundation models provide strong features but need adaptation. 4M’s pretrained representations are geometrically rich, but extracting pose-relevant information requires careful architectural design. Off-the-shelf features are not sufficient.
Monocular depth is a powerful but imperfect signal. Depth Anything and similar models provide useful geometric priors, but their metric inconsistency limits their utility as direct input or supervision for pose estimation. Future work should focus on scale-aware depth estimation.
The single-image setting is fundamentally harder. Track 2 (sequence-to-image) consistently outperformed Track 1 (image-to-image) across all methods, confirming that temporal context provides crucial disambiguation. Closing this gap for the single-image case remains an open challenge.
Compute cost matters. MASt3R at 346M parameters offers a practical balance of accuracy and efficiency. 4M, while producing richer representations, requires 64+ GPUs for training—a reminder that the best research models are not always the best engineering solutions.