3D Scene Understanding with GeoLLMs

The Problem: Language Meets 3D Geography

Large language models understand text. Vision-language models understand images. But neither natively understands the three-dimensional structure of the physical world—the spatial relationships between buildings, the height of structures, the topology of city blocks. For geospatial applications like urban planning, infrastructure monitoring, and autonomous navigation, this 3D understanding is essential.

The goal of this work is to bridge the gap between 3D geospatial data and language models, creating systems that can reason about and describe urban environments in three dimensions. We call this direction GeoLLMs—language models augmented with geospatial 3D awareness.

3D Data: Five Cities, Google 3D Tiles

Our experiments are grounded in real 3D geospatial data from five cities: Saclay, Belmont, Stratford, and Kamisu. The 3D geometry comes from Google 3D Tiles, which provide textured mesh representations of urban environments reconstructed from aerial and satellite photogrammetry.

This data gives us ground-truth 3D structure—building heights, footprints, spatial arrangements—against which we can evaluate whether language models truly understand the scenes they describe. Each city offers different urban morphologies: dense European blocks, suburban American layouts, and Japanese coastal towns, providing diversity in building styles, density, and terrain.

Three Prompting Approaches

We explored three fundamentally different strategies for getting language models to reason about 3D urban scenes. Each represents a different philosophy about how to inject spatial information into the language modeling process.

Approach 1: Boxes-Demonstration-Instruction

The most direct approach: provide the language model with explicit 3D information in text form, then instruct it to reason about the scene.

The pipeline works as follows:

Extract axis-aligned bounding boxes for every 3D object in the scene (buildings, trees, infrastructure). Each box is described by its center coordinates, dimensions (length, width, height), and semantic label.
Format as structured text: Convert the bounding box data into a textual scene description that a language model can parse—a list of objects with their positions and sizes.
Provide few-shot demonstrations: Show the model examples of correct scene descriptions paired with their bounding box inputs, establishing the expected reasoning pattern.
Prompt with diverse instructions: Ask varied questions—“Describe the spatial layout of buildings on this block,” “Which building is tallest?”, “What is the approximate area of the open space?”—to evaluate different aspects of 3D understanding.

This approach treats the language model as a reasoning engine over structured data, bypassing the visual perception problem entirely. The advantage is that the 3D information is precise and unambiguous. The disadvantage is that real-world applications rarely have pre-computed 3D annotations—the whole point is to derive 3D understanding from imagery.

Approach 2: ChatCaptioner-Based Multi-View Fusion

This approach works entirely from 2D images, using multiple viewpoints to reconstruct a 3D understanding:

Render multi-view snapshots: For each city block, generate snapshots from multiple camera angles—north, south, east, west, and overhead views.
Per-view captioning: Use BLIP-2 for visual question answering on each individual snapshot, extracting descriptions of visible buildings, their relative positions, and apparent heights.
Structured annotation: Apply an object detection and description framework to identify and label specific elements in each view, adding precision to the free-text captions.
Global fusion: Feed all per-view descriptions into a language model with the instruction to synthesize them into a single, coherent 3D scene description. The model must resolve inconsistencies (an object visible from the east but occluded from the west), merge complementary information (height from an oblique view, footprint from overhead), and produce a unified spatial narrative.

This approach mirrors how humans build 3D mental models—by integrating multiple 2D observations. The language model acts as the integration engine, combining partial views into a whole. The challenge is that errors in individual captions propagate and compound during fusion. A misidentified building in one view can distort the entire scene description.

Approach 3: Revision-Based 3D Transformation

The most ambitious approach: train models to transform between 3D data representations, using language as an intermediate or output modality.

The core idea is that 3D understanding can be demonstrated through the ability to convert between representations:

3D mesh to text description: Given a textured mesh, produce a natural language description of the scene.
Text to 3D layout: Given a textual description, generate a plausible 3D arrangement of objects.
Point cloud to mesh: Reconstruct surface geometry from sparse 3D points.
Multi-view images to 3D: The classic structure-from-motion problem, but with language-guided priors.

Each transformation tests a different aspect of 3D understanding. A model that can accurately describe a mesh demonstrates perceptual understanding. A model that can generate a 3D layout from text demonstrates generative spatial reasoning. The revision-based framework trains models through iterative refinement—generating an initial output, evaluating it against ground truth, and revising.

3D Feature Extraction Methods

Underlying all three approaches is the question of how to extract meaningful features from 3D data that can interface with language models. We investigated four methods:

Pixel-Aligned Dense Features

Extract features from 2D image backbones (e.g., CLIP, DINOv2) and project them into 3D space using known camera parameters. Each 3D point inherits the feature vector of its corresponding pixel. This approach leverages the rich semantic features of pretrained 2D models but assumes accurate camera calibration and depth estimation.

Direct Reconstruction from RGB-D

When depth information is available alongside color imagery, features can be extracted directly in 3D. RGB-D encoders process the combined color-depth input, producing feature volumes that encode both appearance and geometry. This is the most straightforward approach when depth sensors (LiDAR, structured light) are available, but it does not generalize to settings with only RGB imagery.

Feature Fusion with GradSLAM

GradSLAM provides a differentiable simultaneous localization and mapping framework. By running GradSLAM on a sequence of RGB-D frames, we obtain a dense 3D feature map that fuses information across viewpoints. The differentiability of GradSLAM means that the entire pipeline—from raw frames to 3D features to language output—can be trained end-to-end, allowing the feature extraction to adapt to the downstream language task.

Neural Voxel Fields Without Depth

The most flexible approach: learn a 3D feature volume from RGB images alone, without requiring depth input. A neural voxel field represents the scene as a grid of learned feature vectors in 3D space, supervised only by the consistency of features projected back to the input images. This approach is the most broadly applicable (requiring only posed RGB images) but also the most computationally expensive and the most prone to geometric errors in textureless or repetitive regions.

Challenges and Observations

Several challenges emerged across all approaches:

The Grounding Problem

Language models can produce fluent descriptions of 3D scenes that are spatially incoherent. A model might correctly list the buildings in a scene but place them in impossible spatial arrangements. Grounding language in precise 3D geometry—ensuring that “the tall building to the north” actually corresponds to the tallest structure in the northern portion of the scene—remains an open problem.

Scale and Distance Estimation

Even when models correctly identify objects and their relative positions, absolute scale estimation is poor. Heights might be off by a factor of two, distances between buildings might be wildly inaccurate, and area estimates rarely match ground truth. This suggests that current VLMs lack a calibrated sense of physical scale, likely because their training data does not consistently pair visual observations with metric measurements.

View Synthesis vs. Understanding

There is a tempting equivalence between 3D understanding and the ability to synthesize novel views. A model that can render a scene from an unseen angle must, in some sense, understand the 3D structure. However, we found that models capable of plausible view synthesis did not necessarily produce accurate 3D descriptions. The features useful for photorealistic rendering (texture, lighting, fine geometry) differ from those useful for spatial reasoning (object boundaries, heights, topology).

Takeaways

Explicit 3D data produces the best language descriptions. The Boxes-Demonstration-Instruction approach, which feeds structured 3D information directly to the language model, outperforms image-based approaches on description accuracy. The bottleneck for 3D GeoLLMs is perception, not reasoning.
Multi-view fusion is promising but error-prone. The ChatCaptioner approach shows that language models can integrate partial 2D observations into 3D understanding, but the quality depends heavily on the accuracy of per-view captions. Better vision-language models will directly improve this pipeline.
Depth-free 3D feature extraction is feasible but expensive. Neural voxel fields can recover 3D structure from posed RGB images alone, opening the door to GeoLLMs that work with standard camera imagery rather than requiring specialized depth sensors.
Urban morphology matters. Performance varied significantly across cities, with dense, regular grid layouts (easier to describe) outperforming organic, irregular patterns. GeoLLM evaluation must account for this variation.