From Point Regression to Roof Slope Detection

Part 2 of 2 in the Roof Face Count series. Previous: Counting Roof Faces: A Dataset of 890K Buildings.

The Modeling Challenge

Given an overhead image of a building, predict the (x, y) location of each roof face. This sounds straightforward, but several properties make it tricky:

Variable output size: A simple gable roof has 2 faces; a complex hip-and-valley structure might have 10+. The model must predict a variable number of points.
Unordered outputs: There’s no natural ordering among roof faces, so the model can’t simply regress a fixed-length vector of coordinates.
Dense and overlapping: Roof face centers can be close together, especially on complex roofs, making simple heatmap-based approaches prone to merging nearby predictions.

Architecture: Vision Transformer Backbone

After evaluating several backbones, we settled on ViT-L-32 (Vision Transformer, Large, with 32x32 patch size) as the primary feature extractor. The choice was motivated by:

Global receptive field: Transformers process all patches simultaneously, which is important for reasoning about roof structure—the position of one face constrains where others can be.
Scale: ViT-L provides sufficient capacity to handle the visual complexity across 835K training images.
Transfer learning: Pre-trained ViT weights provide strong initialization for overhead imagery.

The model predicts a fixed maximum number of points (up to 11 roof slopes in our experiments) plus an end-of-sequence (EOS) token that signals “no more faces.” During inference, predictions after the EOS token are discarded.

The Loss Function: Hungarian Matching

The unordered nature of the output creates a fundamental challenge: how do you compute loss between a predicted set of points and a ground truth set when you don’t know which prediction should correspond to which target?

We use Hungarian matching (also known as the linear sum assignment problem), the same approach used in DETR for object detection:

Compute a cost matrix of distances between all predicted and all ground truth points.
Find the optimal one-to-one assignment that minimizes total distance.
Compute the loss only on the matched pairs.

This approach is combined with the EOS token loss: the model must learn both where the points are and how many there are.

Loss Components

The final loss combines several terms:

Matched point distance: Huber loss between each predicted point and its matched ground truth point (more robust to outliers than L2).
EOS classification loss: Binary cross-entropy for predicting whether each output slot is a real face or past the end of sequence.
Cosine similarity: An additional geometric constraint encouraging predicted point configurations to match the spatial arrangement of ground truth points.

Training Details

Dataset: 200K AIRS samples (standardized aerial imagery) for initial training
Architecture: ViT-L-32 backbone
Training duration: 500 epochs
Output: Up to 11 roof slope points per image
Optimizer: AdamW with cosine learning rate schedule

What Worked and What Didn’t

Iterative Improvements

The path to our best model involved several rounds of ablation:

Gradient issues with cosine similarity loss: Early experiments combining L2 distance with cosine similarity suffered from vanishing gradients. Reducing the learning rate and adding gradient clipping stabilized training.

Sorting predictions before matching: We experimented with pre-sorting predicted points (e.g., left-to-right) before matching, but Hungarian matching without pre-sorting performed better—it allows the model to output points in whatever order is most natural.

Bigger models help: Moving from ResNet-50 to ViT-L-32 provided a significant accuracy boost, justifying the additional compute cost.

SAM-filtered training data matters: Using SAM to isolate single buildings in each training image (rather than feeding in multi-building scenes) substantially improved convergence and final accuracy.

What Underperformed

Heatmap-based counting: GradCAM-based hotspot counting as an auxiliary loss added complexity without consistent improvement—the direct point regression approach was cleaner and more effective.

Contrastive learning with negative samples: Adding images without buildings as negative examples didn’t improve the model, likely because the ViT backbone already handles this well from pre-training.

Dataset Analysis

Distribution of roof face counts across the dataset — Distribution of roof face counts in the dataset. 2-face (gable) and 4-face (hip) roofs dominate, with a long tail extending to 11+ faces.

The distribution of roof face counts reveals the challenge: 2-face gable roofs and 4-face hip roofs dominate, but the model must also handle rare complex structures with 8–11+ faces. This long-tailed distribution means the model sees far fewer training examples for complex roofs.

Examples of simple roofs (2-4 faces) — Simple roof structures with 2--4 faces, which make up the majority of the dataset.

Examples of complex roofs (5+ faces) — Complex roof structures with 5+ faces present the greatest challenge for point regression models.

Results and Next Steps

The best model achieves strong accuracy on simple roofs (2–4 faces) and degrades gracefully on complex structures. The ViT-L-32 model trained on 200K AIRS images generalizes surprisingly well to drone imagery, despite the significant domain gap between standardized aerial photos and variable drone captures.

Current directions include:

Scaling to the full 835K dataset: More diverse training data should improve generalization, especially on complex roofs.
Ablation studies: Systematically measuring the contribution of each loss component (EOS token, Hungarian matching, cosine similarity).
From points to edges: Using predicted face centers as seeds for a roof vectorization model that recovers the full wireframe structure.

The ultimate goal is automated roof measurement from a single overhead image—a capability that would transform the roofing, insurance, and solar industries.

Part 2 of 2 in the Roof Face Count series. Previous: Counting Roof Faces: A Dataset of 890K Buildings.