Shingle Damage Detection: From V1 to V2

Where V1 Fell Short

The first version of our shingle damage detector showed that automated hail damage detection from drone imagery was feasible—but deploying it on real-world inspection data exposed several critical failure modes.

False Positive Sources

The V1 model was trained primarily on clean, tightly-cropped images of asphalt shingles with visible hail damage. In production, it encountered a much wider range of visual conditions:

Debris: Leaves, twigs, and granule accumulation in valleys and gutters triggered false detections. The model had learned to associate dark spots with damage, and debris creates similar dark patches.
Color aberrations: Variations in shingle color across a single roof—from manufacturing lots, weathering patterns, or algae growth—produced false positives along color boundaries.
Shadows: Cast shadows from chimneys, vents, and nearby trees created dark regions that the model frequently flagged as damage.

Spatial Boundary Issues

V1 had no concept of where the roof ended. Detections appeared on trees, gutters, driveways, and neighboring properties visible in the image. Without a roof segmentation mask to constrain predictions, every dark patch anywhere in the frame was a potential false positive.

Material Limitations

The V1 model was trained exclusively on asphalt shingles. Deployed on properties with metal roofing, clay tiles, or wood shakes, it produced unpredictable results—sometimes hallucinating damage on perfectly intact surfaces, sometimes missing obvious damage that presented differently on non-asphalt materials.

V2 Design Goals

The V2 system was designed to address each of these shortcomings:

Multi-material support: Handle damage detection on asphalt, metal, clay, and wood roofing.
Zero false positives from debris: Distinguish between surface contaminants and actual material damage.
Roof-constrained detections: Only report damage within the boundaries of the roof surface, eliminating off-roof false positives.

Dataset Evolution

Improving the model required improving the data. The dataset went through three major iterations:

Dataset V0: The Starting Point

The original dataset contained images from a single source, focused exclusively on asphalt shingle damage. While sufficient for proving the concept, it lacked the diversity needed for robust production deployment.

Dataset V1: Expanding Sources

V1 expanded to multiple imagery sources, introducing variation in camera sensors, lighting conditions, and viewing angles. This reduced overfitting to the specific characteristics of the original data source but did not yet address the material diversity problem.

Dataset V2: Production-Representative Data

The critical breakthrough came with V2, which was assembled from live inspection orders spanning all damage types and roofing materials. This dataset captured the true distribution of conditions the model would face in production:

Multiple roofing materials (asphalt, metal, clay, wood)
Various damage types (hail, wind, wear, mechanical)
Diverse lighting and weather conditions
Images with and without debris, shadows, and color variation
Negative examples (undamaged roofs) for each material type

By sourcing from actual inspection orders rather than curated collections, dataset V2 naturally included the challenging edge cases—partially damaged roofs, ambiguous wear patterns, and mixed-material structures—that caused V1 to fail.

Architecture Exploration

We conducted an extensive search across object detection architectures, spanning both the YOLO family and transformer-based alternatives.

YOLO Variants

We evaluated every major YOLO version from v5 through v9, plus specialized variants designed for small object detection:

YOLOv5 (S, M, L, X): The workhorse architecture. Simple, well-understood, and consistently strong.
YOLOv6–v9: Each version introduced architectural refinements, but none showed clear advantages over YOLOv5 on our specific task.
YOLO-World: An open-vocabulary variant that accepts text prompts. Promising for zero-shot detection but underperformed supervised models on our established damage categories.
HIC-YOLOv5: A variant incorporating hierarchical context for small object detection. Showed modest improvements on the smallest damage marks.
TPH-YOLOv5: Transformer Prediction Head YOLOv5, adding self-attention to the detection head. Marginal gains did not justify the increased inference cost.

Transformer-Based Detectors

DETR: End-to-end detection without anchors or NMS. Struggled with the high density of small damage marks—a known weakness of DETR on small, numerous objects.
ViTDet: Vision Transformer backbone for detection. Strong on large objects but did not improve small-damage recall.
Co-DETR: Collaborative DETR with improved training efficiency. Better than vanilla DETR but still behind YOLO on our task.

The Winner: YOLOv5-L with Conservative Augmentation

After evaluating all candidates, YOLOv5-L with “low” augmentation emerged as the best model. The key finding was that heavy augmentation hurt performance. Aggressive color jittering, rotation, and scaling disrupted the subtle visual patterns that distinguish real damage from noise. A conservative augmentation strategy—modest flipping, small-scale translation, and minimal color perturbation—preserved these patterns while still providing sufficient regularization.

Inference Strategy: SAHI for High-Resolution Images

Drone images are typically captured at very high resolution (4000x3000 pixels or more), but the damage marks we need to detect can be as small as 10–20 pixels across. Running a standard detector on the full image either requires massive input resolution (slow and memory-intensive) or downsampling (losing the small damage marks entirely).

We adopted SAHI (Slicing Aided Hyper Inference), which processes the image as a grid of overlapping 512x512 patches. Each patch is processed independently by the detector, and the results are merged with NMS to eliminate duplicates at patch boundaries.

This approach provides several advantages:

Consistent input resolution: Every patch is 512x512, matching the resolution the model was trained on.
Small object preservation: Damage marks that would be invisible in a downsampled full image are clearly visible in individual patches.
Scalable inference: Patches can be processed in parallel, and memory usage is constant regardless of the original image size.

The tradeoff is increased total inference time (proportional to the number of patches), but for inspection workflows where accuracy matters more than speed, this is an acceptable cost.

Unsupervised Analysis: Autoencoder Embeddings

Beyond supervised detection, we explored using autoencoder embeddings to understand the latent structure of roof imagery. By training an autoencoder on cropped roof patches and clustering the learned embeddings, we hoped to discover natural groupings that could inform data collection or model design.

The clustering results were informative but not directly useful for damage detection: clusters formed primarily along axes of shingle color, surface texture, and lighting conditions rather than damage presence or severity. A dark gray roof in shadow and a light brown roof in sunlight landed in very different clusters regardless of their damage state.

This suggests that an unsupervised approach would need explicit disentanglement of appearance and condition—perhaps through contrastive learning that pairs damaged and undamaged patches of the same roof—to be useful for damage analysis.

Future Directions

Two promising research directions emerged from this work:

Synthetic Damage Generation

Real hail damage data is expensive to collect and inherently imbalanced—most roofs are undamaged. Generating realistic synthetic damage marks on images of intact roofs could dramatically expand the training set. The challenge is photorealism: synthetic marks must match the visual characteristics of real hail damage (granule loss patterns, circular morphology, scale-appropriate size) without introducing artifacts that the model could learn to distinguish from real damage.

Shingle-Level Segmentation

Current detection operates at the bounding-box level, but insurance adjusters need to know which individual shingles are damaged. Using SAM (Segment Anything Model) to segment individual shingles, then classifying each as damaged or undamaged, would provide the granularity needed for automated damage reports. The challenge is that SAM does not natively understand shingle boundaries—fine-tuning or prompting strategies would be needed to achieve reliable shingle-level segmentation.

Lessons Learned

Data diversity beats architecture complexity. The jump from dataset V1 to V2 produced larger accuracy gains than any architecture change.
Less augmentation can be more. For tasks involving subtle visual patterns, aggressive augmentation destroys the signal. Match augmentation intensity to the difficulty of the visual discrimination.
YOLO remains hard to beat for small object detection. Despite the appeal of transformer-based detectors, YOLOv5 with appropriate configuration outperformed every alternative on this task.
Patch-based inference is essential for high-resolution imagery. SAHI’s sliding-window approach is a simple but effective solution to the resolution mismatch between training and deployment.
Unsupervised methods need task-aware design. Naive autoencoder clustering captures appearance variation, not damage variation. Useful for data understanding, but not directly applicable to detection.