Benchmarking Vision-Language Models for Property Intelligence
Why Benchmark VLMs on Property Imagery?
Vision-language models (VLMs) have shown remarkable capability on general visual question answering—identifying objects, describing scenes, and reasoning about spatial relationships. But how well do they perform on domain-specific visual analysis tasks, particularly the kind of structured property inspection that insurance companies and roofing contractors rely on?
Property analysis from drone and aerial imagery involves a suite of tasks that range from straightforward classification (what shape is this roof?) to fine-grained counting (how many chimneys are visible?) to subtle material identification (what type of shingle is installed?). These tasks require both visual acuity and domain knowledge, making them an interesting stress test for general-purpose VLMs.
We conducted a comprehensive benchmark across multiple state-of-the-art VLMs to understand where these models excel, where they struggle, and how far they are from replacing human inspectors.
The Task Suite
Our benchmark covers a diverse set of property analysis tasks, each demanding different visual capabilities:
Classification Tasks
- Roof shape identification: Gable, hip, flat, mansard, gambrel, and other common roof geometries.
- Shingle type: Asphalt, metal, clay/tile, wood shake, slate, and composite materials.
- Shingle design: Architectural (dimensional), 3-tab, designer, and specialty patterns.
- Shingle color: A wide palette including charcoal, brown, gray, red, green, and multi-tone blends.
- Scene classification: Residential, commercial, industrial, and mixed-use contexts.
- Viewpoint classification: Nadir (top-down), oblique, and close-up perspectives.
Counting Tasks
- Roof faces: The number of distinct slopes visible on a building.
- Chimneys: Count of chimney structures.
- Vents: Roof penetrations including plumbing vents, ridge vents, and turbine vents.
- Dormers: Window structures projecting from the roof plane.
- Skylights: Glazed roof openings.
Detection Tasks
- Solar panel detection: Presence and extent of photovoltaic installations.
This mix of categorical, numerical, and binary tasks provides a well-rounded assessment of each model’s visual understanding.
Post-Processing: Handling Noisy Outputs
One practical challenge with evaluating VLMs is that their free-text outputs are inherently noisy. A model might respond “gabel” instead of “gable,” or “three” instead of “3.” Rejecting these as incorrect would unfairly penalize models that understood the visual content but produced minor textual errors.
We implemented two key post-processing steps:
- Levenshtein edit distance matching: For categorical outputs, we matched model responses to the nearest valid label using edit distance, with a threshold to prevent spurious matches. This catches common spelling variations and typos.
- Number word conversion: We normalized written numbers (“three,” “seven”) to their numeric equivalents before comparison, ensuring counting tasks are evaluated on visual accuracy rather than output format.
Evaluation Metrics
Different tasks demand different metrics:
- F1-macro: For multi-class classification tasks (roof shape, shingle type), providing a balanced measure across all classes.
- Accuracy: For binary and low-cardinality tasks (solar panel detection, viewpoint classification).
- Hamming loss: For multi-label scenarios where multiple attributes may apply simultaneously.
- Cosine similarity: For comparing embedding-level representations of model outputs.
- Mean Absolute Error (MAE): For counting tasks, measuring how far off numerical predictions are from ground truth.
Benchmark Results
Newer Model Comparison
Our updated benchmark evaluated the latest generation of open-source and proprietary VLMs:
| Model | Overall Score | Strengths | Weaknesses |
|---|---|---|---|
| LLaVa-v1.6-34B | 0.71 | Strong across all task types | Slower inference |
| Yi-VL-34B | 0.66 | Competitive counting accuracy | Weaker on fine-grained material ID |
| LLaVa-v1.5-13B | 0.61 | Good balance of speed and accuracy | Struggles with rare categories |
| Qwen-VL-Chat | 0.52 | Fast inference | Significant gaps on counting and detection |
LLaVa-v1.6-34B emerged as the clear leader at 0.71 overall, demonstrating that scale matters for this domain. The jump from the 13B to 34B parameter LLaVa variant produced a 10-point improvement, primarily driven by better performance on counting tasks and fine-grained material classification.
Earlier Benchmark (Proprietary Models Included)
An earlier round of evaluation included proprietary models:
| Model | Overall Score |
|---|---|
| LLaVa-v1.5-13B | 0.74 |
| GPT-4V | 0.72 |
Interestingly, LLaVa-v1.5-13B slightly outperformed GPT-4V (0.74 vs. 0.72) in this earlier benchmark, suggesting that open-source models can be competitive with—or even exceed—proprietary alternatives on domain-specific visual tasks, particularly when the evaluation focuses on structured property attributes rather than open-ended visual reasoning.
Data Collection: Building the Evaluation Set
Assembling a high-quality evaluation dataset for property analysis required creative sourcing beyond standard computer vision benchmarks:
Web-Scale Sources
- LAION: We mined the LAION dataset for roof and property imagery, filtering by relevant captions and metadata. While large in scale, the imagery tends toward stock-photo aesthetics rather than realistic inspection conditions.
Video Sources
- YouTube roofing videos: Roofing contractors frequently upload walkthrough and inspection videos. Frame extraction from these videos provided close-up views of various shingle types, damage patterns, and installation details that are rare in static image datasets.
Community Forums
We discovered a rich vein of annotated property imagery in online roofing communities:
- RoofingTalk: A professional roofer forum with 731 image-containing posts, often accompanied by detailed descriptions of materials, conditions, and problems. The text surrounding these images provided natural language annotations.
- DIYChatroom: A homeowner forum with 3,251 image-containing posts, offering a complementary perspective—images tend to be lower quality but capture the range of conditions a real-world system would encounter.
Curated Galleries
Web-based roofing galleries and manufacturer catalogs provided clean, labeled examples of specific shingle products, colors, and styles—useful for establishing ground truth on material classification tasks.
Related Work: The VLM Landscape
Our benchmark builds on rapid progress in the VLM field. Several architectures informed our model selection:
- LLaVa: The visual instruction tuning paradigm, connecting a vision encoder (CLIP ViT) with a language model through a simple projection layer. LLaVa’s open-source nature and strong performance made it a natural anchor for our evaluation.
- ShareGPT4V: Extends LLaVa with higher-quality visual instruction data generated by GPT-4V itself, improving fine-grained description capabilities.
- CogVLM: A 17B parameter model using a visual expert module that provides dedicated visual processing capacity within the language model’s transformer layers.
- Monkey: Focuses on high-resolution image understanding by processing images at up to 1344x896 resolution, relevant for our drone imagery where fine details (individual shingle textures, small roof features) matter.
- Med-LLaVa: While designed for medical imaging, Med-LLaVa’s approach to domain-specific visual instruction tuning informed our thinking about how to adapt general VLMs for property analysis.
Takeaways
- Scale helps, but is not everything. The 34B LLaVa variant leads our benchmark, but the 13B version can match GPT-4V, suggesting that architectural choices and training data matter as much as raw parameter count.
- Counting remains the hardest task category. All models struggled more with counting roof features (vents, faces, dormers) than with classification tasks. This aligns with known limitations of VLMs on precise numerical reasoning.
- Post-processing is essential for fair evaluation. Without edit distance matching and number normalization, model scores drop significantly due to formatting inconsistencies rather than genuine visual errors.
- Open-source models are competitive. For structured, domain-specific visual analysis, open-source VLMs can match or exceed proprietary alternatives, which has important implications for deployment in cost-sensitive industries like insurance and roofing.
- Data sourcing matters. Community forums and contractor videos provided more realistic and diverse property imagery than web-scraped datasets, underscoring the value of domain-aware data collection.