Detecting Trees in Urban Spaces from Aerial Imagery
Why Urban Tree Detection Matters
Trees in urban environments serve critical roles: they reduce heat island effects, improve air quality, absorb stormwater, and increase property values. City planners, insurance companies, and environmental agencies all need accurate inventories of urban tree cover. Traditionally, these inventories are built through expensive ground surveys or manual interpretation of aerial photos—methods that scale poorly across entire metropolitan areas.
Automated tree detection from aerial and UAV imagery offers a path to large-scale, repeatable urban canopy assessment. But trees present unique challenges as detection targets: they vary enormously in species, size, crown shape, and seasonal appearance. Dense canopies merge into continuous masses that resist individual tree delineation, while isolated ornamental trees may be tiny relative to image resolution.
This post summarizes our experiments comparing three detection approaches on a UAV imagery dataset, evaluating both supervised and zero-shot methods.
The Dataset
We worked with 33 high-resolution UAV images sourced from EagleView. These are oblique and near-nadir aerial captures of residential and suburban neighborhoods, with sufficient resolution to distinguish individual tree crowns, rooftops, driveways, and vehicles.
The images were manually annotated with bounding boxes around individual trees. The annotations capture a range of tree sizes—from large mature oaks and elms with expansive canopies to small ornamental trees and newly planted saplings. The dataset also includes challenging cases: trees partially occluded by buildings, trees casting shadows that overlap neighboring structures, and clusters of trees whose canopies merge when viewed from above.
For training the supervised models, we applied standard augmentations (random flips, rotations, brightness/contrast adjustments) and split the data into training, validation, and test sets.
Approach 1: YOLOv3
YOLOv3 was our baseline detector. The YOLO (You Only Look Once) family treats detection as a single-pass regression problem, predicting bounding boxes and class probabilities directly from the full image in one forward pass. YOLOv3 introduced multi-scale detection using feature pyramid outputs at three different resolutions, which is particularly relevant for tree detection since tree crowns vary dramatically in apparent size.
We fine-tuned a YOLOv3 model pretrained on COCO, adapting the final detection heads for our single-class (tree) task. Training ran for several hundred epochs with early stopping based on validation mAP.
Results: YOLOv3 produced reasonable detections on isolated trees with well-defined crowns, but struggled in two systematic ways. First, it frequently merged adjacent tree crowns into single large bounding boxes, undercounting trees in dense canopy areas. Second, it generated false positives on other green objects—hedges, lawn patches, and even green-roofed structures occasionally triggered tree detections.
Approach 2: YOLOv5
YOLOv5 represented a significant architectural evolution. Built with a CSPDarknet53 backbone and PANet neck, it offers better feature fusion across scales and more efficient training dynamics. The framework also provides built-in support for mosaic augmentation, which proved valuable for our relatively small dataset by creating synthetic training images from four random crops.
We trained YOLOv5 using transfer learning from COCO pretrained weights, following the same data splits as YOLOv3 for fair comparison. The model converged faster and achieved notably higher validation metrics during training.
Results: YOLOv5 was the best-performing model in our comparison. It showed marked improvements in three areas:
- Individual tree separation: Where YOLOv3 merged adjacent crowns, YOLOv5 more reliably placed separate bounding boxes on individual trees, even when canopies overlapped.
- Reduced false positives: The model learned to better discriminate trees from other green objects, producing cleaner detection maps.
- Small tree detection: YOLOv5’s improved multi-scale feature handling translated to better recall on smaller or newly planted trees that YOLOv3 tended to miss.
The improvement was consistent across suburban scenes with mixed vegetation densities, confirming that the architectural advances in YOLOv5 translate to meaningful gains on domain-specific aerial detection tasks.
Approach 3: SAM-GEO (Zero-Shot)
Our third approach took a fundamentally different direction. SAM-GEO combines Meta’s Segment Anything Model (SAM) with geospatial utilities, enabling zero-shot segmentation of objects in aerial imagery without any task-specific training. The idea is compelling: if a foundation model has learned a sufficiently general notion of “objects” from its massive pretraining corpus, it should be able to segment tree crowns without ever seeing a labeled tree.
We applied SAM-GEO to our UAV images using text prompts and automatic mask generation. The model segments the image into candidate regions, which can then be filtered by area, shape, and other geometric properties.
Results: SAM-GEO produced surprisingly good segmentation masks on individual tree crowns, especially for isolated trees with clear boundaries. The zero-shot nature of the approach—requiring no training data whatsoever—makes it attractive for rapid deployment scenarios.
However, two significant issues emerged:
-
Patch boundary artifacts: When processing large UAV images, we tiled them into patches for inference. SAM-GEO produced inconsistent segmentations at patch boundaries, splitting trees that straddled two patches into separate partial segments or missing them entirely. This is a known challenge with tiled inference for segmentation models, but it was particularly pronounced with SAM.
-
Over-segmentation in dense canopy: In areas with continuous tree cover, SAM tended to either segment the entire canopy as one object (missing individual trees) or fragment it into irregular sub-regions that did not correspond to individual trees. The model lacks the domain knowledge to understand that a large green mass might contain dozens of individual trees.
-
Class ambiguity: Since SAM segments “things” without class labels, post-processing was needed to classify segments as trees versus other objects. Simple heuristics (color, shape, area thresholds) worked for obvious cases but struggled with edge cases like large hedges or green rooftops.
Comparative Summary
| Aspect | YOLOv3 | YOLOv5 | SAM-GEO |
|---|---|---|---|
| Training data required | Yes | Yes | None |
| Individual tree separation | Moderate | Strong | Variable |
| False positive rate | High | Low | Moderate |
| Small tree detection | Weak | Good | Moderate |
| Patch boundary handling | N/A (box-based) | N/A (box-based) | Problematic |
| Deployment complexity | Low | Low | Higher |
Lessons Learned
YOLOv5 is the practical choice for production tree detection when labeled training data is available. Its combination of accuracy, speed, and robustness to tree size variation made it the clear winner in our evaluation. The investment in annotation pays off with reliable, consistent detections.
SAM-GEO offers a viable zero-shot alternative for scenarios where annotation budgets are zero or where rapid initial assessment is needed before committing to supervised model development. The patch boundary issue is solvable with overlapping tiles and non-maximum suppression, though this adds pipeline complexity.
Domain-specific challenges remain: Urban tree detection is harder than it appears. The diversity of tree species, seasonal states, viewing angles, and canopy densities means that models trained or evaluated on one city may not generalize to another without adaptation. Future work should focus on cross-domain transfer and methods that handle dense canopy delineation more robustly.
The broader takeaway is that for applied aerial imagery tasks, modern supervised detectors like YOLOv5 still outperform zero-shot foundation models when labeled data is available—but the gap is narrowing, and the zero-shot option is increasingly practical for bootstrapping new detection workflows.