How Well Do Zero-Shot Models Work on Satellite Data?
The Promise of Zero-Shot for Remote Sensing
Foundation models like CLIP, BLIP, and OpenCLIP have demonstrated remarkable zero-shot capabilities on natural images. Given a text description and an image, these models can classify images into arbitrary categories without any task-specific training. The appeal for remote sensing is obvious: labeled satellite and aerial datasets are expensive to create, domain-specific, and often restricted. If foundation models could classify satellite imagery zero-shot, it would dramatically lower the barrier to geospatial analysis.
But how well do these models actually work on satellite data? Their pretraining corpora are dominated by natural photographs—people, animals, objects, indoor scenes. Satellite imagery looks fundamentally different: overhead perspective, unusual scales, specialized spectral bands, and domain-specific class vocabularies. We set out to find out just how large the domain gap is, and whether clever prompt engineering could close it.
Experimental Setup
Models Tested
We evaluated over 40 model/backbone combinations spanning three foundation model families:
- CLIP (OpenAI): Models with ViT-B/16, ViT-B/32, ViT-L/14, ViT-L/14@336px, RN50, RN101, RN50x4, RN50x16, and RN50x64 backbones.
- OpenCLIP (LAION): Models with ViT-B/16, ViT-B/32, ViT-H/14, ViT-L/14, and ViT-G/14 backbones, trained on LAION-400M and LAION-2B datasets.
- BLIP (Salesforce): The BLIP architecture with ViT-B and ViT-L backbones.
This breadth of models allowed us to separate the effects of architecture (ViT vs. ResNet), scale (B vs. L vs. H vs. G), and pretraining data (OpenAI’s curated dataset vs. LAION’s web-scale data).
Datasets
We tested on five satellite and aerial datasets spanning different tasks and resolutions:
- EuroSAT: 27,000 Sentinel-2 satellite images across 10 land-use classes (residential, industrial, forest, river, etc.). A standard benchmark for satellite image classification.
- BigEarthNet-S2: A large-scale Sentinel-2 dataset with multi-label land-cover annotations. We evaluated on the 19-class simplified label set.
- SyntheticWakeSAR: Synthetic Aperture Radar images of ship wakes, classified by wake pattern type. Particularly challenging due to the non-optical modality.
- xView: Large-scale overhead imagery dataset with 60 fine-grained object categories (vehicles, buildings, infrastructure).
- DOTA: Aerial image dataset for oriented object detection, with 15 categories including planes, ships, bridges, and sports fields.
Prompt Strategies
We tested three prompting strategies:
- Standard prompts: Simple class name insertion into templates like “a satellite image of {class}.”
- Geospatial context prompts: Adding geographic and sensor context, e.g., “a Sentinel-2 satellite image showing {class} land use in Europe.”
- Disambiguated prompts: For datasets with ambiguous or similar class names, providing explicit textual descriptions that distinguish between classes.
Key Finding 1: The Domain Gap Is Real
The headline result is sobering. On EuroSAT, the best zero-shot model—CLIP ViT-L/14—achieved 52.43% accuracy. The supervised state-of-the-art on EuroSAT is approximately 86.1%. That is a 34-point gap, far larger than typical zero-shot deficits on natural image benchmarks where CLIP often approaches supervised performance.
The following table shows zero-shot accuracy on EuroSAT for selected models with standard prompts:
| Model | Backbone | EuroSAT Accuracy (%) |
|---|---|---|
| CLIP | RN50 | 37.62 |
| CLIP | RN101 | 34.48 |
| CLIP | ViT-B/32 | 43.87 |
| CLIP | ViT-B/16 | 49.00 |
| CLIP | ViT-L/14 | 52.43 |
| CLIP | ViT-L/14@336px | 51.63 |
| CLIP | RN50x64 | 47.74 |
| OpenCLIP | ViT-B/32 (LAION-2B) | 40.10 |
| OpenCLIP | ViT-B/16 (LAION-2B) | 43.09 |
| OpenCLIP | ViT-L/14 (LAION-2B) | 48.70 |
| OpenCLIP | ViT-H/14 (LAION-2B) | 18.50 |
| OpenCLIP | ViT-G/14 (LAION-2B) | 49.24 |
| BLIP | ViT-B | 30.74 |
| BLIP | ViT-L | 28.09 |
Several patterns emerge. Within CLIP, ViT backbones consistently outperform ResNet backbones, and larger models generally do better—but the gains plateau quickly. BLIP models perform surprisingly poorly, suggesting their architecture (optimized for image-text matching and generation) does not transfer as well to zero-shot classification on satellite data. Among OpenCLIP models, there is high variance: ViT-H/14 achieves only 18.50%, dramatically worse than the smaller ViT-L/14 at 48.70%.
Key Finding 2: Geospatial Context in Prompts Can Help Enormously
The most striking result in our benchmark came from adding geospatial context to text prompts. For OpenCLIP ViT-H/14 on EuroSAT, switching from a standard prompt (“a satellite image of {class}”) to a context-enriched prompt (“a centered satellite photo of {class}”) caused accuracy to jump from 18.50% to 56.57%—a threefold improvement from prompt engineering alone.
| Model | Backbone | Standard Prompt (%) | Context Prompt (%) | Delta |
|---|---|---|---|---|
| CLIP | ViT-L/14 | 52.43 | 55.68 | +3.25 |
| CLIP | ViT-L/14@336px | 51.63 | 56.02 | +4.39 |
| CLIP | RN50x64 | 47.74 | 49.35 | +1.61 |
| OpenCLIP | ViT-H/14 | 18.50 | 56.57 | +38.07 |
| OpenCLIP | ViT-G/14 | 49.24 | 52.96 | +3.72 |
| OpenCLIP | ViT-L/14 | 48.70 | 52.11 | +3.41 |
The massive gain for OpenCLIP ViT-H/14 suggests this model’s text encoder has learned a strong association between geospatial vocabulary and overhead imagery, but the default prompts fail to activate it. The model “knows” about satellite imagery but needs the right linguistic cue. For most other models, the gains from context prompts were more modest (1–5 points), but consistently positive.
This has a practical implication: when deploying zero-shot models on satellite data, prompt engineering is not optional—it is essential, and the optimal prompt varies significantly by model.
Key Finding 3: Label Disambiguation Doubles Accuracy on Challenging Datasets
SyntheticWakeSAR is an especially difficult dataset for zero-shot models because its classes describe subtle variations in ship wake patterns (Kelvin wake, turbulent wake, narrow-V wake, etc.). These are expert-domain labels with no natural language grounding in typical pretraining data.
With standard prompts using raw class names, the best zero-shot accuracy was 8.81%—barely above random chance for a multi-class problem. But when we replaced the terse class labels with descriptive disambiguating text that explained what each wake type looks like, accuracy jumped to 18.24%, more than doubling performance.
| Prompt Strategy | Best Accuracy (%) | Best Model |
|---|---|---|
| Standard (raw class names) | 8.81 | CLIP ViT-L/14 |
| Disambiguated descriptions | 18.24 | CLIP ViT-L/14 |
While 18.24% is still far from useful for production applications, the doubling of accuracy through text engineering alone reveals how much performance is left on the table when zero-shot models encounter unfamiliar vocabularies. The models have visual capacity that the default text interface fails to unlock.
Key Finding 4: Scale Does Not Guarantee Improvement
A counterintuitive finding was that larger models do not always perform better on satellite data. OpenCLIP ViT-H/14 underperformed ViT-L/14 on EuroSAT with standard prompts (18.50% vs. 48.70%), and on several other benchmarks the largest models showed no advantage over medium-scale ones.
This suggests that the relationship between model scale and domain transfer is non-linear. Larger models may overfit to the distribution of their pretraining data, making them more sensitive to domain shift when encountering satellite imagery. Alternatively, the text encoders of larger models may develop narrower semantic associations that are harder to redirect through prompting.
Practical Recommendations
Based on our extensive benchmarking, we offer these guidelines for practitioners considering zero-shot models for satellite and aerial data:
-
Start with CLIP ViT-L/14: It offered the most consistent performance across datasets and prompt strategies. It is not always the best, but it is rarely the worst.
-
Invest in prompt engineering: The difference between a naive prompt and a well-crafted one can exceed the difference between model architectures. Test multiple prompt templates and include domain-specific context (sensor type, geographic region, viewing angle).
-
Disambiguate class labels: If your classification labels are domain-specific jargon, write natural language descriptions for each class. This is cheap and can dramatically improve results.
-
Do not assume larger is better: Test multiple model scales. The best model for your specific dataset and task may not be the largest one available.
-
Use zero-shot as a starting point, not an endpoint: For production applications, use zero-shot predictions to bootstrap a labeled dataset, then fine-tune. The zero-shot models are best understood as expensive label generators, not as final classifiers.
Looking Forward
The 34-point gap between zero-shot and supervised performance on EuroSAT is both a challenge and an opportunity. It tells us that current foundation models have not adequately learned the visual language of overhead imagery—unsurprising given the composition of their training data. As geospatial data becomes more prevalent in foundation model training sets, and as domain-adapted models like SatCLIP and GeoCLIP emerge, this gap will narrow. But for now, zero-shot models on satellite data require careful engineering and realistic expectations.