Building a Satellite Image-Text Foundation Model with GeoCLIP

Why Satellite Imagery Needs Its Own CLIP

CLIP — OpenAI’s contrastive language-image pretraining model — transformed how the vision community thinks about zero-shot recognition. Train on hundreds of millions of image-text pairs scraped from the web, and you get an embedding space where images and natural language descriptions live side by side. But point that same model at a satellite image, and the magic fades quickly.

Satellite imagery is visually unlike anything in CLIP’s training distribution. Natural photographs are taken at eye level, with perspective distortion, bokeh, and familiar objects at familiar scales. Satellite images are captured from hundreds of kilometers above, looking straight down. Buildings appear as small rectangles. Vehicles are a few pixels wide. The text descriptions that accompany satellite data are also different — they reference spatial resolutions in centimeters, name specific sensor platforms, and describe land-use categories rather than everyday objects.

This domain gap motivated GeoCLIP — an effort to adapt the CLIP framework specifically for satellite and aerial imagery, building a foundation model that understands both overhead visual patterns and the specialized language used to describe them.

The OpenCLIP Training Landscape

Before building a domain-specific model, it helps to understand how the general-purpose versions are trained. OpenCLIP, the open-source reproduction of CLIP, provides detailed training logs across multiple dataset and model configurations. The numbers reveal just how resource-intensive contrastive pretraining can be.

Dataset	Model	GPUs	GPU Type	Training Time	Estimated Cost
LAION-400M	ViT-B/32 (224px)	128	A100 40GB	36 hours	~$19K
LAION-400M	ViT-B/16 (224px)	176	A100 40GB	61 hours	~$44K
LAION-400M	ViT-B/16+ (240px)	224	A100 40GB	61 hours	~$56K
LAION-400M	ViT-L/14 (224px)	400	A100 40GB	127 hours	~$208K
LAION-2B	ViT-B/32 (224px)	112	A100 40GB	51 hours	—
LAION-2B	ViT-H/14	824	A100 40GB	279 hours	—
LAION-2B	ViT-G/14	800	A100 40GB	137 hours	—

Two things stand out. First, compute cost scales aggressively with model size and image resolution — moving from ViT-B/32 to ViT-L/14 on LAION-400M increases cost by roughly 10x. Second, training the largest models requires 800+ GPUs for hundreds of hours. These are not experiments you iterate on casually.

The underlying datasets matter just as much. LAION-400M contains 413 million image-text pairs scraped from the web; LAION-2B scales to 2.2 billion. Neither contains meaningful quantities of satellite imagery — the pairs are overwhelmingly natural photographs, memes, product images, and screenshots.

Training Instabilities at Scale

One of the less-discussed aspects of large-scale CLIP training is how frequently things go wrong. The OpenCLIP logs document recurring NaN values and loss spikes across multiple runs. Attempts to fix these through extra normalization layers, scaled cosine attention, and architecture tweaks were largely unsuccessful. What did work was increasing numerical precision — switching from float16 to bfloat16, or using float32 with TensorFloat-32.

The ViT-H/14 training on LAION-2B experienced a loss spike at epoch 122 that required aggressive learning rate reduction over eight epochs to recover. The ViT-G/14 training exploded entirely at epoch 59. The general pattern: instability frequency is a function of both model scale and global batch size — a relationship that anyone training satellite-specific models at scale will need to navigate carefully.

The Remote Sensing CLIP Landscape

Several research groups have tackled the satellite CLIP problem from different angles. RemoteCLIP fine-tunes CLIP on curated remote sensing image-caption datasets, focusing on existing benchmarks like RSICD (roughly 10,000 images with five captions each) and RSITMD (a similar scale). GeoRSCLIP follows a comparable approach but emphasizes geographic diversity in the training data. SkyScript takes a different path, constructing a large-scale dataset by pairing satellite imagery with metadata-derived text descriptions, reaching into the millions of image-text pairs.

These efforts share a common challenge: where do you get enough high-quality satellite image-text pairs to train a competitive model? Natural image CLIP models benefit from billions of web-scraped pairs, but satellite imagery with meaningful textual descriptions is far scarcer. The existing captioned remote sensing datasets are small by CLIP standards.

Dataset	Images	Text Source	Scale
RSICD	~10K	Human captions (5 per image)	Small
RSITMD	~5K	Human captions	Small
SkyScript	~2.6M	Metadata-derived descriptions	Large
ChatEarthNet	~163K	LLM-generated captions	Medium
RS5M	~5M	Filtered web + metadata pairs	Large

The larger datasets (SkyScript, RS5M) gain scale by sacrificing caption quality — metadata-derived or automatically generated descriptions are less rich than human annotations. This creates a tension between dataset size and alignment quality that is central to the satellite CLIP problem.

Building a Training Dataset from SatlasPretrain and fMoW

For GeoCLIP, we investigated two complementary sources. SatlasPretrain provides 855,000 satellite images annotated with 137 semantic labels spanning land use, infrastructure, and natural features across multiple sensor platforms. Functional Map of the World (fMoW) contributes geographic breadth — over 363,000 images captured across 62 countries over multiple years. The combination addresses two failure modes of naive satellite CLIP approaches: label sparsity and geographic bias.

We also experimented with generating richer captions using language models. By feeding dataset metadata — sensor type, spatial resolution, and semantic labels — into LLaMA-2-7B with few-shot prompting, we produced descriptive captions that encode satellite-specific information. For example, given WorldView-2 metadata at 50 cm resolution with labels for commercial zones and sparse forest, the model generates contextual descriptions referencing both the sensor characteristics and the scene content. Caption quality varied across model quantization levels, with q5_1 consistently scoring highest in human evaluation.

Dataset	Satellite	Resolution	Example Labels
BigEarthNet-S2	Sentinel-2	10 m	Mineral extraction, pastures, coniferous forest
xView	WorldView-3	30 cm	Buildings, cargo vehicles
QFabric	WorldView-2	50 cm	Lakes, urban zones, sparse forest
SeaDroneSee V2	UAV	6 cm	Jetski, boat
Agriculture 2017	UAV	6 cm	Weed clusters, double plant
Houston NOAA	Aerial	20 cm	Affected, major damage, destroyed
MAFAT	Aerial	40 cm	Medium vehicles, buses

This range of resolutions — from 10-meter Sentinel pixels to 6-centimeter UAV captures — is itself a challenge. A model trained on one resolution may fail entirely at another, because the visual features that distinguish a “building” at 30 cm are completely different from those at 10 m.

What We Learned Training Satellite CLIP

Our experiments with training OpenCLIP on satellite datasets, starting with a ResNet-50 backbone, produced several practical insights.

First, encoding satellite-specific metadata into the text encoder matters, but the effects are not uniformly positive. On xView, the baseline model achieved 57% R@1 by predominantly predicting “building” — the most common class. After embedding sensor and resolution metadata into text prompts, R@1 dropped to 1.8%, but the model began predicting a more diverse set of labels (“tank,” “car shed”), and R@10 improved in several categories. The metadata encoding redistributed the model’s attention across classes rather than simply improving top-1 accuracy.

Second, resolution homogeneity within training batches significantly affects convergence. Training on a mixture of 6 cm UAV imagery and 10 m Sentinel data creates conflicting gradients — the visual features useful at one scale are noise at another. Our best results came from resolution-stratified training, where batches are drawn from datasets at similar spatial resolutions.

Third, the compute requirements for satellite CLIP are more manageable than general-purpose CLIP if you are strategic about dataset size. Our estimates for a ViT-B/32 model showed that training on 5 million satellite image-text pairs requires approximately 128 A100 GPUs for under an hour, at a cost around $300. Scaling to 100 million pairs pushes this to about 12 hours and $6,000 — still well within reach for a research lab, and a fraction of the cost of training on LAION-400M.

Takeaways

Building a satellite-specific CLIP model is not simply a matter of fine-tuning on a different dataset. The domain gap between natural photographs and overhead imagery runs deep — different perspectives, different scales, different textures, and fundamentally different text distributions. The existing remote sensing CLIP efforts have made encouraging progress, but the field is still searching for the right combination of dataset scale, caption quality, and training strategy.

The most promising direction is combining curated satellite datasets (SatlasPretrain for label richness, fMoW for geographic diversity) with LLM-generated captions that encode sensor and resolution metadata. This sidesteps the scarcity of human-authored satellite captions while providing enough domain-specific signal for the contrastive objective to learn meaningful alignments. On the practical side, bfloat16 precision, conservative learning rate schedules, and resolution-aware batching are not optional at scale. The good news is that satellite datasets remain orders of magnitude smaller than LAION-2B, keeping training costs accessible — the bottleneck is dataset construction and curation rather than raw compute.