Building GeoEngine: A Geospatial MLOps Platform

The Problem: Geospatial AI Is Too Hard to Access

Training machine learning models on satellite and aerial imagery should not require a PhD in remote sensing. Yet in practice, the geospatial ML workflow is fragmented across dozens of tools, libraries, and manual processes. A typical project might involve sourcing imagery from one provider, annotating it in a separate tool, writing custom preprocessing scripts, training models in yet another framework, and deploying results through ad-hoc pipelines. Each step requires specialized knowledge, and the connections between steps are brittle.

This fragmentation creates a high barrier to entry. Domain experts—urban planners, environmental scientists, insurance analysts—who understand the problems worth solving often lack the ML engineering skills to build solutions. Meanwhile, ML engineers who could build the models often lack the geospatial domain knowledge to build them correctly.

GeoEngine was designed to bridge this gap: a unified MLOps platform purpose-built for geospatial AI, enabling users with minimal machine learning background to train, evaluate, and deploy models on satellite and aerial data. We presented this work at CVPR 2022.

Platform Architecture

GeoEngine consists of six integrated components, each addressing a specific stage of the geospatial ML lifecycle. The components are designed to work together seamlessly while remaining modular enough to use independently.

Europa: Data Annotation

Every supervised ML project begins with labeled data, and geospatial annotation has unique requirements. Europa is a purpose-built annotation tool for geospatial datasets that understands the spatial context of the data.

Key capabilities include:

Georeferenced annotation: Labels are stored with their geographic coordinates, not just pixel positions. This means annotations remain valid even when imagery is reprojected, resampled, or mosaicked.
Multi-resolution support: Annotators can zoom smoothly between overview and detail levels, maintaining context while labeling fine-grained features.
Annotation primitives: Support for points, bounding boxes, polygons, and semantic masks—covering the full range of geospatial labeling tasks from object detection to pixel-level segmentation.
Quality control workflows: Built-in review stages, inter-annotator agreement metrics, and consensus resolution tools to ensure label quality at scale.

Neso: Data Sourcing

Before annotation comes data acquisition, which in geospatial ML means navigating a complex landscape of imagery providers, formats, projections, and licensing terms. Neso abstracts this complexity.

Neso provides a unified interface for:

Multi-source imagery access: Connect to commercial satellite providers, open data repositories (Sentinel, Landsat), drone imagery pipelines, and aerial survey archives through a single API.
Automated preprocessing: Handle format conversion, reprojection to standard coordinate systems, radiometric normalization, and cloud masking without manual scripting.
Spatial and temporal queries: Request imagery by geographic region and date range, with automatic filtering for cloud cover, resolution, and other quality criteria.
Dataset versioning: Track which imagery was used for which training run, ensuring reproducibility.

Arche: Experiment Management

Geospatial ML experiments generate enormous amounts of metadata: model architectures, hyperparameters, training curves, evaluation metrics across multiple geographic regions, and comparisons across different imagery sources. Arche provides structured experiment tracking designed for the geospatial domain.

Features include:

Experiment logging: Automatic capture of model configurations, training parameters, hardware utilization, and performance metrics.
Geographic performance analysis: Evaluate model accuracy broken down by region, terrain type, land cover class, or any spatial attribute—critical for understanding where models succeed and fail.
Comparison dashboards: Side-by-side visualization of experiments to identify which changes improved performance and which did not.

Phobos: Geospatial AI Library

Phobos is the computational core—a Python library that implements the actual model training pipeline with configurable components at every stage.

The library supports:

Configurable model architectures: Swap backbones (ResNet, EfficientNet, ViT), detection heads (YOLO, Faster R-CNN, RetinaNet), and segmentation decoders (UNet, DeepLab, FPN) through configuration files rather than code changes.
Dataset export and loading: Automatic conversion between geospatial formats (GeoTIFF, COG) and ML-ready formats (patches, tiles, COCO-format annotations) with configurable tiling strategies.
Augmentation pipelines: Built-in augmentation strategies tuned for overhead imagery, including rotations (important for nadir views where orientation is arbitrary), spectral augmentations, and scale variations.
Training orchestration: Multi-GPU training, mixed precision, learning rate scheduling, and early stopping—all configurable without code modification.

Dione: Model Management

Training a model is only half the challenge. Deploying it reliably, monitoring its performance, and maintaining it over time is where most production ML systems struggle. Dione handles the post-training lifecycle.

Dione provides:

Model deployment: Package trained models as inference services with automatic scaling, versioning, and rollback capabilities.
Human-in-the-loop grading: Route model predictions to human reviewers for quality assessment, capturing systematic error patterns that automated metrics miss.
Benchmarking: Standardized evaluation on held-out geographic regions and temporal periods, testing for distribution shift and degradation over time.
Automated testing: Continuous integration for models—run regression tests on new model versions against known-good predictions to catch quality regressions before deployment.
Manual auditing: Tools for domain experts to review model outputs on specific regions of interest, flag errors, and feed corrections back into the training loop.
Model compression: Quantization, pruning, and distillation workflows to reduce model size and inference cost for edge deployment scenarios.

Titan: Analysis Dashboard

The final component, Titan, provides a visual interface for consuming model outputs and generating actionable insights from geospatial predictions.

Titan enables:

Map-based visualization: Overlay model predictions on base imagery with interactive exploration of detections, classifications, and segmentation maps.
Statistical summaries: Aggregate predictions across regions, time periods, or custom spatial units (parcels, census blocks, watersheds) for reporting.
Temporal analysis: Track how model predictions change across multiple imagery dates, supporting change detection and trend monitoring workflows.
Export and integration: Generate reports, export results to GIS formats (GeoJSON, Shapefile, GeoPackage), or push predictions to downstream systems via APIs.

Design Principles

Several principles guided GeoEngine’s design:

Configuration Over Code

Wherever possible, users specify what they want through configuration files rather than writing code. Model architectures, training hyperparameters, augmentation strategies, and deployment parameters are all declarative. This dramatically lowers the barrier to entry for non-programmers while still allowing power users to extend the system through custom code when needed.

Geographic Awareness Throughout

Unlike general-purpose MLOps platforms that treat images as abstract tensors, GeoEngine maintains geographic context at every stage. Annotations carry coordinates. Training metrics are spatial. Deployment serves georeferenced predictions. This end-to-end spatial awareness prevents the coordinate system errors and projection mismatches that plague ad-hoc geospatial ML pipelines.

Reproducibility by Default

Every experiment records the full provenance chain: which imagery, which annotations, which preprocessing, which model configuration, which training run, which evaluation region. Reproducing any result requires only pointing to its experiment ID.

Human-in-the-Loop by Design

Rather than treating human review as an afterthought, GeoEngine builds human feedback loops into the core workflow. Annotators, reviewers, and domain experts interact with model outputs at multiple stages, continuously improving data quality and model performance.

Impact and Lessons Learned

We presented GeoEngine at CVPR 2022, demonstrating the platform across multiple use cases including urban change detection, agricultural monitoring, and infrastructure mapping. Several lessons emerged from building and deploying the platform:

The bottleneck is rarely the model: In most geospatial ML projects, data quality, annotation consistency, and preprocessing correctness matter more than model architecture choices. Investing in robust data infrastructure pays higher returns than chasing state-of-the-art architectures.
Domain experts need visual feedback loops: The most valuable feature for non-ML users was not automated model training but the ability to see predictions overlaid on imagery and quickly flag errors. This visual feedback loop built trust and enabled rapid iteration.
Geographic generalization is the real challenge: A model that achieves 95% accuracy in one city may drop to 70% in another due to differences in architecture, vegetation, terrain, and imagery characteristics. Systematic geographic evaluation—built into the platform, not bolted on—is essential.
Compression matters for deployment: Many geospatial applications require inference at the edge (on drones, in field offices with limited connectivity). Model compression was initially an afterthought but became a critical feature as real-world deployment scenarios emerged.

Geospatial AI has enormous potential to inform decisions about climate, infrastructure, agriculture, and public health. But that potential is realized only when the tools are accessible to the people who understand the problems. GeoEngine was our attempt to lower those barriers, and the lessons we learned continue to inform how we think about democratizing AI for domain-specific applications.