Semi-Supervised Learning and Foundation Models for Earth Observation

Tom Böhnel

8 min readDec 13, 2023

Who is this article for?

If one or more of the following applies to you, you will hopefully find this article interesting:

You have a general interest in modern Earth observation methods
You want to learn about semi-supervised learning
You want to learn how semi-supervised learning is applied in earth observation
You intend to apply a foundation model for your earth observation project

Previous knowledge and intuition about machine learning are beneficial, but not required to follow this article.

Introduction

Satellites continuously capture images of our planet. Combined with machine learning, there are numerous applications. Some examples:

Predicting crop yield of an agricultural area
Monitoring forest health
Segmenting photovoltaic modules on rooftops
Detecting sand mines (I’ll write about this in a future article)

One awesome machine learning technique is semi-supervised learning. Researchers provide foundation models that can be applied to solve a wide range of Earth observation tasks. This article introduces semi-supervised learning and foundation models for satellite imagery.

Semi-Supervised Learning

In machine learning, a computer learns to perform a specific task by observing data. Semi-supervised learning is a technique that utilizes an unlabeled and a labeled dataset. First, a model learns an effective representation of the data using the unlabeled dataset. Then, this representation is used to learn the specific task using the labeled dataset. Unlike in fully supervised learning, a relatively small labeled dataset is often sufficient.

The unlabeled data is used for represetantion learning, where the model learns an effective representation in self-supervision. The result is a pre-trained model. This pre-trained model is used in transfer learning, where it is adapted for a specific task using the labeled data. — Semi-supervised learning

Semi-Supervised Learning + Earth Observation = 🧡

Semi-supervised learning is particularly valuable in scenarios where a lot of unlabeled data is available, but labeled data remains sparse. This is the case for many Earth observation applications. Satellites continuously capture new images of the planet, and space agencies like NASA and ESA provide free access to these images. However, annotating this vast amount of images is costly and sometimes requires expert knowledge.

Satellite imagery provides unique characteristics that semi-supervised learning can make use of. (1) Satellite images always depict the Earth’s surface from the same perspective. (2) Satellites revisit locations and provide time-series of images, for example, every 5 days in ESA’s Sentinel mission. (3) While natural images usually have three color channels, red, green, and blue (RGB), many satellite images consist of more channels, from the visible wavelengths to infrared. Thus, satellite imagery is highly structured along the dimensions of geography, time, and wavelength. Semi-supervised learning techniques can leverage this structuredness. Moreover, satellite images can be combined with other data sources, for example, elevation or weather data.

Representation Learning (Step 1)

In this first step, a model learns an effective representation of the unlabeled data. It does not rely on human-created annotations, instead, the computer creates its training target autonomously. This is called self-supervised learning (SSL). Let’s look at two SSL methods:

SSL Method: Masked Autoencoding

An autoencoder consists of an encoder and a decoder. The encoder compresses the input, and the decoder reconstructs the original input based on the compressed representation. A masked autoencoder (MAE) is a variant of this. A portion of the input data is hidden, and the model learns to reconstruct the missing information. This technique is common in natural language processing (NLP), where a MAE reconstructs missing words from text. AI models like ChatGPT are trained with similar methods on vast amounts of text from the internet.

It has been discovered that this technique is also effective in computer vision. Here, an image is divided into small non-overlapping patches. A random subset of these patches is hidden, and the MAE is trained to reconstruct the missing parts. While an MAE in NLP typically hides 15% of the words, for vision tasks, it is optimal to hide around 75% of the image. The intuition behind the different masking rations is that a word in a sentence contains more semantic information than a small patch in an image.

The target is a picture of a dog. The input is the same image, but the majority of the image is hidden.

Intuition

We are not interested in the ability to reconstruct hidden information. But, while the model is trained on this reconstruction task, it builds an effective representation of the data. This representation is beneficial for downstream applications. The intuition is: While the model learns to reconstruct missing information, it gains a deeper understanding of the data. It learns about the underlying processes that produced the data. This is the case for natural language as well as for satellite imagery.

SSL Method: Contrastive Learning

In contrastive learning, a model is trained to produce similar representations for positive pairs of examples, and dissimilar representations for negative pairs. What are positive/negative pairs, and what are similar/dissimilar representations?

A positive pair is two views from the same data, for example, two images from the same object. Usually, a positive image pair is generated by applying augmentations to an image, for example, random cropping, color distortions, blurring, or flips. A negative pair is two distinct images.

A positive image pair is generated by applying artificial augmentation on one image. The example of a negative pair shows a cat and a dog.

The representation produced by a neural network is a vector, a list of numbers. We quantify the similarity between two representations by measuring the distance between the two vectors.

Contrastive learning minimizes the distance between representations from positive pairs, and maximizes the distance between representations from negative pairs. Again, during this process, the model learns an effective representation of the data.

In Earth observation, contrastive learning can make use of seasonal changes. One can consider changes in Earth’s surface over time as natural augmentation. Two images from the same location at different times are a positive pair. This replaces the artificial augmentations.

Three images over the same location at different times. Seasonal changes are visible, especially over agricultural area. — Landscape changes over time

Transfer Learning (Step 2)

The result of representation learning is a pre-trained model. Next, transfer learning adapts (= “transfers”) this model for a specific application. Often, the pre-trained model is extended with a custom head designed for the given application. Transfer learning relies on the labeled data and adapts the model in a supervised manner. There exist several strategies for this training process.

In linear probing, the pre-trained model is frozen, meaning that its parameters don’t change during training. Parameters in the head are trainable.

In finetuning, parameters in the pre-trained model are trainable. Some variations are:

Partial finetuning: The first layers remain frozen, and the last layers are trainable.
Gradual unfreezing: The pre-trained model is initially frozen. Its layers gradually unfreeze during the training process.
Layer-wise learning rate decay: Each layer receives an individual learning rate, while early layers have smaller learning rates.

These finetuning variations fit the intuition that the parameters in the early layers need to adapt more than the parameters in the later layers.

Foundation Models in Earth Observation

A foundation model (FM) is trained on broad data and can be adapted to a wide range of downstream tasks. They are prevalent in NLP, sometimes referred to as ‘generalist AI models‘. Now, recent efforts aim to build FMs for Earth observation. A robust FM is trained with diverse data, in Earth observation, this means diverse countries, landscapes, seasons, and other characteristics.

Training a FM requires large computational resources. Fortunately, several research teams with large resources have developed and published FMs. We can utilize these models and adapt them to our needs. Given a pre-trained FM, we only need to apply the transfer learning step.

The following presents a selection of FMs for Earth observation, all of which emerged recently.

SatMAE comes in two variants. Temporal SatMAE processes time-series of RBG images (3 channels). Spectral SatMAE processes single timestamp images with 10 color channels. They are trained using the fMoW dataset and masked autoencoding.
Prithvi is similar to temporal SatMAE, but it makes use of 6 color channels. Its training data is sampled only within the contiguous United States, therefore, its applicability is probably limited.
Scale-MAE is invariant to the scale (meters per pixel) of an image. The model learned the relationship between features at different scales. It is trained using the fMoW dataset and masked autoencoding.
SSL4EO-S12 is a dataset of satellite images. It has been used to train models with different neural network architectures and SSL methods, including seasonal contrastive learning and masked autoencoding. All of the resulting models can be considered FMs.
SatlasNet is pre-trained in a supervised manner. It is the result of training on the annotated dataset SatlasPretrain.
Presto (one of my favorites) processes single-pixel time-series data from different sensors. All other approaches in this list consume images with spatial awareness. This is the way we humans consume images, we look at many pixels at once, and most modern machine learning methods do the same. Differently, Presto only analyzes one pixel (one small area of Earth’s surface) and how this pixel changes over time. It consumes data from multiple sensors: Multispectral imagery, radar, weather, and topography. Presto demonstrates that a single-pixel time-series can outperform spatially aware methods. One advantage is computational cost: Presto has orders of magnitudes fewer parameters than any other model in this list and can be finetuned without extensive resources.

When comparing FMs, I suggest to consider key aspects: Neural network architecture, learning technique, and dataset.

Ideally, the FM is already ‘almost capable’ of performing the desired task. During the transfer learning step, the model only changes slightly. When extending a FM with a custom head, this head can be very small compared to the entire model.

Outlook

These are exciting times to be working on machine learning with satellite images. Lately, many innovations in machine learning originated in natural language processing, and now are being applied to computer vision and satellite imagery. Examples are the transformer architecture, masked autoencoding, and the development and usage of FMs. Future development continues to be exciting!

The performance of many FMs for Earth observation is already remarkable today. Yet, some ideas with obvious potential for improvement have not been tried. Some approaches are compatible with each other and can be combined, for example, Scale-MAE and SatMAE. Besides incremental improvements of existing approaches, I’m curiously awaiting breakthroughs and new methodologies in semi-supervised learning.
FMs can be made more accessible to users without expertise in machine learning, and users without extensive computational resources. I will write about this in a future article.
How is the pre-training dataset of a FM composed? This is a crucial question. The FM will perform well in geographies that are well represented in this dataset. How many images of a certain country go into this dataset? And who decides this? These questions make the development of FMs even geopolitical!

Image Sources: Unsplash, He et al.: Masked Autoencoders Are Scalable Vision Learners