Looking Beyond a Grainy Picture: Tracking Sand Mines from Space

12 min readFeb 25, 2024

Dozens of trucks loaded with sand — Source: Paul Salopek

🏖🛰

This article presents the development of a sand mining detector. This system detects sand mining activity on satellite imagery using machine learning.

It is my Master’s thesis project which I conducted at UC Berkeley’s School of Information. The project is a collaboration between Ando Shah, Suraj R Nair, Joshua Blumenstock, Veditum India Foundation, and myself.

Why? Our Motivation.

Sustainable sand mining is one of the main ecological challenges of the 21st century.

Sand is a key material for producing concrete, asphalt, and glass. It is the second most exploited resource on the planet after water. In addition to the construction industry, the climate crisis intensifies the demand because sand is used for land reclamation and flood protection.

While desert sand is too fine and smooth, sand on riverbeds is suited for many applications. From riverbeds, sand is extracted in huge volumes. This has severe ecological consequences: Extensive sand mining leads to biodiversity loss and damages hydraulic functions. It not only amplifies floodings, but also droughts. These consequences are most severe in the Global South, where the demand for sand is also the highest.

Sand mining remains unregulated or under-regulated in many countries. Many sand mining activities are illegal, and in some places, a violent sand mafia controls the market. Even murders have been associated with the sand mafia.

What needs to be done?

As sand is essential for human development, effective policies must balance the socio-economic needs of sand against the negative effects of its extraction. As rivers naturally deposit new sand, sustainable sand management will relocate and dynamically adapt the locations for sand mining into areas with the least vulnerable ecosystems. This requires knowledge about the occurrence of sand resources and a better understanding of the consequences of mining.

The UNEP (United Nations Environment Programme) stresses the importance of improved governance of sand resources. One of UNEP’s key recommendations is to map, monitor, and report sand resources. Part of this is developing reporting frameworks to help countries monitor and report on sand extraction (Source). Until now, this challenge has remained unsolved. This project contributes to detecting and monitoring sand extraction activities.

Our Proposal

We propose to detect and monitor sand mining using satellite imagery and machine learning (ML). I developed such a sand mining detector using modern ML techniques. For each pixel of a satellite image, the detector predicts whether sand mining is present. Hopefully, this work will contribute to sustainable sand management. This project focuses on India, but its methods can be applied to any geography.

Input: Satellite image.
Output: A binary classification between sand mine and not sand mine.

High-resolution satellite image over a mining site on the river Betwa in India

How exactly can the ML-based sand mining detector help?

In many places, local populations do not report illegal mining because they fear the sand mafia. The sand mining detector can identify sand mining soon after the beginning of mining activity. It can alarm local authorities to support law enforcement.
The sand mining detector can analyze past mining areas on historical satellite images. Combined with local environmental and socio-economic data, this can contribute to a better understanding of the impacts of sand mining and support policy-making.
Similarly, historical data on illegal mining can provide insights into how the sand mafia operates and which sand deposits are at risk of extraction.
... there are many more useful applications to be explored.

How? Data and Methods.

Satellite Imagery

The sand mining detector is based on satellite images from the Sentinel-2 mission of the European Space Agency (ESA). Sentinel imagery is freely available.

Temporal resolution: Two satellites orbit Earth with revisit times of 10 days each. Jointly, they capture new images every 5 days over every location (except when it’s cloudy, then the images are useless).
Spectral resolution: While conventional images (e.g., taken by your smartphone) capture three color bands, red, green, and blue (RGB), Sentinel-2 images consist of 13 color bands. These include RGB and infrared, which is invisible to the human eye. For many earth observation applications, this multi-spectral resolution is valuable. For example, infrared reflectivity is used to assess the photosynthetic activity of forests. We also know that the infrared reflectivity of sand depends on grain size properties (e.g., roughness, shape, grain size). Hence, we hypothesize that the multi-spectral resolution will benefit sand mining detection.
Spatial resolution: One pixel covers 10x10 meters of Earth’s surface. (Depending on the color band, resolution varies between 10x10, 20x20 or 60x60 meters.)

Introduction to Machine Learning

If you are familiar with neural networks and supervised learning, feel free to skip the following paragraph.

In machine learning, a computer learns from examples. An example is a mapping from an input (x) to a target (y). In our case, (x) is the satellite image and (y) is a pixel-wise classification between mine and not-mine. These examples are referred to as ground truth. A neural network (NN) produces predictions (y’). We aim to minimize the difference between prediction (y’) and target (y). This is an optimization problem. During training, we present a set of labeled examples to the NN, this is the training dataset. The NN adapts its parameters according to the optimization problem. Ideally, the resulting NN generalizes beyond the training dataset and produces accurate predictions for new, unseen inputs.

In the next section, I describe how we created the ground truth dataset to train and evaluate the sand mining detector.

Ground Truth

1. Goverment reports. 2. Validting mining activity. 3. List of mining sites (lat/lon) 4. Rectangular locations 5. 2 timestamps per location 6. Annoating mining areas (polygons)

First, we need to identify sand mining locations. This is a joint effort with the Veditum India Foundation. We assess government reports with mining concessions. This leads us to locations with potential sand mining activity, for which we validate whether sand mining occurred. After validation, we have a list of point coordinates with mining sites.

Example of government record with mining concession

Next, we define our study areas. Some of the mining sites are small, covering a few hundred square kilometers. Other mining sites are huge, they follow a river for several dozens of kilometers. Accordingly, we define one or more non-overlapping rectangles around every mining location. These rectangles are arbitrary in size. They become our study area.

Third, we select timestamps. Sentinel-2 images are available for every fifth day. For each rectangular area defined above, we chose 2 timestamps, usually several months apart. We aim to maximize the variability of the landscape between these two timestamps.

A timestamp together with a geographical area yields a unique satellite image. Next, we annotate sand mining areas in these images. We do this by manually drawing polygons over areas where we see signs of sand mining.

Satellite image with annotations of sand mining area

Dataset size: Our dataset consists of 39 rectangles. Their sizes range between 2.5 to 582 sqkm. After choosing 2 timestamps per rectangle, we obtain 78 satellite images. The total area is 9614 sqkm. We labeled 2.69% of all pixels as sand mine.

Labeling Challenge

Unfortunately, the annotation process is highly ambiguous. Mining locations on riverbeds are exposed to nature, and river water removes signs of mining. Often it is ambiguous whether the remnant of a sand mine should be annotated. Or, it is unclear where to draw boundaries between a mining and a non-mining area. Even with high-resolution images from Google Earth available, expert human labelers disagree on annotations.

Sometimes, our team spent 30 minutes or more to discuss a single small mining area. In addition to the Sentinel-2 RGB image, we looked at high-resolution images from Google Earth Pro (multiple timestamps available), Mapbox (single unknown timestamp), and monthly mid-resolution images from NICFI. Occasionally, we even searched for clues on Google Street View from neighboring streets. Despite these efforts, our annotation confidence remains unsatisfying.

With sufficient financial resources, a research team could purchase commercial high-resolution satellite imagery. These could make the annotation process easier. Another costly approach involves field visits at the mining sites.

Split into Training and Validation Subsets

We divide the dataset of 79 labeled images into two subsets. The neural network learns from the training set. After training, we evaluate the neural network on the validation set.

We design the split with high geographical separation. We group the 79 images by their geographical location into 9 clusters. Then, we perform the split at cluster level, meaning that images within one cluster are either completely in training or completely in validation. This ensures that any validation data is far away from any training data. We train using 60 images and evaluate on 18 images.

Neural Networks Architectures

In computer vision (CV), there exist two prevalent classes of neural network architectures: CNNs and ViTs. I will briefly introduce both.

Convolutional neural networks (CNN) progressively extract features from an image, layer by layer. The initial layers capture simple features such as edges. Subsequent layers combine these features, for example, two edges form a corner. The neural network increasingly extracts higher-level features. The final layers are rich with semantic information. Interestingly, this is similar to how our human brain processes visual information.
The transformer emerged in natural language processing (NLP) and is prevalent in this field. AI models such as ChatGPT are based on the transformer. Now, it has also successfully been applied to CV tasks. A vision transformer (ViT) divides an input image into small patches. The neural network processes a sequence of patches in CV like a sequence of words in NLP. The ViT excels at capturing long-range dependencies.

During the 2010s, CNNs were the standard approach for computer vision problems. The ViT was introduced in 2021. Afterward, ViTs were considered superior to CNNs, at least for tasks where a lot of training data is available. Now, a newer study points out that under fair conditions, given the same computational budget, there is no evidence that ViTs perform better than CNNs.

Semi-Supervised Learning and Foundation Models

Given the relatively small labeled dataset, we hypothesize that a semi-supervised learning approach will perform best. I apply semi-supervised learning by utilizing pre-trained foundation models. I suggest reading the following article to learn about semi-supervised learning and foundation models.

Semi-Supervised Learning and Foundation Models in Earth Observation

Leveraging Modern Machine Learning Techniques for Satellite Imagery

tboehnel.medium.com

The article above lists several foundation models for Earth observation, from which I select two for this project. (1) Spectral SatMAE, a ViT trained using masked autoencoding on the fMoW dataset. (2) SSL4EO-ResNet-MoCo, a CNN trained using seasonal contrastive learning on the SSL4EO-S12 dataset. Both are available in two sizes: SatMAE-base/large and ResNet18/50.

In our sand mining detection task, we want to produce a prediction for every pixel. Therefore, I construct decoder heads for pixel-wise predictions. For SSL4EO-ResNet, we built a decoder inspired by U-Net. For SatMAE, I built a decoder inspired by this study’s ‘progressive upsampling’ approach.

In addition to the semi-supervised approaches, I train a fully-supervised model. I select the original U-Net. This is a fully convolutional network, a CNN that produces predictions for every pixel. This fully supervised U-Net is the benchmark for the semi-supervised approaches.

Let’s look at the size of the resulting 5 neural networks in terms of the number of parameters.

Note that the SatMAE decoders are quite small, they have less than one million parameters. This is possible because SatMAE yields a high-resolution final feature map. This feature map describes every 8x8 pixels patch of the input image with 2304 (=768*3) or 3072 (=1024*3) numbers. My decoder only consumes this final feature map to produce pixel-wise prediction.

I train the combined neural network by linear probing. This means the parameters in the foundation model are fixed, only the parameters in the decoder are trainable. I’ve found approaches with trainable parameters in the foundation model to often be unstable. I believe they require a careful search for feasible hyperparameters (e.g. learning rate). Differently, linear probing seems to be insensitive to hyperparameter choices, thus making linear probing a practical choice.

Training Data Sampling

The labeled satellite images are arbitrary in size, many cover hundreds of square kilometers. We know that sand mining occurs close to rivers. Thus, I want to focus the training effort on the relevant river landscapes. We define an area of interest (AOI), which is the area within a 1 km distance of a river. During training, I sample training data by sampling random windows within this AOI. All windows have a fixed size of 160 x 160 pixels, corresponding to an area of 1600 x 1600 square meters.

Limiting the random window sampling to the AOI reduces training time because we do not ‘waste’ training resources on irrelevant areas far away from rivers.

Satellite image. Quadrats indicate the location of sampled training windows. All of them are close to the river. — Training data: Sampling random windows within the AOI

Inference Strategy

During inference (when we are obtaining new predictions), I apply a systematic sliding window that covers the entire input image. Predictions at the edge of a window are expected to be worse because the neural network has less context for these edge pixels. My strategy (c) involves discarding edge pixels and compensating the gaps by letting the windows overlap. After stitching the individual predictions into a large prediction map, this strategy yields a smooth result.

Simple approach: Sliding window without overlapping windows. Chosen strategy: Overlapping windows with discarded edge pixels.

Tools

Google Earth Engine (GEE) provides convenient access to various geospatial datasets, including Sentinel-2 images. We preprocess the images in GEE and export them to a storage unit on the Google Cloud Platform (GCP).
We use Labelbox to annotate sand mines. First, I populate a Labelbox project with URLs pointing to Sentinel-2 RGB images on GCP. Labelbox allows us to draw and edit polygons on these images. Afterward, we export the polygons as GeoJSONs.
PyTorch, of course.
Rastervision is a Python library for ML with satellite imagery. It handles many of the geospatial computations, for example associating the global geographical coordinate system with the pixel-wise coordinates.
Weights & Biases is an ML developer tool that I use to monitor, analyze, and compare different training runs. I highly recommend it! It is easy to set up and offers a large variety of features. Any ML project benefits from end-to-end training and evaluation pipelines that allow for quick experimentation.

Labelbox and Weights & Biases are commercial products, but free for academic use.

Preliminary Results

How do we quantitatively compare the models? First, we assess precision and recall. Given all predicted sand mining pixels, precision is the percentage of actual sand mining pixels. Given all actual sand mining pixels, recall is the percentage of correctly predicted sand mining pixels. The precision-recall-curve describes the trade-off between precision and recall at all possible decision thresholds. The F1-score summarizes a precision and recall pair at one particular decision threshold. The average precision summarizes the entire precision-recall-curve.

Let’s look at some quantitative results from an initial investigation. The following table reports the average precision (AP) and the F1 score for a decision threshold at 0.5.

Quantitative results per model (preliminary)

Size matters.

We find that the large foundation model variants (SSL4EO-ResNet50 and SatMAE-large) perform best. Notably, the fully supervised U-Net performs better than the smaller variants of the foundation models (SSLEO-ResNet18 and SatMAE-base). This a surprising finding. I conclude that size matters. Possibly, the advantage of large models shrinks when finetuning with trainable encoder weights.

Next, let’s look at some predictions.

The model correctly detects the largest sand mining area. In some areas, the prediction does not match the ground truth. Often, these are areas where human labelers are uncertain about the annotation as well.

The following shows a mining site that was neither in the training nor in the validation set. It demonstrates the detector’s capability to precisely segment a mining site over time and monitor its extent.

Sand mining detector precisely segments a mining site.

Outlook. What's next?

It is unclear whether the sand mining detector’s current accuracy is sufficient to provide value to stakeholders outside of academia. Future work will continue to improve the current system. Some starting points:

We can continue the self-supervised pre-training of the foundation model with images of river landscapes. This makes the pre-trained model more specialized for river landscapes.
One idea is to consider the annotation uncertainty in the training process. This idea involves associating every annotation with a confidence level. Intuitively, annotations with high confidence should be weighted higher than low-confidence annotations. This can be realized by placing a lower weight in the loss function on pixels with low annotation confidence.
Improving the quality and quantity of the training dataset. The performance of any ML model is determined mostly by its underlying training dataset, and only slightly by the neural network architecture, training strategy, etc. (This is unfortunate. Because it’s fun to implement and experiment with new machine learning ideas, while it’s usually tedious to improve a dataset.)

Thank you for your interest in this project. We also presented our work at the NeurIPS 2023 workshop on Tackling Climate Change with Machine Learning. If you want to get in touch, don’t hesitate to reach out.