Blending

Overview

National Meteorological Services need to provide the best possible agro-meteorological and climate risk products. Precipitation and temperature are two of the key variables as they are the main driver of most hydrological and climatological processes, and a multitude of climate products can be derived from basic estimates at daily or ten daily time step.

Data from meteorological stations provides locally very accurate data. However, depending on the density of the observing network, inference about the situation in ungauged areas may be difficult. Frequently there is a requirement from users for products in map form. Gridded products (rasters) are very desirable as they allow analysis and assessments over ungauged regions. In the absence of other resources, these requirements are met by interpolating station point values. The quality of the results depends on the variable being interpolated, the methods used and how sparse the observing network is.

A better way to proceed is by blending the station data with externally provided grids of the same or of a closely related variable (often satellite data). This preserves the information in the station data while using it to correct biases in the gridded variable.

Application to Rainfall

The key feature of rainfall data is its extreme variability even at short spatial scales. Many people are familiar with seeing a rain event in one place while conditions remain dry a few kilometers away. This means that raingauges are poor samplers of the true rainfall field. Gridded Satellite Rainfall Estimates (SREs) have become an increasingly popular instrument as a result of steadily improving quality and availability. However, they suffer from a varying degree of inaccuracies and biases. In particular, they tend to be conditionally biased by overestimating small rainfall events and underestimating large rainfall amounts.

Blending station rainfall and SRE data addresses these problems:

The blended product (like SREs) provides rainfall values at all points of the area of interest.
Values can be easily aggregated for specific areas such as administrative units or river catchments.
The inclusion of station data improves the accuracy of the blended products compared to the SRE and corrects conditional biases in the SRE.

Formulation

Blending is the combination of two estimates into a single value. We use a well known approach where the SRE is taken as a first-guess that is adjusted by observations from station measurements. The approach used is as follows:

$BRE = SRE + 𝑭(SRE-OBS)$

This means that we take the $SRE$ and apply a correction $𝑭$ that is related to the differences between observations $OBS$ and the $SRE$. $𝑭$ is estimated by interpolation (Gaussian process regression or kriging) applied to the residuals, i.e. the point differences between the $SRE$ and the gauge data. We call this blending procedure residual kriging as it allows to correct the gridded satellite estimates by combining them only with station point data through Gaussian processes. To be combined, SRE data needs to be unbiased to have the same mean as station data at each time step. Indeed, to be interpolated through kriging, we want to assume that those residuals follows a mean zero Gaussian process. Thus, we need to first perform an unbiasing of the SRE data such that SRE and station data have the same mean at each time steps in order for the residuals to have a zero mean.

SRE Unbiasing and residuals computation

We unbias the SRE values by introducing a mean ratio bias correction factor $ BC $ computed at each time step $ t $.

$BC(t) = \frac{mean(OBS(t)) + c}{mean(SRE(t)) + c}$

where $ c=1 $.

Unbiased SRE then reads

$SRE_{unbiased}(t) = SRE(t) * BC(t)$

We then compute the residuals between the unbiased SRE and the station data

$residuals(i, t) = SRE_{unbiased}(i, t) - OBS(i, t)$

where $i$ is the station location and $t$ the timestep.

Gaussian process regression - Kriging

The aim is to interpolate the computed residuals over the entire area. The method used to interpolate the residuals is called Kriging.

Kriging is an advanced geostatistical procedure that generates an estimated surface from a scattered set of points. An investigation of the spatial behaviour of the phenomenon needs to be made in order to use the kriging interpolation. Kriging assumes that the distance or direction between sample points reflects a spatial correlation that can be used to explain variation in the surface. It is a multi-step process: it includes exploratory statistical analysis of the data, variogram modelling, and finally creating the surface by interpolation.

The general formula of kriging is the following:

$\hat{Z}(x_0) = \sum_{i=1}^N w_i(x_0) Z(x_i)$

where

$Z(x_i)$ is the measured value at the $i^{th}$ location
$w_i(x_0)$ is the weight for the measured value at the $i^{th}$ location
$x_0$ is the prediction location
$N$ is the number of measured values \end{itemize}

In Inverse Distance Weighting (IDW), the weight, $w_i$, depends solely on the distance to the prediction location. However, kriging introduces weights that are not only based on the distance between the measured points and the prediction location but also on the overall spatial arrangement of the measured points. To use the spatial arrangement in the weights, the spatial autocorrelation must be quantified with a semivariogram.

To make a prediction with the kriging interpolation method, two tasks are necessary:

Quantify the spatial autocorrelation by creating a semivariogram and fitting the model with a covariance function
Make the predictions (predict the unknown values with Gaussian process regression)

As a result, the data is used twice, first to estimate the spatial autocorrelation of the data and then to make the predictions.

Variography

Semivariograms characterize spatial dependence and variability by quantifying the degree of similarity or dissimilarity between pairs of spatial observations as a function of their separation distance. This mathematical function provides valuable insights into the spatial structure of a dataset, enabling the identification of spatial trends and patterns. This will allow us to find a relationship in the probability that it rains at a given point in function of how it is raining across the country.

Fitting a model, or spatial modelling, is also known as structural analysis, or variography. To begin, an empirical semivariogram is built, computed with the following equation for all pairs of locations separated by distance $h$:

$\gamma(h) = \frac{1}{2|N(h)|} \sum_{N(h)} (Z(x_i) - Z(x_j))^2$

with $h$ the distance between points $x_i$ and $x_j$ and $|N(h)|$ the number of distinct pairs in the set of all pairwise distances $i - j = h$.

The formula involves calculating the difference squared between the values of the paired locations. Often, each pair of locations has a unique distance, and there are often many pairs of points. To plot all pairs quickly becomes unmanageable. Instead of plotting each pair, the pairs are grouped into lag bins. In the case of our problematic, we have three dimensions: latitude, longitude, and time. We only want to compute the semivariogram value spatially. As a result, the semivariogram can be computed either by computing a semivariogram at each date for each pair of points or by dividing the dataset into different time periods (wet-dry season or monthly) and compute the semivariogram for each period.

Rainfall variabillity is varying a lot across months, the relationship between two locations is not the same in the middle of the wet season or during the dry season when rain is an exception. However, for a given month, this variability is similar across years. Thus, in this study we compute one semivariogram per month of the wet season and one characterizing the dry season.

The term "variogram" is often used to denote the semivariogram. Both terms will be used to design a semivariogram hereinafter.

The next step is to fit a model to the points forming the empirical variogram as the empirical variogram does not provide information for all possible directions and distances. For this reason, and to ensure that kriging predictions have positive kriging variances, it is necessary to fit a covariance model - that is, a continuous function or curve - to the empirical variogram. Several models can be used to fit the empirical variogram. We will detail three different covariance kernels here: spherical, matern and exponential kernels as they allow to approximate correctly rainfall variability.

Spherical covariance model

$\gamma(h) = \begin{cases} p \left( \frac{3h}{2a} - \frac{h^3}{2a^3} \right) + n, \quad d < a \\ p + n, \qquad \qquad \qquad d \ge a \\ \end{cases}$

Exponential covariance model

$\gamma(h) = p\left( 1 - e^{-\frac{h}{a}} \right) + n$

Matern covariance model

$\gamma(h) = p\left(1 - \rho\left(\frac{h}{a}\right)\right) + n, \quad with$

$\rho(h) = \frac{2^{1-\nu}}{\Gamma(\nu)} \cdot \left( \sqrt{\nu} \cdot \frac{h}{a} \right) ^{\nu} \cdot K_{\nu}\left( \sqrt{\nu} \cdot \frac{h}{a} \right)$

where $\Gamma$ is the gamma function and $K_{\nu}$ is the modified Bessel function of the second kind.

The empirical variogram $\gamma$ depends on several parameters that need to be fitted for a given covariance kernel. In the equations above, we have:

$\mathbf{h}$ the distance value at which to calculate the variogram.
$\mathbf{n}$ the nugget, intercept of the semivariogram. Theoretically, at zero separation distance (for example, $lag = 0$), the semivariogram value is $0$. However, at an infinitely small separation distance, the semivariogram often exhibits a nugget effect, which is a value greater than $0$.
$\mathbf{a}$ the range, distance at which the model levels out. Sample locations separated by distances closer than the range are spatially autocorrelated, whereas locations further apart than the range are not.
$\mathbf{s}$ the sill, limit of the variogram tending to infinity lag distances. It is the value at which the variogram model attains the range.
$\mathbf{p}$ the partial sill ($p = s - n$)

The choice of which model to use is based on the spatial autocorrelation of the data and on prior knowledge of rainfall variability. A covariance kernel is chosen for each empirical semivariogram (each month of the wet season and dry season) based on climatic experts' visual inspection and statistical results.

Kriging estimation and interpolation

The point residuals are then used to make the prediction, ie. an estimation of the residuals over the entire raster. To create a continuous surface of the phenomenon, predictions are made for each location, pixel centers, over the study area based on the semivariogram and the spatial arrangement of measured values that are nearby.

To do so, Ordinary Kriging is applied wrapping around the GSTools Python library. It provides a raster of interpolated residuals over the whole area.As a last step, a Gaussian Filter (convolution with a Gaussian kernel) of order 0 is applied to smooth softly those residuals and remove artefacts that could be introduced by a too sparse network of input stations.

Blending Rainfall Estimate (BRE)

The final blended product is obtained by subtracting back the interpolated residuals to the unbiased SRE:

$BRE = SRE_{unbiased} - RESIDUALS_{kriging}$