Analyzing spatiotemporal data - a seasonal monitoring example
Quick recap¶
In our first session, we saw how to define our area of interest, load data, and explore it visually.
We define our AOI using the AnalysisArea object.
from hip.analysis import AnalysisArea
BBOX = (37.968750,0.878872,50.976563,12.554564) # Horn of Africa
DATES = "2022-01-01/2023-03-31"
area = AnalysisArea(
bbox=BBOX,
datetime_range=DATES
)
We use the AnalysisArea.get_dataset() to connect to datasources and load them into memory when appropriate.
spi3m = area.get_dataset(["CHIRPS","R3S_DEKAD"])
spi3m = spi3m.load()
GDAL_DATA = /envs/shared/hdc/share/gdal GDAL_DISABLE_READDIR_ON_OPEN = EMPTY_DIR GDAL_HTTP_MAX_RETRY = 10 GDAL_HTTP_RETRY_DELAY = 0.5 AWS_ACCESS_KEY_ID = xx..xx AWS_SECRET_ACCESS_KEY = xx..xx AWS_SESSION_TOKEN = xx..xx
And we can explore the data with tools in the hip.viz accessor.
spi3m.hip.viz.point_timeseries()
In today's session, we look at some hip-analysis features that are particularly releveant for national-scale monitoring applications on medium resolution data (0.05-0.01 deg), for example seasonal monitoring. These features include use of administrative boundaries, zonal aggregations and further mapping and timeseries visualizations.
Note that this tutorial is primarily to demonstrate the functionality available. It does not necessarily present a scientifically sound approach to seasonal monitoring.
Define your AOI from administrative boundaries¶
Many analyses are primarily interested in reporting at admin level. To facilitate this, one can define the AnalysisArea object using the country's iso3 code and the desired an administrative level. This will automatically read administrative boundaries from the VAM GeoAPI and store them in the AnalysisArea object, and configure the appropriate GeoBox.
Let's look at Ethiopia at admin level 1.
DATES = "2022-01-01/2023-03-31"
area = AnalysisArea.from_admin_boundaries(iso3='ETH', admin_level=1, datetime_range=DATES)
area.geobox
GeoBox
WKT
ENSEMBLE["World Geodetic System 1984 ensemble",
MEMBER["World Geodetic System 1984 (Transit)"],
MEMBER["World Geodetic System 1984 (G730)"],
MEMBER["World Geodetic System 1984 (G873)"],
MEMBER["World Geodetic System 1984 (G1150)"],
MEMBER["World Geodetic System 1984 (G1674)"],
MEMBER["World Geodetic System 1984 (G1762)"],
MEMBER["World Geodetic System 1984 (G2139)"],
ELLIPSOID["WGS 84",6378137,298.257223563,
LENGTHUNIT["metre",1]],
ENSEMBLEACCURACY[2.0]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433]],
CS[ellipsoidal,2],
AXIS["geodetic latitude (Lat)",north,
ORDER[1],
ANGLEUNIT["degree",0.0174532925199433]],
AXIS["geodetic longitude (Lon)",east,
ORDER[2],
ANGLEUNIT["degree",0.0174532925199433]],
USAGE[
SCOPE["Horizontal component of 3D system."],
AREA["World."],
BBOX[-90,-180,90,180]],
ID["EPSG",4326]]
The preloaded the administrative boundaries are stored into a GeoPandas DataFrame under the [area.BASE_AREA_DATASET] key. These will be used as default by any AnalysisArea function that needs, unless you explictly provide alternative geometries.
area.get_dataset([area.BASE_AREA_DATASET])
| geometry | Code | Name | |
|---|---|---|---|
| Name | |||
| Addis Ababa | POLYGON ((38.65820 9.01460, 38.65940 8.98060, ... | 900532 | Addis Ababa |
| Oromia | POLYGON ((34.12950 9.20560, 34.13180 9.17360, ... | 900539 | Oromia |
| SNNPR | POLYGON ((34.19420 7.09660, 34.19510 7.06000, ... | 900540 | SNNPR |
| Somali | POLYGON ((41.79490 10.99190, 41.76660 10.99410... | 900541 | Somali |
| Tigray | POLYGON ((39.98320 14.44730, 39.95580 14.41530... | 900542 | Tigray |
| Afar | POLYGON ((39.98320 14.44730, 39.99450 14.38800... | 900533 | Afar |
| Amhara | POLYGON ((36.45450 13.76740, 36.42240 13.72880... | 900534 | Amhara |
| B. Gumuz | MULTIPOLYGON (((36.38140 9.47180, 36.38190 9.4... | 900535 | B. Gumuz |
| Dire Dawa | POLYGON ((42.34190 9.72130, 42.29680 9.76080, ... | 900536 | Dire Dawa |
| Gambela | POLYGON ((35.26420 7.54080, 35.26890 7.52180, ... | 900537 | Gambela |
| Harari | POLYGON ((42.07370 9.30730, 42.06370 9.26920, ... | 900538 | Harari |
It is also possible to query any administrative areas of interest at any point of your analysis, using the AnalysisArea.get_admin_boundaries method. Note however that these geometries won't have the special status that is given to the geometries used to create the AnalysisArea object.
area.get_admin_boundaries(iso3='SOM', admin_level=1)
| geometry | Code | Name | |
|---|---|---|---|
| 0 | MULTIPOLYGON (((43.44780 11.48900, 43.45170 11... | 2688 | Awdal |
| 1 | POLYGON ((42.88530 4.30790, 42.89090 4.29150, ... | 2689 | Bakool |
| 2 | POLYGON ((50.17720 8.32820, 50.18250 8.33160, ... | 2690 | Bari |
| 3 | POLYGON ((43.12090 3.67150, 43.10210 3.25740, ... | 2691 | Bay |
| 4 | POLYGON ((45.60600 2.18480, 45.58760 2.18320, ... | 2692 | Banadir |
| 5 | POLYGON ((46.53630 6.51890, 46.46950 6.45040, ... | 2693 | Galgaduud |
| 6 | POLYGON ((42.88530 4.30790, 42.86240 4.28940, ... | 2694 | Gedo |
| 7 | POLYGON ((44.59210 4.93490, 44.60280 4.87170, ... | 2695 | Hiraan |
| 8 | MULTIPOLYGON (((42.89030 0.00090, 42.90510 0.0... | 2696 | Juba Hoose |
| 9 | POLYGON ((42.98090 1.46170, 43.26100 1.04640, ... | 2697 | Shabelle Hoose |
| 10 | POLYGON ((42.40090 1.95170, 41.64090 1.94170, ... | 2698 | Juba Dhexe |
| 11 | POLYGON ((45.39100 2.15170, 45.46100 2.17160, ... | 2699 | Shabelle Dhexe |
| 12 | POLYGON ((47.55920 7.56190, 47.55620 7.56230, ... | 2700 | Mudug |
| 13 | POLYGON ((49.08110 9.14050, 48.73100 8.78100, ... | 2701 | Nugaal |
| 14 | POLYGON ((47.41320 11.17750, 47.40670 11.17430... | 2702 | Sanaag |
| 15 | POLYGON ((46.04940 10.05040, 46.15100 9.44100,... | 2703 | Sool |
| 16 | POLYGON ((46.04940 10.05040, 46.01100 10.08090... | 2704 | Togdheer |
| 17 | POLYGON ((44.69100 8.77180, 44.69100 9.31100, ... | 2705 | Woqooyi Galbeed |
Zonal statistics¶
We are often intestested in getting data over an area rather than at pixel level. We can use the AnalysisArea.zonal_stats method to get the statistics over an area, including 'mean', 'max', 'min', 'sum', 'std', 'var', 'count'.
Here we will calculate the mean and standard deviation of monthly rainfall Standard Precipitation Index (SPI) over the administrative unit we used to create the AnalysisArea object.
Technical note: mean of SPIs over pixels (calculated here) does not equal to the more accurate SPI of the mean across pixels.
admin_spi = area.zonal_stats(['CHIRPS','R1S_DEKAD'], stats=['mean','std'])
admin_spi
GDAL_DATA = /envs/shared/hdc/share/gdal GDAL_DISABLE_READDIR_ON_OPEN = EMPTY_DIR GDAL_HTTP_MAX_RETRY = 10 GDAL_HTTP_RETRY_DELAY = 0.5 AWS_ACCESS_KEY_ID = xx..xx AWS_SECRET_ACCESS_KEY = xx..xx AWS_SESSION_TOKEN = xx..xx
| mean | std | ||
|---|---|---|---|
| zone | time | ||
| Addis Ababa | 2022-01-01 | -183.894737 | 164.336271 |
| Oromia | 2022-01-01 | -58.526425 | 834.220157 |
| SNNPR | 2022-01-01 | 127.673807 | 737.566795 |
| Somali | 2022-01-01 | -395.969627 | 665.164207 |
| Tigray | 2022-01-01 | 1865.853372 | 1463.635691 |
| ... | ... | ... | ... |
| Amhara | 2023-03-21 | 522.217703 | 799.686746 |
| B. Gumuz | 2023-03-21 | -212.592525 | 446.880870 |
| Dire Dawa | 2023-03-21 | 1920.181818 | 68.074137 |
| Gambela | 2023-03-21 | 132.798561 | 263.439440 |
| Harari | 2023-03-21 | 2180.545455 | 60.233430 |
495 rows × 2 columns
The AnalysisArea.zonal_stats takes care of the necessary steps for you: it loads the data if you have not done so already and calculates aggregate statistscs using the default admin boundaries (i.e. those used to create the AnalysisArea object).
Custom boundaries can also be provided, both in the form of a GeoPandas DataFrame or an Xarray DataArray. We will see an example of this later in this session. See method reference for more details.
Visualising zonal statisics¶
We can use the hvplot package to analyse the timeseries interactively.
import hvplot.pandas
admin_spi.hvplot(groupby='zone')
To map zonal statistics we need to combine the calculated statistics with their corresponding geometries. We can use AnalysisArea.join_zonal_stats for this.
zonal_gdf = area.join_zonal_stats(admin_spi['mean']) # Note: can only join one statistic at a time
Now that the data is combined back with the corresponding geometries, we can use the hip.viz accessor to make maps for reporting.
from hdc.colors.rainfall import rxs
date = '2022-03-21'
zonal_gdf.hip.viz.map(
title = f"1-Month SPI {date}",
column = date, # Select one date to plott
annotate = 'Name', # Column to use to add labels to areas.
cmap=rxs.cmap,
vmin=-2000, vmax=2000,
legend=True
)
Your turn!¶
In this exercise, you will calculate area statistics over admin areas and visualize the results.
- Create an AnalysisArea object from the administrative uints of a country of your choice.
- Choose a data sources offered by hip-analysis, and calculate some zonal stats over admin areas.
- Use hvplot, hip.viz.map, or any other tools you are familiar with, to inspect the results.
Extra credit
- Use AnalysisArea.get_admin_boundaries to source a different set of geometries from the ones used to create the AnalysisArea. Lookig at the zonal_stats reference, try to calculate zonal stats over this different geometries instead of the AnalysisArea's default geometries.
Monitoring the season¶
Seasonal monitoring applications are, naturally, most interested in the evolution of key variables during the key growning season. Season timings, however, can vary significantly across space. Here we show how to mask out-of season data on a pixel-by-pixel basis.
To do so, we will use a dataset created by RAM, identifying the (climatological) start and end of up to two seasons for each pixel. These are stored asdekadal indices, where 0 is the first dekad in January and dekads over 35 describe seasons that end in the following year. It can be accessed via the ["WFP_RAM","SOC_EOC"] key.
Technical note: the seasonality dataset may be subject to revision as we improve the methodology used to create it.
seasons = area.get_dataset(["WFP_RAM","SOC_EOC"])
WARNING:root:nodata attribute is not set in the product '['WFP_RAM', 'SOC_EOC']'. Computing without masking nodata values.
For example, the map below shows the start of season 1, showing a prevalence of season starts in dekads ~5-10 (Feb-Apr) in the south east, and a later season start in the north west.
seasons.sel(soc_eoc_idx='soc_1').index.hip.viz.map(title='Start of season 1', vmin=0)
We can use this general information to extract specific season dates for a given point in time, and then use this information to mask out-of-season data on a pixel-by-pixel basis. The following code block is commented to explain the steps. Note that this sequence of steps is likely to be wrapped into a single function in the near future.
# Imports the necessary packages and methods
import datetime
import numpy as np
from hip.analysis.ops._seasonality import get_season_dates
# Pick the 'point in time'. This routine will find the dates of the most recent completed seasons before this date.
run_date = datetime.datetime(2023,3,31)
# Convert generic dekadal indices to specific season dates before run_date
ss1_dates, ss2_dates = get_season_dates(seasons, date=run_date)
# Pick season 1 or 2, which ever was most recent
seasondates = ss2_dates.where(
(
(np.datetime64(run_date) - ss2_dates[:, :, 1])
< (np.datetime64(run_date) - ss1_dates[:, :, 1])
),
ss1_dates,
)
# Load 1 month SPI data
spi = area.get_dataset(['CHIRPS','R1S_DEKAD'])
# Mask all pixel-timestep combinations that are out-of-season.
spi_season = spi.where(
((spi.time >= seasondates.sel(soc_eoc_idx="soc")) & (spi.time <= seasondates.sel(soc_eoc_idx="eoc"))),
np.nan,
drop=True
)
/envs/shared/hdc/lib/python3.10/site-packages/hip/analysis/ops/_seasonality.py:1155: UserWarning: Converting non-nanosecond precision datetime values to nanosecond precision. This behavior can eventually be relaxed in xarray, as it is an artifact from pandas which is now beginning to support non-nanosecond precision values. This warning is caused by passing non-nanosecond np.datetime64 or np.timedelta64 values to the DataArray or Variable constructor; it can be silenced by converting the values to nanosecond precision ahead of time. sos = xr.apply_ufunc( /envs/shared/hdc/lib/python3.10/site-packages/hip/analysis/ops/_seasonality.py:1163: UserWarning: Converting non-nanosecond precision datetime values to nanosecond precision. This behavior can eventually be relaxed in xarray, as it is an artifact from pandas which is now beginning to support non-nanosecond precision values. This warning is caused by passing non-nanosecond np.datetime64 or np.timedelta64 values to the DataArray or Variable constructor; it can be silenced by converting the values to nanosecond precision ahead of time. ss1_dates = xr.apply_ufunc(
The spi_season xr.DataArray now has null values whenever the pixel is out of season. We can now pass this data to the AnalysisArea.zonal_stats function to calcualte seasonally-masked admin timeseries.
Note how this example shows how AnalysisArea.zonal_stats can accept either a DataSource key or an xr.DataArray / xr.Dataset as input data.
admin_spi = area.zonal_stats(spi_season, stats=['mean','std'])
Finally, we show present a further visualization tools that can be particularly handy when one wants to inspect the evolution of a variable over a panel of units, like a set of administrative areas.
The hip.viz.heatmap accessor can be called on any result of AnalysisArea.zonal_stats to render a heatmap showing the timeseries of a variable across all admin areas.
admin_spi['mean'].hip.viz.heatmap(
title='Admin 1-Month SPI',
kwargs={
'cmap':'RdBu',
'vmax':'2000',
'vmin':'-2000'}
)
Data-driven spatial clustering¶
Administrative units are convenient for reporting and policy responses. However, as they are not climatologically homogenous, aggregations over their area may hide the heterogeneity of the seasonal evolution within them.
As an alternative approach, we demonstrate how to define areas based on similarity in their underlying data. We use the popular k-means algorithm to derive a spatial clustering. We use this technique here in the context of seasonal monitoring, but of course the method can be used in a variety of applications.
Recall that we already have an spi xr.DataArray at hand.
spi
<xarray.DataArray 'band' (time: 45, latitude: 230, longitude: 300)> Size: 25MB
dask.array<add, shape=(45, 230, 300), dtype=float64, chunksize=(1, 230, 300), chunktype=numpy.ndarray>
Coordinates:
* latitude (latitude) float64 2kB 14.88 14.82 14.78 ... 3.525 3.475 3.425
* longitude (longitude) float64 2kB 33.02 33.07 33.12 ... 47.88 47.92 47.98
spatial_ref int32 4B 4326
* time (time) datetime64[ns] 360B 2022-01-01 2022-01-11 ... 2023-03-21
Attributes:
nodata: nanThe following code clusters the XY space into 10 clusters based on the similariy of their SPI timeseries.
from hip.analysis.ops.clustering import xrKMeans
clustering = xrKMeans(n_clusters=10, random_state=0)
clustering.preprocess_data(spi)
clustering.fit()
/envs/shared/hdc/lib/python3.10/site-packages/hip/analysis/ops/clustering.py:29: FutureWarning: The return type of `Dataset.dims` will be changed to return a set of dimension names in future, in order to be more consistent with `DataArray.dims`. To access a mapping from dimension names to lengths, please use `Dataset.sizes`. self.feature_dims = list(set(self.data.dims.keys()) - set(spatial_dims)) /envs/shared/hdc/lib/python3.10/site-packages/hip/analysis/ops/clustering.py:40: FutureWarning: The return type of `Dataset.dims` will be changed to return a set of dimension names in future, in order to be more consistent with `DataArray.dims`. To access a mapping from dimension names to lengths, please use `Dataset.sizes`. [self.data.dims[d] for d in self.feature_dims] + [len(self.data.data_vars)]
xrKMeans(n_clusters=10, random_state=0)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
xrKMeans(n_clusters=10, random_state=0)
The resulting clustered area are stored in the labels_da attribute, and can be visualized with hip.viz.
clustering.labels_da.hip.viz.map(
title="Areas clustered by SPI",
cmap='tab10'
)
As we are carrying out an analysis for Ethiopia, we would like to clip these clustered areas to the country's boundaries. AnalysisArea.get_admin_boundaries comes in handy here, as it will return a rasterized geometry if applying the rasterize=True flag. We can then mask the pixels that fall outside of the country boundaries.
# Imports
import xarray as xr
# Get ETH geometries as raster
admin_raster = area.get_admin_boundaries(iso3='ETH', admin_level=1, rasterize=True)
# Turn clustering.labels_da into zones, with null values outside of ETH
zones = xr.where(admin_raster!=-1, clustering.labels_da, np.nan)
# Inspect the results in a map
zones.hip.viz.map(
title="ETH Areas clustered by SPI",
cmap='tab10'
)
Finally, we can re-run our zonal statistics over the clustered areas. To do so, note how we pass an argument to the zones parameter of AnalysisArea.zonal_stats.
spi_custom_zones = area.zonal_stats(spi_season, zones=zones, stats=['mean','std'])
Here we display the same heatmap seen perviously. However, now each row represents a spatial area that has experienced the timeseries sequence displayed homogenously throught its area. Though not really in this case, this apporach may help uncover specific data sequences that might get hidden when aggregating over climatologically arbitrary administrative boundaries.
spi_custom_zones['mean'].hip.viz.heatmap(
title='1-Month SPI clusters',
kwargs={
'cmap':'RdBu',
'vmax':'2000',
'vmin':'-2000'
}
)
The following code helps understand where a given zone is found.
zone_id = 6
(zones==zone_id).hip.viz.map(cmap='Reds')
Your turn!¶
By now you have seen many of the features offered by hip-analysis, including a large DataSource collection; the ability to access administrative boundaries or to cluster custom areas based on data; the zonal_stats method; and a variety of visualization tools in hip.viz.
This excercise consists in attempting a minimal end-to-end analysis. Look at the DataSource catalogue and identify a dataset that might be helpful to answer a question of interest. Use the hip-analysis tools you have heard about so far, along with any other tools you already know, and carry out an end-to-end workflow around your topic of interest, including data loading, calculations and visualization.
Good luck! We are here to answer any questions!