Load datasets¶

We start by defining our area of interest.

In [1]:

Copied!

from hip.analysis import AnalysisArea

_BBOX = (9.316406, 32.546813, 41.835938, 46)
_DATES = "2020-01-01/2020-01-10"
area = AnalysisArea(bbox=_BBOX, datetime_range=_DATES)
from hip.analysis import AnalysisArea

_BBOX = (9.316406, 32.546813, 41.835938, 46)
_DATES = "2020-01-01/2020-01-10"
area = AnalysisArea(bbox=_BBOX, datetime_range=_DATES)

c:\Users\amine.barkaoui\AppData\Local\miniconda3\envs\aa-env\lib\site-packages\dask\dataframe\_pyarrow_compat.py:17: FutureWarning: Minimal version of pyarrow will soon be increased to 14.0.1. You are using 13.0.0. Please consider upgrading.
  warnings.warn(

You can then easily load a variety of datasets for this area using the AnalysisArea.get_dataset() method. Each dataset is identified by a key consisting of a list of 2 strings. Here is an example of how to load the CHIRPS rainfall estimates.

In [3]:

Copied!

rainfall = area.get_dataset(["CHIRPS", "rfh_dekad"])
rainfall
rainfall = area.get_dataset(["CHIRPS", "rfh_dekad"])
rainfall

GDAL_DATA                    = /envs/user/hip-analysis-dev-env/share/gdal
GDAL_DISABLE_READDIR_ON_OPEN = EMPTY_DIR
GDAL_HTTP_MAX_RETRY          = 10
GDAL_HTTP_RETRY_DELAY        = 0.5
AWS_ACCESS_KEY_ID            = xx..xx
AWS_SECRET_ACCESS_KEY        = xx..xx
AWS_SESSION_TOKEN            = xx..xx

Out[3]:

<xarray.DataArray 'band' (time: 1, latitude: 270, longitude: 651)>
dask.array<add, shape=(1, 270, 651), dtype=float64, chunksize=(1, 270, 651), chunktype=numpy.ndarray>
Coordinates:
  * latitude     (latitude) float64 45.98 45.93 45.88 ... 32.62 32.58 32.52
  * longitude    (longitude) float64 9.325 9.375 9.425 ... 41.73 41.78 41.83
    spatial_ref  int32 4326
  * time         (time) datetime64[ns] 2020-01-01
Attributes:
    nodata:   nan

By default, the data will be loaded for the bounding box and dates used to define the AnalysisArea. However, these can be overridden in the get_dataset call. This example overrides the single date of 2020-01-01 with the full 2019 range. The geographical area of interest can be changed in the same way.

In [4]:

Copied!

rainfall = area.get_dataset(["CHIRPS", "rfh_dekad"], dates="2019-01-01/2019-12-31")
rainfall
rainfall = area.get_dataset(["CHIRPS", "rfh_dekad"], dates="2019-01-01/2019-12-31")
rainfall

GDAL_DATA                    = /envs/user/hip-analysis-dev-env/share/gdal
GDAL_DISABLE_READDIR_ON_OPEN = EMPTY_DIR
GDAL_HTTP_MAX_RETRY          = 10
GDAL_HTTP_RETRY_DELAY        = 0.5
AWS_ACCESS_KEY_ID            = xx..xx
AWS_SECRET_ACCESS_KEY        = xx..xx
AWS_SESSION_TOKEN            = xx..xx

Out[4]:

<xarray.DataArray 'band' (time: 36, latitude: 270, longitude: 651)>
dask.array<add, shape=(36, 270, 651), dtype=float64, chunksize=(1, 270, 651), chunktype=numpy.ndarray>
Coordinates:
  * latitude     (latitude) float64 45.98 45.93 45.88 ... 32.62 32.58 32.52
  * longitude    (longitude) float64 9.325 9.375 9.425 ... 41.73 41.78 41.83
    spatial_ref  int32 4326
  * time         (time) datetime64[ns] 2019-01-01 2019-01-11 ... 2019-12-21
Attributes:
    nodata:   nan

To discover all the available datasets you can ... FEATURE TO BE ADDED IN A FUTURE PR

Raster datasets¶

Raster data will be loaded into an xarray.DataArray or xarray.Dataset. You can learn more about how to use xarray here

Loading from STAC APIs¶

Most raster data is loaded from STAC APIs. hip-analysis takes care of interacting with the API for you, and in most cases, all you will need to provide is the key for the dataset you need. However, if you want to have more fine-grained control over how the data is loaded, you can pass arguments to the STAC API call via the stac_load_kw_args key of the load_config argument of the get_dataset call. This example shows how to pass a custom dask chunk parameter, requesting the output to be returned in 100 x 100 chunks along the spatial dimensions.

The next example shows how to pass a custom resampling method to get_dataset via the gridded_load_kwargs key of the load_config argument. You can choose one from the following options:

"nearest", "cubic", "bilinear", "cubic_spline", "lanczos", "average", "mode", "gauss", "max", "min", "med", "q1", "q3"

If no method is specified, the default "nearest" method will be used. The resampling parameter can be either a string or a dictionary if you want to specify a method per band. In this case, the key corresponds to the name of the data variable, and the value to the chosen method. If you need to know the names of the variables to be able to specify a reampling method per band, please refer to the dataset documentation or you can load the data for a reduced time interval over a small area. This example uses NDVI from MODIS at 250m which is loaded at high resolution, i.e. 30m, with resampling carried out using the "cubic_spline" method.

In [9]:

Copied!





xx = area.get_dataset(
    ["MODIS", "250M_16_DAYS_NDVI"],
    load_config={
        "gridded_load_kwargs": {
            "resampling": "cubic_spline",
        }
    },
)
xx.isel(time=-1).plot.imshow()
xx = area.get_dataset(
    ["MODIS", "250M_16_DAYS_NDVI"],
    load_config={
        "gridded_load_kwargs": {
            "resampling": "cubic_spline",
        }
    },
)
xx.isel(time=-1).plot.imshow()

WARNING:root:nodata attribute is not set in the product '['MODIS', '250M_16_DAYS_NDVI']'. Computing without masking nodata values.

Out[9]:

<matplotlib.image.AxesImage at 0x7f8e1d2cec20>

No description has been provided for this image

hip-analysis connects to multiple STAC APIs: WFP's Humanitarian Data Cube, Microsoft's Planetary Computer and Digital Earth Africa. For the most commonly used datasets, the connection to the appropriate STAC API is taken care of by hip-analysis. For example, the above ["LANDSAT","NDVI"] key does not reveal where the data is loaded from (psst... it's Planetary Computer). hip-analysis also takes care of soe pre-processing operations which, depending on the datasource, include scaling and offsetting data as needed, masking no data, and removing outliers.

However, datasets that are not already mapped to a curated key can still be loaded from the above STAC APIs via the get_dataset method. For example, the code below loads the crop_mask product from Digital Earth Africa without it being explicitly listed among the curated datasets in hip-analysis. Note, however that the data will be loaded as-is, without any preprocessing or cleaning.

In [10]:

Copied!





BBOX = (32.244186, -2.698206, 32.470093, -2.526723)
DATES = None
area = AnalysisArea(bbox=BBOX, hi_res=True, datetime_range=DATES)
crop_mask = area.get_dataset(["DEA", "crop_mask"])
crop_mask
BBOX = (32.244186, -2.698206, 32.470093, -2.526723)
DATES = None
area = AnalysisArea(bbox=BBOX, hi_res=True, datetime_range=DATES)
crop_mask = area.get_dataset(["DEA", "crop_mask"])
crop_mask

GDAL_DATA                    = /envs/user/hip-analysis-dev-env/share/gdal
GDAL_DISABLE_READDIR_ON_OPEN = EMPTY_DIR
GDAL_HTTP_MAX_RETRY          = 10
GDAL_HTTP_RETRY_DELAY        = 0.5
AWS_S3_ENDPOINT              = s3.af-south-1.amazonaws.com
AWS_NO_SIGN_REQUEST          = YES

WARNING:root:nodata attribute is not set in the product '['DEA', 'crop_mask']'. Computing without masking nodata values.

Out[10]:

<xarray.Dataset>
Dimensions:      (y: 633, x: 839, time: 1)
Coordinates:
  * y            (y) float64 9.721e+06 9.721e+06 ... 9.702e+06 9.702e+06
  * x            (x) float64 4.16e+05 4.16e+05 4.16e+05 ... 4.411e+05 4.411e+05
    spatial_ref  int32 32736
  * time         (time) datetime64[ns] 2019-01-01
Data variables:
    mask         (time, y, x) float32 dask.array<chunksize=(1, 633, 839), meta=np.ndarray>
    prob         (time, y, x) float32 dask.array<chunksize=(1, 633, 839), meta=np.ndarray>
    filtered     (time, y, x) float32 dask.array<chunksize=(1, 633, 839), meta=np.ndarray>

Loading from zarr files¶

hip-analysis also interacts with s3 buckets that contain zarr files. (Important note) The method to load the data is exactly the same as with STAC APIs, as all you will need to provide is the key for the dataset you need. The aim is that you do not have to worry about the source of the dataset you want to load, and the following examples show that the parameters used above can be used in the same way.

These datasets are also registered in the datasources documentation, the only difference is that the source of the dataset is not any external organization, but WFP-RAM's database. But similarly, if you want to have more fine-grained control over how the data is loaded, you can pass argumennt to the reprojection function via the gridded_load_kwargs key of the load_config argument of the get_dataset call. This example also shows how to pass a custom dask chunk parameter, requesting the output to be returned in 200 x 360 chunks along the spatial dimensions.

The next example shows, as it is detailed also in the data loading from STAC APIs section, how to pass a custom resampling method to get_dataset via the gridded_load_kwargs key of the load_config argument. You can choose one from the following options:

"nearest", "cubic", "bilinear", "cubic_spline", "lanczos", "average", "mode", "gauss", "max", "min", "med", "q1", "q3"

If no method is specified, the default "nearest" method will be used. The resampling parameter can be either a string or a dictionary if you want to specify a method per band. In this case, the key corresponds to the name of the data variable, and the value to the chosen method. If you need to know the names of the variables to be able to specify a reampling method per band, please refer to the dataset documentation or you can load the data for a reduced time interval over a small area. This example shows ECMWF forecasts at 1 degree resampled at a 0.25 degree resolution using the "lanczos" method.

In [25]:

Copied!





BBOX = (9.316406, 32.546813, 41.835938, 46)
DATES = "2020-01-01/2020-06-01"
area = AnalysisArea(bbox=BBOX, resolution=0.25, datetime_range=DATES)

xx = area.get_dataset(
    ["ECMWF", "RFH_FORECASTS_SEAS5_ISSUE1"],
    load_config={
        "gridded_load_kwargs": {
            "resampling": "lanczos",
        }
    },
)
xx.isel(time=-1, ensemble=0).plot.imshow()
BBOX = (9.316406, 32.546813, 41.835938, 46)
DATES = "2020-01-01/2020-06-01"
area = AnalysisArea(bbox=BBOX, resolution=0.25, datetime_range=DATES)

xx = area.get_dataset(
    ["ECMWF", "RFH_FORECASTS_SEAS5_ISSUE1"],
    load_config={
        "gridded_load_kwargs": {
            "resampling": "lanczos",
        }
    },
)
xx.isel(time=-1, ensemble=0).plot.imshow()

Out[25]:

<matplotlib.image.AxesImage at 0x1ff1d4d2830>