Technical design for developers

Accessing data

The data access layer is the one that has seen most development. The package currently allows users 1-line-of-code access to many raw and derived spectral indices from Modis, Landsat and Sentinel. Extending the catagloge to household survey data and admin boundaries is on the near-term roadmap.

Data access is centered around the concept of Datasource and Dataset. The package contains a growing set of preconfigured Datasources of different types (databases; external APIs; custom zarr files etc) which could be called upon, even multiple times, to yield specific Datasets. The advantage here is that we are decoupling possible filter arguments on the Dataset call (eg date_range) from the general abstraction offered by the Datasource.

Note that the concept of Datasource is broader than simple connectors to raw data. Indeed, Datasources are intended to be wrappers over deterministic, though potentially complex, data extractions. For example, the cloud-masked, smoothed Normalized Vegetation Index from Sentinel 2 is still considered a Datasource, though its creation involves data processing as well as extraction. This allows exposing datasets to users while wrapping up and abstracting the details of how that data is sourced or constructed. The user simply calls higher-order API methods like, get_bare_soil.

Given their computaitonal cost of extraction, resulting Datasets are automatically cached locally on disk, without requiring any user input. This is beneficial as it frees up RAM memory of intermediate objects, while allowing the user quickly retrieve a previous dataset in downstream analyses.

Resulting Dataset objects are common types, like xr, geopandas, pandas, dask etc, to maintain interoperability with commonly used libraries.

Data analysis logic

The data processing logic lives in two main submodules: analyses and ops. These map, respectively, to the package's higher-level and lower-level APIs.

Analyses are end-to-end, multi-step, user-facing methods. For example, this could be routine drought analysis that we run upon request from the field. These are the methods we might imagine being used by an analyst. They may be bound to custom object types that come with hip-analysis, in particular the AnalysisArea class.

Operations are lower-level, modular and generic analytical steps. An analysis would be composed of several operations. This code is more developer-facing code (data scientists, data engineers) and uses exclusively established data structures (xr, np, pd etc). Operations come in many forms, like cloud masking, detrending and composite index creation, linear regression, significance tests, threshold detection etc.

Computation

Hip-analysis does not provision a compute infrastructure, but expects a Dask cluster (which can also be a simple local dask cluster). This is necessary to be able to scale an algorithm to large areas.

Some analyses on large datasets, especially on spatial dataset, require a high degree of compute optimization. For this reason, ops are ofen optimized using numba.