ISRIC-logo

Summary

Digital Soil Mapping (DSM) for Mongolia for five properties : pH, Humus, Bulk Density, Electrical Conductivity.
We present a guided workflow and visualization of results. In this document only the guided workflow is presented. A second document with the visualization of results will follow. The workflow consists of several connected scripts that prepare the data for modelling, do the model fitting and assessment of performance, and the final prediction of soil property maps. A configuration file named config.R stores important variables that are needed by almost all of the scripts in the workflow, this file is in the config/ folder. In config.R you will have, for instance, to specify on which variable to run the workflow and the project_dir (the directory which contains all the data of the project: inputs, outputs, reports).

For this project we map topsoil (depth 0 - 20). The soil input data is provided in layers that have a top depth and bottom depth. These layers need to be combined in a single value for each profile. We do this performing a weighted average, where the weights are the widths (bottom depth - top depth) of the layers.

Data data preparation

This section shows the preparation of the input data for the DSM modelling. A script cleans the soil observations dataset and writes to disk the resulting cleaned dataset. Another script performs preprocessing on the cleaned input data, in this workflow the preprocessing consists of filtering the soil layers so that only topsoil (depth up to 20 cm) is considered, and then applies a weighted average over the layers for each of the soil variables of the project. An additional script uses the locations of the soil profiles and extracts the values of the covariates at those locations.

Data cleaning script

List of imported packages for this section

library(dplyr)
library(tidyr)
library(stringr)
source("config/config.R")# modify according to your config (ex. data directory)

Read the soil input data

input_data <- read.csv(file.path(project_dir,"data/input_data.csv"))

Clean data: The column Depth has several special characters that have to be erased Once Depth is clean we can separate it into top and bottom

input_data <- input_data %>% 
    mutate(Depth = str_replace_all(Depth, "([()?ABCD .])", "")) %>%
    separate(col = Depth, into = c("top","bottom")) %>%
    mutate(top = as.numeric(top), bottom = as.numeric(bottom))

Compute depth: depth will be the mean between top and bottom

input_data <- input_data %>% 
    mutate(depth = rowMeans(dplyr::select(.,top,bottom),na.rm = TRUE))

Filter Bulk Density: The BulkDensityMax parameter is set in the config file config/config.R

input_data$BulkDensity <- ifelse(input_data$BulkDensity <= BulkDensityMax, 
                                 input_data$BulkDensity,NA)

Write file to disk

write.csv(input_data,
          file.path(project_dir,"data/inpt_data_clean.csv"),
          row.names=FALSE)

Data preprocessing script

List of imported packages for this section

library(dplyr)
library(tidyr)
library(magrittr)
source('config/config.R')

Read data (cleand by script scripts/data_preparation/input_data_clean.R)

observations <- read.csv(file.path(project_dir,'data/inpt_data_clean.csv'))

Convert soil variables to numeric, select layer_vars vois (as defined in the variable vois in config/config.R) Create new variable called width = bottom - top

observations <- observations %>%
mutate_at(vois, as.numeric) %>%
mutate_at(vois, round,3) %>%
select(all_of(layer_vars),all_of(vois)) %>%
filter(bottom <= maxDepth) %>%
mutate(width = (bottom - top))

Compute the weigheted average of all soil variables (as defined in the variable vois in config/config.R)

observations_wavg <- observations%>%
group_by(Prof_Code) %>%
summarise_at(vois,~weighted.mean(.,width))%>%
mutate_at(vois,round,3)

Write file to disk

write.csv(observations_wavg,
            file=paste0(project_dir, "data/input_data_preprocessed.csv"), 
            row.names=FALSE,quote = FALSE)

Regression matrix script

List of imported packages for this section

library(terra)
library(sf)
library(dplyr)
source("config/config.R")

List covariates

covars_files <- list.files(normalizePath(paste0(project_dir,'/covars_sg250m/')),
                           pattern = '.tif$',
                           full.names = TRUE)

Read covariates stack with the {terra} package

covariates_stack <- rast(covars_files)
names(covariates_stack) <- tools::file_path_sans_ext(basename(covars_files))

Read soil observations

observations <- read.csv(file.path(project_dir,'data/inpt_data_clean.csv'))

Get only the information about profiles (exclude layers)

profiles <- distinct(select(observations,Prof_Code,lon,lat,fold),
                     Prof_Code,.keep_all = TRUE)

Convert profiles into a spatial point object with {terra}

profiles_spatial <- vect(profiles, geom = c('lon','lat'),
                             crs="+proj=longlat +datum=WGS84")

Reproject the spatial points to the same crs of the covariates stack

profiles_spatial <- project(profiles_spatial,covariates_stack)

Perform a spatial overlay to get the data of the covariates at the locations of the soil observations

this operation can take several minutes

regression_matrix <- terra::extract(covariates_stack,profiles_spatial)

Combine the covariates data with the profiles

regression_matrix <- cbind(profiles,
                           data.frame(geom(profiles_spatial))[c('x','y')],
                           regression_matrix)

Write file to disk

write.csv(regression_matrix, 
          file.path(project_dir,'data/regression_matrix.csv'), 
          row.names = FALSE)

If file is large you can save it as an .RDS

#saveRDS(regression_matrix,file.path(project_dir,"data/regression_matrix.RDS"))

Feature selection

This section shows the step of selecting the most relevant covariates using Recursive Feature Elimination (RFE), a feature selection algorithm in machine learning and statistics. This will reduce the number of covariates used maintaining a high the model performance and lowering the computation demand.

Feature selection script

List of imported packages and functions

library(tidyverse)
library(caret)
source('config/config.R')
source('R/rangerFuncs.R')

Read the covariates list, the soil observations input data, and the regression matrix

lst_covs_rfe <- read.table("config/covs_list_corr_0.85_all")[[1]]
observations <- read.csv(file.path(project_dir,
                                   'data/input_data_preprocessed.csv'))
df <- read.csv(file.path(project_dir,'data/regression_matrix.csv'))

Join the soil observation data with the regression matrix (containing the values of the covariates at the soil profiles locations)

df <- left_join(observations,df)

Sanity check: only select the covariates names that are in the regression matrix

covars_list <- names(df)[names(df) %in% lst_covs_rfe]

Remove all the rows that have NAs both for the soil variable of interest (voi) an for the covariates

df <- drop_na(df,all_of(voi),all_of(covars_list))

Define control options for the Recursive Feature Elimination (RFE)

ctrl <- rfeControl(functions = rangerFuncs,method = "repeatedcv",
                   number = 2,returnResamp="all",
                   saveDetails = TRUE, verbose = TRUE,
                   allowParallel = TRUE)

Define the sizes parameter for RFE using the number of covariates and custom thresholds

ncvrts <- length(covars_list)
if (ncvrts<=15) {
sizes=unique(c(seq(0,ncvrts,3),ncvrts))
} else if (ncvrts > 15 & ncvrts <= 75) {
sizes=unique(c(seq(15,ncvrts,5),ncvrts))
} else if (ncol(df[covars_list]) > 75) {
sizes=unique(c(seq(15,75,5),seq(75,ncvrts,20),ncvrts))
}

If the log transformation parameter (defined in config) is active apply log transformation to the soil variable of interest

if (trnsfrm == "log"){voi_vctr <- log(df[[voi]])}else{voi_vctr <- df[[voi]]}

Run the RFE using the caret package

rfe_profile <- rfe(df[covars_list],voi_vctr,sizes = sizes, rfeControl = ctrl)

Save the created object to disk

saveRDS(profile,file=paste0(project_dir, 
                     "data/rfe_profile_",voi,"_trnsfrm-",trnsfrm,".RDS"))

Also save the list of selected covariates in a .csv

write.table(predictors(rfe_profile),
            file=paste0(project_dir, 
                 "data/rfe_predictors_",voi,"_trnsfrm-",trnsfrm,".csv"), 
            row.names=FALSE, col.names=FALSE,quote = FALSE)

Modelling

This section shows the model fitting and model validation. We use the Random Forest algorithm to model one of the soil variables (dependent variable) and the spatial covariates (independent variables).

Model Validation script

This script performs a 10 fold cross validation to assess the model performance. It uses the {tidymodels} framework

List of imported packages and functions

library(ranger)
library(dplyr)
library(tidyr)
library(tidymodels)
library(Metrics)
source('config/config.R')
source('R/ccc_with_bias.R')
source('R/MEC.R')

Define the output file path: an .RDS file which will contain the results of the cross validation

outfile = file.path(project_dir,"outputs",voi,
                    paste0("qrf_model_cv","_trnsfrm-",trnsfrm,".RDS"))

Read the preprocessed soil observation data and the regression matrix of the cross validation

observations <- read.csv(file.path(project_dir,
                                   'data/input_data_preprocessed.csv'))
rgrssn_mtrx <- read.csv(file.path(project_dir,'data/regression_matrix.csv'))

Join the regression matrix and soil observation together

rgrssn_mtrx <- left_join(observations,rgrssn_mtrx)

Read the selected covariates for the soil variable of interest.

cvrts <- readLines(paste0(project_dir, "data/rfe_predictors_",
                              voi,"_trnsfrm-",trnsfrm,".csv"))

Sanity check: only select the covariates names that are in the regression matrix

cvrts <- names(rgrssn_mtrx)[names(rgrssn_mtrx) %in% cvrts]

Remove all the rows that have NAs both for the soil variable of interest (voi) an for the covariates

rgrssn_mtrx <- drop_na(rgrssn_mtrx, all_of(voi), all_of(cvrts))

prepare a formula to apply log transformation to the soil variable of interest. Apply log transformation ff the log transformation parameter (defined in config) is active

if (trnsfrm == "log"){
    fmla=as.formula(paste0("log(",voi,")","~",paste0(cvrts,collapse="+")))
}else {
    fmla=as.formula(paste0(voi,"~",paste0(cvrts,collapse="+")))
}

Prepare the dataset for cross validation. The groups used to split the data are defined in the column fold of the dataframe and have been created beforehand by ISRIC

splitted_regr_matr <- group_vfold_cv(data=rgrssn_mtrx,group = fold)

Specify model and engine (Random Forest with quantile regression) and other hyperparameters (ntree)

rf_mod <- rand_forest(trees = ntree) %>% #, mtry = mtry
          set_engine("ranger",importance = "impurity",
                     quantreg=TRUE,seed = rndseed) %>% 
          set_mode("regression")

Specify the tidymodel workflow creating a an empty workflow object and adding the just created engine and the formula

rf_wf <- workflow() %>%
         add_model(rf_mod) %>%
         add_formula(fmla)

Define a custom metric to assess model performance, in this case ccc with bias and mec (model efficiency coefficient). These have been pre-defined in the ccc_with_bias.R and R/MEC.R scripts

multi_metric <- metric_set(yardstick::rmse, rsq, ccc_with_bias,mec)

Run the cross validation using the created workflow and the specified metrics

rf_fit_rs <- rf_wf %>% 
             fit_resamples(splitted_regr_matr,
             metrics = multi_metric,
             control = control_resamples(save_pred = TRUE))


#collect_metrics(rf_fit_rs)
#show_best(rf_fit_rs,metric = "mec")

Save the cross validation results into an .RDS object for later evaluation

saveRDS(rf_fit_rs,outfile)

Model fit script

This script performs the model fitting using the soil observation values and the covariates at the locations of the soil profiles

List of imported packages and functions

library(ranger)
library(dplyr)
library(tidyr)
source('config/config.R')

Define the output file path: an .RDS file which will contain the results of the model fitting

outfile = file.path(project_dir,"outputs",voi,
                    paste0("qrf_model","_trnsfrm-",trnsfrm,".RDS"))

Read the preprocessed soil observation data and the regression matrix of the cross validation

observations <- read.csv(file.path(project_dir,
                                   'data/input_data_preprocessed.csv'))
rgrssn_mtrx <- read.csv(file.path(project_dir,'data/regression_matrix.csv'))

Join the regression matrix and soil observation together

rgrssn_mtrx <- left_join(observations,rgrssn_mtrx)

Read the selected covariates for the soil variable of interest.

cvrts <- readLines(paste0(project_dir, "data/rfe_predictors_",
                              voi,"_trnsfrm-",trnsfrm,".csv"))

Sanity check: only select the covariates names that are in the regression matrix

cvrts <- names(rgrssn_mtrx)[names(rgrssn_mtrx) %in% cvrts]

Remove all the rows that have NAs both for the soil variable of interest (voi) an for the covariates

rgrssn_mtrx <- drop_na(rgrssn_mtrx, all_of(voi), all_of(cvrts))

Prepare a formula to apply log transformation to the soil variable of interest. Apply log transformation ff the log transformation parameter (defined in config) is active

if (trnsfrm == "log"){
    fmla=as.formula(paste0("log(",voi,")","~",paste0(cvrts,collapse="+")))
}else {
    fmla=as.formula(paste0(voi,"~",paste0(cvrts,collapse="+")))
}

Fit the Random Forest model

mdl_fit = ranger(fmla,rgrssn_mtrx,
             seed = rndseed,
             importance="impurity", 
             quantreg=TRUE, 
             mtry=mtry,
             num.trees=ntree
            )

Create the directory (if it does not exist) to store the fitted model and write the fitted model to disk in an .RDS file

dir.create(dirname(outfile), showWarnings = FALSE,recursive = TRUE)
saveRDS(mdl_fit,outfile)

Prediction

This sections shows how to use the fitted model to predict over the study area (Mongolia) using the spatial covariates as independent variables for the model.

The script is designed to split the computation in a number of tiles (portion of the study area) so that less memory is used and parallel computation is possible.

Prediction script

List of imported packages and functions

library(terra)
library(ranger)
library("future.apply")
source("config/config.R")
source("R/rast_split_in_tiles.R")

Prepare for parallel computation choosing the number of cores

plan(multisession(workers = 4)) ## Run in parallel on 4 cores

Read the fitted Random Forest model on the soil variable of interest (voi)

mdl_fit <- readRDS(file.path(project_dir,"outputs",voi,
                    paste0("qrf_model","_trnsfrm-",trnsfrm,".RDS")))

Read the selected covariates for the soil variable of interest and the list of covariates files (the Geotiffs )

cvrts_lst <- readLines(paste0(project_dir, "data/rfe_predictors_",
                              voi,"_trnsfrm-",trnsfrm,".csv"))
cvrts_files <- normalizePath(paste0(project_dir,'/covars_sg250m/',
                                    cvrts_lst,".tif"))

Define the location of the landmask file. This is a GeoTIFF that is used to mask out areas where soil is not present (water bodies, rocky areas, cities, etc.)

landmask_file <- file.path(project_dir,
                            "mask/landmask_Mongolia_homolosine_250.tif")

Read the covariates files using the rast function from the {terra} package

cvrts <- rast(cvrts_files)

Crate splits to reduce size of handled maps using a custom function that can be found in R/rast_split_in_tiles.R. This splits the extent of the covariates in a (user defined) number of tiles.

splits <- split_extent(extent = ext(cvrts), ntiles = ntiles,split_type="rect")
nsplits <- length(splits)

To run the prediction on the tiles we define a function (read_predict) and apply it on a list of tiles using the function lapply(). The function read_predict reads the covarites only on the area specified by the selected extent, it then masks the covariates using the landmask. Once the covariates are read into a raster stack are converted to data.frame and used to predict. Both the quantiles 0.05, 0.5, 0.95 and the mean are predicted. The predicted values are then converted to raster and written as geoTiffs

read_predict <- function(i,splits,nsplits,
                         covariates_files,landmask_file,
                         voi,trnsfrm){

    landmask <- rast(landmask_file)
    NAflag(landmask)<-0 # assing the correct NA value to the mask

    cvrts <- rast(cvrts_files)
    names(cvrts) <- tools::file_path_sans_ext(basename(cvrts_files))
    print(paste("Working on:",voi,i,"of",nsplits))

    e <- ext(splits[[i]][1],splits[[i]][2],splits[[i]][3],splits[[i]][4])

    window(cvrts) <- NULL
    window(cvrts) <- e
    window(landmask) <- NULL
    window(landmask) <- e

    # Here we read into memory so this operation could take some time
    covariates_window_mskd <- mask(cvrts,landmask)
    covariates_window_mskd_df <- as.data.frame(covariates_window_mskd, 
                                               xy=TRUE, na.rm=TRUE)

    if (nrow(covariates_window_mskd_df)!=0) {#check that the tile is not empty
    
    rm(covariates_window_mskd)
    gc()

    preds_qntl <- predict(mdl_fit,covariates_window_mskd_df, 
                     type = 'quantiles',quantiles = c(0.05, 0.5, 0.95))

    for (predname in colnames(preds_qntl$predictions)) {
        
        bandname <- paste0(voi,"_",sub("quantile= ","",predname))
        outfile <-file.path(project_dir,"outputs",voi,"predictions",
                    paste0("Q",sub("quantile= ","",predname)),
                    paste0("ntiles_",nsplits),
                    paste0(bandname,"_",sprintf("%05d", i),".tif"))
        
        if (trnsfrm == "log"){
            prdctn_vls <- exp(preds_qntl$predictions[,predname])*10
        }
        else if(trnsfrm == "none"){
            prdctn_vls <- preds_qntl$predictions[,predname]*10
        }

        dir.create(dirname(outfile), showWarnings = FALSE,recursive = TRUE)

        pred_raster <- rast(cbind(covariates_window_mskd_df[c("x","y")],
                           prdctn_vls),
                    type="xyz", 
                    crs=homolosine_wkt, digits=0)

        names(pred_raster) <- bandname
        writeRaster(pred_raster, outfile, overwrite=TRUE, 
            wopt= list(gdal=c("COMPRESS=LZW"), datatype='INT2U'))
    }

    rm(preds_qntl)
    rm(pred_raster)
    rm(prdctn_vls)
    gc()

    preds_mean <- predict(mdl_fit,covariates_window_mskd_df, type = 'response')

    bandname <- paste0(voi,"_","mean")
    outfile <-file.path(project_dir,"outputs",voi,"predictions","mean",
                  paste0("ntiles_",nsplits),
                  paste0(bandname,"_",sprintf("%05d", i),".tif"))
    dir.create(dirname(outfile), showWarnings = FALSE,recursive = TRUE)

    if (trnsfrm == "log"){
            prdctn_vls <- exp(preds_mean$predictions+0.5*var(preds_mean$predictions))*10
        }else if(trnsfrm == "none"){
            prdctn_vls <- preds_mean$predictions*10
        }

    pred_raster <- rast(cbind(covariates_window_mskd_df[c("x","y")],
                              prdctn_vls),
                        type="xyz", 
                        crs=homolosine_wkt, digits=0)

    names(pred_raster) <- bandname
    writeRaster(pred_raster, outfile, overwrite=TRUE, 
                wopt= list(gdal=c("COMPRESS=LZW"), datatype='INT2U'))

    rm(preds_mean)
    rm(pred_raster)
    rm(prdctn_vls)
    gc()

return(paste0(bandname,"_",sprintf("%04d", i),"of",nsplits,".tif"))

}

}

Apply the function read_predict on the tiles

y <- future_lapply(seq_along(splits), FUN = read_predict,
                   splits,nsplits,cvrts_files,landmask_file,voi,trnsfrm,
                   future.packages = c("ranger","terra"),
                   future.seed=TRUE
                   )

Mosaic script

List of imported packages and functions

library(gdalUtilities)
source("config/config.R")

Define the folder where predictions for the soil variable of interest are , which maps (quantiles or mean) to select and the directory where the tiles are

prdctn_dir <- paste0(project_dir,"/outputs/",voi,"/predictions/")
stats <- c("Q0.05","Q0.5","Q0.95","mean")
ntile_dir <- "ntiles_240/"

Loop through all the maps, create a .vrt to mosaic Finally change the .vrt to the desired Coordinate Reference System and save it as a Cloud Optimized GeoTiff

for (stat in stats) {

    stat_dir <- paste0(prdctn_dir,"/",stat,"/",ntile_dir)

    tifs <- list.files(stat_dir,pattern = ".tif$",full.names = TRUE)
    outvrt <- file.path(prdctn_dir,paste0(voi,"_",stat,".vrt"))
    outcog <- file.path(prdctn_dir,paste0(voi,"_",stat,"_cog_","32648",".tif"))

    gdalbuildvrt(tifs,outvrt)
    #gdal_translate(outvrt, outcog ,of= "COG" ,co= "COMPRESS=LZW")
    gdalwarp(outvrt, outcog,t_srs =  "EPSG:32648",
             of= "COG" ,co= "COMPRESS=LZW")
}

Appendix

Config file script

Here follows the content of the configuration file which creates variables that are used by the scripts in the workflow

The directory where project data is stored

project_dir = "DSM_Mongolia/"

The variable of interest

voi = "Humus"

The complete list of variables of interest

vois = c( "pH","EC","BulkDensity","Humus","BD")#,"depth"

vois_to_log = c("Humus")#

if (voi %in% vois_to_log){trnsfrm = "log"}else{trnsfrm = "none"}

The list of variables related to the profile/layer

layer_vars = c("Prof_Code","top","bottom")
laver_id = "Prof_Code"

Set maximun depth to filter the input data

maxDepth = 20

Set maximun value for Bulk Density

BulkDensityMax = 2

Read homolosine Well Known Text - Coordinate reference system definition

homolosine_wkt <- readLines("config/homolosine.txt")

Random forest hyperparameters

mtry= NULL
ntree= 500

Seed for random processes

rndseed = 424242
set.seed(rndseed)

Number of tiles used to split the prediction process

ntiles <- 250

Functions

In addition to the config file a number of custom functions have been created to deal with specific parts of the workflow. These can be found in the R folder of the repository. These are R/rangerFuncs.R, R/ccc_with_bias.R, R/MEC.R, and R/rast_split_in_tiles.R