Digital Soil Mapping (DSM) for Mongolia for five properties : pH, Humus, Bulk Density, Electrical Conductivity.
We present a guided workflow and visualization of results. In this document only the guided workflow is presented. A second document with the visualization of results will follow. The workflow consists of several connected scripts that prepare the data for modelling, do the model fitting and assessment of performance, and the final prediction of soil property maps. A configuration file named config.R stores important variables that are needed by almost all of the scripts in the workflow, this file is in the config/ folder. In config.R you will have, for instance, to specify on which variable to run the workflow and the project_dir (the directory which contains all the data of the project: inputs, outputs, reports).
For this project we map topsoil (depth 0 - 20). The soil input data is provided in layers that have a top depth and bottom depth. These layers need to be combined in a single value for each profile. We do this performing a weighted average, where the weights are the widths (bottom depth - top depth) of the layers.
This section shows the preparation of the input data for the DSM modelling. A script cleans the soil observations dataset and writes to disk the resulting cleaned dataset. Another script performs preprocessing on the cleaned input data, in this workflow the preprocessing consists of filtering the soil layers so that only topsoil (depth up to 20 cm) is considered, and then applies a weighted average over the layers for each of the soil variables of the project. An additional script uses the locations of the soil profiles and extracts the values of the covariates at those locations.
List of imported packages for this section
library(dplyr)
library(tidyr)
library(stringr)
source("config/config.R")# modify according to your config (ex. data directory)
Read the soil input data
input_data <- read.csv(file.path(project_dir,"data/input_data.csv"))
Clean data: The column Depth has several special characters that have to be erased Once Depth is clean we can separate it into top and bottom
input_data <- input_data %>%
mutate(Depth = str_replace_all(Depth, "([()?ABCD .])", "")) %>%
separate(col = Depth, into = c("top","bottom")) %>%
mutate(top = as.numeric(top), bottom = as.numeric(bottom))
Compute depth: depth will be the mean between top and bottom
input_data <- input_data %>%
mutate(depth = rowMeans(dplyr::select(.,top,bottom),na.rm = TRUE))
Filter Bulk Density: The BulkDensityMax parameter is set in the config file config/config.R
input_data$BulkDensity <- ifelse(input_data$BulkDensity <= BulkDensityMax,
input_data$BulkDensity,NA)
Write file to disk
write.csv(input_data,
file.path(project_dir,"data/inpt_data_clean.csv"),
row.names=FALSE)
List of imported packages for this section
library(dplyr)
library(tidyr)
library(magrittr)
source('config/config.R')
Read data (cleand by script scripts/data_preparation/input_data_clean.R)
observations <- read.csv(file.path(project_dir,'data/inpt_data_clean.csv'))
Convert soil variables to numeric, select layer_vars vois (as defined in the variable vois in config/config.R) Create new variable called width = bottom - top
observations <- observations %>%
mutate_at(vois, as.numeric) %>%
mutate_at(vois, round,3) %>%
select(all_of(layer_vars),all_of(vois)) %>%
filter(bottom <= maxDepth) %>%
mutate(width = (bottom - top))
Compute the weigheted average of all soil variables (as defined in the variable vois in config/config.R)
observations_wavg <- observations%>%
group_by(Prof_Code) %>%
summarise_at(vois,~weighted.mean(.,width))%>%
mutate_at(vois,round,3)
Write file to disk
write.csv(observations_wavg,
file=paste0(project_dir, "data/input_data_preprocessed.csv"),
row.names=FALSE,quote = FALSE)
List of imported packages for this section
library(terra)
library(sf)
library(dplyr)
source("config/config.R")
List covariates
covars_files <- list.files(normalizePath(paste0(project_dir,'/covars_sg250m/')),
pattern = '.tif$',
full.names = TRUE)
Read covariates stack with the {terra} package
covariates_stack <- rast(covars_files)
names(covariates_stack) <- tools::file_path_sans_ext(basename(covars_files))
Read soil observations
observations <- read.csv(file.path(project_dir,'data/inpt_data_clean.csv'))
Get only the information about profiles (exclude layers)
profiles <- distinct(select(observations,Prof_Code,lon,lat,fold),
Prof_Code,.keep_all = TRUE)
Convert profiles into a spatial point object with {terra}
profiles_spatial <- vect(profiles, geom = c('lon','lat'),
crs="+proj=longlat +datum=WGS84")
Reproject the spatial points to the same crs of the covariates stack
profiles_spatial <- project(profiles_spatial,covariates_stack)
Perform a spatial overlay to get the data of the covariates at the locations of the soil observations
this operation can take several minutes
regression_matrix <- terra::extract(covariates_stack,profiles_spatial)
Combine the covariates data with the profiles
regression_matrix <- cbind(profiles,
data.frame(geom(profiles_spatial))[c('x','y')],
regression_matrix)
Write file to disk
write.csv(regression_matrix,
file.path(project_dir,'data/regression_matrix.csv'),
row.names = FALSE)
If file is large you can save it as an .RDS
#saveRDS(regression_matrix,file.path(project_dir,"data/regression_matrix.RDS"))
This section shows the step of selecting the most relevant covariates using Recursive Feature Elimination (RFE), a feature selection algorithm in machine learning and statistics. This will reduce the number of covariates used maintaining a high the model performance and lowering the computation demand.
List of imported packages and functions
library(tidyverse)
library(caret)
source('config/config.R')
source('R/rangerFuncs.R')
Read the covariates list, the soil observations input data, and the regression matrix
lst_covs_rfe <- read.table("config/covs_list_corr_0.85_all")[[1]]
observations <- read.csv(file.path(project_dir,
'data/input_data_preprocessed.csv'))
df <- read.csv(file.path(project_dir,'data/regression_matrix.csv'))
Join the soil observation data with the regression matrix (containing the values of the covariates at the soil profiles locations)
df <- left_join(observations,df)
Sanity check: only select the covariates names that are in the regression matrix
covars_list <- names(df)[names(df) %in% lst_covs_rfe]
Remove all the rows that have NAs both for the soil variable of interest (voi) an for the covariates
df <- drop_na(df,all_of(voi),all_of(covars_list))
Define control options for the Recursive Feature Elimination (RFE)
ctrl <- rfeControl(functions = rangerFuncs,method = "repeatedcv",
number = 2,returnResamp="all",
saveDetails = TRUE, verbose = TRUE,
allowParallel = TRUE)
Define the sizes parameter for RFE using the number of covariates and custom thresholds
ncvrts <- length(covars_list)
if (ncvrts<=15) {
sizes=unique(c(seq(0,ncvrts,3),ncvrts))
} else if (ncvrts > 15 & ncvrts <= 75) {
sizes=unique(c(seq(15,ncvrts,5),ncvrts))
} else if (ncol(df[covars_list]) > 75) {
sizes=unique(c(seq(15,75,5),seq(75,ncvrts,20),ncvrts))
}
If the log transformation parameter (defined in config) is active apply log transformation to the soil variable of interest
if (trnsfrm == "log"){voi_vctr <- log(df[[voi]])}else{voi_vctr <- df[[voi]]}
Run the RFE using the caret package
rfe_profile <- rfe(df[covars_list],voi_vctr,sizes = sizes, rfeControl = ctrl)
Save the created object to disk
saveRDS(profile,file=paste0(project_dir,
"data/rfe_profile_",voi,"_trnsfrm-",trnsfrm,".RDS"))
Also save the list of selected covariates in a .csv
write.table(predictors(rfe_profile),
file=paste0(project_dir,
"data/rfe_predictors_",voi,"_trnsfrm-",trnsfrm,".csv"),
row.names=FALSE, col.names=FALSE,quote = FALSE)
This section shows the model fitting and model validation. We use the Random Forest algorithm to model one of the soil variables (dependent variable) and the spatial covariates (independent variables).
This script performs a 10 fold cross validation to assess the model performance. It uses the {tidymodels} framework
List of imported packages and functions
library(ranger)
library(dplyr)
library(tidyr)
library(tidymodels)
library(Metrics)
source('config/config.R')
source('R/ccc_with_bias.R')
source('R/MEC.R')
Define the output file path: an .RDS file which will contain the results of the cross validation
outfile = file.path(project_dir,"outputs",voi,
paste0("qrf_model_cv","_trnsfrm-",trnsfrm,".RDS"))
Read the preprocessed soil observation data and the regression matrix of the cross validation
observations <- read.csv(file.path(project_dir,
'data/input_data_preprocessed.csv'))
rgrssn_mtrx <- read.csv(file.path(project_dir,'data/regression_matrix.csv'))
Join the regression matrix and soil observation together
rgrssn_mtrx <- left_join(observations,rgrssn_mtrx)
Read the selected covariates for the soil variable of interest.
cvrts <- readLines(paste0(project_dir, "data/rfe_predictors_",
voi,"_trnsfrm-",trnsfrm,".csv"))
Sanity check: only select the covariates names that are in the regression matrix
cvrts <- names(rgrssn_mtrx)[names(rgrssn_mtrx) %in% cvrts]
Remove all the rows that have NAs both for the soil variable of interest (voi) an for the covariates
rgrssn_mtrx <- drop_na(rgrssn_mtrx, all_of(voi), all_of(cvrts))
prepare a formula to apply log transformation to the soil variable of interest. Apply log transformation ff the log transformation parameter (defined in config) is active
if (trnsfrm == "log"){
fmla=as.formula(paste0("log(",voi,")","~",paste0(cvrts,collapse="+")))
}else {
fmla=as.formula(paste0(voi,"~",paste0(cvrts,collapse="+")))
}
Prepare the dataset for cross validation. The groups used to split the data are defined in the column fold of the dataframe and have been created beforehand by ISRIC
splitted_regr_matr <- group_vfold_cv(data=rgrssn_mtrx,group = fold)
Specify model and engine (Random Forest with quantile regression) and other hyperparameters (ntree)
rf_mod <- rand_forest(trees = ntree) %>% #, mtry = mtry
set_engine("ranger",importance = "impurity",
quantreg=TRUE,seed = rndseed) %>%
set_mode("regression")
Specify the tidymodel workflow creating a an empty workflow object and adding the just created engine and the formula
rf_wf <- workflow() %>%
add_model(rf_mod) %>%
add_formula(fmla)
Define a custom metric to assess model performance, in this case ccc with bias and mec (model efficiency coefficient). These have been pre-defined in the ccc_with_bias.R and R/MEC.R scripts
multi_metric <- metric_set(yardstick::rmse, rsq, ccc_with_bias,mec)
Run the cross validation using the created workflow and the specified metrics
rf_fit_rs <- rf_wf %>%
fit_resamples(splitted_regr_matr,
metrics = multi_metric,
control = control_resamples(save_pred = TRUE))
#collect_metrics(rf_fit_rs)
#show_best(rf_fit_rs,metric = "mec")
Save the cross validation results into an .RDS object for later evaluation
saveRDS(rf_fit_rs,outfile)
This script performs the model fitting using the soil observation values and the covariates at the locations of the soil profiles
List of imported packages and functions
library(ranger)
library(dplyr)
library(tidyr)
source('config/config.R')
Define the output file path: an .RDS file which will contain the results of the model fitting
outfile = file.path(project_dir,"outputs",voi,
paste0("qrf_model","_trnsfrm-",trnsfrm,".RDS"))
Read the preprocessed soil observation data and the regression matrix of the cross validation
observations <- read.csv(file.path(project_dir,
'data/input_data_preprocessed.csv'))
rgrssn_mtrx <- read.csv(file.path(project_dir,'data/regression_matrix.csv'))
Join the regression matrix and soil observation together
rgrssn_mtrx <- left_join(observations,rgrssn_mtrx)
Read the selected covariates for the soil variable of interest.
cvrts <- readLines(paste0(project_dir, "data/rfe_predictors_",
voi,"_trnsfrm-",trnsfrm,".csv"))
Sanity check: only select the covariates names that are in the regression matrix
cvrts <- names(rgrssn_mtrx)[names(rgrssn_mtrx) %in% cvrts]
Remove all the rows that have NAs both for the soil variable of interest (voi) an for the covariates
rgrssn_mtrx <- drop_na(rgrssn_mtrx, all_of(voi), all_of(cvrts))
Prepare a formula to apply log transformation to the soil variable of interest. Apply log transformation ff the log transformation parameter (defined in config) is active
if (trnsfrm == "log"){
fmla=as.formula(paste0("log(",voi,")","~",paste0(cvrts,collapse="+")))
}else {
fmla=as.formula(paste0(voi,"~",paste0(cvrts,collapse="+")))
}
Fit the Random Forest model
mdl_fit = ranger(fmla,rgrssn_mtrx,
seed = rndseed,
importance="impurity",
quantreg=TRUE,
mtry=mtry,
num.trees=ntree
)
Create the directory (if it does not exist) to store the fitted model and write the fitted model to disk in an .RDS file
dir.create(dirname(outfile), showWarnings = FALSE,recursive = TRUE)
saveRDS(mdl_fit,outfile)
This sections shows how to use the fitted model to predict over the study area (Mongolia) using the spatial covariates as independent variables for the model.
The script is designed to split the computation in a number of tiles (portion of the study area) so that less memory is used and parallel computation is possible.
List of imported packages and functions
library(terra)
library(ranger)
library("future.apply")
source("config/config.R")
source("R/rast_split_in_tiles.R")
Prepare for parallel computation choosing the number of cores
plan(multisession(workers = 4)) ## Run in parallel on 4 cores
Read the fitted Random Forest model on the soil variable of interest (voi)
mdl_fit <- readRDS(file.path(project_dir,"outputs",voi,
paste0("qrf_model","_trnsfrm-",trnsfrm,".RDS")))
Read the selected covariates for the soil variable of interest and the list of covariates files (the Geotiffs )
cvrts_lst <- readLines(paste0(project_dir, "data/rfe_predictors_",
voi,"_trnsfrm-",trnsfrm,".csv"))
cvrts_files <- normalizePath(paste0(project_dir,'/covars_sg250m/',
cvrts_lst,".tif"))
Define the location of the landmask file. This is a GeoTIFF that is used to mask out areas where soil is not present (water bodies, rocky areas, cities, etc.)
landmask_file <- file.path(project_dir,
"mask/landmask_Mongolia_homolosine_250.tif")
Read the covariates files using the rast function from the {terra} package
cvrts <- rast(cvrts_files)
Crate splits to reduce size of handled maps using a custom function that can be found in R/rast_split_in_tiles.R. This splits the extent of the covariates in a (user defined) number of tiles.
splits <- split_extent(extent = ext(cvrts), ntiles = ntiles,split_type="rect")
nsplits <- length(splits)
To run the prediction on the tiles we define a function (read_predict) and apply it on a list of tiles using the function lapply(). The function read_predict reads the covarites only on the area specified by the selected extent, it then masks the covariates using the landmask. Once the covariates are read into a raster stack are converted to data.frame and used to predict. Both the quantiles 0.05, 0.5, 0.95 and the mean are predicted. The predicted values are then converted to raster and written as geoTiffs
read_predict <- function(i,splits,nsplits,
covariates_files,landmask_file,
voi,trnsfrm){
landmask <- rast(landmask_file)
NAflag(landmask)<-0 # assing the correct NA value to the mask
cvrts <- rast(cvrts_files)
names(cvrts) <- tools::file_path_sans_ext(basename(cvrts_files))
print(paste("Working on:",voi,i,"of",nsplits))
e <- ext(splits[[i]][1],splits[[i]][2],splits[[i]][3],splits[[i]][4])
window(cvrts) <- NULL
window(cvrts) <- e
window(landmask) <- NULL
window(landmask) <- e
# Here we read into memory so this operation could take some time
covariates_window_mskd <- mask(cvrts,landmask)
covariates_window_mskd_df <- as.data.frame(covariates_window_mskd,
xy=TRUE, na.rm=TRUE)
if (nrow(covariates_window_mskd_df)!=0) {#check that the tile is not empty
rm(covariates_window_mskd)
gc()
preds_qntl <- predict(mdl_fit,covariates_window_mskd_df,
type = 'quantiles',quantiles = c(0.05, 0.5, 0.95))
for (predname in colnames(preds_qntl$predictions)) {
bandname <- paste0(voi,"_",sub("quantile= ","",predname))
outfile <-file.path(project_dir,"outputs",voi,"predictions",
paste0("Q",sub("quantile= ","",predname)),
paste0("ntiles_",nsplits),
paste0(bandname,"_",sprintf("%05d", i),".tif"))
if (trnsfrm == "log"){
prdctn_vls <- exp(preds_qntl$predictions[,predname])*10
}
else if(trnsfrm == "none"){
prdctn_vls <- preds_qntl$predictions[,predname]*10
}
dir.create(dirname(outfile), showWarnings = FALSE,recursive = TRUE)
pred_raster <- rast(cbind(covariates_window_mskd_df[c("x","y")],
prdctn_vls),
type="xyz",
crs=homolosine_wkt, digits=0)
names(pred_raster) <- bandname
writeRaster(pred_raster, outfile, overwrite=TRUE,
wopt= list(gdal=c("COMPRESS=LZW"), datatype='INT2U'))
}
rm(preds_qntl)
rm(pred_raster)
rm(prdctn_vls)
gc()
preds_mean <- predict(mdl_fit,covariates_window_mskd_df, type = 'response')
bandname <- paste0(voi,"_","mean")
outfile <-file.path(project_dir,"outputs",voi,"predictions","mean",
paste0("ntiles_",nsplits),
paste0(bandname,"_",sprintf("%05d", i),".tif"))
dir.create(dirname(outfile), showWarnings = FALSE,recursive = TRUE)
if (trnsfrm == "log"){
prdctn_vls <- exp(preds_mean$predictions+0.5*var(preds_mean$predictions))*10
}else if(trnsfrm == "none"){
prdctn_vls <- preds_mean$predictions*10
}
pred_raster <- rast(cbind(covariates_window_mskd_df[c("x","y")],
prdctn_vls),
type="xyz",
crs=homolosine_wkt, digits=0)
names(pred_raster) <- bandname
writeRaster(pred_raster, outfile, overwrite=TRUE,
wopt= list(gdal=c("COMPRESS=LZW"), datatype='INT2U'))
rm(preds_mean)
rm(pred_raster)
rm(prdctn_vls)
gc()
return(paste0(bandname,"_",sprintf("%04d", i),"of",nsplits,".tif"))
}
}
Apply the function read_predict on the tiles
y <- future_lapply(seq_along(splits), FUN = read_predict,
splits,nsplits,cvrts_files,landmask_file,voi,trnsfrm,
future.packages = c("ranger","terra"),
future.seed=TRUE
)
List of imported packages and functions
library(gdalUtilities)
source("config/config.R")
Define the folder where predictions for the soil variable of interest are , which maps (quantiles or mean) to select and the directory where the tiles are
prdctn_dir <- paste0(project_dir,"/outputs/",voi,"/predictions/")
stats <- c("Q0.05","Q0.5","Q0.95","mean")
ntile_dir <- "ntiles_240/"
Loop through all the maps, create a .vrt to mosaic Finally change the .vrt to the desired Coordinate Reference System and save it as a Cloud Optimized GeoTiff
for (stat in stats) {
stat_dir <- paste0(prdctn_dir,"/",stat,"/",ntile_dir)
tifs <- list.files(stat_dir,pattern = ".tif$",full.names = TRUE)
outvrt <- file.path(prdctn_dir,paste0(voi,"_",stat,".vrt"))
outcog <- file.path(prdctn_dir,paste0(voi,"_",stat,"_cog_","32648",".tif"))
gdalbuildvrt(tifs,outvrt)
#gdal_translate(outvrt, outcog ,of= "COG" ,co= "COMPRESS=LZW")
gdalwarp(outvrt, outcog,t_srs = "EPSG:32648",
of= "COG" ,co= "COMPRESS=LZW")
}
Here follows the content of the configuration file which creates variables that are used by the scripts in the workflow
The directory where project data is stored
project_dir = "DSM_Mongolia/"
The variable of interest
voi = "Humus"
The complete list of variables of interest
vois = c( "pH","EC","BulkDensity","Humus","BD")#,"depth"
vois_to_log = c("Humus")#
if (voi %in% vois_to_log){trnsfrm = "log"}else{trnsfrm = "none"}
The list of variables related to the profile/layer
layer_vars = c("Prof_Code","top","bottom")
laver_id = "Prof_Code"
Set maximun depth to filter the input data
maxDepth = 20
Set maximun value for Bulk Density
BulkDensityMax = 2
Read homolosine Well Known Text - Coordinate reference system definition
homolosine_wkt <- readLines("config/homolosine.txt")
Random forest hyperparameters
mtry= NULL
ntree= 500
Seed for random processes
rndseed = 424242
set.seed(rndseed)
Number of tiles used to split the prediction process
ntiles <- 250
In addition to the config file a number of custom functions have been created to deal with specific parts of the workflow. These can be found in the R folder of the repository. These are R/rangerFuncs.R, R/ccc_with_bias.R, R/MEC.R, and R/rast_split_in_tiles.R