Skip to content
Snippets Groups Projects
Commit 9a9063fb authored by Peters, Wouter's avatar Peters, Wouter
Browse files

more cleaning up

parent 5b790696
Branches
No related tags found
No related merge requests found
%% Cell type:markdown id:3548bae8-e690-4664-9fc0-ddf5ed9f7ba9 tags:
# DATA ASSIMILATION BASIC exercises
---------
<FONT COLOR=red>
# Part 1: Optimizing the land sink of CO$_2$
<FONT COLOR=black>
----------
##### Wouter Peters, December 2021
##### Wouter Peters, modifications for ICOS Summerschool, May 2023
##### version 2.0
### Goal
* Quantify the mismatch between observations and model, understand the L1 and L2 norm (Exercise 4)
* Understand the role of a cost function, and learn to construct it (Exercise 5)
* Apply the minimum least-squares solution and understand the role of observation uncertainty (Exercise 6)
<b>Tip:</b>
You can go through this practical at your own pace:
1. For a novice user, it is fine to just read the instructions, execute the cells, and try to answer the questions. Sometimes you might need to modify a value in the code and run a cell multiple times. In that case focus on the part of the cell that looks like this:
```python
1| ############### YOUR INPUT BELOW ################
2|
3| SomeVariable = [1,2,3] <--- You make a change
4|
5| ############### YOUR INPUT ENDS ################
```
2. For a regular user, it might be nice to read and understand the python code in the cells. In that case look at the parts indicated by:
```python
1| ############### PYTHON CODE STARTS BELOW ################
```
3. Expert users are challenged to also modify and write code to explore the material further
%% Cell type:code id:9c3e9e7d-6cf3-4121-866c-d4fc404ee22b tags:
``` python
#### PLEASE EXECUTE THIS CELL ONCE UPON STARTUP, IT LOADS A SET OF NEEDED PYTHON LIBRARIES ####
%load_ext autoreload
%autoreload 2
import warnings
warnings.filterwarnings('ignore')
import glob
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import mpda
import seaborn as sns
%matplotlib inline
#The following is the full observation set you can pick from in this practical:
full_obsset = ['mlo', # Mauna Loa
'mhd', # Mace Head, Ireland
'rpb', # Ragged Point Barbados
'smo', # Samoa
'cgo', # Cape Grim Observatory, Australia
'spo', # South Pole
'alt', # Alert, Alaska, USA
'zep', # Zeppelin, Ny-Alesund, Norway
#'pal', # Hyytiala, Finland substituted by Pallas (!)
#'crz', # Crozet Island
'thd' # Trinidad Head, California, USA
]
```
%% Cell type:markdown id:75259f02-f224-40c4-bb28-9b642e3fad73 tags:
## Introduction
The optimal land uptake found in Exercise 3 of the [the MOGUNTIA CO2 Notebook](./MOGUNTIA-CO2.ipynb) was determined by trial and error, and by judging the difference to the
observations using the human eye. In data assimilation this "forward modeling" of the problem is very important. It
gives the researcher a good feeling for the system, and helps build an expectation of the outcome of a more formal optimization
process. Once the forward modeling yields a satisfactory system, an optimization algorithm can be used to fine-tune
the solution and find the numerically best values for a set of unknowns.
Often, the best numerical solution is one that best reproduces observations, quantified by for example the root-mean-square-error (RMSE):
<div class="alert alert-block alert-warning">
$ RMSE=\sqrt{\frac{1}{(N-1)}\sum (y^{0}-H(x))^2 } $.........(Equation 1)
with:
$y^{0}$ = Observations [ppm]
$x$ = the unknown scaling factor for the land-sink [-]
$H$ = a linear operator on $x$ that makes it comparable to $y^{0}$: MOGUNTIA
$H(x)$ = MOGUNTIA calculated value of the CO$_2$ mole fraction given $x$ [ppm]
$N$ = number of observations
</div>
We often tend to say that the value of the land sink $x$ that is "optimal", is the one that minimizes the RMSE to observations ($y^{0}$) once run through the MOGUNTIA model ($H$). The RMSE is an example of a so-called L2-norm, in which deviations from the observed value are weighted in quadrature. If we had used the absolute difference:
<div class="alert alert-block alert-warning">
$ \frac{1}{(N-1)}\sum(| y^{0}-H(x) |) $ .........(Equation 2)
</div>
this would have put less emphasis on larger deviations. We call this an L1-norm.
Defining what the 'optimum' means in the solution of your problem is thus a very important first step. This choice leads to the choice of optimization method, the restrictions on the type of errors you use, and often the numerical methods available to you. In many atmospheric application the basis for your optimization is a so-called cost function, and we often restrict ourselves to an L2-norm:
<div class="alert alert-block alert-warning">
$ J(x) = (y^{0}-H(x)) R^{-1} (y^{0}-H(x))^{T} $ .........(Equation 2)
where we have two new symbols, and we refer to y and H(x) now as vectors:
$ J $ = the cost assigned to a proposed solution $x$
$ R $ = the covariance of the model-data mismatch [ppm]$^{2}$
</div>
The NxN matrix $R$ represents the uncertainty incurred when comparing each observation to its modeled value, due to for example observational errors, transport model errors, model sampling errors, etc. These errors provide the weight for each measurements in the overall cost calculated. We often refer to these values as **'model-data-mismatch'**, instead of only 'observation error'. The latter would be much smaller usually than the full errors in R.
**In Exercise 4, we will calculate the L1 and L2 norm and value of J for the solutions you created in Exercise 3 of [the MOGUNTIA CO2 Notebook](./MOGUNTIA-CO2.ipynb).**
---
%% Cell type:markdown id:5707b3a2-2e23-4331-ac84-4b41d0b5f917 tags:
### Exercise 4: Inspecting the optimum land sink
#### Estimated time to complete: 45 mins
In the cell below, you can investigate the two metrics above (RMSE and J) in the runs you did so far.
<div class="alert alert-block alert-info">
Note that if your Virtual Machine was stopped, the output will have been deleted so you might need to run your results from Exercise 3 again in [the MOGUNTIA CO2 Notebook](./MOGUNTIA-CO2.ipynb)
</div>
---
<div class="alert alert-block alert-warning">
<b>To do</b>
<FONT COLOR=red>
* Inspect the cell, and see if you understand the Python code provided.
* Print the values of the ABME and RMSE and cost function J for some of the runs you've done in Exercise 3. Can you recognize the "best" simulation from the metrics?
* How does the metric depend on the set of observations you include in the set? Try to add an "independent" site to the set that makes the RMSE go very high.
* Write down the optimum land sink and the cost function minimum you attained, on the blackboard up front. You can do a few extra runs in [the MOGUNTIA CO2 Notebook](./MOGUNTIA-CO2.ipynb) from Exercise 3 if you feel you can do better...
</FONT>
</div>
%% Cell type:code id:daac13de-f0d2-49d9-aee8-962efa796ac6 tags:
``` python
############### YOUR INPUT BELOW ################
############################# Define a set of sites to assess with our metrics: ############################
obsset = ['mhd','mlo']
mdm = [0.5,0.5] # in ppm, these are the errors we put on the diagonal of the R-matrix
assert len(obsset) == len(mdm),'Please specify as many model-data mismatch values as sites in the obsset'
############################# Get the observations and modeled values for a given experiment ############################
y,Hx,info = mpda.get_concentrations('FOSSIL2',obsset) # specify the name of your run to assess here, rerun if needed
############### YOUR INPUT ENDS ################
############################ Compute the metrics ############################
R = mpda.make_R(info,mdm) # 0.5 ppm^2 measurement uncertainty for all observations
ABME = np.abs((y-Hx)).mean() # make L1 for this run
RMSE = np.sqrt(((y-Hx)**2).mean()) # make RMSE for this run
J = np.dot(np.transpose(y-Hx),np.linalg.inv(R)).dot(y-Hx) # make J for this run
print(f'Absolute mean error : {ABME:f} [ppm] ')
print(f'Root-mean-square error : {RMSE:f} [ppm] ')
print(f'Value of cost function : {J:f} [-] ')
############################# plot the outcomes per site ############################
fig, axs = plt.subplots(len(obsset))
fig.suptitle('Observations and modeled values')
istart=0
for i,site in enumerate(obsset):
nn=info[site]
iend=istart+nn
axs[i].plot(np.arange(nn),Hx[istart:iend],lw=2,color='blue',label=site)
axs[i].errorbar(np.arange(nn),y[istart:iend],yerr=np.sqrt(R.diagonal()[istart:iend]),lw=3,color='red',label='obs')
axs[i].legend()
axs[i].set_ylabel('CO2 [ppm]')
istart+=nn
```
%% Cell type:markdown id:ec9a30ad-da76-4307-af0d-22064cc78430 tags:
## Exercise 5: Finding the cost function minimum
By now, each student has been able to find a best estimate of the land sink. But not all solutions are the same, and not all have the same RMSE or cost. The discussion on how this came to be has been done now. So time to bundle forces.
<figure>
<img src=https://cdn.analyticsvidhya.com/wp-content/uploads/2021/03/Screenshot-from-2021-03-03-17-20-40.png width="500" height="400">
<figcaption> <i>Figure 3: Many optimization methods use quadratic cost functions, with a minimum defined in the N-dimensions of the problem being solved. Two dimensions we can visualize recognizably still
</i></figcaption>
</figure>
The true cost function you have been inspecting is of course quadratic: a hyperbolic curve with a theoretical minimim at (x_opt, 0.0), if all observations are matched perfectly for the optimum land sink x_opt. So far, you have all focused on this minimum point only.
<div class="alert alert-block alert-warning">
<b>To do</b>
<FONT COLOR=red>
* Discuss with the whole class a strategy to collectively explore the quadratic cost function of the inverse problem that yuo have so far solved alone (i.e., estimating the value **SINK EXTRA_LAND**.
* Focus not only on finding the minimum, but also the rest of the shape of the hyperbolic curve. What could its curvature tell you?
</FONT>
</div>
___
%% Cell type:markdown id:13efda1a-9b09-4ddd-ac56-fd90ca61c4e8 tags:
### End of exercises, rest of practical is OPTIONAL
%% Cell type:markdown id:654b62ea-4922-4e57-9842-53f5e2545a40 tags:
<details><summary>CLICK TO SEE THE SOLUTION FOR MANY VALUES OF EXTRA_LAND</summary>
<p>
```python
Jvals=[]
obsset = ['mlo','mhd','rpb','smo','cgo','spo','alt','zep','thd']
mdm = np.array([0.2,0.5,0.5,0.5,0.5,0.5,1,1,1])
print(len(obsset),len(mdm))
y_full,Hx_b, info = mpda.get_concentrations('basefunc_base',obsset,version='ext') # Get the run without an extra sink
y = y_full-Hx_b # presubtract it
H= mpda.get_H(obsset,glob=True) # Get the linearized MOGUNTIA model matrix H
R = mpda.make_R(info,mdm) # 0.5 ppm^2 measurement uncertainty for all observations
print(len(y),H.shape, R.shape)
solutions = np.arange(+1,3,0.2) # try solutions from -1 to +5 in steps of 0.5 PgC/yr
# You code it from here...
invR = np.linalg.inv(R)
for x in solutions:
Hx = H*x # create mole fractions in [ppm] for solution x
J = np.dot(np.transpose(y-Hx),invR).dot(y-Hx) # make J for this run
Jvals.append(J)
Jvals=np.array(Jvals)
best = Jvals.argmin()
print(f'x_a : {solutions[best]:4.3f}\nJ : {Jvals[best]:4.3f}')
plt.plot(solutions,Jvals)
'''
%% Cell type:code id:f0d14ad3-3873-4e2b-aa25-df7569fade1c tags:
``` python
Jvals=[]
obsset = ['mhd','mlo']
mdm = [0.5,0.5] # in ppm, these are the errors we put on the diagonal of the R-matrix
print(len(obsset),len(mdm))
y_full,Hx_b, info = mpda.get_concentrations('basefunc_base',obsset,version='ext') # Get the run without an extra sink
y = y_full-Hx_b # presubtract it
H= mpda.get_H(obsset,glob=True) # Get the linearized MOGUNTIA model matrix H
R = mpda.make_R(info,mdm) # 0.5 ppm^2 measurement uncertainty for all observations
print(len(y),H.shape, R.shape)
solutions = np.arange(+2.2,2.6,0.001) # try solutions from -1 to +5 in steps of 0.5 PgC/yr
invR = np.linalg.inv(R)
print(R),(invR)
for x in solutions:
Hx = H*x # create mole fractions in [ppm] for solution x
J = np.dot(np.transpose(y-Hx),invR).dot(y-Hx) # make J for this run
Jvals.append(J)
Jvals=np.array(Jvals)
best = Jvals.argmin()
print(f'x_a : {solutions[best]:4.3f}\nJ : {Jvals[best]:4.3f}')
plt.plot(solutions,Jvals)
```
%% Cell type:markdown id:d445b7b9-1f89-4ee5-bc59-e8540661ee82 tags:
## Exercise 6: The minimum least-squares solution
<P><FONT COLOR=darkblue>
With the relatively simple cost function, we can also create our first real 'optimal' solution of the scaling factor for the land sink. This is not the one that depends on trial-and-error, but one that is actually the mathematical minimum of the quadratic J.
It is given by the Ordinary Least Squares (OLS) solution used also in simple linear regression:
<P>
<div class="alert alert-block alert-warning">
$x^{a}=(H^{T}RH)^{-1}H^{T}Ry^{0}$
</div>
with the subscript $a$ referring to the "analysis", representing the optimal value of the state $x$ after using all observations $y^{0}$ with each of their weights $R$.
<P>
<figure>
<img src=https://cdn-images-1.medium.com/max/500/0*gglavDlTUWKn4Loe width=400 height=400 align=center>
<figcaption> <i>Figure 4: The Ordinary Least Squares fitting of a straight line through your data is done by scientists worldwide. Only few realize that the math behind Python and R functions like "optimize", "lingress", "linfit", "polyfit" etc is just a simple variant of the algebraic solution given in the cell below
</i></figcaption>
</figure>
In the cell below we create this minimum least-squares solution for the unknown value of EXTRA_LAND, given the model-data mismatch R of the requested observation set.
<P>
<div class="alert alert-block alert-info">
<b>Note</b>
Note that we do not actually run the full MOGUNTIA model to create it, but instead we have computed H, the matrix operator that exactly reproduces the transport that MOGUNTIA would do for us with an extra SINK of 1 PgC. We were able to do this because MOGUNTIA transport is fully linear (a doubling of emissions exactly doubles all mole fractions), and allows us to use this relation between one parameter (EXTRA_LAND) and all possible CO$_2$ sites. Moreover, we included the initial condition of CO2=369 ppm and only fossil fuels in the `basefunc_base` run, so that we can subtract it from the measurements before optimizing the residuals.
</div>
---
<div class="alert alert-block alert-warning">
<FONT COLOR='RED'>
<b>To do</b>
* Investigate how the solution depends on the value of R (model-data mismatch)
* And what about the observation set, how does it influence x_a? Can you use one site? Which one would you pick?
* Compare the cost function and RMSE to the solution you created with the whole class in Exercise 5. Did you get close?
</FONT>
</div>
---
%% Cell type:code id:fb83f341-7d7a-4842-aeaf-d3b2d52e3a46 tags:
``` python
############### YOUR INPUT BELOW ################
############################# Define a set of (>1) sites to use in the optimization: ############################
obsset = ['mhd','mlo']
mdm = [0.5,0.5] # in ppm, these are the errors we put on the diagonal of the R-matrix
assert len(obsset) == len(mdm),'Please specify as many model-data mismatch values as sites in the obsset'
############################# Get H, the observations and modeled values for a given experiment ############################
############### YOUR INPUT ENDS ################
############### PYTHON CODE BELOW ################
H= mpda.get_H(obsset,glob=True) # Get the linearized MOGUNTIA model matrix H
# Get a run with no extra sinks (but with fossil fuels, ocean and biosphere),
# and pre-subtract it from the observations (y).
# Now x_a is the remaining sink, and y [in ppm] is the extra uptake needed
y_full,Hx_b, info = mpda.get_concentrations('basefunc_base',obsset,version='ext') # Get the run without an extra sink
y = y_full-Hx_b # presubtract it
############################# We define the weight of each observation using the matrix R: ############################
R = mpda.make_R(info,mdm) # 0.5 ppm^2 measurement uncertainty for all observations
W = np.linalg.inv(R)
################### MLS solution algebra given below ###################
w1=np.dot(np.transpose(H),np.dot(W,H)) # bracket term XWX^T
w2=1./w1 # inverse bracket (XWX^T)-1
w3=np.dot(w2,np.transpose(H)) # times X^T
w4 = np.dot(w3,W)
w5 = np.dot(w4,y) # times y
x_a=w5
############################################################################
print (f'ML solution for x_a : {x_a:4.3f}')
############################ Compute the metrics ############################
Hx = H*x_a
RMSE = np.sqrt(((y-Hx)**2).mean()) # make RMSE for this run
J = np.dot(np.transpose(y-Hx),np.linalg.inv(R)).dot(y-Hx) # make J for this run
#print(len(y),len(Hx))
print(f'RMSE [ppm] : {RMSE:f} \nJ [-] : {J:f}')
############################# plot the outcomes per site ############################
nplots=max([len(obsset),2])
fig, axs = plt.subplots(nplots)
fig.suptitle('Observations and modeled values')
istart=0
for i,site in enumerate(obsset):
nn=info[site]
iend=istart+nn
axs[i].plot(np.arange(nn),(Hx_b+Hx)[istart:iend],lw=2,color='blue',label=site)
axs[i].errorbar(np.arange(nn),(Hx_b+y)[istart:iend],yerr=np.sqrt(R.diagonal()[istart:iend]),lw=3,color='red',label='obs')
axs[i].legend()
axs[i].set_ylabel('CO2 [ppm]')
istart+=nn
```
%% Cell type:markdown id:88b97089-e3a8-46e6-a479-715ba39bbef1 tags:
## Exercise 7: Adding a prior term to the cost function
The cost function J above is quite simple, and does not allow for other information than the observations to play a role in the final solution. However, quite often we start data assimilation with a 'first-guess' or other type of a-priori information on the state of a system. And just like with observations, we want our final solution to stay close to such information too.
Including a-priori information in the cost function is an essential component of data assimilation, and sets it aside from simple curve-fitting or linear regression. This is because the first-guess state of a system can be predicted by information from a previous moment in time, or from expert information. Such information can be added to the cost function:
<div class="alert alert-block alert-warning">
$J(x)=(y^{0}-H(x))R^{-1}(y^{0}-H(x))^{T}+(x-x^{p})P^{-1}(x-x^{p})^{T}$
where:
$x^{p}$ = prior information on the state (obtained from expert knowledge or a model of the state)
$P$ = covariance matrix representing the full error structure of the prior state
</div>
The optimal solution (minimum cost J) thus depends on the relative uncertainties assumed in the model-data comparison (R) and in the uncertainties in the prior state (P). If one becomes very small (low errors), deviations will incur high costs.
In the cell below, we once again calculate the cost function for various values of EXTRA_LAND but now with a prior term added.
<div class="alert alert-block alert-warning">
<b>To do</b>
<FONT COLOR=red>
---
* Read and understand the cell below, then execute it. Which term in J is smaller? What does it mean?
* Change the values of P, and/or R and execute the cell again. What is needed to make J1 ~ J2 (a balanced cost function)?
* Can you think of a way to make the solution (minimum cost) stick to the prior value of 1.0? Try it.
</div>
---
%% Cell type:code id:038e8191-0e2b-4a42-a618-b5d9ca69ff79 tags:
``` python
############### YOUR INPUT BELOW ################
x_p = np.array([1.0]) # We think the solution should stay close to 1.0 PgC/yr, so we make this the prior
P = 1.0e-1*np.array([0.16]) # We give it a reasonable error of 40%, which is 1.0±0.4 (and 0.4**2 = 0.16)
############################# Define a set of sites to assess with our metrics: ############################
obsset = ['spo','mlo','smo']
mdm = 1.0*np.array([0.5,0.5,0.5]) # in ppm
assert len(obsset) == len(mdm),'Please specify as many model-data mismatch values as sites in the obsset'
############### YOUR INPUT ENDS ################
############### PYTHON CODE BELOW ################
############################# Get H, the observations and modeled values for a given experiment ############################
H= mpda.get_H(obsset,glob=True) # Get the linearized MOGUNTIA model matrix H
y_full,Hx_b, info = mpda.get_concentrations('basefunc_base',obsset,version='ext') # Get the run without an extra sink
y = y_full-Hx_b # presubtract it
R = mpda.make_R(info,mdm) # 0.5 ppm^2 measurement uncertainty for all observations
############### YOUR INPUT ENDS ################
############### PYTHON CODE BELOW ################
Jvals=[]
solutions = np.arange(+0.7,+2.7,0.1) # try solutions from -1 to +5 in steps of 0.5 PgC/yr
invR = np.linalg.inv(R)
for x in solutions:
#print(x)
Hx = H*x # create mole fractions in [ppm] for solution x
J1 = np.dot(np.transpose(y-Hx),invR).dot(y-Hx) # make J for this run
J2 = np.dot(np.transpose(x-x_p),1./P[0]).dot(x-x_p)
J = J1 + J2
#print(x,J1,J2,J)
Jvals.append([J1,J2,J])
J1=np.array(Jvals)[:,0]
J2=np.array(Jvals)[:,1]
J=np.array(Jvals)[:,2]
best = J.argmin()
print(len(y))
print(f'x_p : {x_p[0]:4.3f}')
print(f'x_a : {solutions[best]:4.3f}')
print(f'J1 (obs) : {J1[best]:4.3f}')
print(f'J2 (prior) : {J2[best]:4.3f}')
print(f'J : {J[best]:4.3f}')
plt.plot(solutions,J,linewidth=3,label='J (total)')
plt.plot(solutions,J1,'--',label='J1 (obs)')
plt.plot(solutions,J2,'--',label='J2 (prior)')
plt.plot([solutions[best],solutions[best]],[-1.0*J.max()/8.0,J.max()/2.0],
color='red',lw=2,label=f'minimum J ({solutions[best]:4.3f})')
plt.legend()
```
%% Cell type:markdown id:8c92d968-d5f6-44fd-9c83-7a568d718bf7 tags:
---------
<FONT COLOR=red>
## This concludes the BASIC Notebook for Data Assimilation.
## Feel free to proceed to [the Data Assimilation Advanced Notebook](./DA-advanced.ipynb) next.
<FONT COLOR=black>
----------
%% Cell type:code id:d8634cdd-4d71-456a-a271-9f557a4e31be tags:
``` python
```
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment