"##### Wouter Peters, modifications for ICOS Summerschool, May 2023\n",
"\n",
" \n",
"##### version 2.0\n",
"\n",
"### Goal\n",
"\n",
"* Quantify the mismatch between observations and model, understand the L1 and L2 norm (Exercise 4)\n",
"* Understand the role of a cost function, and learn to construct it (Exercise 5)\n",
"* Apply the minimum least-squares solution and understand the role of observation uncertainty (Exercise 6)\n",
" \n",
"\n",
"<b>Tip:</b> \n",
"\n",
"You can go through this practical at your own pace: \n",
"\n",
" 1. For a novice user, it is fine to just read the instructions, execute the cells, and try to answer the questions. Sometimes you might need to modify a value in the code and run a cell multiple times. In that case focus on the part of the cell that looks like this:\n",
"\n",
"```python\n",
"1| ############### YOUR INPUT BELOW ################\n",
"2| \n",
"3| SomeVariable = [1,2,3] <--- You make a change \n",
"4| \n",
"5| ############### YOUR INPUT ENDS ################ \n",
"``` \n",
" \n",
" 2. For a regular user, it might be nice to read and understand the python code in the cells. In that case look at the parts indicated by:\n",
" 3. Expert users are challenged to also modify and write code to explore the material further \n",
"\n",
" "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9c3e9e7d-6cf3-4121-866c-d4fc404ee22b",
"metadata": {},
"outputs": [],
"source": [
"#### PLEASE EXECUTE THIS CELL ONCE UPON STARTUP, IT LOADS A SET OF NEEDED PYTHON LIBRARIES ####\n",
"\n",
"\n",
"%load_ext autoreload\n",
"%autoreload 2\n",
"\n",
"import warnings\n",
"warnings.filterwarnings('ignore')\n",
"\n",
"import glob\n",
"import pandas as pd\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"import mpda\n",
"import seaborn as sns\n",
"\n",
"%matplotlib inline\n",
"\n",
"#The following is the full observation set you can pick from in this practical:\n",
"\n",
"full_obsset = ['mlo', # Mauna Loa\n",
" 'mhd', # Mace Head, Ireland\n",
" 'rpb', # Ragged Point Barbados\n",
" 'smo', # Samoa\n",
" 'cgo', # Cape Grim Observatory, Australia\n",
" 'spo', # South Pole\n",
" 'alt', # Alert, Alaska, USA\n",
" 'zep', # Zeppelin, Ny-Alesund, Norway\n",
" #'pal', # Hyytiala, Finland substituted by Pallas (!)\n",
" #'crz', # Crozet Island\n",
" 'thd' # Trinidad Head, California, USA\n",
" ]\n"
]
},
{
"cell_type": "markdown",
"id": "75259f02-f224-40c4-bb28-9b642e3fad73",
"metadata": {},
"source": [
"\n",
"## Introduction \n",
"\n",
"\n",
"The optimal land uptake found in Exercise 3 of the [the MOGUNTIA CO2 Notebook](./MOGUNTIA-CO2.ipynb) was determined by trial and error, and by judging the difference to the \n",
"observations using the human eye. In data assimilation this \"forward modeling\" of the problem is very important. It \n",
"gives the researcher a good feeling for the system, and helps build an expectation of the outcome of a more formal optimization \n",
"process. Once the forward modeling yields a satisfactory system, an optimization algorithm can be used to fine-tune \n",
"the solution and find the numerically best values for a set of unknowns.\n",
"\n",
"Often, the best numerical solution is one that best reproduces observations, quantified by for example the root-mean-square-error (RMSE):\n",
"$x$ = the unknown scaling factor for the land-sink [-]\n",
"\n",
"$H$ = a linear operator on $x$ that makes it comparable to $y^{0}$: MOGUNTIA\n",
"\n",
"$H(x)$ = MOGUNTIA calculated value of the CO$_2$ mole fraction given $x$ [ppm]\n",
"\n",
"$N$ = number of observations\n",
" \n",
"</div> \n",
" \n",
"We often tend to say that the value of the land sink $x$ that is \"optimal\", is the one that minimizes the RMSE to observations ($y^{0}$) once run through the MOGUNTIA model ($H$). The RMSE is an example of a so-called L2-norm, in which deviations from the observed value are weighted in quadrature. If we had used the absolute difference:\n",
"this would have put less emphasis on larger deviations. We call this an L1-norm. \n",
" \n",
"Defining what the 'optimum' means in the solution of your problem is thus a very important first step. This choice leads to the choice of optimization method, the restrictions on the type of errors you use, and often the numerical methods available to you. In many atmospheric application the basis for your optimization is a so-called cost function, and we often restrict ourselves to an L2-norm:\n",
"where we have two new symbols, and we refer to y and H(x) now as vectors:\n",
"\n",
"$ J $ = the cost assigned to a proposed solution $x$\n",
"\n",
"$ R $ = the covariance of the model-data mismatch [ppm]$^{2}$\n",
"\n",
" \n",
"</div> \n",
" \n",
"The NxN matrix $R$ represents the uncertainty incurred when comparing each observation to its modeled value, due to for example observational errors, transport model errors, model sampling errors, etc. These errors provide the weight for each measurements in the overall cost calculated. We often refer to these values as **'model-data-mismatch'**, instead of only 'observation error'. The latter would be much smaller usually than the full errors in R.\n",
" \n",
"\n",
"\n",
"**In Exercise 4, we will calculate the L1 and L2 norm and value of J for the solutions you created in Exercise 3 of [the MOGUNTIA CO2 Notebook](./MOGUNTIA-CO2.ipynb).**\n",
"\n",
"---"
]
},
{
"cell_type": "markdown",
"id": "5707b3a2-2e23-4331-ac84-4b41d0b5f917",
"metadata": {},
"source": [
"### Exercise 4: Inspecting the optimum land sink\n",
"#### Estimated time to complete: 45 mins\n",
"\n",
"\n",
"\n",
"In the cell below, you can investigate the two metrics above (RMSE and J) in the runs you did so far. \n",
"\n",
"<div class=\"alert alert-block alert-info\">\n",
" \n",
" Note that if your Virtual Machine was stopped, the output will have been deleted so you might need to run your results from Exercise 3 again in [the MOGUNTIA CO2 Notebook](./MOGUNTIA-CO2.ipynb)\n",
"* Inspect the cell, and see if you understand the Python code provided. \n",
"\n",
"* Print the values of the ABME and RMSE and cost function J for some of the runs you've done in Exercise 3. Can you recognize the \"best\" simulation from the metrics? \n",
"\n",
"* How does the metric depend on the set of observations you include in the set? Try to add an \"independent\" site to the set that makes the RMSE go very high. \n",
" \n",
"* Write down the optimum land sink and the cost function minimum you attained, on the blackboard up front. You can do a few extra runs in [the MOGUNTIA CO2 Notebook](./MOGUNTIA-CO2.ipynb) from Exercise 3 if you feel you can do better... \n",
"\n",
"</FONT>\n",
"</div> "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "daac13de-f0d2-49d9-aee8-962efa796ac6",
"metadata": {},
"outputs": [],
"source": [
"############### YOUR INPUT BELOW ################\n",
"\n",
"\n",
"############################# Define a set of sites to assess with our metrics: ############################\n",
"\n",
"obsset = ['mhd','mlo']\n",
"mdm = [0.5,0.5] # in ppm, these are the errors we put on the diagonal of the R-matrix\n",
"\n",
"assert len(obsset) == len(mdm),'Please specify as many model-data mismatch values as sites in the obsset'\n",
"\n",
"############################# Get the observations and modeled values for a given experiment ############################\n",
"\n",
"y,Hx,info = mpda.get_concentrations('FOSSIL2',obsset) # specify the name of your run to assess here, rerun if needed\n",
"\n",
"############### YOUR INPUT ENDS ################\n",
"\n",
"\n",
"############################ Compute the metrics ############################\n",
"\n",
"R = mpda.make_R(info,mdm) # 0.5 ppm^2 measurement uncertainty for all observations \n",
"\n",
"ABME = np.abs((y-Hx)).mean() # make L1 for this run\n",
"RMSE = np.sqrt(((y-Hx)**2).mean()) # make RMSE for this run\n",
"J = np.dot(np.transpose(y-Hx),np.linalg.inv(R)).dot(y-Hx) # make J for this run\n",
"\n",
"print(f'Absolute mean error : {ABME:f} [ppm] ')\n",
"## Exercise 5: Finding the cost function minimum\n",
"\n",
"By now, each student has been able to find a best estimate of the land sink. But not all solutions are the same, and not all have the same RMSE or cost. The discussion on how this came to be has been done now. So time to bundle forces.\n",
" <figcaption> <i>Figure 3: Many optimization methods use quadratic cost functions, with a minimum defined in the N-dimensions of the problem being solved. Two dimensions we can visualize recognizably still\n",
"</i></figcaption>\n",
"</figure>\n",
"\n",
"The true cost function you have been inspecting is of course quadratic: a hyperbolic curve with a theoretical minimim at (x_opt, 0.0), if all observations are matched perfectly for the optimum land sink x_opt. So far, you have all focused on this minimum point only. \n",
"* Discuss with the whole class a strategy to collectively explore the quadratic cost function of the inverse problem that yuo have so far solved alone (i.e., estimating the value **SINK EXTRA_LAND**.\n",
"\n",
"* Focus not only on finding the minimum, but also the rest of the shape of the hyperbolic curve. What could its curvature tell you?\n",
"\n",
"</FONT>\n",
"</div>\n",
"___"
]
},
{
"cell_type": "markdown",
"id": "13efda1a-9b09-4ddd-ac56-fd90ca61c4e8",
"metadata": {
"tags": []
},
"source": [
"### End of exercises, rest of practical is OPTIONAL\n"
]
},
{
"cell_type": "markdown",
"id": "654b62ea-4922-4e57-9842-53f5e2545a40",
"metadata": {},
"source": [
"<details><summary>CLICK TO SEE THE SOLUTION FOR MANY VALUES OF EXTRA_LAND</summary>\n",
"## Exercise 6: The minimum least-squares solution\n",
"\n",
"<P><FONT COLOR=darkblue>\n",
"\n",
"With the relatively simple cost function, we can also create our first real 'optimal' solution of the scaling factor for the land sink. This is not the one that depends on trial-and-error, but one that is actually the mathematical minimum of the quadratic J. \n",
" \n",
"It is given by the Ordinary Least Squares (OLS) solution used also in simple linear regression:\n",
"with the subscript $a$ referring to the \"analysis\", representing the optimal value of the state $x$ after using all observations $y^{0}$ with each of their weights $R$.\n",
" <figcaption> <i>Figure 4: The Ordinary Least Squares fitting of a straight line through your data is done by scientists worldwide. Only few realize that the math behind Python and R functions like \"optimize\", \"lingress\", \"linfit\", \"polyfit\" etc is just a simple variant of the algebraic solution given in the cell below \n",
"</i></figcaption>\n",
"</figure>\n",
"\n",
" \n",
"In the cell below we create this minimum least-squares solution for the unknown value of EXTRA_LAND, given the model-data mismatch R of the requested observation set. \n",
"<P>\n",
"\n",
"<div class=\"alert alert-block alert-info\">\n",
"<b>Note</b> \n",
" \n",
"Note that we do not actually run the full MOGUNTIA model to create it, but instead we have computed H, the matrix operator that exactly reproduces the transport that MOGUNTIA would do for us with an extra SINK of 1 PgC. We were able to do this because MOGUNTIA transport is fully linear (a doubling of emissions exactly doubles all mole fractions), and allows us to use this relation between one parameter (EXTRA_LAND) and all possible CO$_2$ sites. Moreover, we included the initial condition of CO2=369 ppm and only fossil fuels in the `basefunc_base` run, so that we can subtract it from the measurements before optimizing the residuals. \n",
"## Exercise 7: Adding a prior term to the cost function\n",
"\n",
"The cost function J above is quite simple, and does not allow for other information than the observations to play a role in the final solution. However, quite often we start data assimilation with a 'first-guess' or other type of a-priori information on the state of a system. And just like with observations, we want our final solution to stay close to such information too.\n",
"\n",
" \n",
"Including a-priori information in the cost function is an essential component of data assimilation, and sets it aside from simple curve-fitting or linear regression. This is because the first-guess state of a system can be predicted by information from a previous moment in time, or from expert information. Such information can be added to the cost function:\n",
"$x^{p}$ = prior information on the state (obtained from expert knowledge or a model of the state)\n",
"\n",
"$P$ = covariance matrix representing the full error structure of the prior state\n",
"\n",
"</div> \n",
" \n",
"The optimal solution (minimum cost J) thus depends on the relative uncertainties assumed in the model-data comparison (R) and in the uncertainties in the prior state (P). If one becomes very small (low errors), deviations will incur high costs.\n",
"\n",
"In the cell below, we once again calculate the cost function for various values of EXTRA_LAND but now with a prior term added. \n",
##### Wouter Peters, modifications for ICOS Summerschool, May 2023
##### version 2.0
### Goal
* Quantify the mismatch between observations and model, understand the L1 and L2 norm (Exercise 4)
* Understand the role of a cost function, and learn to construct it (Exercise 5)
* Apply the minimum least-squares solution and understand the role of observation uncertainty (Exercise 6)
<b>Tip:</b>
You can go through this practical at your own pace:
1. For a novice user, it is fine to just read the instructions, execute the cells, and try to answer the questions. Sometimes you might need to modify a value in the code and run a cell multiple times. In that case focus on the part of the cell that looks like this:
```python
1|############### YOUR INPUT BELOW ################
2|
3|SomeVariable=[1,2,3]<---Youmakeachange
4|
5|############### YOUR INPUT ENDS ################
```
2. For a regular user, it might be nice to read and understand the python code in the cells. In that case look at the parts indicated by:
The optimal land uptake found in Exercise 3 of the [the MOGUNTIA CO2 Notebook](./MOGUNTIA-CO2.ipynb) was determined by trial and error, and by judging the difference to the
observations using the human eye. In data assimilation this "forward modeling" of the problem is very important. It
gives the researcher a good feeling for the system, and helps build an expectation of the outcome of a more formal optimization
process. Once the forward modeling yields a satisfactory system, an optimization algorithm can be used to fine-tune
the solution and find the numerically best values for a set of unknowns.
Often, the best numerical solution is one that best reproduces observations, quantified by for example the root-mean-square-error (RMSE):
$x$ = the unknown scaling factor for the land-sink [-]
$H$ = a linear operator on $x$ that makes it comparable to $y^{0}$: MOGUNTIA
$H(x)$ = MOGUNTIA calculated value of the CO$_2$ mole fraction given $x$ [ppm]
$N$ = number of observations
</div>
We often tend to say that the value of the land sink $x$ that is "optimal", is the one that minimizes the RMSE to observations ($y^{0}$) once run through the MOGUNTIA model ($H$). The RMSE is an example of a so-called L2-norm, in which deviations from the observed value are weighted in quadrature. If we had used the absolute difference:
this would have put less emphasis on larger deviations. We call this an L1-norm.
Defining what the 'optimum' means in the solution of your problem is thus a very important first step. This choice leads to the choice of optimization method, the restrictions on the type of errors you use, and often the numerical methods available to you. In many atmospheric application the basis for your optimization is a so-called cost function, and we often restrict ourselves to an L2-norm:
where we have two new symbols, and we refer to y and H(x) now as vectors:
$ J $ = the cost assigned to a proposed solution $x$
$ R $ = the covariance of the model-data mismatch [ppm]$^{2}$
</div>
The NxN matrix $R$ represents the uncertainty incurred when comparing each observation to its modeled value, due to for example observational errors, transport model errors, model sampling errors, etc. These errors provide the weight for each measurements in the overall cost calculated. We often refer to these values as **'model-data-mismatch'**, instead of only 'observation error'. The latter would be much smaller usually than the full errors in R.
**In Exercise 4, we will calculate the L1 and L2 norm and value of J for the solutions you created in Exercise 3 of [the MOGUNTIA CO2 Notebook](./MOGUNTIA-CO2.ipynb).**
In the cell below, you can investigate the two metrics above (RMSE and J) in the runs you did so far.
<divclass="alert alert-block alert-info">
Note that if your Virtual Machine was stopped, the output will have been deleted so you might need to run your results from Exercise 3 again in [the MOGUNTIA CO2 Notebook](./MOGUNTIA-CO2.ipynb)
</div>
---
<div class="alert alert-block alert-warning">
<b>To do</b>
<FONT COLOR=red>
*Inspect the cell, and see if you understand the Python code provided.
*Print the values of the ABME and RMSE and cost function J for some of the runs you've done in Exercise 3. Can you recognize the "best" simulation from the metrics?
*How does the metric depend on the set of observations you include in the set? Try to add an "independent" site to the set that makes the RMSE go very high.
*Write down the optimum land sink and the cost function minimum you attained, on the blackboard up front. You can do a few extra runs in [the MOGUNTIA CO2 Notebook](./MOGUNTIA-CO2.ipynb) from Exercise 3 if you feel you can do better...
By now, each student has been able to find a best estimate of the land sink. But not all solutions are the same, and not all have the same RMSE or cost. The discussion on how this came to be has been done now. So time to bundle forces.
<figcaption> <i>Figure 3:Many optimization methods use quadratic cost functions, with a minimum defined in the N-dimensions of the problem being solved. Two dimensions we can visualize recognizably still
</i></figcaption>
</figure>
The true cost function you have been inspecting is of course quadratic:a hyperbolic curve with a theoretical minimim at (x_opt, 0.0), if all observations are matched perfectly for the optimum land sink x_opt. So far, you have all focused on this minimum point only.
<div class="alert alert-block alert-warning">
<b>To do</b>
<FONT COLOR=red>
*Discuss with the whole class a strategy to collectively explore the quadratic cost function of the inverse problem that yuo have so far solved alone (i.e., estimating the value **SINK EXTRA_LAND**.
*Focus not only on finding the minimum, but also the rest of the shape of the hyperbolic curve. What could its curvature tell you?
Withtherelativelysimplecostfunction,wecanalsocreateourfirstreal'optimal' solution of the scaling factor for the land sink. This is not the one that depends on trial-and-error, but one that is actually the mathematical minimum of the quadratic J.
It is given by the Ordinary Least Squares (OLS) solution used also in simple linear regression:
<P>
<div class="alert alert-block alert-warning">
$x^{a}=(H^{T}RH)^{-1}H^{T}Ry^{0}$
</div>
with the subscript $a$ referring to the "analysis", representing the optimal value of the state $x$ after using all observations $y^{0}$ with each of their weights $R$.
<figcaption> <i>Figure 4:The Ordinary Least Squares fitting of a straight line through your data is done by scientists worldwide. Only few realize that the math behind Python and R functions like "optimize", "lingress", "linfit", "polyfit" etc is just a simple variant of the algebraic solution given in the cell below
</i></figcaption>
</figure>
In the cell below we create this minimum least-squares solution for the unknown value of EXTRA_LAND, given the model-data mismatch R of the requested observation set.
<P>
<div class="alert alert-block alert-info">
<b>Note</b>
Note that we do not actually run the full MOGUNTIA model to create it, but instead we have computed H, the matrix operator that exactly reproduces the transport that MOGUNTIA would do for us with an extra SINK of 1 PgC. We were able to do this because MOGUNTIA transport is fully linear (a doubling of emissions exactly doubles all mole fractions), and allows us to use this relation between one parameter (EXTRA_LAND) and all possible CO$_2$ sites. Moreover, we included the initial condition of CO2=369 ppm and only fossil fuels in the `basefunc_base` run, so that we can subtract it from the measurements before optimizing the residuals.
</div>
---
<divclass="alert alert-block alert-warning">
<FONTCOLOR='RED'>
<b>To do</b>
* Investigate how the solution depends on the value of R (model-data mismatch)
* And what about the observation set, how does it influence x_a? Can you use one site? Which one would you pick?
* Compare the cost function and RMSE to the solution you created with the whole class in Exercise 5. Did you get close?
## Exercise 7: Adding a prior term to the cost function
The cost function J above is quite simple, and does not allow for other information than the observations to play a role in the final solution. However, quite often we start data assimilation with a 'first-guess' or other type of a-priori information on the state of a system. And just like with observations, we want our final solution to stay close to such information too.
Including a-priori information in the cost function is an essential component of data assimilation, and sets it aside from simple curve-fitting or linear regression. This is because the first-guess state of a system can be predicted by information from a previous moment in time, or from expert information. Such information can be added to the cost function:
$x^{p}$ = prior information on the state (obtained from expert knowledge or a model of the state)
$P$ = covariance matrix representing the full error structure of the prior state
</div>
The optimal solution (minimum cost J) thus depends on the relative uncertainties assumed in the model-data comparison (R) and in the uncertainties in the prior state (P). If one becomes very small (low errors), deviations will incur high costs.
In the cell below, we once again calculate the cost function for various values of EXTRA_LAND but now with a prior term added.
<divclass="alert alert-block alert-warning">
<b>To do</b>
<FONTCOLOR=red>
---
* Read and understand the cell below, then execute it. Which term in J is smaller? What does it mean?
* Change the values of P, and/or R and execute the cell again. What is needed to make J1 ~ J2 (a balanced cost function)?
* Can you think of a way to make the solution (minimum cost) stick to the prior value of 1.0? Try it.