Commit b988e748 authored by Johannes Kruisselbrink's avatar Johannes Kruisselbrink
Browse files

Update README

parent 3c3588fb
......@@ -13,7 +13,6 @@ from datetime import datetime
import textwrap
import os
# Small utility to create hyperlink to hyperlink :-)
def print_as_link(text):
return f'[{text}]({text})'
......@@ -136,10 +135,10 @@ capeg['Name'] = capeg['targetL1'].str[max_len]
# Set the reference
capeg['Reference'] = ''
# Remove Acute rows without ARfD
# Remove Acute rows without ARfD
capeg.drop(capeg.loc[(capeg['AcuteChronic'] == 'Acute') & ((capeg['arfd'] == 'na') | capeg['arfd'].isna())].index, inplace=True)
# Remove Chronic rows without ADI
# Remove Chronic rows without ADI
capeg.drop(capeg.loc[(capeg['AcuteChronic'] == 'Chronic') & ((capeg['adi'].isna() | capeg['arfd'].isna()))].index, inplace=True)
#capeg.to_excel('dump.xlsx', sheet_name='Dump', index=False)
......
## Introduction
This script creates an MCRA complient dataset from the data from the Cumulative Assessment Groups of Pesticides as proposed by Nielsen et al. 2012. It uses the data from the [CAPEG 1.2 database](https://efsa.onlinelibrary.wiley.com/action/downloadSupplement?doi=10.2903%2Fsp.efsa.2012.EN-269&file=269eax2-sup-0002.zip) from the supplementary material as source data and generates MCRA complient tables containing the assessment group definitions, and also including a substances catalogue and effects catalogue.
These are the input and output files of the script. All names are defaults, and can be changed by the user on the command line.
Elsa Nielsen, Pia Nørhede, Julie Boberg, Louise Krag Isling, Stine Kroghsbo, Niels Hadrup, Lea Bredsdorff, Alicja Mortensen, John Christian Larsen, 2012. Identification of Cumulative Assessment Groups of Pesticides. EFSA Supporting Publication 2012; 9( 4):EN-269, 303 pp. doi:10.2903/sp.efsa.2012.EN-269
# How to use the convert script
## Quick start
### Solve all dependencies
### Install required packages
```
pip install pandas xlrd tabulate openpyxl requests
pip install -r requirements.txt
```
### Run the script in trial (-x) and verbose (-v) mode
```
python.exe Convert-EUProcessingFactorsDB.py -x -v
```
Again in trial mode, now using all input files from the Example directory
```
python.exe Convert-EUProcessingFactorsDB.py -x -v -s -g
```
### Run the script with the default names
In the example here also a substance conversion (-s) and FoodTranslation (-g) is done.
```
python.exe Convert-EUProcessingFactorsDB.py -s -g -v
python.exe Convert-DTUCAG.py -x -v
```
### Run the script with specific input files
```
python.exe Convert-EUProcessingFactorsDB.py -v -t ProcessingTypes.csv -p ProcTypeTranslations.csv -f FoodTranslations.csv
python.exe Convert-DTUCAG.py -v -e MyEffects.csv -s MySubstances.csv
```
### Questions?
```
python.exe Convert-EUProcessingFactorsDB.py -h
```
## Introduction
This script takes data from the [EU Processing Factors file](https://zenodo.org/record/1488653/files/EU_Processing_Factors_db_P.xlsx.xlsx?download=1) and combines this with two (user supplied) files (a food translations file and a processing translations file) to get to an MCRA processing factors file with food codes and processing type codes in the desired coding system. In this way, data from the EU processing factors file can be used in MCRA analyses.
These are the input and output files of the script. All names are defaults, and can be changed by the user on the command line.
* Input files:
* The [EU Processing Factors file](https://zenodo.org/record/1488653/files/EU_Processing_Factors_db_P.xlsx.xlsx?download=1)
* The processing translation input file, [ProcTypeTranslations.csv](ProcTypeTranslations.csv)
* The food translation input file, [FoodTranslations.csv](FoodTranslations.csv)
* The processing types input file to augment info in the report, [ProcessingTypes.csv](ProcessingTypes.csv) (not used in any data processing)
* An optional substances sheet (-s), to augment the output with the ``CASNumber``.
* An optional FoodComposition file (-g), to augment the output with A-codes.
* Output files:
* The goal of this script, the file [ProcessingTypes.csv](ProcessingTypes.csv) with the new MCRA ProcessingTypes. By default this file will be contained in a zip file [ProcessingFactors.zip](ProcessingFactors.zip)
* A small markdown report is also created, usally called [Report.md](Report.md), but within the zip file is called Readme.md.
* A csv file with a summary (and counts) of *the remaining data* of the EU sheet, called [Mismatches.csv](Mismatches.csv).
The following is happening in the script, essentially
* The script wil try to match the first column (``FromFC``) of [ProcTypeTranslations.csv](ProcTypeTranslations.csv) to the column ``KeyFacets Code`` of the EU sheet. If a match is found, then the second column (``FCToProcType``) of [ProcTypeTranslations.csv](ProcTypeTranslations.csv) will become the field ``idProcessingType``.
* Then the script will try to match both the ``FromFX`` and ``FXToRpc`` column of [FoodTranslations.csv](FoodTranslations.csv) with the columns ``Matrix FoodEx2 Code`` and ``Matrix Code`` from the EU sheet, *for all rows that didn't already match in the previous step*. If a match was found, then the value of ``FXToProcType`` will be copied to ``idProcessingType``.
* If no substance file was given, then just copy the field ``ParamCode Active Substance`` to ``idSubstance``. But if a substance was given, then strip the dash from the ``CASNumber`` column in the substance file, and match the column ``ParamCode Active Substance`` in the EFSA sheet to ``code`` in the substances sheet. If a match was found then copy the modified (without dash) ``CASNumber`` to ``idSubstance``.
* If a foodcompositions file was given, then an additional translation is done. This table needs to have the layout of the MCRA FoodComposition.
* Only records of ``idToFood`` starting with ``P`` and ``idFromFood`` which contain a dash (-) will be used
* The ``idFromFood`` column is split on the dash (-)
* A new column is temporarily added combining ``idToFood`` and the right part of the split on ``idFromFood``
* For all matches of the new column with the field ``idFoodProcessed`` in ``ProcessingFactors``, the field ``idFoodProcessed`` will be replaced by the field ``idFromFood`` from the FoodComposition table, and duplicates will also be added
* Finally the output file [ProcessingFactors.csv](ProcessingFactors.csv) (contained within [ProcessingFactors.zip](ProcessingFactors.zip)) will be written, together with some reports.
## Prerequisites
In order to use the python script, the following libraries are necessary
* [pandas](https://pandas.pydata.org/)
* [xlrd](https://pypi.org/project/xlrd/)
* [tabulate](https://pypi.org/project/tabulate/)
* [openpyxl](https://pypi.org/project/openpyxl/)
* [requests](https://pypi.org/project/requests/)
Install all the libraries at once with
```
pip install pandas xlrd tabulate openpyxl requests
```
## Usage
The script will assume defaults for all filenames. The ``-h`` option (help) will display info about which defaults. So the following would produce help information:
```
python.exe convert-script -h
```
Theses are command line options that are supported.
```
usage: Convert-EUProcessingFactorsDB.py [-h] [-v] [-x] [-e [EFSA_FILE]]
[-t [PROCESSING_TYPE_FILE]]
[-p [PROCESSING_TRANSLATION_FILE]]
[-f [FOOD_TRANSLATION_FILE]]
[-s [SUBSTANCE_TRANSLATION_FILE]]
[-g [FOOD_COMPOSITION_FILE]]
[-o [PROCESSING_FACTOR_FILE]]
Converts the EFSA Zendono Excel sheet into an MCRA conforming format, using
some external translation files.
optional arguments:
-h, --help show this help message and exit
-v, --verbosity Show verbose output
-x, --example Uses input files from the Example subdir.
-e [EFSA_FILE], --efsa_file [EFSA_FILE]
The EFSA Zendono Excel sheet (.xlsx); either file or
URL. (default: https://zenodo.org/record/1488653/files
/EU_Processing_Factors_db_P.xlsx.xlsx?download=1)
-t [PROCESSING_TYPE_FILE], --processing_type_file [PROCESSING_TYPE_FILE]
The (input) processing type file - format: csv (Comma
Seperated). (default: ProcessingTypes.csv)
-p [PROCESSING_TRANSLATION_FILE], --processing_translation_file [PROCESSING_TRANSLATION_FILE]
The (input) processing translation file - format: csv
(Comma Seperated). (default: ProcTypeTranslations.csv)
-f [FOOD_TRANSLATION_FILE], --food_translation_file [FOOD_TRANSLATION_FILE]
The (input) food translation file - format: csv (Comma
Seperated). (default: FoodTranslations.csv)
-s [SUBSTANCE_TRANSLATION_FILE], --substance_translation_file [SUBSTANCE_TRANSLATION_FILE]
The (input) substance translation file - format: tsv
(Tab Seperated), file not required. (default:
SubstanceTranslations.tsv)
-g [FOOD_COMPOSITION_FILE], --food_composition_file [FOOD_COMPOSITION_FILE]
The (input) food composition file - format: xlsx
(Excel), file not required. (default:
FoodCompositions.xlsx)
-o [PROCESSING_FACTOR_FILE], --processing_factor_file [PROCESSING_FACTOR_FILE]
The (output) processing factor file - format: csv
(Comma Seperated). (default: ProcessingFactors.zip)
For example: use Convert-EUProcessingFactorsDB.py -v -x for a verbose example.
python.exe Convert-DTUCAG.py -h
```
## Coding
......@@ -135,5 +37,5 @@ Check your changes using ``pycodestyle`` for example.
```
pip install pycodestyle # To install the programm
pycodestyle .\Convert-EUProcessingFactorsDB.py # To check whether the code complies.
pycodestyle .\Convert-DTUCAG.py # To check whether the code complies.
```
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment