From 2417b84b2257d078cc1eb1bfc599f4d8210635b5 Mon Sep 17 00:00:00 2001 From: Hans van den Heuvel <hans1.vandenheuvel@wur.nl> Date: Thu, 26 Mar 2020 15:08:13 +0100 Subject: [PATCH] Readme in Convert-EUProcessingFactorsDB updated. --- Convert-EUProcessingFactorsDB/README.md | 102 ++++++++++-------------- 1 file changed, 44 insertions(+), 58 deletions(-) diff --git a/Convert-EUProcessingFactorsDB/README.md b/Convert-EUProcessingFactorsDB/README.md index 6263274..ec339eb 100644 --- a/Convert-EUProcessingFactorsDB/README.md +++ b/Convert-EUProcessingFactorsDB/README.md @@ -38,7 +38,7 @@ These are the input and output files of the script. All names are defaults, and * A small markdown report is also created, usally called [Report.md](Report.md), but within the zip file is called Readme.md. * A csv file with a summary (and counts) of *the remaining data* of the EU sheet, called [Mismatches.csv](Mismatches.csv). -The following is happening in the script, essentially ([more details here](#detailed-workings)) +The following is happening in the script, essentially * The script wil try to match the first column (``FromFC``) of [ProcTypeTranslations.csv](ProcTypeTranslations.csv) to the column ``KeyFacets Code`` of the EU sheet. If a match is found, then the second column (``FCToProcType``) of [ProcTypeTranslations.csv](ProcTypeTranslations.csv) will become the field ``idProcessingType``. * Then the script will try to match both the ``FromFX`` and ``FXToRpc`` column of [FoodTranslations.csv](FoodTranslations.csv) with the columns ``Matrix FoodEx2 Code`` and ``Matrix Code`` from the EU sheet, *for all rows that didn't already match in the previous step*. If a match was found, then the value of ``FXToProcType`` will be copied to ``idProcessingType``. * If no substance file was given, then just copy the field ``ParamCode Active Substance`` to ``idSubstance``. But if a substance was given, then strip the dash from the ``'CASNumber`` column in the substance file, and match the column ``ParamCode Active Substance`` in the EFSA sheet to ``code`` in the substances sheet. If a match was found then copy the modified (without dash) ``CASNumber`` to ``idSubstance``. @@ -69,61 +69,49 @@ python.exe convert-script -h ``` Theses are command line options that are supported. - * ``-h`` : shows help, use this to see which default file names are used. - * ``-e EFSA_FILE_OR_URL`` : uses ``EFSA_FILE_OR_URL`` as input Excel sheet. This may be a filename, or the URL, but it should be the format as in the [EU Processing Factors file](https://zenodo.org/record/1488653/files/EU_Processing_Factors_db_P.xlsx.xlsx?download=1) - * ``-f FOOD_TRANSLATION_FILE`` : uses ``FOOD_TRANSLATION_FILE`` as a food translation file (``.csv``). This file should have the following format - * ``FromFX,FXToRpc,FXToProcType``; this line should be the first line (a header). Values are read as string. Values separated by comma. - * Lines starting with \# will be ignored and this can be used to insert comments. - * ``-m MISMATCH_FILE`` : this uses ``MISMATCH_FILE`` as an output file for the mismatches. Supported format: ``.csv``. - * ``-o PROCESSING_FACTOR_FILE`` : this uses ``PROCESSING_FACTOR_FILE`` as an output file for the MCRA formatted Processing Factors file. Supported formats: ``.zip``, ``.xlsx``, ``.csv``. If ``.zip`` is chosen as a format (default) then within the zipfile a ``.csv`` will be written with the MCRA conforming filename. Also a ``Readme.md`` file will be placed, which is just a copy of the report file (see option ``-r``) - * ``-p PROCESSING_TRANSLATION_FILE`` : Uses ``PROCESSING_TRANSLATION_FILE`` as a processing type translation file (``.csv``). This file should have the following format - * ``FromFC,FCToProcType``; this line should be the first line (a header). Values are read as string. Values seperated by comma. - * Lines starting with \# will be ignored and this can be used to insert comments. - * ``-r REPORT_FILE`` : this uses ``REPORT_FILE`` as an output report file (a Markdown file). A copy will be placed in the ``PROCESSING_FACTOR_FILE`` (option ``-o``) as ``Readme.md`` *if a zip file was chosen there* as an output file. - * ``-t PROCESSING_TYPE_FILE``] : this uses ``PROCESSING_TYPE_FILE`` as input to augment the data in the output file. The format is defined by MCRA. - * ``-v`` : writes verbose output. Multiple levels (1-3) of verbosity are possible, by using more ``v``'s. E.g. ``-vv`` or ``-vvv``. - -## Detailed workings - -The script is basically one long file, with sequential actions happening. No iteration is used, because the data processing is handed over to the ``pandas`` library. The script is diveded into five phases. If the ``-vv`` verbosity is used, these phases will be displayed as output. This is also extensively documented (commented) within the python file itself. - -The pandas dataprocessing can be thought of here as an SQL database. The script will read the EU Excel sheet into a database. Using left joins, and copying of columns the sheet/database is extended. Finally a selection of the newly created columns will be exported. - -Below a detailed description. - -* **PHASE 0. Initialization** - * Libraries are imported - * Command line arguments are parsed - * Objects created/adjusted -* **PHASE 1. Read input files** - * Script reads the [EU Processing Factors file](https://zenodo.org/record/1488653/files/EU_Processing_Factors_db_P.xlsx.xlsx?download=1) - * Script reads the (MCRA formatted) files: - * A food translation file, [Foodtranslations.csv](Foodtranslations.csv) - * A processing translation file, [ProcTypeTranslations.csv](ProcTypeTranslations.csv) - * Only for information, a processing translation file, [ProcessingTypes.csv](ProcessingTypes.csv) -* **PHASE 2. Processing data** - * Script will ``left join`` column ``KeyFacets Code`` from the EU sheet with the ``FromFC`` column of [ProcTypeTranslations.csv](ProcTypeTranslations.csv). - * The result will ``left join`` column ``Matrix FoodEx2 Code`` from the EU sheet with the ``FromFX`` column of [Foodtranslations.csv](Foodtranslations.csv). - * Copy existing columns - -|From |To | -|:-------------------------|:------------------| -|ParamCode Active Substance|idSubstance | -|ParamName Active Substance|SubstanceName | -|Matrix Code |idFoodUnProcessed | -|Raw Primary Commodity |FoodUnprocessedName| -|Median PF |Nominal | - -* - * Add empty columns: ``Upper``,``NominalUncertaintyUpper``,``UpperUncertaintyUpper`` - * Next, if the first ``left join`` was succesfull (i.e ``FCToProcType`` contains a value), then make a copy of ``FCToProcType`` to a new field, ``idProcessingType`` - * Next, if the second ``left join`` was succesfull (i.e ``FCToProcType`` does NOT contain a value, and ``FXToProcType`` does), then make a copy of ``FXToProcType`` to ``idProcessingType`` - * Do a ``left join`` on column ``idProcessingType`` from the sheet with column ``idProcessingType`` from the file [ProcessingTypes.csv](ProcessingTypes.csv) - * Now, if column ``idProcessingType`` has an entry, ``idFoodUnProcessed`` will be concatenated with a dash ``-`` and with ``idProcessingType`` and the result will be placed into ``idFoodProcessed`` -* **PHASE 3. Exporting data** - * The columns ``idProcessingType``, ``idSubstance``, ``SubstanceName``, ``idFoodProcessed``, ``idFoodUnProcessed``, ``FoodUnprocessedName``, ``Nominal``, ``Upper``, ``NominalUncertaintyUpper``, ``UpperUncertaintyUpper``, ``KeyFacets Interpreted``, ``Matrix Code Interpreted``, ``MCRA_ProcessingType_Description`` are exported, for all rows in which either ``FCToProcType`` or ``FXToProcType`` has an entry. -* **PHASE 4. Analysing data and creating report** - * This has to be expanded further in this readme file. +``` +usage: Convert-EUProcessingFactorsDB.py [-h] [-v] [-x] [-e [EFSA_FILE]] + [-t [PROCESSING_TYPE_FILE]] + [-p [PROCESSING_TRANSLATION_FILE]] + [-f [FOOD_TRANSLATION_FILE]] + [-s [SUBSTANCE_TRANSLATION_FILE]] + [-g [FOOD_COMPOSITION_FILE]] + [-o [PROCESSING_FACTOR_FILE]] + +Converts the EFSA Zendono Excel sheet into an MCRA conforming format, using +some external translation files. + +optional arguments: + -h, --help show this help message and exit + -v, --verbosity Show verbose output + -x, --example Uses input files from the Example subdir. + -e [EFSA_FILE], --efsa_file [EFSA_FILE] + The EFSA Zendono Excel sheet (.xlsx); either file or + URL. (default: https://zenodo.org/record/1488653/files + /EU_Processing_Factors_db_P.xlsx.xlsx?download=1) + -t [PROCESSING_TYPE_FILE], --processing_type_file [PROCESSING_TYPE_FILE] + The (input) processing type file - format: csv (Comma + Seperated). (default: ProcessingTypes.csv) + -p [PROCESSING_TRANSLATION_FILE], --processing_translation_file [PROCESSING_TRANSLATION_FILE] + The (input) processing translation file - format: csv + (Comma Seperated). (default: ProcTypeTranslations.csv) + -f [FOOD_TRANSLATION_FILE], --food_translation_file [FOOD_TRANSLATION_FILE] + The (input) food translation file - format: csv (Comma + Seperated). (default: FoodTranslations.csv) + -s [SUBSTANCE_TRANSLATION_FILE], --substance_translation_file [SUBSTANCE_TRANSLATION_FILE] + The (input) substance translation file - format: tsv + (Tab Seperated), file not required. (default: + SubstanceTranslations.tsv) + -g [FOOD_COMPOSITION_FILE], --food_composition_file [FOOD_COMPOSITION_FILE] + The (input) food composition file - format: xlsx + (Excel), file not required. (default: + FoodComposition.xlsx) + -o [PROCESSING_FACTOR_FILE], --processing_factor_file [PROCESSING_FACTOR_FILE] + The (output) processing factor file - format: csv + (Comma Seperated). (default: ProcessingFactors.zip) + +For example: use Convert-EUProcessingFactorsDB.py -v -x for a verbose example. +``` ## Coding @@ -134,5 +122,3 @@ Check your changes using ``pycodestyle`` for example. pip install pycodestyle # To install the programm pycodestyle .\Convert-EUProcessingFactorsDB.py # To check whether the code complies. ``` - -At the moment only one line is not according to the guidelines, a commented line with the URL of the EU website. This one execption is allowed. \ No newline at end of file -- GitLab