@@ -38,7 +38,7 @@ These are the input and output files of the script. All names are defaults, and
* A small markdown report is also created, usally called [Report.md](Report.md), but within the zip file is called Readme.md.
* A csv file with a summary (and counts) of *the remaining data* of the EU sheet, called [Mismatches.csv](Mismatches.csv).
The following is happening in the script, essentially ([more details here](#detailed-workings))
The following is happening in the script, essentially
* The script wil try to match the first column (``FromFC``) of [ProcTypeTranslations.csv](ProcTypeTranslations.csv) to the column ``KeyFacets Code`` of the EU sheet. If a match is found, then the second column (``FCToProcType``) of [ProcTypeTranslations.csv](ProcTypeTranslations.csv) will become the field ``idProcessingType``.
* Then the script will try to match both the ``FromFX`` and ``FXToRpc`` column of [FoodTranslations.csv](FoodTranslations.csv) with the columns ``Matrix FoodEx2 Code`` and ``Matrix Code`` from the EU sheet, *for all rows that didn't already match in the previous step*. If a match was found, then the value of ``FXToProcType`` will be copied to ``idProcessingType``.
* If no substance file was given, then just copy the field ``ParamCode Active Substance`` to ``idSubstance``. But if a substance was given, then strip the dash from the ``'CASNumber`` column in the substance file, and match the column ``ParamCode Active Substance`` in the EFSA sheet to ``code`` in the substances sheet. If a match was found then copy the modified (without dash) ``CASNumber`` to ``idSubstance``.
...
...
@@ -69,61 +69,49 @@ python.exe convert-script -h
```
Theses are command line options that are supported.
*``-h`` : shows help, use this to see which default file names are used.
*``-e EFSA_FILE_OR_URL`` : uses ``EFSA_FILE_OR_URL`` as input Excel sheet. This may be a filename, or the URL, but it should be the format as in the [EU Processing Factors file](https://zenodo.org/record/1488653/files/EU_Processing_Factors_db_P.xlsx.xlsx?download=1)
*``-f FOOD_TRANSLATION_FILE`` : uses ``FOOD_TRANSLATION_FILE`` as a food translation file (``.csv``). This file should have the following format
*``FromFX,FXToRpc,FXToProcType``; this line should be the first line (a header). Values are read as string. Values separated by comma.
* Lines starting with \# will be ignored and this can be used to insert comments.
*``-m MISMATCH_FILE`` : this uses ``MISMATCH_FILE`` as an output file for the mismatches. Supported format: ``.csv``.
*``-o PROCESSING_FACTOR_FILE`` : this uses ``PROCESSING_FACTOR_FILE`` as an output file for the MCRA formatted Processing Factors file. Supported formats: ``.zip``, ``.xlsx``, ``.csv``. If ``.zip`` is chosen as a format (default) then within the zipfile a ``.csv`` will be written with the MCRA conforming filename. Also a ``Readme.md`` file will be placed, which is just a copy of the report file (see option ``-r``)
*``-p PROCESSING_TRANSLATION_FILE`` : Uses ``PROCESSING_TRANSLATION_FILE`` as a processing type translation file (``.csv``). This file should have the following format
*``FromFC,FCToProcType``; this line should be the first line (a header). Values are read as string. Values seperated by comma.
* Lines starting with \# will be ignored and this can be used to insert comments.
*``-r REPORT_FILE`` : this uses ``REPORT_FILE`` as an output report file (a Markdown file). A copy will be placed in the ``PROCESSING_FACTOR_FILE`` (option ``-o``) as ``Readme.md``*if a zip file was chosen there* as an output file.
*``-t PROCESSING_TYPE_FILE``] : this uses ``PROCESSING_TYPE_FILE`` as input to augment the data in the output file. The format is defined by MCRA.
*``-v`` : writes verbose output. Multiple levels (1-3) of verbosity are possible, by using more ``v``'s. E.g. ``-vv`` or ``-vvv``.
## Detailed workings
The script is basically one long file, with sequential actions happening. No iteration is used, because the data processing is handed over to the ``pandas`` library. The script is diveded into five phases. If the ``-vv`` verbosity is used, these phases will be displayed as output. This is also extensively documented (commented) within the python file itself.
The pandas dataprocessing can be thought of here as an SQL database. The script will read the EU Excel sheet into a database. Using left joins, and copying of columns the sheet/database is extended. Finally a selection of the newly created columns will be exported.
Below a detailed description.
***PHASE 0. Initialization**
* Libraries are imported
* Command line arguments are parsed
* Objects created/adjusted
***PHASE 1. Read input files**
* Script reads the [EU Processing Factors file](https://zenodo.org/record/1488653/files/EU_Processing_Factors_db_P.xlsx.xlsx?download=1)
* Script reads the (MCRA formatted) files:
* A food translation file, [Foodtranslations.csv](Foodtranslations.csv)
* A processing translation file, [ProcTypeTranslations.csv](ProcTypeTranslations.csv)
* Only for information, a processing translation file, [ProcessingTypes.csv](ProcessingTypes.csv)
***PHASE 2. Processing data**
* Script will ``left join`` column ``KeyFacets Code`` from the EU sheet with the ``FromFC`` column of [ProcTypeTranslations.csv](ProcTypeTranslations.csv).
* The result will ``left join`` column ``Matrix FoodEx2 Code`` from the EU sheet with the ``FromFX`` column of [Foodtranslations.csv](Foodtranslations.csv).
* Next, if the first ``left join`` was succesfull (i.e ``FCToProcType`` contains a value), then make a copy of ``FCToProcType`` to a new field, ``idProcessingType``
* Next, if the second ``left join`` was succesfull (i.e ``FCToProcType`` does NOT contain a value, and ``FXToProcType`` does), then make a copy of ``FXToProcType`` to ``idProcessingType``
* Do a ``left join`` on column ``idProcessingType`` from the sheet with column ``idProcessingType`` from the file [ProcessingTypes.csv](ProcessingTypes.csv)
* Now, if column ``idProcessingType`` has an entry, ``idFoodUnProcessed`` will be concatenated with a dash ``-`` and with ``idProcessingType`` and the result will be placed into ``idFoodProcessed``
***PHASE 3. Exporting data**
* The columns ``idProcessingType``, ``idSubstance``, ``SubstanceName``, ``idFoodProcessed``, ``idFoodUnProcessed``, ``FoodUnprocessedName``, ``Nominal``, ``Upper``, ``NominalUncertaintyUpper``, ``UpperUncertaintyUpper``, ``KeyFacets Interpreted``, ``Matrix Code Interpreted``, ``MCRA_ProcessingType_Description`` are exported, for all rows in which either ``FCToProcType`` or ``FXToProcType`` has an entry.
***PHASE 4. Analysing data and creating report**
* This has to be expanded further in this readme file.