Updated tutorial part 4 - Re-arragend setup of tutorial page

8264099d · Flege, Patrick · 2691c3d3 · 8264099d
Commit 8264099d authored 6 months ago by Flege, Patrick
--- a/docs/source/tutorial/tutorial_part4.rst
+++ b/docs/source/tutorial/tutorial_part4.rst
@@ -5,8 +5,8 @@ This part guides you through some example cases, from raw data to PanVA
 instance.


-Data sets
---------
+Data Packages
+-------------

 Multiple data packages are available, which contain all the information for
 creating a pangenome using PanTools and creating a PanVA instance from it,
@@ -44,37 +44,44 @@ Steps to generate PanVA input

 These example cases run through the following steps:

-* Downloading publicly available data
-   * Genome and structural annotation data
-   * Accession data for arabidopsis and tomato
-* Preprocessing the data for Pantools
-   * Filtering the minimum sequence size of the fasta file
+1: Downloading publicly available data
+   * Acquire genome and structural annotation data
+   * Accession data for arabidopsis, pectobacterium, and yeast
+2: Preprocessing the data for Pantools
+   * Filtering the minimum sequence size of genomes in the FASTA file
   * Filtering the minimum ORF size of CDS features in the annotation
-   * Create functional annotations for extracted protein sequences
-* Constructing and annotating a pangenome using Pantools
+   * Extract protein-sequences by matching CDS features to genomic sequences
+   * Create functional annotations for extracted protein sequences (Optional)
+   * Generate statistics for raw and filtered data
+3: Constructing and annotating a pangenome using Pantools
   * Build the pangenome
   * Add structural annotations, functional annotations and phenotypes
   * Add vcf information or phasing information if available
   * Create homology groups
-* Running the necessary analysis steps in order to create a PanVA instance
+4: Running the necessary analysis steps in order to create a PanVA instance
   * Gene classification
   * K-mer classification
   * Multiple sequence alignment
   * Group info
-* Create a PanVA instance
+5: Create a PanVA instance
   * Preprocessing data for PanVA
   * Set up the PanVA instance

-Instructions to create a pangenome and PanVA instance for *Arabidopsis*.
-The other packages follow the same logic. Every package contains a README
-with all the exact commands, so make sure to check those.
+This tutorial contains instructions to create a pangenome and PanVA instance for different species. Every package contains a README
+with all the exact commands, so make sure to check those if you're stuck.
 Both Snakemake pipelines used in this workflow create conda environments
 in the data package directory. If you want to re-use these pipelines for
 different data, use the ``--conda-prefix`` Snakemake command to set a directory
 where the conda environments will be stored.

-Download the data package
-~~~~~~~~~~~~~~~~~~~~~~~~~
+1: Downloading publicly available data
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Goal:
+   * Acquire genome and structural annotation data
+   * Accession data for arabidopsis, pectobacterium, and yeast
+
+1.1: Download the data package
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. code:: bash

@@ -85,18 +92,29 @@ All commands should be run from the root directory of the package.
 Run the whole package from RAM-disk or SSD, or set the path of the results to
 RAM-disk/SSD in the configs.

-Download the raw data
-~~~~~~~~~~~~~~~~~~~~~
+1.2: Download the raw data
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+To download corresponding raw-data for each of the above-linked packages, follow the first steps of the instructions laid out in the respective README file.
+For all packages, those can be found at the root of the decompressed TAR-files.

-To download corresponding raw-data for each of the above-linked packages, follow the instructions laid out in the respective README file.
-For all packages, those can be found at the root of the decompressed tar-files.

-Clone the PanUtils data filtering pipeline
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+2: Preprocessing the data for Pantools
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Goal:
+   * Filtering the minimum sequence size of genomes in the FASTA file
+   * Filtering the minimum ORF size of CDS features in the annotation
+   * Extract protein-sequences by matching CDS features to genomic sequences
+   * Create functional annotations for extracted protein sequences (Optional)
+   * Generate statistics for raw and filtered data
+
+2.1: Clone the PanUtils data filtering pipeline
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 The
 :ref:`data filtering pipeline <getting_started/diy_pangenomics:Quality control pipeline>`
-filters out small sequences, matches fasta with gff contents and removes CDS
+filters out small sequences, matches FASTA with GFF contents and removes CDS
 features with ORF below cutoff value. Also extracts protein sequences and
 creates functional annotations from them.

@@ -104,21 +122,58 @@ creates functional annotations from them.

   $ git clone https://github.com/PanUtils/pantools-qc-pipeline.git

+2.2.1: Activate or create Snakemake *(Linux and non-silicon based machines)*
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 Activate or create a snakemake environment (works in python
-versions <= 3.11):
+versions <= 3.11).

 .. code:: bash

   $ mamba create -c conda-forge -c bioconda -n snakemake snakemake

-Filter the raw data:
+If you are using an ARM-based machine (such as an M4-based Mac), make sure to make the new environment compatible with Intel-based packages.
+Many dependencies in conda are not yet compatible with ARM systems. Consider for example installing *Rosetta 2*
+
+2.2.2: Activate or create Snakemake *(Silicon based machines)*
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. code:: bash
+
+   $ softwareupdate --install-rosetta
+
+Please use this command to set up your environment:
+
+.. code:: bash
+
+   $ CONDA_SUBDIR=osx-64 mamba create -c conda-forge -c bioconda -n snakemake snakemake
+
+This command ensures that packages are downloaded for an Intel-based architecture. Afterwards, restart your shell with the "Open using Rosetta"-setting enabled.
+To do this via the GUI, go to "Applications"/Utilities/Terminal" and click on "Get Info". Select the option to start the terminal with Rosetta!
+
+2.3: Filter the raw data and create functional annotations for extracted protein-sequences
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Filter the raw data and create protein sequences from the root of your data-package:

 .. code:: bash

   $ snakemake --use-conda --snakefile pantools-qc-pipeline/workflow/Snakefile --configfile config/<target-dataset>_qc.yaml --cores <threads>

-Clone the PanTools pipeline v4
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+3 & 4: Constructing and annotating a pangenome using Pantools & running the necessary analysis steps in order to create a PanVA instance
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Goal (3, pangenome construction):
+   * Build the pangenome
+   * Add structural annotations, functional annotations and phenotypes
+   * Add vcf information or phasing information if available
+   * Create homology groups
+
+Goal (4, PanVA-specific analyses):
+   * Gene classification
+   * K-mer classification
+   * Multiple sequence alignment
+   * Group info
+
+3.1: Clone the PanTools pipeline v4
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 The :ref:`PanTools pipeline
 <getting_started/diy_pangenomics:PanTools v4 pipeline>` contains all PanTools
@@ -129,18 +184,31 @@ above.

   $ git clone https://github.com/PanUtils/pantools-pipeline-v4.git

-Run all required PanTools functions
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+3.2: Run PanTools to generate a pangenome
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+All analyses to create a complete pangenome happen together with those analyses specific for PanVA.
+Those are started with the same command, outlined below.

-The snakemake rule PanVA runs all functions to create a complete PanVA instance.
+4.1: Run PanTools to for PanVA-specific analyses
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The snakemake rule PanVA *(panva)* runs all Pantools-functions to create a complete PanVA instance.
+This step covers therefore both step 3 and step 4 in one command.

 .. code:: bash

   $ snakemake panva --use-conda --snakefile pantools-pipeline-v4/workflow/Snakefile --configfile config/<target-dataset>_pantools.yaml --cores <threads>

-Clone the export-to-panva python script
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+This will create a pangenome-database from which PanVA files can be generated.
+
+5: Create a PanVA instance
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Goal:
+   * Preprocessing data for PanVA
+   * Set up the PanVA instance

+5.1: Clone the export-to-panva python script
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 The export script reads data from the pangenome database and converts it to
 the proper format for PanVA.

@@ -148,23 +216,28 @@ the proper format for PanVA.

   $ git clone https://github.com/PanUtils/export-to-panva.git

-Create a conda environment for the script
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+5.2: Create a conda environment for the script
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Make sure to create an environment that can deal with Intel-based dependencies if you are on a silicon-based Mac.

 .. code:: bash

   $ mamba env create -n export-to-panva -f export-to-panva/envs/pantova.yaml
   $ conda activate export-to-panva

-Run the export script
-~~~~~~~~~~~~~~~~~~~~~
+5.3: Run the export script
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+Finally, run

 .. code:: bash

   $ python3 export-to-panva/scripts/pan_to_va.py config/<target-dataset>_panva.ini

-Create a PanVA instance
-~~~~~~~~~~~~~~~~~~~~~~~
+from the root of the data package, to create the inputs for PanVA.
+
+5.4: Create a PanVA instance
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 With the output of the export script, you should be able to create a PanVA
 instance for your dataset using the instructions from