diff --git a/CHANGELOG.md b/CHANGELOG.md
index a6a8837cce0d79a0f6b4aec0cf7eac7c65af5033..14d2e35060869104dd1c62e746075df3cb3416eb 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -10,6 +10,7 @@ All notable changes to Pantools will be documented in this file.
 
 ### Changed
 - No longer accept tab characters in fasta input files (!232).
+- Updated and reordered the tutorial section of the documentation (!228).
 
 ### Fixed
 - `pantools blast` no longer crashes if not all mrna nodes contain protein IDs (!226).
diff --git a/docs/source/analysis/classification.rst b/docs/source/analysis/classification.rst
index 235bd21c77852d62f6e204b633feb4f775bc667e..18138fe4faca9ee304bfd2b46e4d374c9b20e49f 100644
--- a/docs/source/analysis/classification.rst
+++ b/docs/source/analysis/classification.rst
@@ -243,7 +243,7 @@ simulation.
 
 Heaps' law (a power law) can be fitted to the number of new genes
 observed when increasing the pangenome by one random genome. The formula
-for the power law model is :math:`n = k * N^{-a}`, where *n* is the
+for the power law model is :math:`\Delta n = k * N^{-a}`, where *Δn* is the
 newly discovered genes, *N* is the total number of genomes, and *k* and
 *a* are the fitting parameters. A pangenome can be considered open when
 *a* < 1 and closed if *a* > 1.
diff --git a/docs/source/analysis/explore.rst b/docs/source/analysis/explore.rst
index 9cbe53d65f09eba75c1f91f5396bd31866a222fe..5924ac7d1e48901ffe387588077e00cd7f856fab 100644
--- a/docs/source/analysis/explore.rst
+++ b/docs/source/analysis/explore.rst
@@ -41,7 +41,7 @@ members.
      * - <databaseDirectory>
        - Path to the database root directory.
      * - <homologyFile>
-       - A text file with homology group node identifiers, seperated by a
+       - A text file with homology group node identifiers, separated by a
          comma.
 
 **Options**
@@ -180,10 +180,10 @@ annotation node and extract the nucleotide and protein sequence.
 
      * - ``--functions``
        - One or multiple function identifiers (GO, InterPro, PFAM,
-         TIGRFAM), seperated by a comma.
+         TIGRFAM), separated by a comma.
      * - ``--nodes``/``-n``
        - One or multiple identifiers of function nodes (GO, InterPro,
-         PFAM, TIGRFAM), seperated by a comma.
+         PFAM, TIGRFAM), separated by a comma.
      * - ``--include``/``-i``
        - Only include a selection of genomes.
      * - ``--exclude``/``-e``
@@ -283,9 +283,9 @@ specific GO terms of the hierarchy to a sequence.
      :widths: 30 70
 
      * - ``--functions``
-       - One or multiple GO term identifiers, seperated by a comma.
+       - One or multiple GO term identifiers, separated by a comma.
      * - ``--nodes``/``-n``
-       - One or multiple identifiers of 'GO' nodes, seperated by a comma.
+       - One or multiple identifiers of 'GO' nodes, separated by a comma.
 
 **Example commands**
   .. code:: bash
@@ -322,9 +322,9 @@ their location in the hierarchy is reported.
      :widths: 30 70
 
      * - ``--functions``
-       - One or multiple GO term identifiers, seperated by a comma.
+       - One or multiple GO term identifiers, separated by a comma.
      * - ``--nodes``/``-n``
-       - One or multiple identifiers of 'GO' nodes, seperated by a comma.
+       - One or multiple identifiers of 'GO' nodes, separated by a comma.
      * - ``--include``/``-i``
        - Only include a selection of genomes.
      * - ``--exclude``/``-e``
@@ -378,7 +378,7 @@ Report all available information of one or multiple homology groups.
          To find Phobius (P) or SignalP (S) annotations, include:
          'secreted' (P/S), 'receptor' (P/S), or 'transmembrane' (P).
      * - ``--genes``/``-g``
-       - One or multiple gene names, seperated by a comma.
+       - One or multiple gene names, separated by a comma.
      * - ``--node``
        - Retrieve the nucleotide nodes belonging to genes in homology groups
 
diff --git a/docs/source/analysis/go_enrichment.rst b/docs/source/analysis/go_enrichment.rst
index 71d2c7b1e4f5407d413ad4d6fc6fa8f0a1dd391c..00ea8c6cc66bec16a130cc3d3d2b854cff7e6629 100644
--- a/docs/source/analysis/go_enrichment.rst
+++ b/docs/source/analysis/go_enrichment.rst
@@ -69,9 +69,9 @@ the second and third rank this will be 0.0010 and 0.0015, respectively.
      :widths: 30 70
 
      * - ``--homology-file``/``-H``
-       - A text file with homology group node identifiers, seperated by a comma.
+       - A text file with homology group node identifiers, separated by a comma.
      * - ``--nodes``/``-n``
-       - mRNA node identifiers, seperated by a comma on the command line.
+       - mRNA node identifiers, separated by a comma on the command line.
      * - ``--include``/``-i``
        - Only include a selection of genomes.
      * - ``--exclude``/``-e``
diff --git a/docs/source/analysis/phylogeny.rst b/docs/source/analysis/phylogeny.rst
index 4df7d04dd188f8c614d0349d31a258994dca558c..f90c3885be3a0d53d082b7b56958e2db521c36fa 100644
--- a/docs/source/analysis/phylogeny.rst
+++ b/docs/source/analysis/phylogeny.rst
@@ -695,7 +695,7 @@ for the visualization of phylogenies in iTOL. Phenotypes must already be
 included in the pangenome with the
 :ref:`add_phenotypes <construction/annotate:add phenotypes>` functionality. How
 to use the template files in iTOL can be found in one of the
-:ref:`tutorials <tutorial/tutorial_part5:create itol templates>`.
+:ref:`tutorials <tutorial/tutorial_part3:create itol templates>`.
 
 If you run this function without a ``--phenotype`` argument, templates
 are created for trees that contain only genome numbers as node labels.
diff --git a/docs/source/construction/annotate.rst b/docs/source/construction/annotate.rst
index 76786703b3a2e143f8110ffb7bcc8ca58035aa41..249f9ecb8523c50876508deea24ac96b7a57cf22 100644
--- a/docs/source/construction/annotate.rst
+++ b/docs/source/construction/annotate.rst
@@ -40,6 +40,9 @@ selected and the second annotation of genome 2 and 3.
   if the GFF3 file contains a trans-spliced gene that has alternative
   splicing, it will not be able to handle it (it will only annotate one
   mRNA).
+  The PanUtils project contains a :ref:`data filtering pipeline
+  <getting_started/diy_pangenomics:Quality control pipeline>` that fixes most
+  common issues encountered by the parser.
 
 **Parameters**
   .. list-table::
diff --git a/docs/source/construction/group.rst b/docs/source/construction/group.rst
index e0c80dccb7ff90e47841f24756e274fe3c12967a..5d87e4b2711e2d2c6b5bbdef12eb6321d29c34cc 100644
--- a/docs/source/construction/group.rst
+++ b/docs/source/construction/group.rst
@@ -148,7 +148,7 @@ argument to the command.
 | **What BUSCO benchmark set to use**
 
   -  When using **BUSCO v3**, go to https://busco.ezlab.org, download a odb9
-     set, and untar it with ``tar -xvzf``. Include the entire directory in
+     set, and untar it with ``tar xvzf``. Include the entire directory in
      the command using the ``--input-file`` argument.
   -  For **BUSCO v4** or **v5**, you only have to provide the odb10 database
      name with the ``--input-file`` argument, the database is downloaded
@@ -493,5 +493,5 @@ settings that were used.
   -  **current_pantools_homology_groups.txt**, overview of the active
      homology groups. Each line represents one homology group. The line
      starts with the homology group (database) identifier followed by a
-     colon and the rest are mRNA IDs (from gff/genbank) seperated by a
+     colon and the rest are mRNA IDs (from gff/genbank) separated by a
      space.
diff --git a/docs/source/developer_guide/install.rst b/docs/source/developer_guide/install.rst
index 0144964c79d2182127f533db0b6fcdd6b7f24a6e..c0820be5fff059198bbd3a4d0e133f18bc907f2a 100644
--- a/docs/source/developer_guide/install.rst
+++ b/docs/source/developer_guide/install.rst
@@ -56,7 +56,7 @@ downloaded from https://github.com/refresh-bio/KMC/releases.
 
 .. code:: bash
 
-   $ tar -xvzf KMC* #uncompress the KMC binaries
+   $ tar xvzf KMC* #uncompress the KMC binaries
 
    # Edit your ~/.bashrc to include KMC to your PATH
    $ echo "export PATH=/path/to/KMC/:\$PATH" >> ~/.bashrc #replace /path/to with the correct path on your computer
@@ -76,7 +76,7 @@ https://micans.org/mcl under License & software.
 .. code:: bash
 
    $ wget https://micans.org/mcl/src/mcl-14-137.tar.gz
-   $ tar -xvzf mcl-*
+   $ tar xvzf mcl-*
    $ cd mcl-14-137
    $ ./configure --prefix=/path/to/mcl-14-137/shared #replace /path/to with the correct path on your computer
    $ make install
@@ -405,7 +405,7 @@ The following commands can be used to test and build the documentation:
 .. code:: bash
 
    # Test the documentation
-   sphinx-lint sphinx-lint -e=all --max-line-length=80
+   sphinx-lint -e=all --max-line-length=80
 
    # Build the documentation
    sphinx-build -W docs/source output
diff --git a/docs/source/getting_started/diy_pangenomics.rst b/docs/source/getting_started/diy_pangenomics.rst
new file mode 100644
index 0000000000000000000000000000000000000000..89dba3ba9bdfa3f7faf76c9d7878e986cf86a5b9
--- /dev/null
+++ b/docs/source/getting_started/diy_pangenomics.rst
@@ -0,0 +1,114 @@
+DIY Pangenomics
+===============
+
+Likely you want to do pangenomics using your own data, on your own system.
+To support this, we offer several tools:
+
+* `PanTools`_ : core application to construct and analyze pangenomes
+
+* `PanUtils`_: utilities to support pangenomics (e.g. QC)
+
+* `PanBrowse`_: applications for pangenome visualization (e.g. PanVA)
+
+When working with these software tools, consider the following tips:
+
+* Separate issues in the (technical) setup from issues in the data, so try
+  the tool on known (test) data first.
+
+* Garbage in, garbage out! Therefore, perform rigorous quality control on the
+  input data, e.g. using https://github.com/PanUtils/pantools-qc-pipeline.
+
+* Start small, e.g. 10-50 bacterial genomes or 3-10 eukaryotic genomes, before
+  scaling up.
+
+PanTools
+--------
+
+PanTools (https://git.wur.nl/bioinformatics/pantools) is a software package
+for pangenome construction and analysis. It works across the tree of life,
+from small viruses to large plant, animal, or human genomes. Because it is not
+alignment based, but employs a compacted De Bruijn Graph (cDBG) it works over
+large(r) evolutionary distances, e.g. at species, genus or family level.
+
+
+PanUtils
+--------
+
+We have been working on several pangenome utilities, collected in
+https://github.com/PanUtils. The repository currently contains three pipelines:
+one for quality control, one for pangenome construction and one for analysis
+and preparing a PanVA instance.
+
+Quality control pipeline
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+This is a Snakemake pipeline that can be used for running quality control on
+the data. We highly recommend this pipeline for checking your files, as
+annotation files can be very tricky to work with (and therefore sometimes
+cause issues for tools like PanTools). This pipeline can be found at
+https://github.com/PanUtils/pantools-qc-pipeline which includes a README on
+how to install and run it. In summary, one needs to create a conda environment
+with the most recent version of Snakemake (at the moment of writing this is
+8.25.5). Please follow the instructions on which configuration to set and
+where all files should be placed.
+
+The pipeline has four different workflows:
+
+* raw_statistics: creates an overview of the statistics of the input data
+
+* filter: can be used for filtering genome and annotation files
+
+* proteins: creates protein files including statistics
+
+* functions: creates functional annotations for all proteins in the protein
+  files
+
+
+PanTools v4 pipeline
+^^^^^^^^^^^^^^^^^^^^
+
+The PanTools Snakemake pipeline runs through all major
+PanTools functionalities. You can use this pipeline for inspiration on what
+can be done with PanTools or to actually run PanTools.
+Please see https://github.com/PanUtils/pantools-pipeline-v4 for the code,
+including README that summarizes how to install, configure and run it.
+
+.. warning::
+ It is not recommended to use this pipeline with new experimental data;
+ adjustments to data and PanTools command settings might have to be made
+ during the process, for which this pipeline lacks the flexibility.
+
+The pipeline has three different workflows:
+
+* all_panproteome: runs through all major PanTools functionalities available
+  for a panproteome
+* all_pangenome: runs through all major PanTools functionalities available for
+  a pangenome
+* panva: runs all PanTools functions necessary to set up a PanVA instance
+
+PanVA conversion
+^^^^^^^^^^^^^^^^
+
+If one has already created a pangenome database with PanTools that needs to be
+converted to the data structure necessary for setting up a PanVA instance,
+please use this repository: https://github.com/PanUtils/export-to-panva. It
+contains a README on how to install and run it. In summary, one needs to
+create an environment using mamba from a predefined YAML file, configure
+settings in a config.ini file and then run the python script which takes the
+config.ini as input.
+
+PanBrowse
+---------
+
+We work on visualization methods for (interactive) visualization of
+complex pangenomes. PanBrowse (https://github.com/PanBrowse) is where
+we host the visualization applications.
+
+PanVA
+^^^^^
+
+`PanVA <https://github.com/PanBrowse/PanVA>`_ is a web application
+allowing users to visually and interatively explore sequence variants
+in pangenomes (generated by PanTools). It provides context for these
+variants by displaying their corresponding annotations, phylogenetic
+and phenotypic information.
diff --git a/docs/source/getting_started/technical_setup.rst b/docs/source/getting_started/technical_setup.rst
new file mode 100644
index 0000000000000000000000000000000000000000..4010e0a97fe7b5b7859d0332ec0c61442f7c2642
--- /dev/null
+++ b/docs/source/getting_started/technical_setup.rst
@@ -0,0 +1,137 @@
+Technical setup validation
+==========================
+
+Performance is important when running PanTools on larger datasets. Therefore,
+it is recommended to test your PanTools installation on one or more of the
+test data sets below, before applying it to your own data.
+
+PanTools pangenome construction commands make a large amount of
+transactions to the database and should be built on a location that can
+handle this for larger pangenomes. We recommend using an SSD or a RAM-disk
+(/dev/shm/, available on Linux machines). If pangenome construction takes
+too much time, the location of the database might be the issue. For reference,
+the building time of a small yeast data set of 10 genomes on our RAM-disk takes
+a little over two minutes, while it takes nearly 30 minutes on a network disk!
+Larger datasets exacerbate this gap in runtime. Disk size needed for
+construction depends on input data (genome size and number of genomes),
+for 8 arabidopsis genomes ±20GB should suffice, for 3 lettuce genomes ±250GB.
+Java heap space needed depends on the same parameters: in general it is
+recommendable to set this to larger than expected because java running out
+of heap space makes the tool crash (e.g. for lettuce we used
+-Xmx200g -Xms200g).
+
+.. warning::
+ If you want to do everything in a single directory, make sure to download the
+ data on the disk where you want to write the output. If the input data is on
+ a network disk, adjust the commands to write the pangenome database on a
+ local disk (RAM or SSD).
+
+Ten yeast genomes
+-----------------
+
+Download the data from:
+https://www.bioinformatics.nl/pangenomics/data/yeast_10strains.tgz
+
+.. code:: bash
+
+   $ wget https://www.bioinformatics.nl/pangenomics/data/yeast_10strains.tgz
+   $ tar zxvf yeast_10strains.tgz
+
+Build the pangenome:
+
+.. code:: bash
+
+   $ pantools -Xms20g -Xmx50g build_pangenome -t16 yeast_db yeast_10strains/metadata/genome_locations.txt
+
+Add annotations:
+
+.. code:: bash
+
+   $ pantools -Xms20g -Xmx50g add_annotations yeast_db yeast_10strains/metadata/annotation_locations.txt
+
+Group proteins:
+
+.. code:: bash
+
+   $ pantools -Xms20g -Xmx50g group yeast_db -t16 --relaxation=3
+
+Run time statistics (16 threads on RAM-disk):
+
+.. csv-table::
+   :file: /tables/runtime_statistics_yeast.csv
+   :header-rows: 1
+   :delim: ,
+
+Two *Arabidopsis* genomes
+-------------------------
+
+Start this use case with a bash script to download the data. Download the
+script from here:
+https://www.bioinformatics.nl/pangenomics/data/download_2_ara.sh
+To download the *Arabidopsis* data, run the script:
+
+.. code:: bash
+
+   $ bash download_2_ara.sh <threads>
+
+Build the pangenome:
+
+.. code:: bash
+
+   $ pantools -Xms40g -Xmx80g build_pangenome -t50 --kmer-size=17 ara_db a_thaliana_2genomes/metadata/genome_locations.txt
+
+Add annotations:
+
+.. code:: bash
+
+   $ pantools -Xms40g -Xmx80g add_annotations ara_db a_thaliana_2genomes/metadata/annotation_locations.txt
+
+Group proteins:
+
+.. code:: bash
+
+   $ pantools -Xms40g -Xmx80g group ara_db -t50 --relaxation=3
+
+Run time statistics (50 threads on RAM-disk)
+
+.. csv-table::
+   :file: /tables/runtime_statistics_arabidopsis.csv
+   :header-rows: 1
+   :delim: ,
+
+Three tomato genomes
+--------------------
+
+Download the script from here:
+https://www.bioinformatics.nl/pangenomics/data/tomato_3genomes.tar.gz
+
+Extract the tomato data:
+
+.. code:: bash
+
+   $ tar xvzf tomato_3genomes.tar.gz
+
+Build the pangenome:
+
+.. code:: bash
+
+   $ pantools -Xms40g -Xmx80g build_pangenome -t50 --kmer-size=13 tomato_db tomato_3genomes/metadata/genome_locations.txt
+
+Add annotations:
+
+.. code:: bash
+
+   $ pantools -Xms40g -Xmx80g add_annotations tomato_db tomato_3genomes/metadata/annotation_locations.txt
+
+Group proteins:
+
+.. code:: bash
+
+   $ pantools -Xms40g -Xmx80g group tomato_db -t50 --relaxation=3
+
+Run time statistics (50 threads on RAM-disk)
+
+.. csv-table::
+   :file: /tables/runtime_statistics_tomato.csv
+   :header-rows: 1
+   :delim: ,
diff --git a/docs/source/index.rst b/docs/source/index.rst
index 6ec402e36fad3a3b49ae974a5684fee6d23fdc61..023b48b8d6efd81b7958b4951fa174b43b807237 100644
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -91,12 +91,24 @@ Contents
    :caption: Getting Started
    :maxdepth: 1
 
+   getting_started/diy_pangenomics
    Install <getting_started/install>
+   getting_started/technical_setup
    getting_started/differences
    getting_started/workflows
    getting_started/selection
    getting_started/query
 
+.. toctree::
+   :caption: Tutorial PanTools
+   :maxdepth: 1
+
+   Tutorial 1 - Pangenome construction <tutorial/tutorial_part1>
+   Tutorial 2 - Pangenome analysis <tutorial/tutorial_part2>
+   Tutorial 3 - Phylogeny <tutorial/tutorial_part3>
+   Tutorial 4 - Visualization (PanVA) <tutorial/tutorial_part4>
+   Tutorial 5 - Neo4j browser <tutorial/tutorial_part5>
+
 .. toctree::
    :caption: Pangenome Construction
    :maxdepth: 1
@@ -123,16 +135,6 @@ Contents
    analysis/support
    analysis/dn_ds
 
-.. toctree::
-   :caption: Tutorial
-   :maxdepth: 1
-
-   Tutorial 1 - Install PanTools <tutorial/tutorial_part1>
-   Tutorial 2 - Construct pangenome <tutorial/tutorial_part2>
-   Tutorial 3 - Neo4j browser <tutorial/tutorial_part3>
-   Tutorial 4 - Characterization <tutorial/tutorial_part4>
-   Tutorial 5 - Phylogeny <tutorial/tutorial_part5>
-
 .. toctree::
    :caption: Developer guide
    :maxdepth: 1
@@ -140,4 +142,3 @@ Contents
    Installation <developer_guide/install>
    Release steps <developer_guide/release>
    Database structure <developer_guide/database>
-
diff --git a/docs/source/tables/runtime_statistics_arabidopsis.csv b/docs/source/tables/runtime_statistics_arabidopsis.csv
new file mode 100644
index 0000000000000000000000000000000000000000..e0a849e7f4f4d1a1572959e9fb7f33dae24d0e9b
--- /dev/null
+++ b/docs/source/tables/runtime_statistics_arabidopsis.csv
@@ -0,0 +1,4 @@
+command,running time (h:m:s),CPU time (s)
+build_pangenome,0:27:38,3102.41
+add_annotations,0:00:52,93.56
+group,0:02:25,507.60
\ No newline at end of file
diff --git a/docs/source/tables/runtime_statistics_tomato.csv b/docs/source/tables/runtime_statistics_tomato.csv
new file mode 100644
index 0000000000000000000000000000000000000000..3eb9c21b6f0cdcecacdab2561f22ab8955177fba
--- /dev/null
+++ b/docs/source/tables/runtime_statistics_tomato.csv
@@ -0,0 +1,4 @@
+command,running time (h:m:s),CPU time (s)
+build_pangenome,20:28:33,257610.72
+add_annotations,0:01:35,181.19
+group,0:03:02,5861.83
\ No newline at end of file
diff --git a/docs/source/tables/runtime_statistics_yeast.csv b/docs/source/tables/runtime_statistics_yeast.csv
new file mode 100644
index 0000000000000000000000000000000000000000..d478542cdbc2981b1a3f54c112ebcf479add7fe4
--- /dev/null
+++ b/docs/source/tables/runtime_statistics_yeast.csv
@@ -0,0 +1,4 @@
+command,running time (h:m:s),CPU time (s)
+build_pangenome,0:02:23,276.50
+add_annotations,0:00:35,51.82
+group,0:01:58,1351.73
\ No newline at end of file
diff --git a/docs/source/tutorial/tutorial_part1.rst b/docs/source/tutorial/tutorial_part1.rst
index 939f06fa013c18acf3a1a50f929bb3a664ed13dc..68d89890b5492c472684216699c4434322cf0440 100644
--- a/docs/source/tutorial/tutorial_part1.rst
+++ b/docs/source/tutorial/tutorial_part1.rst
@@ -1,5 +1,191 @@
-Part 1. Install PanTools
-========================
+Part 1. Build your own pangenome using PanTools
+===============================================
 
-For instructions on how to install PanTools,
-see :doc:`/getting_started/install`.
+Install PanTools from bioconda following the
+:ref:`install <getting_started/install:Install from bioconda>`
+instructions.
+
+To demonstrate the main functionalities of PanTools we use a small
+chloroplasts dataset to avoid long construction times.
+
+.. csv-table::
+   :file: /tables/chloroplast_datasets.csv
+   :header-rows: 1
+   :delim: ;
+
+Download the chloroplast fasta and gff files
+`here <http://bioinformatics.nl/pangenomics/tutorial/chloroplasts.tar.gz>`_
+or via wget.
+
+.. code:: bash
+
+   $ wget http://bioinformatics.nl/pangenomics/tutorial/chloroplasts.tar.gz
+   $ tar xvzf chloroplasts.tar.gz #unpack the archive
+
+--------------
+
+BUILD, ANNOTATE and GROUP
+-------------------------
+
+We start with building a pangenome using four of the five chloroplast
+genomes. For this you need a text file which directs PanTools to the
+FASTA files. Call your text file **genome_locations.txt** and include
+the following lines:
+
+.. code:: text
+
+   YOUR_PATH/C_sativus.fasta
+   YOUR_PATH/O_sativa.fasta
+   YOUR_PATH/S_lycopersicum.fasta
+   YOUR_PATH/S_tuberosum.fasta
+
+Make sure that ‘*YOUR_PATH*’ is the full path to the input files! The text
+file should only contain full paths to FASTA files, no additional spaces or
+empty lines. Then run PanTools with the
+:ref:`build_pangenome <construction/build:build pangenome>` function and
+include the text file
+
+.. code:: bash
+
+   $ pantools -Xmx5g build_pangenome chloroplast_DB genome_locations.txt
+
+Did the program run without any error messages? Congratulations, you’ve
+built your first pangenome!
+
+Adding additional genomes
+~~~~~~~~~~~~~~~~~~~~~~~~~
+PanTools has the ability to add additional genomes to an already
+existing pangenome. To test the function of PanTools, prepare a text
+file containing the path to the Maize chloroplast genome. Call your
+text file **fifth_genome_location.txt** and include the following
+line to the file:
+
+.. code:: text
+
+   YOUR_PATH/Z_mays.fasta
+
+Run PanTools on the new text file and using the
+:ref:`add_genomes <construction/build:add genomes>` function
+
+.. code:: bash
+
+   $ pantools add_genomes chloroplast_DB fifth_genome_location.txt
+
+Adding annotations
+~~~~~~~~~~~~~~~~~~
+
+To include gene annotations to the pangenome, prepare a text file
+containing paths to the GFF files. Call your text file
+**annotation_locations.txt** and include the following lines into the
+file:
+
+.. code:: text
+
+   1 YOUR_PATH/C_sativus.gff3
+   2 YOUR_PATH/O_sativa.gff3
+   3 YOUR_PATH/S_lycopersicum.gff3
+   4 YOUR_PATH/S_tuberosum.gff3
+   5 YOUR_PATH/Z_mays.gff3
+
+Run PanTools using the :ref:`add_annotations <construction/annotate:add
+annotations>` function and include the new text file. Also add the
+``--connect`` flag to attach the annotations to the nucleotide nodes;
+this is useful for exploring the pangenome in the Neo4j browser later
+in the tutorial.
+
+.. code:: bash
+
+   $ pantools add_annotations --connect chloroplast_DB annotation_locations.txt
+
+Homology grouping
+~~~~~~~~~~~~~~~~~
+
+PanTools can infer homology between the protein sequences of a pangenome
+and cluster them into homology groups with the
+:ref:`group <construction/group:group>` function. The ``--relaxation``
+parameter must be set to influence the sensitivity,  If you don't know the
+best setting for the grouping relaxation, you can use the
+:ref:`busco_protein <construction/group:busco protein>` and
+:ref:`optimal grouping <construction/group:optimal grouping>` functions
+instead, but for now we will use a relaxation of 4.
+
+.. code:: bash
+
+   $ pantools group --relaxation=4 chloroplast_DB
+
+--------------
+
+Adding phenotypes
+-----------------
+
+Phenotype values can be Integers, Double, String or Boolean values.
+Create a text file **phenotypes.txt**.
+
+.. code:: text
+
+   Genome,Solanum
+   1,false
+   2,false
+   3,true
+   4,true
+   5,false
+
+And use :ref:`add_phenotypes <construction/annotate:add phenotypes>` to add the
+information to the pangenome.
+
+.. code:: bash
+
+   $ pantools add_phenotypes chloroplast_DB phenotypes.txt
+
+--------------
+
+RETRIEVE functions
+------------------
+
+Now that the construction is complete, lets quickly validate if the
+construction was successful and the database can be used. To retrieve
+some genomic regions, prepare a text file containing genomic
+coordinates. Create the file **regions.txt** and include the following
+for each region: genome number, contig number, start and stop position
+and separate them by a single space
+
+.. code:: text
+
+   1 1 200 500
+   2 1 300 700
+   3 1 1 10000
+   3 1 1 10000 -
+   4 1 9999 15000
+   5 1 100000 110000
+
+Now run the :ref:`retrieve_regions <analysis/explore:retrieve regions>`
+function and include the new text file
+
+.. code:: bash
+
+   $ pantools retrieve_regions chloroplast_DB regions.txt
+
+Take a look at the extracted regions that are written to the
+**chloroplast_DB/retrieval/regions/** directory.
+
+To retrieve entire genomes, prepare a text file **genome_numbers.txt**
+and include each genome number on a separate line in the file
+
+.. code:: text
+
+   1
+   3
+   5
+
+Use the **retrieve_regions** function again but include the new text
+file
+
+.. code:: bash
+
+   $ pantools retrieve_regions chloroplast_DB genome_numbers.txt
+
+Genome files are written to same directory as before. Take a look at one
+of the three genomes you have just retrieved.
+
+In :doc:`part 2 <tutorial_part2>` of the tutorial we explore the
+pangenome you just built using the Neo4j browser and the Cypher language.
diff --git a/docs/source/tutorial/tutorial_part2.rst b/docs/source/tutorial/tutorial_part2.rst
index f8fa9079322847b0ae859a3e1898956f6a53f8f5..d9331df7d2e3d7bd6ae92ec4f048606e02b7d4be 100644
--- a/docs/source/tutorial/tutorial_part2.rst
+++ b/docs/source/tutorial/tutorial_part2.rst
@@ -1,188 +1,435 @@
-Part 2. Build your own pangenome using PanTools
-===============================================
+Part 2. Pangenome Analysis
+==========================
 
-To demonstrate the main functionalities of PanTools we use a small
-chloroplasts dataset to avoid long construction times.
+Input data
+----------
 
 .. csv-table::
-   :file: /tables/chloroplast_datasets.csv
+   :file: /tables/bacteria_datasets.csv
    :header-rows: 1
    :delim: ;
 
-Download the chloroplast fasta and gff files
-`here <http://bioinformatics.nl/pangenomics/tutorial/chloroplasts.tar.gz>`_
-or via wget.
+To demonstrate how to use the PanTools functionalities we use a small
+dataset of six bacteria to avoid long runtimes. Download a
+pre-constructed pangenome or test your new skills and construct a
+pangenome yourself using the fasta and gff files.
+
+Option 1: Download separate genome and annotation files
 
 .. code:: bash
 
-   $ wget http://bioinformatics.nl/pangenomics/tutorial/chloroplasts.tar.gz
-   $ tar -xvzf chloroplasts.tar.gz #unpack the archive
+   $ wget http://bioinformatics.nl/pangenomics/tutorial/pecto_dickeya_input.tar.gz
+   $ tar xvzf pecto_dickeya_input.tar.gz
+   $ gzip -d pecto_dickeya_input/annotations/*
+   $ gzip -d pecto_dickeya_input/genomes/*
+   $ gzip -d pecto_dickeya_input/functions/*
+
+   $ pantools build_pangenome pecto_dickeya_DB pecto_dickeya_input/genomes.txt
+   $ pantools add_annotations pecto_dickeya_DB pecto_dickeya_input/annotations.txt
+   $ pantools group --relaxation=4 pecto_dickeya_DB
+
+Option 2: Download the pre-constructed pangenome
+
+.. code:: bash
 
-We assume a PanTools alias was set during the
-:ref:`installation <getting_started/install:set pantools alias>`. This allows
-PanTools to be executed with ``pantools`` rather than the full path to the jar
-file. If you don’t have an alias, either set one or replace the pantools
-command with the full path to the .jar file in the tutorials.
+   $ wget http://bioinformatics.nl/pangenomics/tutorial/pecto_dickeya_DB.tar.gz
+   $ tar xvzf pecto_dickeya_DB.tar.gz
 
 --------------
 
-BUILD, ANNOTATE and GROUP
--------------------------
+Adding phenotype/metadata to the pangenome
+------------------------------------------
 
-We start with building a pangenome using four of the five chloroplast
-genomes. For this you need a text file which directs PanTools to the
-FASTA files. Call your text file **genome_locations.txt** and include
-the following lines:
+Before starting with the analysis, we will add some phenotype data to
+the pangenome. Phenotypes allow you to find similarities for a group of
+genomes sharing a phenotype as well as identifying variation between
+different phenotypes. Below is a textfile with data for three
+phenotypes. The third phenotype, *low_temperature*, is in this case a
+made up example! It states whether the strain is capable of growing on
+(extreme) low temperatures. The phenotype file can be found inside the
+database directory or create a new file using the text from the box
+below. Add the phenotype information to the pangenome using
+:ref:`add_phenotypes <construction/annotate:add phenotypes>`.
 
 .. code:: text
 
-   YOUR_PATH/C_sativus.fasta
-   YOUR_PATH/O_sativa.fasta
-   YOUR_PATH/S_lycopersicum.fasta
-   YOUR_PATH/S_tuberosum.fasta
+   Genome, species, strain_name, low_temperature
+   1,P. odoriferum,P. odoriferum Q166, false
+   2,P. fontis, P. fontis M022, true
+   3,P. polaris,P. polaris S4.16.03.2B, false
+   4,P. brasiliense, P. brasiliense S2, true
+   5,P. brasiliense, P. brasiliense Y49, false
+   6,D. dadantii, D. dadantii 3937,?
 
-Make sure that ‘*YOUR_PATH*’ is the full path to the input files! Then
-run PanTools with the :ref:`build_pangenome <construction/build:build
-pangenome>` function and include the text file
+.. code:: bash
+
+   $ pantools add_phenotypes pecto_dickeya_DB pecto_dickeya_input/phenotypes.txt
+
+--------------
+
+Metrics and general statistics
+------------------------------
+
+After building or uncompressing the pangenome, run the
+:ref:`metrics <analysis/metrics:metrics>` functionality to produce
+various statistics that should verify an errorless construction.
 
 .. code:: bash
 
-   $ pantools build_pangenome chloroplast_DB genome_locations.txt
+   $ pantools metrics pecto_dickeya_DB
 
-Did the program run without any error messages? Congratulations, you’ve
-built your first pangenome! If not? Make sure your Java version is up to
-date and kmc is executable. The text file should only contain full paths
-to FASTA files, no additional spaces or empty lines.
+Open **metrics_per_genome.csv** with a spreadsheet tool (Excel,
+Libreoffice, Google sheets) and make sure the columns are split on
+commas. You may easily notice the many empty columns in this table as
+these type of annotations or features are not included in the database
+(yet). Functional annotations are incorporated later in this tutorial.
+Columns for features like exon and intron will remain empty as bacterial
+coding sequences are not interrupted.
 
-Adding additional genomes
-~~~~~~~~~~~~~~~~~~~~~~~~~
-PanTools has the ability to add additional genomes to an already
-existing pangenome. To test the function of PanTools, prepare a text
-file containing the path to the Maize chloroplast genome. Call your
-text file **fifth_genome_location.txt** and include the following
-line to the file:
+--------------
 
-.. code:: text
+Gene classification
+-------------------
 
-   YOUR_PATH/Z_mays.fasta
+With the :ref:`gene_classification <analysis/classification:gene
+classification>` functionality you are able to organize the gene repertoire
+into the core, accessory or unique part of the pangenome.
 
-Run PanTools on the new text file and use the
-:ref:`add_genomes <construction/build:add genomes>` function
+-  **Core**, a gene is present in all genomes
+-  **Unique**, a gene is present in a single genome
+-  **Accessory**, a gene is present in some but not all genomes
 
 .. code:: bash
 
-   $ pantools add_genomes chloroplast_DB fifth_genome_location.txt
+   $ pantools gene_classification pecto_dickeya_DB
 
-Adding annotations
-To include gene annotations to the pangenome, prepare a text file
-containing paths to the GFF files. Call your text file
-**annotation_locations.txt** and include the following lines into the
-file:
+.. figure:: /figures/classify_table.png
+   :width: 600
+   :align: center
 
-.. code:: text
+Take a look in **gene_classification_overview.txt**. Here you can find
+the number of classified homology groups and genes on a pangenome level
+but also for individual genomes.
+
+Open **additional_copies.csv** with a spreadsheet tool. This file can be
+useful to identify duplicated genes in relation to other genomes.
+
+The default criteria to call a group core is presence in all genomes
+where unique is only allowed to be present in one genome. These two
+categories are highly influenced by annotation quality, especially in
+large pangenomes. Luckily, the threshold for core and unique groups can
+easily be adjusted. Let’s consider genes to be core when present in only
+five of the six genomes by setting the ``--core-threshold`` argument.
+
+.. code:: bash
+
+   $ pantools gene_classification --core-threshold=85 pecto_dickeya_DB
+
+Look in **gene_classification_overview.txt** again to observe the
+increase of core groups/genes at the cost of accessory groups.
+
+For this pangenome, the *Dickeya* genome is considered an outgroup to
+the five *Pectobacterium* genomes. While this outgroup is needed to root
+and analyze phylogenetic trees (tutorial part 3), it affects the number
+classified groups for the all other genomes. Use ``--include`` or
+``--exclude`` to exclude the *Dickeya* genome.
+
+.. code:: bash
+
+   $ pantools gene_classification --include=1-5 pecto_dickeya_DB
+   $ pantools gene_classification --exclude=6 pecto_dickeya_DB
+
+Take a look in **gene_classification_overview.txt** one more time to
+examine the effect of excluding this genome. The total number of groups
+in the analysis is lower now but the number of core and unique genes
+have increased for the five remaining genomes.
+
+When phenotype information is used in the analysis, three additional
+categories can be assigned to a group:
+
+-  **Shared**, a gene present in all genomes of a phenotype
+-  **Exclusive**, a gene is only present in a certain phenotype
+-  **Specific**, a gene present in all genomes of a phenotype and is
+   also exclusive
+
+Include a ``--phenotype`` argument to find genes that are exclusive for
+a certain species.
+
+.. code:: bash
+
+   $ pantools gene_classification --phenotype=species pecto_dickeya_DB
+
+Open **gene_classification_phenotype_overview.txt** to see the number of
+classified groups for the species phenotype.
+
+Open **phenotype_disrupted.csv** in a spreadsheet tool. This file
+explains exactly why a homology groups is labeled as phenotype shared
+and not specific.
+
+Open **phenotype_additional_copies.csv** in a spreadsheet tool.
+Similarly to *phenotype_additional.csv* this file shows groups where all
+genomes of a certain phenotype have additional gene copies to (at least
+one of) the other phenotypes.
 
-   1 YOUR_PATH/C_sativus.gff3
-   2 YOUR_PATH/O_sativa.gff3
-   3 YOUR_PATH/S_lycopersicum.gff3
-   4 YOUR_PATH/S_tuberosum.gff3
-   5 YOUR_PATH/Z_mays.gff3
+Each time you run the
+:ref:`gene_classification <analysis/classification:gene classification>`
+function, multiple files are created that contain node identifiers of a certain
+homology group category. These files can be given to other PanTools
+functions for a downstream analysis, for example, sequence alignment,
+phylogeny, or GO enrichment. We will use one of the files later in this
+tutorial.
 
-Run PanTools using the :ref:`add_annotations <construction/annotate:add
-annotations>` function and include the new text file
+--------------
+
+Pangenome structure
+-------------------
+
+With the previous functionality we identified the core, accessory and
+unique parts of the pangenome. Now we will use the
+:ref:`pangenome_size_genes <analysis/classification:pangenome structure>`
+function to observe how these numbers are reached by simulating the growth of
+the pangenome. Simulating the growth helps explaining if a pangenome should
+be considered open or closed. An pangenome is called open as long as a
+significant number of new (unique) genes are added to the total gene
+repertoire. The openness of a pangenome is usually tested using Heap’s
+law.
 
 .. code:: bash
 
-   $ pantools add_annotations --connect chloroplast_DB annotation_locations.txt
+   $ pantools pangenome_structure pecto_dickeya_DB
 
-PanTools attached the annotations to our nucleotide nodes so now we can
-cluster them.
+The current test set of six bacterial genomes is not representative of a
+full-sized pangenome. Therefore we prepared the results for the
+structure simulation on a set of 197 *Pectobacterium* genomes. The
+runtime of the analysis using 10.000 loops and 24 threads was 1 minute
+and 54 seconds. Download the files here, unpack the archive and take a
+look at the files.
+
+.. code:: bash
 
-Homology grouping
-~~~~~~~~~~~~~~~~~
+   $ wget wget http://bioinformatics.nl/pangenomics/tutorial/pectobacterium_structure.tar.gz
+   $ tar -xvf pectobacterium_structure.tar.gz
 
-PanTools can infer homology between the protein sequences of a
-pangenome and cluster them into homology groups. Multiple parameters
-can be set to influence the sensitivity but for now we use the
-:ref:`group <construction/group:group>` functionality with default
-settings.
+You still have to run the R scripts to create the output figures
+and determine the openness of the pangenome.
 
 .. code:: bash
 
-   $ pantools group chloroplast_DB
+   cd pectobacterium_structure
+   $ Rscript pangenome_growth.R
+   $ Rscript gains_losses_median_and_average.R
+   $ Rscript heaps_law.R
+
+Take a look at the plot. In **core_accessory_unique_size.png**, the
+number of classified groups are plotted for any of the genome
+combination that occurred during the simulation. For the
+**core_accessory_size.png** plots, the number of unique groups is
+combined with accessory groups.
+
+The **gains_losses.png** files display the average and mean group gain
+and loss between different pangenome sizes. The line of the core starts
+below zero, meaning for every random genome added, the core genome
+decreases by a number of *X* genes.
+
+.. raw:: html
+
+   <details>
+   <summary>
+      The alpha value of heaps' law is written to
+      <b><i>heaps_law_alpha.txt</i></b>. Click here to reveal the result.
+   </summary>
+   <p>
+   <br>
+      With an alpha of 0.53 the <i>Pectobacterium</i> pangenome has an
+      open structure, which is typical for many bacterial species due
+      their extensive exchange of genetic information. Performing this
+      analysis on a set of closely related animal or plant genomes usually
+      results in a closed pangenome, indicating a limited gene pool.
+   </p>
+   </details>
 
 --------------
 
-Adding phenotypes (requires PanTools v3)
-----------------------------------------
+Functional annotations
+----------------------
 
-Phenotype values can be Integers, Double, String or Boolean values.
-Create a text file **phenotypes.txt**.
+PanTools is able to incorporate functional annotations into the
+pangenome by reading output of various functional annotation tools. In
+this tutorial we only include annotations from InterProScan. Please see
+the :ref:`add_functions <construction/annotate:add functions>`
+manual to check which other tools are available. To include the annotations,
+create a file **functions.txt** using text from the box below and add it
+to the command line argument.
 
 .. code:: text
 
-   Genome,Solanum
-   1,false
-   2,false
-   3,true
-   4,true
-   5,false
+   1 YOUR_PATH/GCF_002904195.1.gff3
+   2 YOUR_PATH/GCF_000803215.1.gff3
+   3 YOUR_PATH/GCF_003595035.1.gff3
+   4 YOUR_PATH/GCF_000808375.1.gff3
+   5 YOUR_PATH/GCF_000808115.1.gff3
+   6 YOUR_PATH/GCA_000147055.1.gff3
+
+.. code:: bash
+
+   $ pantools add_functions pecto_dickeya_DB functions.txt
+
+PanTools will ask you to download the InterPro database. Follow the
+steps and execute the program again.
+
+The complete GO, PFAM, Interpro and TIGRFAM, databases are now
+integrated in the graph database after. Genes with a predicted function
+have gained a relationship to that function node. Retrieving a set of
+genes that share a function is now possible through a simple cypher
+query. If you would run **metrics** again, statistics for these type
+functional annotations are calculated. To create a summary table for
+each type of functional annotation, run
+:ref:`function_overview <construction/annotate:function overview>`.
+
+.. code:: bash
+
+   $ pantools function_overview pecto_dickeya_DB
+
+In **function_overview_per_group.csv** you can navigate to a homology
+group or gene to see the connected functions. You can also search in the
+opposite direction, use one of the created overview files for a type of
+functional annotation and quickly navigate to a function of interest to
+find which genes are connected.
+
+GO enrichment
+~~~~~~~~~~~~~
+
+We go back to the output files from gene classification that only
+contain node identifiers. We can retrieve group functions by providing
+one the files to
+:ref:`group_info <analysis/explore:group info>` with the
+``--homology-groups`` argument. However, interpreting groups by
+assessing each one individually is not very practical. A common
+approach to discover interesting genes from a large set is
+GO-enrichment. This statistical method enables the identification of
+genes sharing a function that are significantly over or
+under-represented in a given gene set compared to the rest of the
+genome. Let’s perform a :ref:`analysis/go_enrichment:go enrichment`
+on homology groups of the core genome.
+
+.. code:: text
+
+   Phenotype: P._brasiliense, 2 genomes, threshold of 2 genomes
+   1278537,1282642,1283856,1283861,1283862,1283869,1283906,1283921,1283934,1283941,1283945,1283946
+
+.. code:: bash
+
+   $ pantools group_info -H brasiliense_groups.csv pecto_dickeya_DB
+   $ pantools go_enrichment -H brasiliense_groups.csv pecto_dickeya_DB
+
+Open **go_enrichment.csv** with a spreadsheet tool. This file holds GO
+terms found in at least one of the genomes, the p-value of the
+statistical test and whether it is actually enriched after the multiple
+testing correction. as this is different for each genome a function
+might enriched in one genome but not in another.
+
+A directory with separate output files is created for each genome, open
+**go_enrichment.csv** for the genome 4 or 5 in a spreedsheet. Also take
+a look at the PDF files that visualize part of the Gene ontology
+hierarchy.
+
+Classifying functional annotations
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-And use :ref:`add_phenotypes <construction/annotate:add phenotypes>` to add the
-information to the pangenome.
+Similarly to classifying gene content, functional annotations can be
+categorized using
+:ref:`functional_classification <analysis/classification:functional classification>`.
+This tool provides an easy way to identify functions shared by a group
+of genomes of a certain phenotype but can also be used to identify
+core or unique functions. The functionality uses the same set of
+arguments as **gene_classification**. You can go through the same
+steps again to see the effect of changing the arguments.
 
 .. code:: bash
 
-   $ pantools add_phenotypes chloroplast_DB phenotypes.txt
+   $ pantools functional_classification pecto_dickeya_DB
+   $ pantools functional_classification --core-threshold=85 pecto_dickeya_DB
+   $ pantools functional_classification --exclude=6 pecto_dickeya_DB
+   $ pantools functional_classification --phenotype=species pecto_dickeya_DB
 
 --------------
 
-RETRIEVE functions
+Sequence alignment
 ------------------
 
-Now that the construction is complete, lets quickly validate if the
-construction was successful and the database can be used. To retrieve
-some genomic regions, prepare a text file containing genomic
-coordinates. Create the file **regions.txt** and include the following
-for each region: genome number, contig number, start and stop position
-and separate them by a single space
+In the final part of this tutorial we will test the alignment function
+by aligning homology groups. PanTools is able to align genomic regions,
+genes and proteins to identify SNPs or amino acid changes with
+:doc:`msa </analysis/msa>`.
 
-.. code:: text
+Start with the alignment of protein sequences from the 12 *P.
+brasiliense* specific homology groups.
+
+.. code:: bash
+
+   $ pantools msa --mode=protein --method=per-group -H=brasiliense_groups.csv pecto_dickeya_DB
 
-   1 1 200 500
-   2 1 300 700
-   3 1 1 10000
-   3 1 1 10000 -
-   4 1 9999 15000
-   5 1 100000 110000
+Go to the **pecto_dickeya_DB/alignments/grouping_v1/groups/** directory
+and select one of homology groups and check if you can find the
+following files
 
-Now run the :ref:`retrieve_regions <analysis/explore:retrieve regions>`
-function and include the new text file
+-  The alignments are written to **prot_trimmed.fasta** and
+   **prot_trimmed.afa**.
+-  A gene tree is written to **prot_trimmed.newick**
+-  **prot_trimmed_variable_positions.csv** located in the
+   **var_inf_positions** subdirectory. This matrix holds every variable
+   position of the alignment; the rows are the position in the alignment
+   and the columns are the 20 amino acids and gaps.
+-  The identity and similarity (for proteins) is calculated between
+   sequences and written to tables in the **similarity_identity**
+   subdirectory.
+
+Run the function again but include the ``--no-trimming`` argument.
 
 .. code:: bash
 
-   $ pantools retrieve_regions chloroplast_DB regions.txt
+   $ pantools msa --no-trimming --mode=protein --method=per-group -H=brasiliense_groups.csv pecto_dickeya_DB
 
-Take a look at the extracted regions that are written to the
-**chloroplast_DB/retrieval/regions/** directory.
+The output files are generated right after the first alignment without
+trimming the sequences first. The file names differ from the trimmed
+alignments by the ’*_trimmed*’ substring.
 
-To retrieve entire genomes, prepare a text file **genome_numbers.txt**
-and include each genome number on a separate line in the file
+Run the function again but exclude the ``--mode=protein`` and
+``--no-trimming`` arguments. When no additional arguments are included
+to the command, both nucleotide and protein sequences are aligned two
+consecutive times.
 
-.. code:: text
+.. code:: bash
+
+   $ pantools msa --method=per-group -H=brasiliense_groups.csv pecto_dickeya_DB
 
-   1
-   3
-   5
+Again, the same type of files are generated but the output files from
+nucleotide sequence can be recognized by the ‘*nuc\_*’ substrings. The
+matrix in **nuc_trimmed_variable_positions.csv** now only has columns
+for the four nucleotides and gaps.
 
-Use the **retrieve_regions** function again but include the new text
-file
+Finally, run the function one more time but include a phenotype. This
+allows you to identify phenotype specific SNPs or amino acid changes.
 
 .. code:: bash
 
-   $ pantools retrieve_regions chloroplast_DB genome_numbers.txt
+   $ pantools msa --no-trimming --phenotype=low_temperature -H=brasiliense_groups.csv pecto_dickeya_DB
+
+Open the **nuc**- or **prot_trimmed_phenotype_specific_changes.info**
+file inside one of the homology group output directories.
+
+--------------
+
+Besides the functionalities in this tutorial, PanTools has more useful
+functions that may aid you in retrieving more specific information from
+the pangenome.
 
-Genome files are written to same directory as before. Take a look at one
-of the three genomes you have just retrieved.
+-  Identify shared k-mers between genomes with
+   :ref:`kmer_classification <analysis/classification:k-mer classification>`.
+-  Find co-localized genes in a set of homology groups:
+   :ref:`locate_genes <analysis/explore:locate genes>`.
+-  Mapping short reads against the pangenome with :ref:`map
+   <analysis/mapping:map>`.
 
-In :doc:`part 3 <tutorial_part3>` of the tutorial we explore the
-pangenome you just built using the Neo4j browser and the Cypher language.
+In :doc:`part 3 <tutorial_part3>` of the tutorial we explore some of the
+phylogenetic methods implemented in PanTools.
diff --git a/docs/source/tutorial/tutorial_part3.rst b/docs/source/tutorial/tutorial_part3.rst
index 93949bde0d367c790d43df68222521b1f098a242..dfb755355e7225df89b0923bfd61d0c3a6b7dcee 100644
--- a/docs/source/tutorial/tutorial_part3.rst
+++ b/docs/source/tutorial/tutorial_part3.rst
@@ -1,291 +1,259 @@
-Part 3. Explore the pangenome using the Neo4j browser
-=====================================================
+Part 3. Phylogeny
+=================
 
-Did you skip :doc:`part 2 <tutorial_part2>` of the tutorial or were you
-unable to build the chloroplast pangenome? Download the pre-constructed
-pangenome
-`here <http://bioinformatics.nl/pangenomics/tutorial/chloroplast_DB.tar.gz>`_
-or via wget.
+Part 3 preparation
+------------------
+
+If you did not follow part 2 of the tutorial, download the
+pre-constructed pangenome
+`here <http://bioinformatics.nl/pangenomics/tutorial/pecto_dickeya_DB.tar.gz>`_.
 
 .. code:: bash
 
-   $ wget http://bioinformatics.nl/pangenomics/tutorial/chloroplast_DB.tar.gz
-   $ tar -xvzf chloroplast_DB.tar.gz
+   $ wget http://bioinformatics.nl/pangenomics/tutorial/pecto_dickeya_DB.tar.gz
+   $ tar xvzf pecto_dickeya_DB.tar.gz
 
 --------------
 
-Configuring Neo4j
------------------
+Adding phenotype/metadata to the pangenome
+------------------------------------------
 
-Set the full path to the chloroplast pangenome database by opening
-neo4j.conf ('*neo4j-community-3.5.30/conf/neo4j.conf*') and include the
-following line in the config file. Please make sure there is always only
-a single uncommented line with 'dbms.directories.data'.
+Before we construct the trees, we will add some phenotype data to the
+pangenome. Once the we have a phylogeny, the information can be included
+or be used to color parts of the tree. Below is a textfile with data for
+three phenotypes. The third phenotype, *low_temperature*, is in this
+case a made up example! It states whether the strain is capable of
+growing on (extreme) low temperatures. The phenotype file can be found
+inside the database directory, add the information to the pangenome by
+using :ref:`add_phenotypes <construction/annotate:add phenotypes>`.
 
 .. code:: text
 
-   #dbms.directories.data=/YOUR_PATH/any_other_database
-   dbms.directories.data=/YOUR_PATH/chloroplast_DB
-
-| **Allowing non-local connections**
-| To be able to run Neo4j on a server and have access to it from
-  anywhere, some additional lines in the config file must be changed.
-
--  **Uncomment** the four following lines in
-   neo4j-community-3.5.30/conf/neo4j.conf.
--  Replace 7686, 7474, and 7473 by three different numbers that are not
-   in use by other people on your server. In this way, everyone can have
-   their own database running at the same time.
-
-.. code:: text
-
-   #dbms.connectors.default_listen_address=0.0.0.0
-   #dbms.connector.bolt.listen_address=:7687
-   #dbms.connector.http.listen_address=:7474
-   #dbms.connector.https.listen_address=:7473
-
-Lets start up the Neo4j server!
-
-.. code:: bash
-
-   $ neo4j start
-
-Start Firefox (or a web browser of your own preference) and let it run
-on the background.
+   Genome, species, strain_name, low_temperature
+   1,P. odoriferum,P. odoriferum Q166, false
+   2,P. fontis, P. fontis M022, true
+   3,P. polaris,P. polaris S4.16.03.2B, false
+   4,P. brasiliense, P. brasiliense S2, true
+   5,P. brasiliense, P. brasiliense Y49, false
+   6,D. dadantii, D. dadantii 3937,?
 
 .. code:: bash
 
-   $ firefox &
-
-In case you did not change the config to allow non-local connections,
-browse to *http://localhost:7474*. Whenever you did change the config
-file, go to *server_address:7474*, where 7474 should be replaced with the
-number you chose earlier.
-
-If the database startup was successful, a login terminal will appear in
-the webpage. Use '*neo4j*' both as username and password. After logging
-in, you are requested to set a new password.
+   $ pantools add_phenotypes pecto_dickeya_DB pecto_dickeya_DB/phenotypes.txt
 
 --------------
 
-Exploring nodes and edges in Neo4j
-----------------------------------
-
-Go through the following steps to become proficient in using the Neo4j
-browser and the underlying PanTools data structure. If you have any
-difficulty trouble finding a node, relationship or any type of
-information, download and use `this visual guide
-<http://www.bioinformatics.nl/pangenomics/tutorial/neo4j_browser.tar.gz>`_.
-
-1.  Click on the database icon on the left. A menu with all node types
-    and relationship types will appear.
-2.  Click on the '*gene*' button in the node label section. This
-    automatically generated a query. Execute the query.
-3.  The **LIMIT** clause prevents large numbers of nodes popping up to
-    avoid your web browser from crashing. Set LIMIT to 10 and execute the
-    query.
-4.  Hover over the nodes, click on them and take a look at the values
-    stored in the nodes. All these features (except ID) were extracted
-    from the GFF annotation files. ID is an unique number automatically
-    assigned to nodes and relationships by Neo4j.
-5.  Double-click on the **matK** gene node, all nodes with a connection
-    to this gene node will appear. The nodes have distinct colors as
-    these are different node types, such as **mRNA**, **CDS**,
-    **nucleotide**. Take a look at the node properties to observe that
-    most values and information is specific to a certain node type.
-6.  Double-click on the *matK* mRNA node, a **homology_group** node
-    should appear. These type of nodes connect homologous genes in the
-    graph. However, you can see this gene did not cluster with any other
-    gene.
-7.  Hover over the **start** relation of the *matK* gene node. As you
-    can see information is not only stored in nodes, but also in
-    relationships! A relationship always has a certain direction, in
-    this case the relation starts at the gene node and points to a
-    nucleotide node. Offset marks the location within the node.
-8.  Double-click on the **nucleotide** node at the end of the 'start'
-    relationship. An in- and outgoing relation appear that connect to
-    other nucleotide nodes. Hover over both the relations and compare
-    them. The relations holds the genomic coordinates and shows this
-    path only occurs in contig/sequence 1 of genome 1.
-9.  Follow the outgoing **FF**-relationship to the next nucleotide node
-    and expand this node by double-clicking. Three nodes will pop up
-    this time. If you hover over the relations you see the coordinates
-    belong to other genomes as well. You may also notice the
-    relationships between nucleotide nodes is always a two letter
-    combination of F (forward) and R (reverse) which state if a sequence
-    is reverse complemented or not. The first letter corresponds to the
-    sequence of the node at the start of the relation where the second
-    letters refers to the sequence of the end node.
-10. Finally, execute the following query to call the database scheme to
-    see how all node types are connected to each other: *CALL
-    db.schema()*. The schema will be useful when designing your own
-    queries!
-
---------------
-
-Query the pangenome database using CYPHER
------------------------------------------
-
-Cypher is a declarative, SQL-inspired language and uses ASCII-Art to
-represent patterns. Nodes are represented by circles and relationships
-by arrows.
-
--  The **MATCH** clause allows you to specify the patterns Neo4j will
-   search for in the database.
--  With **WHERE** you can add constraints to the patterns described.
--  In the **RETURN** clause you define which parts of the pattern to
-   display.
+Constructing a phylogeny
+------------------------
 
-Cypher queries
-~~~~~~~~~~~~~~
+In this tutorial we will construct three phylogenies, each based on a
+different type of variation: SNPs, genes and k-mers. Take a look at the
+phylogeny manuals to get an understanding how the three methods work and
+how they differ from each other.
 
-**Match and return 100 nucleotide nodes**
+1. :ref:`analysis/phylogeny:core phylogeny`
+2. :ref:`analysis/phylogeny:gene distance tree`
+3. :ref:`analysis/phylogeny:k-mer distance tree`
 
-.. code:: text
-
-   MATCH (n:nucleotide) RETURN n LIMIT 100
+Core SNP phylogeny
+------------------
 
-**Find all the genome nodes**
+The core SNP phylogeny will run various Maximum Likelihood models on
+parsimony informative sites of single-copy orthologous sequences. A site
+is parsimony-informative when there are at least two types of
+nucleotides that occur with a minimum frequency of two. The informative
+sites are automatically identified by aligning the sequences; however,
+it does not know which sequences are single-copy orthologous. You can
+identify these conserved sequences by running
+:ref:`gene_classification <analysis/classification:gene classification>`.
 
-.. code:: text
+.. code:: bash
 
-   MATCH (n:genome) RETURN n
+   $ pantools gene_classification --phenotype=species pecto_dickeya_DB
 
-**Find the pangenome node**
+Open **gene_classification_overview.txt** and take a look at statistics.
+As you can see there are 2134 single-copy ortholog groups. Normally, all
+of these groups are aligned to identify SNPs but for this tutorial we’ll
+make a selection of only a few groups to accelerate the steps. You can
+do this in two different ways:
 
-.. code:: text
+Option 1: Open **single_copy_orthologs.csv** and remove all node
+identifiers after the first 20 homology groups and save the file.
 
-   MATCH (n:pangenome) RETURN n
+.. code:: bash
 
-**Match and return 100 genes**
+   $ pantools core_phylogeny --mode=ML pecto_dickeya_DB
 
-.. code:: text
+Option 2: Open **single_copy_orthologs.csv** and select the first 20
+homology_group node identifiers. Place them in a new file sco_groups.txt
+and include this file to the function.
 
-   MATCH (g:gene) RETURN g LIMIT 100
+.. code:: bash
 
-**Match and return 100 genes and order them by length**
+   $ pantools core_phylogeny --mode=ML -H=sco_groups.txt pecto_dickeya_DB
 
-.. code:: text
+The sequences of the homology groups are being aligned two consecutive
+times. After the initial alignment, input sequences are trimmed based on
+the longest start and end gap of the alignment. The parsimony
+informative positions are taken from the second alignment and
+concatenated into a sequence. When opening **informative.fasta** you can
+find 6 sequences, the length of the sequences being the number of
+parsimony-informative sites.
 
-   MATCH (g:gene) RETURN g ORDER BY g.length DESC LIMIT 100
+.. code:: bash
 
-**The same query as before but results are now returned in a table**
+   $ iqtree -nt 4 -s pecto_dickeya_DB/alignments/grouping_v1/core_phylogeny/informative.fasta -redo -bb 1000
 
-.. code:: text
+IQ-tree generates several files, the tree that we later on in the
+tutorial will continue with is called **informative.fasta.treefile**.
+When examining the **informative.fasta.iqtree** file you can find the
+best fit model of the data. This file also shows the number of sites
+that were used, as sites with gaps (which IQ-tree does not allow) were
+changed into singleton or constant sites.
 
-   MATCH (g:gene) RETURN g.name, g.address, g.length ORDER BY g.length DESC LIMIT 100
+Gene distance tree
+~~~~~~~~~~~~~~~~~~
 
-**Return genes which are longer as 100 but shorter than 250 bp** (this
-can also be applied to other features such as exons introns or CDS)
+To create a phylogeny based on gene distances (absence/presence), we can
+simply execute the Rscript that was created by
+:ref:`gene_classification <analysis/classification:gene classification>`.
 
-.. code:: text
+.. code:: bash
 
-   MATCH (g:gene) where g.length > 100 AND g.length < 250 RETURN * LIMIT 100
+   $ Rscript pecto_dickeya_DB/gene_classification/gene_distance_tree.R
 
-**Find genes located on first genome**
+The resulting tree is called **gene_distance.tree**.
 
-.. code:: text
+K-mer distance tree
+~~~~~~~~~~~~~~~~~~~
 
-   MATCH (g:gene) WHERE g.address[0] = 1 RETURN * LIMIT 100
+To obtain a k-mer distance phylogeny, the k-mers must first be counted
+with the :ref:`kmer_classification <analysis/classification:k-mer
+classification>`
+function. Afterwards, the tree can be constructed by executing the
+Rscript.
 
-**Find genes located on first genome and first sequence**
+.. code:: bash
 
-.. code:: text
+   $ pantools kmer_classification pecto_dickeya_DB
+   $ Rscript pecto_dickeya_DB/kmer_classification/genome_kmer_distance_tree.R
 
-   MATCH (g:gene) WHERE g.address[0] = 1 AND g.address[1] = 1 RETURN * LIMIT 100
+The resulting tree is written to **genome_kmer_distance.tree**.
 
 --------------
 
-Homology group queries
-~~~~~~~~~~~~~~~~~~~~~~
+Renaming tree nodes
+-------------------
 
-**Return 100 homology groups**
-
-.. code:: text
+So far, we used three different types of distances (SNPs, genes,
+k-mers), and two different methods (ML, NJ) to create three phylogenetic
+trees. First, lets take a look at the text files. The
+**informative.fasta.treefile** only contain genome numbers, bootstrap
+values and branch lengths but is lacking the metadata. Examining
+**gene_distance.tree** file also shows this information but the species
+names as well, because we included this as a phenotype during
+:ref:`gene_classification <analysis/classification:gene classification>`.
 
-   MATCH (h:homology_group) RETURN h LIMIT 100
+Let’s include the strain identifiers to the core snp tree to make the
+final figure more informative. Use the
+:ref:`rename_phylogeny <analysis/phylogeny:rename phylogeny>` function to
+rename the tree nodes.
 
-**Match homology groups which contain two members**
-
-.. code:: text
-
-   MATCH (h:homology_group) WHERE h.num_members = 2 RETURN h
-
-**Match homology groups and 'walk' to the genes and corresponding start
-and end node**
-
-.. code:: text
-
-   MATCH (h:homology_group)-->(f:feature)<--(g:gene)-->(n:nucleotide) WHERE h.num_members = 2 RETURN * LIMIT 25
-
-Turn off autocomplete by clicking on the button on the bottom right. The
-graph falls apart because relations were not assigned to variables.
-
-**The same query as before but now the relations do have variables**
-
-.. code:: text
-
-   MATCH (h:homology_group)-[r1]-> (f:feature) <-[r2]-(g:gene)-[r3]-> (n:nucleotide) WHERE h.num_members = 2 RETURN * LIMIT 25
-
-When you turn off autocomplete again only the '*is_similar_to*' relation
-disappears since we did not call it
-
-**Find homology group that belong to the rpoC1 gene**
-
-.. code:: text
-
-   MATCH (n:homology_group)--(m:mRNA)--(g:gene) WHERE g.name = 'rpoC1' RETURN *
-
-**Find genes on genome 1 which don't show homology**
+.. code:: bash
 
-.. code:: text
+   $ pantools rename_phylogeny --phenotype=strain_name pecto_dickeya_DB pecto_dickeya_DB/alignments/grouping_v1/core_phylogeny/informative.fasta.treefile
 
-   MATCH (n:homology_group)--(m:mRNA)--(g:gene) WHERE n.num_members = 1 and g.genome = 1 RETURN *
+Take a look at **informative.fasta_RENAMED.treefile**, strain
+identifiers have been added to the tree.
 
 --------------
 
-Structural variant detection
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-**Find SNP bubbles (for simplification we only use the FF relation)**
-
-.. code:: text
-
-   MATCH p= (n:nucleotide) -[:FF]-> (a1)-[:FF]->(m:nucleotide) <-[:FF]-(b1) <-[:FF]- (n) return * limit 50
-
-**The same query but returning the results in a table**
-
-.. code:: text
+Visualizing the tree in iTOL
+----------------------------
+
+Go to https://itol.embl.de and click on “Upload a tree” under the
+**ANNOTATE** box. On this page you can paste the tree directly into the
+**tree text:** textbox or can click the button to upload the .newick
+file.
+
+.. figure:: /figures/tutorial_phylogeny1.png
+   :width: 600
+   :align: center
+
+Basic controls iTOL
+-------------------
+
+-  The default way of visualizing a tree is the rectangular view.
+   Depending on the number of genomes, the circular view can be easier
+   to interpret. You can the view by clicking on the “Display Mode”
+   buttons.
+-  Increase the font size and branch width to improve readability
+-  When visualizing a Maximum likelihood (ML) tree, bootstrap values can
+   be displayed by clicking the “Display” button next to
+   **Bootstrap/metadata** in the Advanced tab of the Control window.
+   This enables you to visualize the values as text or symbol on the
+   branch. or by coloring the branch or adjusting the width.
+
+.. figure:: /figures/tutorial_phylogeny2.png
+   :width: 600
+   :align: center
+
+-  When you have a known out group or one of the genomes is a clear
+   outlier in the tree, you should reroot the tree. Hover over the name,
+   click it so a pop-up menu appears. Click “tree structure” followed by
+   “Reroot the tree here”.
+
+.. figure:: /figures/tutorial_phylogeny3.png
+   :width: 600
+   :align: center
+
+-  Clicking on the name of a node in the tree allows you to color the
+   name, branch, or background of that specific node.
+-  When you’re happy the way your tree looks, go to the Export tab of
+   the Control window. Select the desired output format, click on the
+   “Full image” button and export the file to a figure.
+-  Refresh the webpage to go back to the default view of your tree.
+
+Create iTOL templates
+---------------------
+
+In iTOL it is possible to add colors to the tree by coloring the
+terminal nodes or adding an outer ring. The PanTools function
+:ref:`create_tree_template <analysis/phylogeny:create tree template>` is able
+to create templates that allows for easy coloring (with maximum of 20
+possible colors). If the function is run without any additional
+argument, templates are created for trees that only contain genome
+numbers (e.g. k-mer distance tree). Here we want to color the (renamed)
+core SNP tree with the ‘low_temperature’ phenotype. Therefore, the
+``--phenotype`` strain_name must be included to the function.
 
-   MATCH (n:nucleotide) -[:FF]-> (a1)-[:FF]->(m:nucleotide) <-[:FF]-(b1) <-[:FF]- (n) return a1.length,b1.length, a1.sequence, b1.sequence limit 50
-
-Functions such as **count()**, **sum()** and **stDev()** can be used in
-a query.
-
-**The same SNP query but count the hits instead of displaying them**
-
-.. code:: text
+.. code:: bash
 
-   MATCH p= (n:nucleotide) -[:FF]-> (a1)-[:FF]->(m:nucleotide) <-[:FF]-(b1) <-[:FF]- (n) return count(p)
+   $ pantools create_tree_template pecto_dickeya_DB # Run this command when the tree contains genome numbers only
+   $ pantools create_tree_template --phenotype=strain_name pecto_dickeya_DB
 
---------------
+Copy the two low_temperature.txt files from the label/strain_name/ and
+ring/strain_name/ directories to your personal computer. Click and move
+the ring template file into the tree visualization webpage.
 
-Hopefully you know have some feeling with the Neo4j browser and cypher
-and you're inspired to create your own queries!
+.. figure:: /figures/tutorial_phylogeny4.png
+   :width: 600
+   :align: center
 
-When you're done working in the browser, close the database (by using
-the command line again).
+The resulting tree should look this when: the tree is rooted with the
+*Dickeya* genome, bootstrap values are displayed as text and the ring
+color template was included.
 
-.. code:: text
+.. figure:: /figures/tutorial_phylogeny_color.png
+   :width: 600
+   :align: center
 
-   $ neo4j stop
+Tree coloring is especially useful for large datasets. An example is
+shown in the figure below, where members of the same species share a
+color.
 
-| More information on Neo4j and the cypher language:
-| `Neo4j Cypher Manual v3.5 <https://neo4j.com/docs/developer-manual/3.5/cypher/>`_
-| `Neo4j Cypher Refcard <http://neo4j.com/docs/cypher-refcard/3.5/>`_
-| `Neo4j API <https://neo4j.com/developer/>`_
+.. figure:: /figures/tutorial_phylogeny_tree.png
+   :width: 600
+   :align: center
 
 --------------
-
-In :doc:`part 4 <tutorial_part4>` of the tutorial we explore some of the
-functionalities to analyze the pangenome.
diff --git a/docs/source/tutorial/tutorial_part4.rst b/docs/source/tutorial/tutorial_part4.rst
index fafb5b4680c75697f6c2603f900fc7d5e69dd53f..ea5c20ead32184a25db9716b8db213d928947888 100644
--- a/docs/source/tutorial/tutorial_part4.rst
+++ b/docs/source/tutorial/tutorial_part4.rst
@@ -1,464 +1,212 @@
-Part 4. Characterization
-========================
+Part 4 - Visualization using PanVA
+==================================
 
-Part 4 preparation
-------------------
+This part guides you through the process of getting from raw data to PanVA
+instance.
 
-PanTools v3 is required to follow this part of the tutorial. In
-addition, MAFFT and R (and a few packages) need to be installed and set
-to your $PATH. Everything should already be correctly installed if you
-use the conda environment. Validate if the tools are executable by using
-the following commands.
 
-.. code:: bash
-
-   $ Rscript --help
-   $ mafft -h
-
-We assume a PanTools alias was set during the
-:ref:`installation <getting_started/install:set pantools alias>`. This allows
-PanTools to be executed with ``pantools`` rather than the full path to the jar
-file. If you don’t have an alias, either set one or replace the pantools
-command with the full path to the .jar file in the tutorials.
+Data Packages
+-------------
 
-Input data
-----------
+Currently one small data package is available, which contain all the information
+for creating a pangenome using PanTools and creating a PanVA instance from it,
+all the way from raw data to pangenome browser. This data package is different
+from the one in the performance test, so don't reuse that one here!
 
-.. csv-table::
-   :file: /tables/bacteria_datasets.csv
-   :header-rows: 1
-   :delim: ;
+  .. list-table::
+     :widths: 25 25 25 25
+     :header-rows: 1
 
-To demonstrate how to use the PanTools functionalities we use a small
-dataset of six bacteria to avoid long runtimes. Download a
-pre-constructed pangenome or test your new skills and construct a
-pangenome yourself using the fasta and gff files.
+     * - data type
+       - link
+       - contents
+       - database size
+     * - Fungi
+       - `Yeast <https://www.bioinformatics.nl/pangenomics/data/
+         yeast_panva.tar.gz>`_ [4.3K]
+       - 10 genomes
+       - 4G
 
-Option 1: Download separate genome and annotation files
 
-.. code:: bash
-
-   $ wget http://bioinformatics.nl/pangenomics/tutorial/pecto_dickeya_input.tar.gz
-   $ tar -xvzf pecto_dickeya_input.tar.gz
-   $ gzip -d pecto_dickeya_input/annotations/*
-   $ gzip -d pecto_dickeya_input/genomes/*
-   $ gzip -d pecto_dickeya_input/functions/*
+Steps to generate PanVA input
+-----------------------------
 
-   $ pantools build_pangenome pecto_dickeya_DB pecto_dickeya_input/genomes.txt
-   $ pantools add_annotations --connect pecto_dickeya_DB pecto_dickeya_input/annotations.txt
-   $ pantools group --relaxation=4 pecto_dickeya_DB
 
-Option 2: Download the pre-constructed pangenome
+1: Download the data package
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. code:: bash
 
-   $ wget http://bioinformatics.nl/pangenomics/tutorial/pecto_dickeya_DB.tar.gz
-   $ tar -xvzf pecto_dickeya_DB.tar.gz
-
---------------
+   $ wget https://www.bioinformatics.nl/pangenomics/data/<target-dataset>.tar.gz
+   $ tar xvzf <target-dataset>.tar.gz
 
-Adding phenotype/metadata to the pangenome
-------------------------------------------
+Every package contains a README with all the exact commands, so make sure to
+check those if you're stuck.
+For all packages, those can be found at the root of the decompressed TAR-files.
+The Snakemake pipelines used in this workflow create conda environments
+in the data package directory. If you want to re-use these pipelines for
+different data, use the ``--conda-prefix`` Snakemake command to set a directory
+where the conda environments will be stored.
+All commands should be run from the root directory of the package.
+Run the whole package from RAM-disk or SSD, or set the path of the results to
+RAM-disk/SSD in the configs.
 
-Before starting with the analysis, we will add some phenotype data to
-the pangenome. Phenotypes allow you to find similarities for a group of
-genomes sharing a phenotype as well as identifying variation between
-different phenotypes. Below is a textfile with data for three
-phenotypes. The third phenotype, *low_temperature*, is in this case a
-made up example! It states whether the strain is capable of growing on
-(extreme) low temperatures. The phenotype file can be found inside the
-database directory or create a new file using the text from the box
-below. Add the phenotype information to the pangenome using
-:ref:`add_phenotypes <construction/annotate:add phenotypes>`.
+2: Downloading data for the pangenome
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-.. code:: text
+Goal:
+   * Acquire genome and structural annotation data
 
-   Genome, species, strain_name, low_temperature
-   1,P. odoriferum,P. odoriferum Q166, false
-   2,P. fontis, P. fontis M022, true
-   3,P. polaris,P. polaris S4.16.03.2B, false
-   4,P. brasiliense, P. brasiliense S2, true
-   5,P. brasiliense, P. brasiliense Y49, false
-   6,D. dadantii, D. dadantii 3937,?
+To download corresponding raw-data for each of the above-linked packages,
+follow the first steps of the instructions laid out in the README file.
+This command looks looks like this:
 
 .. code:: bash
 
-   $ pantools add_phenotypes pecto_dickeya_DB pecto_dickeya_input/phenotypes.txt
+    $ ./<target-dataset>_download.sh
 
---------------
+Make sure the ``gunzip`` command is available on your system.
 
-Metrics and general statistics
-------------------------------
 
-After building or uncompressing the pangenome, run the
-:ref:`metrics <analysis/metrics:metrics>` functionality to produce
-various statistics that should verify an errorless construction.
+3: Preprocessing the data for PanTools
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-.. code:: bash
-
-   $ pantools metrics pecto_dickeya_DB
-
-Open **metrics_per_genome.csv** with a spreadsheet tool (Excel,
-Libreoffice, Google sheets) and make sure the columns are split on
-commas. You may easily notice the many empty columns in this table as
-these type of annotations or features are not included in the database
-(yet). Functional annotations are incorporated later in this tutorial.
-Columns for features like exon and intron will remain empty as bacterial
-coding sequences are not interrupted.
+Goal:
+   * Filtering the minimum sequence size of genomes in the FASTA file
+   * Filtering the minimum ORF size of CDS features in the annotation
+   * Extract protein-sequences by matching CDS features to genomic sequences
+   * Create functional annotations for extracted protein sequences (Optional)
+   * Generate statistics for raw and filtered data
 
---------------
+This requires the following three steps:
 
-Gene classification
--------------------
+**Step 1**: Clone the PanUtils data filtering pipeline
 
-With the :ref:`gene_classification <analysis/classification:gene
-classification>` functionality you are able to organize the gene repertoire
-into the core, accessory or unique part of the pangenome.
-
--  **Core**, a gene is present in all genomes
--  **Unique**, a gene is present in a single genome
--  **Accessory**, a gene is present in some but not all genomes
+The
+:ref:`data filtering pipeline <getting_started/diy_pangenomics:Quality control pipeline>`
+filters out small sequences, matches FASTA with GFF contents and removes CDS
+features with ORF below cutoff value. Also extracts protein sequences and
+creates functional annotations from them.
 
 .. code:: bash
 
-   $ pantools gene_classification pecto_dickeya_DB
-
-.. figure:: /figures/classify_table.png
-   :width: 600
-   :align: center
+   $ git clone https://github.com/PanUtils/pantools-qc-pipeline.git
 
-Take a look in **gene_classification_overview.txt**. Here you can find
-the number of classified homology groups and genes on a pangenome level
-but also for individual genomes.
+**Step 2**: Activate or create Snakemake
 
-Open **additional_copies.csv** with a spreadsheet tool. This file can be
-useful to identify duplicated genes in relation to other genomes.
-
-The default criteria to call a group core is presence in all genomes
-where unique is only allowed to be present in one genome. These two
-categories are highly influenced by annotation quality, especially in
-large pangenomes. Luckily, the threshold for core and unique groups can
-easily be adjusted. Let’s consider genes to be core when present in only
-five of the six genomes by setting the ``--core-threshold`` argument.
+Activate or create a snakemake environment (works in python
+versions <= 3.11).
 
 .. code:: bash
 
-   $ pantools gene_classification --core-threshold=85 pecto_dickeya_DB
+   $ mamba create -c conda-forge -c bioconda -n snakemake snakemake=8
 
-Look in **gene_classification_overview.txt** again to observe the
-increase of core groups/genes at the cost of accessory groups.
+.. note::
 
-For this pangenome, the *Dickeya* genome is considered an outgroup to
-the five *Pectobacterium* genomes. While this outgroup is needed to root
-and analyze phylogenetic trees (tutorial part 5), it affects the number
-classified groups for the all other genomes. Use ``--include`` or
-``--exclude`` to exclude the *Dickeya* genome.
+    If you are using an ARM-based machine (such as an M4-based Mac), make sure
+    to make the new environment compatible with Intel-based packages.
+    Many dependencies in conda are not yet compatible with ARM systems.
+    Consider for example installing *Rosetta 2*.
 
-.. code:: bash
+    .. code:: bash
 
-   $ pantools gene_classification --include=1-5 pecto_dickeya_DB
-   $ pantools gene_classification --exclude=6 pecto_dickeya_DB
+       $ softwareupdate --install-rosetta
 
-Take a look in **gene_classification_overview.txt** one more time to
-examine the effect of excluding this genome. The total number of groups
-in the analysis is lower now but the number of core and unique genes
-have increased for the five remaining genomes.
+    Please use this command to set up your environment:
 
-When phenotype information is used in the analysis, three additional
-categories can be assigned to a group:
+    .. code:: bash
 
--  **Shared**, a gene present in all genomes of a phenotype
--  **Exclusive**, a gene is only present in a certain phenotype
--  **Specific**, a gene present in all genomes of a phenotype and is
-   also exclusive
+       $ CONDA_SUBDIR=osx-64 mamba create -c conda-forge -c bioconda -n snakemake snakemake=8
 
-Include a ``--phenotype`` argument to find genes that are exclusive for
-a certain species.
+    This command ensures that packages are downloaded for an Intel architecture.
+    Afterwards, restart your shell with the "Open using Rosetta"-setting
+    enabled. For this, go to "Applications"/Utilities/Terminal" and click on
+    "Get Info". Select the option to start the terminal with Rosetta!
 
-.. code:: bash
+**Step 3**: Filter raw data and create functional annotations for protein
+sequences
 
-   $ pantools gene_classification --phenotype=species pecto_dickeya_DB
-
-Open **gene_classification_phenotype_overview.txt** to see the number of
-classified groups for the species phenotype.
-
-Open **phenotype_disrupted.csv** in a spreadsheet tool. This file
-explains exactly why a homology groups is labeled as phenotype shared
-and not specific.
-
-Open **phenotype_additional_copies.csv** in a spreadsheet tool.
-Similarly to *phenotype_additional.csv* this file shows groups where all
-genomes of a certain phenotype have additional gene copies to (at least
-one of) the other phenotypes.
-
-Each time you run the
-:ref:`gene_classification <analysis/classification:gene classification>`
-function, multiple files are created that contain node identifiers of a certain
-homology group category. These files can be given to other PanTools
-functions for a downstream analysis, for example, sequence alignment,
-phylogeny, or GO enrichment. We will use one of the files later in this
-tutorial.
-
---------------
-
-Pangenome structure
--------------------
-
-With the previous functionality we identified the core, accessory and
-unique parts of the pangenome. Now we will use the
-:ref:`pangenome_size_genes <analysis/classification:pangenome structure>`
-function to observe how these numbers are reached by simulating the growth of
-the pangenome. Simulating the growth helps explaining if a pangenome should
-be considered open or closed. An pangenome is called open as long as a
-significant number of new (unique) genes are added to the total gene
-repertoire. The openness of a pangenome is usually tested using Heap’s
-law. Heaps’ law (a power law) can be fitted to the number of new genes
-observed when increasing the pangenome by one random genome. The formula
-for the power law model is n = k x N-a, where n is the newly discovered
-genes, N is the total number of genomes, and k and a are the fitting
-parameters. A pangenome can be considered open when a < 1 and closed if
-a > 1.
-
-The outcome of the function can again be controlled through command line
-arguments. Genomes can be excluded from the analysis with ``--exclude``.
-You can set the number of iterations with ``--loops``.
+Filter the raw data and create protein sequences:
 
 .. code:: bash
 
-   $ pantools pangenome_structure pecto_dickeya_DB
-
-The current test set of six bacterial genomes is not representative of a
-full-sized pangenome. Therefore we prepared the results for the
-structure simulation on a set of 197 *Pectobacterium* genomes. The
-runtime of the analysis using 10.000 loops and 24 threads was 1 minute
-and 54 seconds. Download the files here, unpack the archive and take a
-look at the files.
-
-.. code:: bash
+   $ snakemake --use-conda --snakefile pantools-qc-pipeline/workflow/Snakefile --configfile config/<target-dataset>_qc.yaml --cores <threads>
 
-   $ wget wget http://bioinformatics.nl/pangenomics/tutorial/pectobacterium_structure.tar.gz
-   $ tar -xvf pectobacterium_structure.tar.gz
+4: Running all pangenome analyses using PanTools
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-Normally you still have to run the R scripts to create the output figures
-and determine the openness of the pangenome.
+Goal:
+   * Build the pangenome
+   * Add structural annotations, functional annotations and phenotypes
+   * Add VCF information or phasing information if available
+   * Create homology groups
+   * Run the necessary analysis steps (gene classification, k-mer
+     classification, multiple sequence alignment, group info) in order to create
+     a PanVA instance
 
-.. code:: bash
+**Step 1**: Clone the PanTools pipeline v4
 
-   cd pectobacterium_structure
-   $ Rscript pangenome_growth.R
-   $ Rscript gains_losses_median_and_average.R
-   $ Rscript heaps_law.R
-
-Take a look at the plot. In **core_accessory_unique_size.png**, the
-number of classified groups are plotted for any of the genome
-combination that occurred during the simulation. For the
-**core_accessory_size.png** plots, the number of unique groups is
-combined with accessory groups.
-
-The **gains_losses.png** files display the average and mean group gain
-and loss between different pangenome sizes. The line of the core starts
-below zero, meaning for every random genome added, the core genome
-decreases by a number of *X* genes.
-
-.. raw:: html
-
-   <details>
-   <summary>
-      The alpha value of heaps' law is written to
-      <b><i>heaps_law_alpha.txt</i></b>. Click here to reveal the result.
-   </summary>
-   <p>
-   <br>
-      With an alpha of 0.53 the <i>Pectobacterium</i> pangenome has an
-      open structure, which is typical for many bacterial species due
-      their extensive exchange of genetic information. Performing this
-      analysis on a set of closely related animal or plant genomes usually
-      results in a closed pangenome, indicating a limited gene pool.
-   </p>
-   </details>
-
---------------
-
-Functional annotations
-----------------------
-
-PanTools is able to incorporate functional annotations into the
-pangenome by reading output of various functional annotation tools. In
-this tutorial we only include annotations from InterProScan. Please see
-the :ref:`add_functions <construction/annotate:add functions>`
-manual to check which other tools are available. To include the annotations,
-create a file **functions.txt** using text from the box below and add it
-to the command line argument.
-
-.. code:: text
-
-   1 YOUR_PATH/GCF_002904195.1.gff3
-   2 YOUR_PATH/GCF_000803215.1.gff3
-   3 YOUR_PATH/GCF_003595035.1.gff3
-   4 YOUR_PATH/GCF_000808375.1.gff3
-   5 YOUR_PATH/GCF_000808115.1.gff3
-   6 YOUR_PATH/GCA_000147055.1.gff3
+The :ref:`PanTools pipeline
+<getting_started/diy_pangenomics:PanTools v4 pipeline>` contains all PanTools
+functions. We will use it here to create a pangenome and run all steps discussed
+above.
 
 .. code:: bash
 
-   $ pantools add_functions pecto_dickeya_DB functions.txt
+   $ git clone https://github.com/PanUtils/pantools-pipeline-v4.git
 
-PanTools will ask you to download the InterPro database. Follow the
-steps and execute the program again.
+**Step 2**: Run PanTools to for PanVA-specific analyses
 
-The complete GO, PFAM, Interpro and TIGRFAM, databases are now
-integrated in the graph database after. Genes with a predicted function
-have gained a relationship to that function node. Retrieving a set of
-genes that share a function is now possible through a simple cypher
-query. If you would run **metrics** again, statistics for these type
-functional annotations are calculated. To create a summary table for
-each type of functional annotation, run
-:ref:`function_overview <construction/annotate:function overview>`.
+The snakemake rule ``panva`` takes care of creating a pangenome and running all
+necessary functions to create a PanVA instance. Those are started with the same
+command, outlined below.
 
 .. code:: bash
 
-   $ pantools function_overview pecto_dickeya_DB
-
-In **function_overview_per_group.csv** you can navigate to a homology
-group or gene to see the connected functions. You can also search in the
-opposite direction, use one of the created overview files for a type of
-functional annotation and quickly navigate to a function of interest to
-find which genes are connected.
+   $ snakemake panva --use-conda --snakefile pantools-pipeline-v4/workflow/Snakefile --configfile config/<target-dataset>_pantools.yaml --cores <threads>
 
-GO enrichment
-~~~~~~~~~~~~~
+This will create a pangenome database from which PanVA files can be generated.
 
-We go back to the output files from gene classification that only
-contain node identifiers. We can retrieve group functions by providing
-one the files to
-:ref:`group_info <analysis/explore:group info>` with the
-``--homology-groups`` argument. However, interpreting groups by
-assessing each one individually is not very practical. A common
-approach to discover interesting genes from a large set is
-GO-enrichment. This statistical method enables the identification of
-genes sharing a function that are significantly over or
-under-represented in a given gene set compared to the rest of the
-genome. Let’s perform a :ref:`analysis/go_enrichment:go enrichment`
-on homology groups of the core genome.
+5: Generate input for PanVA instance
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-.. code:: text
+Goal:
+   * Preprocessing data for PanVA
 
-   Phenotype: P._brasiliense, 2 genomes, threshold of 2 genomes
-   1278537,1282642,1283856,1283861,1283862,1283869,1283906,1283921,1283934,1283941,1283945,1283946
+**Step 1**: Clone the export-to-panva python script
 
 .. code:: bash
 
-   $ pantools group_info -H brasiliense_groups.csv pecto_dickeya_DB
-   $ pantools go_enrichment -H brasiliense_groups.csv pecto_dickeya_DB
-
-Open **go_enrichment.csv** with a spreadsheet tool. This file holds GO
-terms found in at least one of the genomes, the p-value of the
-statistical test and whether it is actually enriched after the multiple
-testing correction. as this is different for each genome a function
-might enriched in one genome but not in another.
-
-A directory with seperate output files is created for each genome, open
-**go_enrichment.csv** for the genome 4 or 5 in a spreedsheet. Also take
-a look at the PDF files that visualize part of the Gene ontology
-hierarchy.
+   $ git clone https://github.com/PanUtils/export-to-panva.git
 
-Classifying functional annotations
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-Similarly to classifying gene content, functional annotations can be
-categorized using
-:ref:`functional_classification <analysis/classification:functional classification>`.
-This tool provides an easy way to identify functions shared by a group
-of genomes of a certain phenotype but can also be used to identify
-core or unique functions. The functionality uses the same set of
-arguments as **gene_classification**. You can go through the same
-steps again to see the effect of changing the arguments.
-
-.. code:: bash
-
-   $ pantools functional_classification pecto_dickeya_DB
-   $ pantools functional_classification --core-threshold=85 pecto_dickeya_DB
-   $ pantools functional_classification --exclude=6 pecto_dickeya_DB
-   $ pantools functional_classification --phenotype=species pecto_dickeya_DB
-
---------------
-
-Sequence alignment
-------------------
-
-In the final part of this tutorial we will test the alignment function
-by aligning homology groups. PanTools is able to align genomic regions,
-genes and proteins to identify SNPs or amino acid changes with
-:doc:`msa </analysis/msa>`.
-
-Start with the alignment of protein sequences from the 12 *P.
-brasiliense* specific homology groups.
-
-.. code:: bash
-
-   $ pantools msa --mode=protein --method=per-group -H=brasiliense_groups.csv pecto_dickeya_DB
-
-Go to the **pecto_dickeya_DB/alignments/grouping_v1/groups/** directory
-and select one of homology groups and check if you can find the
-following files
-
--  The alignments are written to **prot_trimmed.fasta** and
-   **prot_trimmed.afa**.
--  A gene tree is written to **prot_trimmed.newick**
--  **prot_trimmed_variable_positions.csv** located in the
-   **var_inf_positions** subdirectory. This matrix holds every variable
-   position of the alignment; the rows are the position in the alignment
-   and the columns are the 20 amino acids and gaps.
--  The identity and similarity (for proteins) is calculated between
-   sequences and written to tables in the **similarity_identity**
-   subdirectory.
-
-Run the function again but include the ``--no-trimming`` argument.
+**Step 2**: Create a conda environment for the script
 
 .. code:: bash
 
-   $ pantools msa --no-trimming --mode=protein --method=per-group -H=brasiliense_groups.csv pecto_dickeya_DB
-
-The output files are generated right after the first alignment without
-trimming the sequences first. The file names differ from the trimmed
-alignments by the ’*_trimmed*’ substring.
+   $ mamba env create -n export-to-panva -f export-to-panva/envs/pantova.yaml
+   $ conda activate export-to-panva
 
-Run the function again but exclude the ``--mode=protein`` and
-``--no-trimming`` arguments. When no additional arguments are included
-to the command, both nucleotide and protein sequences are aligned two
-consecutive times.
+.. note::
+   Make sure to create an environment that can deal with
+   Intel-based dependencies if you are on a silicon-based Mac.
 
-.. code:: bash
-
-   $ pantools msa --method=per-group -H=brasiliense_groups.csv pecto_dickeya_DB
-
-Again, the same type of files are generated but the output files from
-nucleotide sequence can be recognized by the ‘*nuc\_*’ substrings. The
-matrix in **nuc_trimmed_variable_positions.csv** now only has columns
-for the four nucleotides and gaps.
+**Step 3**: Run the export script
 
-Finally, run the function one more time but include a phenotype. This
-allows you to identify phenotype specific SNPs or amino acid changes.
+The export script reads data from the pangenome database and converts it to
+the proper format for PanVA. Run the following command from the root of the data
+package, to create the inputs for PanVA:
 
 .. code:: bash
 
-   $ pantools msa --no-trimming --phenotype=low_temperature -H=brasiliense_groups.csv pecto_dickeya_DB
-
-Open the **nuc**- or **prot_trimmed_phenotype_specific_changes.info**
-file inside one of the homology group output directories.
-
---------------
+   $ python3 export-to-panva/scripts/pan_to_va.py config/<target-dataset>_panva.ini
 
-Besides the functionalities in this tutorial, PanTools has more useful
-functions that may aid you in retrieving more specific information from
-the pangenome.
+6: Create a PanVA instance
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
--  Identify shared k-mers between genomes with
-   :ref:`kmer_classification <analysis/classification:k-mer classification>`.
--  Find co-localized genes in a set of homology groups:
-   :ref:`locate_genes <analysis/explore:locate genes>`.
--  Mapping short reads against the pangenome with :ref:`map
-   <analysis/mapping:map>`.
+Goal:
+   * Set up the PanVA instance
 
-In :doc:`part 5 <tutorial_part5>` of the tutorial we explore some of the
-phylogenetic methods implemented in PanTools.
+With the output of the export script, you should be able to create a PanVA
+instance for your dataset using the instructions from
+`PanVA's Technical Test
+<https://panbrowse.github.io/PanVA/v0.0.0/technical-test.html>`_.
diff --git a/docs/source/tutorial/tutorial_part5.rst b/docs/source/tutorial/tutorial_part5.rst
index fcaaee486109df024348ce683c5f8b86f3435012..5dbe08a6c0942422af2fb264f436d1476a47f620 100644
--- a/docs/source/tutorial/tutorial_part5.rst
+++ b/docs/source/tutorial/tutorial_part5.rst
@@ -1,275 +1,290 @@
-Part 5. Phylogeny
-=================
+Part 5. Explore the pangenome using the Neo4j browser
+=====================================================
 
-Part 5 preparation
-------------------
-
-Pantools v3 is required to follow this part of the tutorial. In
-addition, MAFFT, FastTree, IQ-tree, R (and the ape R package) need to be
-installed and set to your $PATH. Validate if the tools are executable by
-using the following commands.
+Did you skip :doc:`part 1 <tutorial_part1>` of the tutorial or were you
+unable to build the chloroplast pangenome? Download the pre-constructed
+pangenome
+`here <http://bioinformatics.nl/pangenomics/tutorial/chloroplast_DB.tar.gz>`_
+or via wget.
 
 .. code:: bash
 
-   pantools --version
-   Rscript --help
-   mafft -h
-   iqtree -h
-   fasttree -h
+   $ wget http://bioinformatics.nl/pangenomics/tutorial/chloroplast_DB.tar.gz
+   $ tar xvzf chloroplast_DB.tar.gz
 
-If you did not follow part 4 of the tutorial, download the
-pre-constructed pangenome
-`here <http://bioinformatics.nl/pangenomics/tutorial/pecto_dickeya_DB.tar.gz>`_.
+--------------
 
-.. code:: bash
+Configuring Neo4j
+-----------------
 
-   $ wget http://bioinformatics.nl/pangenomics/tutorial/pecto_dickeya_DB.tar.gz
-   $ tar -xvzf pecto_dickeya_DB.tar.gz
+Set the full path to the chloroplast pangenome database by opening
+neo4j.conf ('*neo4j-community-3.5.30/conf/neo4j.conf*') and include the
+following line in the config file. Please make sure there is always only
+a single uncommented line with 'dbms.directories.data'.
 
---------------
+.. code:: text
 
-Adding phenotype/metadata to the pangenome
-------------------------------------------
+   #dbms.directories.data=/YOUR_PATH/any_other_database
+   dbms.directories.data=/YOUR_PATH/chloroplast_DB
 
-Before we construct the trees, we will add some phenotype data to the
-pangenome. Once the we have a phylogeny, the information can be included
-or be used to color parts of the tree. Below is a textfile with data for
-three phenotypes. The third phenotype, *low_temperature*, is in this
-case a made up example! It states whether the strain is capable of
-growing on (extreme) low temperatures. The phenotype file can be found
-inside the database directory, add the information to the pangenome by
-using :ref:`add_phenotypes <construction/annotate:add phenotypes>`.
+| **Allowing non-local connections**
+| To be able to run Neo4j on a server and have access to it from
+  anywhere, some additional lines in the config file must be changed.
+
+-  **Uncomment** the four following lines in
+   neo4j-community-3.5.30/conf/neo4j.conf.
+-  Replace 7686, 7474 and 7473 by any port number not in use on your system.
+   This way everyone can have their own database running at the same time.
 
 .. code:: text
 
-   Genome, species, strain_name, low_temperature
-   1,P. odoriferum,P. odoriferum Q166, false
-   2,P. fontis, P. fontis M022, true
-   3,P. polaris,P. polaris S4.16.03.2B, false
-   4,P. brasiliense, P. brasiliense S2, true
-   5,P. brasiliense, P. brasiliense Y49, false
-   6,D. dadantii, D. dadantii 3937,?
+   #dbms.connectors.default_listen_address=0.0.0.0
+   #dbms.connector.bolt.listen_address=:7687
+   #dbms.connector.http.listen_address=:7474
+   #dbms.connector.https.listen_address=:7473
+
+Lets start up the Neo4j server!
+
+.. code:: bash
+
+   $ neo4j start
+
+Start Firefox (or a web browser of your own preference) and let it run
+on the background.
 
 .. code:: bash
 
-   $ pantools add_phenotypes pecto_dickeya_DB pecto_dickeya_DB/phenotypes.txt
+   $ firefox &
+
+In case you did not change the config to allow non-local connections,
+browse to *http://localhost:7474*. Whenever you did change the config
+file, go to *server_address:7474*, where 7474 should be replaced with the
+number you chose earlier, and *server_address* is the address set in the config.
+
+If the database startup was successful, a login terminal will appear in
+the webpage. Use '*neo4j*' both as username and password. After logging
+in, you are requested to set a new password.
 
 --------------
 
-Constructing a phylogeny
-------------------------
+Exploring nodes and edges in Neo4j
+----------------------------------
+
+Go through the following steps to become proficient in using the Neo4j
+browser and the underlying PanTools data structure. If you have any
+difficulty trouble finding a node, relationship or any type of
+information, download and use `this visual guide
+<http://www.bioinformatics.nl/pangenomics/tutorial/neo4j_browser.tar.gz>`_.
+
+1.  Click on the database icon on the left. A menu with all node types
+    and relationship types will appear.
+2.  Click on the '*gene*' button in the node label section. This
+    automatically generated a query. Execute the query.
+3.  The **LIMIT** clause prevents large numbers of nodes popping up to
+    avoid your web browser from crashing. Set LIMIT to 10 and execute the
+    query.
+4.  Hover over the nodes, click on them and take a look at the values
+    stored in the nodes. All these features (except ID) were extracted
+    from the GFF annotation files. ID is an unique number automatically
+    assigned to nodes and relationships by Neo4j.
+5.  Double-click on the **matK** gene node, all nodes with a connection
+    to this gene node will appear. The nodes have distinct colors as
+    these are different node types, such as **mRNA**, **CDS**,
+    **nucleotide**. Take a look at the node properties to observe that
+    most values and information is specific to a certain node type.
+6.  Double-click on the *matK* mRNA node, a **homology_group** node
+    should appear. These type of nodes connect homologous genes in the
+    graph. However, you can see this gene did not cluster with any other
+    gene.
+7.  Hover over the **start** relation of the *matK* gene node. As you
+    can see information is not only stored in nodes, but also in
+    relationships! A relationship always has a certain direction, in
+    this case the relation starts at the gene node and points to a
+    nucleotide node. Offset marks the location within the node.
+8.  Double-click on the **nucleotide** node at the end of the 'start'
+    relationship. An in- and outgoing relation appear that connect to
+    other nucleotide nodes. Hover over both the relations and compare
+    them. The relations holds the genomic coordinates and shows this
+    path only occurs in contig/sequence 1 of genome 1.
+9.  Follow the outgoing **FF**-relationship to the next nucleotide node
+    and expand this node by double-clicking. Three nodes will pop up
+    this time. If you hover over the relations you see the coordinates
+    belong to other genomes as well. You may also notice the
+    relationships between nucleotide nodes is always a two letter
+    combination of F (forward) and R (reverse) which state if a sequence
+    is reverse complemented or not. The first letter corresponds to the
+    sequence of the node at the start of the relation where the second
+    letters refers to the sequence of the end node.
+10. Finally, execute the following query to call the database scheme to
+    see how all node types are connected to each other: *CALL
+    db.schema()*. The schema will be useful when designing your own
+    queries!
 
-In this tutorial we will construct three phylogenies, each based on a
-different type of variation: SNPs, genes and k-mers. Take a look at the
-phylogeny manuals to get an understanding how the three methods work and
-how they differ from each other.
+--------------
 
-1. :ref:`analysis/phylogeny:core phylogeny`
-2. :ref:`analysis/phylogeny:gene distance tree`
-3. :ref:`analysis/phylogeny:k-mer distance tree`
+Query the pangenome database using CYPHER
+-----------------------------------------
 
-Core SNP phylogeny
-------------------
+Cypher is a declarative, SQL-inspired language and uses ASCII-Art to
+represent patterns. Nodes are represented by circles and relationships
+by arrows.
 
-The core SNP phylogeny will run various Maximum Likelihood models on
-parsimony informative sites of single-copy orthologous sequences. A site
-is parsimony-informative when there are at least two types of
-nucleotides that occur with a minimum frequency of two. The informative
-sites are automatically identified by aligning the sequences; however,
-it does not know which sequences are single-copy orthologous. You can
-identify these conserved sequences by running
-:ref:`gene_classification <analysis/classification:gene classification>`.
+-  The **MATCH** clause allows you to specify the patterns Neo4j will
+   search for in the database.
+-  With **WHERE** you can add constraints to the patterns described.
+-  In the **RETURN** clause you define which parts of the pattern to
+   display.
 
-.. code:: bash
+Cypher queries
+~~~~~~~~~~~~~~
 
-   $ pantools gene_classification --phenotype=species pecto_dickeya_DB
+**Match and return 100 nucleotide nodes**
 
-Open **gene_classification_overview.txt** and take a look at statistics.
-As you can see there are 2134 single-copy ortholog groups. Normally, all
-of these groups are aligned to identify SNPs but for this tutorial we’ll
-make a selection of only a few groups to accelerate the steps. You can
-do this in two different ways:
+.. code:: text
 
-Option 1: Open **single_copy_orthologs.csv** and remove all node
-identifiers after the first 20 homology groups and save the file.
+   MATCH (n:nucleotide) RETURN n LIMIT 100
 
-.. code:: bash
+**Find all the genome nodes**
 
-   $ pantools core_phylogeny --mode=ML pecto_dickeya_DB
+.. code:: text
 
-Option 2: Open **single_copy_orthologs.csv** and select the first 20
-homology_group node identifiers. Place them in a new file sco_groups.txt
-and include this file to the function.
+   MATCH (n:genome) RETURN n
 
-.. code:: bash
+**Find the pangenome node**
 
-   $ pantools core_phylogeny --mode=ML -H=sco_groups.txt pecto_dickeya_DB
+.. code:: text
 
-The sequences of the homology groups are being aligned two consecutive
-times. After the initial alignment, input sequences are trimmed based on
-the longest start and end gap of the alignment. The parsimony
-informative positions are taken from the second alignment and
-concatenated into a sequence. When opening **informative.fasta** you can
-find 6 sequences, the length of the sequences being the number of
-parsimony-informative sites.
+   MATCH (n:pangenome) RETURN n
 
-.. code:: bash
+**Match and return 100 genes**
 
-   $ iqtree -nt 4 -s pecto_dickeya_DB/alignments/grouping_v1/core_phylogeny/informative.fasta -redo -bb 1000
+.. code:: text
 
-IQ-tree generates several files, the tree that we later on in the
-tutorial will continue with is called **informative.fasta.treefile**.
-When examining the **informative.fasta.iqtree** file you can find the
-best fit model of the data. This file also shows the number of sites
-that were used, as sites with gaps (which IQ-tree does not allow) were
-changed into singleton or constant sites.
+   MATCH (g:gene) RETURN g LIMIT 100
 
-Gene distance tree
-~~~~~~~~~~~~~~~~~~
+**Match and return 100 genes and order them by length**
 
-To create a phylogeny based on gene distances (absence/presence), we can
-simply execute the Rscript that was created by
-:ref:`gene_classification <analysis/classification:gene classification>`.
+.. code:: text
 
-.. code:: bash
+   MATCH (g:gene) RETURN g ORDER BY g.length DESC LIMIT 100
 
-   $ Rscript pecto_dickeya_DB/gene_classification/gene_distance_tree.R
+**The same query as before but results are now returned in a table**
 
-The resulting tree is called **gene_distance.tree**.
+.. code:: text
 
-K-mer distance tree
-~~~~~~~~~~~~~~~~~~~
+   MATCH (g:gene) RETURN g.name, g.address, g.length ORDER BY g.length DESC LIMIT 100
 
-To obtain a k-mer distance phylogeny, the k-mers must first be counted
-with the :ref:`kmer_classification <analysis/classification:k-mer
-classification>`
-function. Afterwards, the tree can be constructed by executing the
-Rscript.
+**Return genes which are longer as 100 but shorter than 250 bp** (this
+can also be applied to other features such as exons introns or CDS)
 
-.. code:: bash
+.. code:: text
 
-   $ pantools kmer_classification pecto_dickeya_DB
-   $ Rscript pecto_dickeya_DB/kmer_classification/genome_kmer_distance_tree.R
+   MATCH (g:gene) where g.length > 100 AND g.length < 250 RETURN * LIMIT 100
 
-The resulting tree is written to **genome_kmer_distance.tree**.
+**Find genes located on first genome**
+
+.. code:: text
+
+   MATCH (g:gene) WHERE g.address[0] = 1 RETURN * LIMIT 100
+
+**Find genes located on first genome and first sequence**
+
+.. code:: text
+
+   MATCH (g:gene) WHERE g.address[0] = 1 AND g.address[1] = 1 RETURN * LIMIT 100
 
 --------------
 
-Renaming tree nodes
--------------------
+Homology group queries
+~~~~~~~~~~~~~~~~~~~~~~
+
+**Return 100 homology groups**
 
-So far, we used three different types of distances (SNPs, genes,
-k-mers), and two different methods (ML, NJ) to create three phylogenetic
-trees. First, lets take a look at the text files. The
-**informative.fasta.treefile** only contain genome numbers, bootstrap
-values and branch lengths but is lacking the metadata. Examining
-**gene_distance.tree** file also shows this information but the species
-names as well, because we included this as a phenotype during
-:ref:`gene_classification <analysis/classification:gene classification>`.
+.. code:: text
 
-Let’s include the strain identifiers to the core snp tree to make the
-final figure more informative. Use the
-:ref:`rename_phylogeny <analysis/phylogeny:rename phylogeny>` function to
-rename the tree nodes.
+   MATCH (h:homology_group) RETURN h LIMIT 100
 
-.. code:: bash
+**Match homology groups which contain two members**
+
+.. code:: text
+
+   MATCH (h:homology_group) WHERE h.num_members = 2 RETURN h
+
+**Match homology groups and 'walk' to the genes and corresponding start
+and end node**
+
+.. code:: text
+
+   MATCH (h:homology_group)-->(f:feature)<--(g:gene)-->(n:nucleotide) WHERE h.num_members = 2 RETURN * LIMIT 25
+
+Turn off autocomplete by clicking on the button on the bottom right. The
+graph falls apart because relations were not assigned to variables.
+
+**The same query as before but now the relations do have variables**
+
+.. code:: text
+
+   MATCH (h:homology_group)-[r1]-> (f:feature) <-[r2]-(g:gene)-[r3]-> (n:nucleotide) WHERE h.num_members = 2 RETURN * LIMIT 25
+
+When you turn off autocomplete again only the '*is_similar_to*' relation
+disappears since we did not call it
+
+**Find homology group that belong to the rpoC1 gene**
+
+.. code:: text
 
-   $ pantools rename_phylogeny --phenotype=strain_name pecto_dickeya_DB pecto_dickeya_DB/alignments/grouping_v1/core_phylogeny/informative.fasta.treefile
+   MATCH (n:homology_group)--(m:mRNA)--(g:gene) WHERE g.name = 'rpoC1' RETURN *
 
-Take a look at **informative.fasta_RENAMED.treefile**, strain
-identifiers have been added to the tree.
+**Find genes on genome 1 which don't show homology**
+
+.. code:: text
+
+   MATCH (n:homology_group)--(m:mRNA)--(g:gene) WHERE n.num_members = 1 and g.genome = 1 RETURN *
 
 --------------
 
-Visualizing the tree in iTOL
-----------------------------
-
-Go to https://itol.embl.de and click on “Upload a tree” under the
-**ANNOTATE** box. On this page you can paste the tree directly into the
-**tree text:** textbox or can click the button to upload the .newick
-file.
-
-.. figure:: /figures/tutorial_phylogeny1.png
-   :width: 600
-   :align: center
-
-Basic controls ITOL
--------------------
-
--  The default way of visualizing a tree is the rectangular view.
-   Depending on the number of genomes, the circular view can be easier
-   to interpret. You can the view by clicking on the “Display Mode”
-   buttons.
--  Increase the font size and branch width to improve readability
--  When visualizing a Maximum likelihood (ML) tree, bootstrap values can
-   be displayed by clicking the “Display” button next to
-   **Bootstrap/metadata** in the Advanced tab of the Control window.
-   This enables you to visualize the values as text or symbol on the
-   branch. or by coloring the branch or adjusting the width.
-
-.. figure:: /figures/tutorial_phylogeny2.png
-   :width: 600
-   :align: center
-
--  When you have a known out group or one of the genomes is a clear
-   outlier in the tree, you should reroot the tree. Hover over the name,
-   click it so a pop-up menu appears. Click “tree structure” followed by
-   “Reroot the tree here”.
-
-.. figure:: /figures/tutorial_phylogeny3.png
-   :width: 600
-   :align: center
-
--  Clicking on the name of a node in the tree allows you to color the
-   name, branch, or background of that specific node.
--  When you’re happy the way your tree looks, go to the Export tab of
-   the Control window. Select the desired output format, click on the
-   “Full image” button and export the file to a figure.
--  Refresh the webpage to go back to the default view of your tree.
-
-Create iTOL templates
----------------------
-
-In iTOL it is possible to add colors to the tree by coloring the
-terminal nodes or adding an outer ring. The PanTools function
-:ref:`create_tree_template <analysis/phylogeny:create tree template>` is able
-to create templates that allows for easy coloring (with maximum of 20
-possible colors). If the function is run without any additional
-argument, templates are created for trees that only contain genome
-numbers (e.g. k-mer distance tree). Here we want to color the (renamed)
-core SNP tree with the ‘low_temperature’ phenotype. Therefore, the
-``--phenotype`` strain_name must be included to the function.
+Structural variant detection
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-.. code:: bash
+**Find SNP bubbles (for simplification we only use the FF relation)**
 
-   $ pantools create_tree_template pecto_dickeya_DB # Run this command when the tree contains genome numbers only
-   $ pantools create_tree_template --phenotype=strain_name pecto_dickeya_DB
+.. code:: text
+
+   MATCH p= (n:nucleotide) -[:FF]-> (a1)-[:FF]->(m:nucleotide) <-[:FF]-(b1) <-[:FF]- (n) return * limit 50
+
+**The same query but returning the results in a table**
+
+.. code:: text
 
-Copy the two low_temperature.txt files from the label/strain_name/ and
-ring/strain_name/ directories to your personal computer. Click and move
-the ring template file into the tree visualization webpage.
+   MATCH (n:nucleotide) -[:FF]-> (a1)-[:FF]->(m:nucleotide) <-[:FF]-(b1) <-[:FF]- (n) return a1.length,b1.length, a1.sequence, b1.sequence limit 50
 
-.. figure:: /figures/tutorial_phylogeny4.png
-   :width: 600
-   :align: center
+Functions such as **count()**, **sum()** and **stDev()** can be used in
+a query.
 
-The resulting tree should look this when: the tree is rooted with the
-*Dickeya* genome, bootstrap values are displayed as text and the ring
-color template was included.
+**The same SNP query but count the hits instead of displaying them**
 
-.. figure:: /figures/tutorial_phylogeny_color.png
-   :width: 600
-   :align: center
+.. code:: text
+
+   MATCH p= (n:nucleotide) -[:FF]-> (a1)-[:FF]->(m:nucleotide) <-[:FF]-(b1) <-[:FF]- (n) return count(p)
+
+--------------
+
+Hopefully you know have some feeling with the Neo4j browser and cypher
+and you're inspired to create your own queries!
+
+When you're done working in the browser, close the database (by using
+the command line again).
+
+.. code:: text
 
-Tree coloring is especially useful for large datasets. An example is
-shown in the figure below, where members of the same species share a
-color.
+   $ neo4j stop
 
-.. figure:: /figures/tutorial_phylogeny_tree.png
-   :width: 600
-   :align: center
+| More information on Neo4j and the cypher language:
+| `Neo4j Cypher Manual v3.5 <https://neo4j.com/docs/developer-manual/3.5/cypher/>`_
+| `Neo4j Cypher Refcard <http://neo4j.com/docs/cypher-refcard/3.5/>`_
+| `Neo4j API <https://neo4j.com/developer/>`_
 
 --------------
 
-..
-   In :doc:`part 6 <tutorial_part6>` of the tutorial we go through a couple of use cases with the pangenomic read mapping functionality implemented in PanTools.
+In :doc:`part 3 <tutorial_part3>` of the tutorial we explore some of the
+functionalities to analyze the pangenome.