Merge branch 'update_tutorial' into 'develop'

Update tutorial See merge request !228

Merge branch 'update_tutorial' into 'develop'
934cbb5f · Workum, Dirk-Jan van · 90ea877a · d901debc · 934cbb5f · 934cbb5f
Commit 934cbb5f authored 8 months ago by Workum, Dirk-Jan van
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -10,6 +10,7 @@ All notable changes to Pantools will be documented in this file.
 ### Changed
 - No longer accept tab characters in fasta input files (!232).
+- Updated and reordered the tutorial section of the documentation (!228).
 ### Fixed
 - `pantools blast` no longer crashes if not all mrna nodes contain protein IDs (!226).

--- a/docs/source/analysis/classification.rst
+++ b/docs/source/analysis/classification.rst
@@ -243,7 +243,7 @@ simulation.
 Heaps' law (a power law) can be fitted to the number of new genes
 observed when increasing the pangenome by one random genome. The formula
-for the power law model is :math:`n = k * N^{-a}`, where *n* is the
+for the power law model is :math:`\Delta n = k * N^{-a}`, where *Δn* is the
 newly discovered genes, *N* is the total number of genomes, and *k* and
 *a* are the fitting parameters. A pangenome can be considered open when
 *a* < 1 and closed if *a* > 1.

--- a/docs/source/analysis/explore.rst
+++ b/docs/source/analysis/explore.rst
@@ -41,7 +41,7 @@ members.
     * - <databaseDirectory>
       - Path to the database root directory.
     * - <homologyFile>
-       - A text file with homology group node identifiers, seperated by a
+       - A text file with homology group node identifiers, separated by a
         comma.
 **Options**
@@ -180,10 +180,10 @@ annotation node and extract the nucleotide and protein sequence.
     * - ``--functions``
       - One or multiple function identifiers (GO, InterPro, PFAM,
-         TIGRFAM), seperated by a comma.
+         TIGRFAM), separated by a comma.
     * - ``--nodes``/``-n``
       - One or multiple identifiers of function nodes (GO, InterPro,
-         PFAM, TIGRFAM), seperated by a comma.
+         PFAM, TIGRFAM), separated by a comma.
     * - ``--include``/``-i``
       - Only include a selection of genomes.
     * - ``--exclude``/``-e``
@@ -283,9 +283,9 @@ specific GO terms of the hierarchy to a sequence.
     :widths: 30 70
     * - ``--functions``
-       - One or multiple GO term identifiers, seperated by a comma.
+       - One or multiple GO term identifiers, separated by a comma.
     * - ``--nodes``/``-n``
-       - One or multiple identifiers of 'GO' nodes, seperated by a comma.
+       - One or multiple identifiers of 'GO' nodes, separated by a comma.
 **Example commands**
  .. code:: bash
@@ -322,9 +322,9 @@ their location in the hierarchy is reported.
     :widths: 30 70
     * - ``--functions``
-       - One or multiple GO term identifiers, seperated by a comma.
+       - One or multiple GO term identifiers, separated by a comma.
     * - ``--nodes``/``-n``
-       - One or multiple identifiers of 'GO' nodes, seperated by a comma.
+       - One or multiple identifiers of 'GO' nodes, separated by a comma.
     * - ``--include``/``-i``
       - Only include a selection of genomes.
     * - ``--exclude``/``-e``
@@ -378,7 +378,7 @@ Report all available information of one or multiple homology groups.
         To find Phobius (P) or SignalP (S) annotations, include:
         'secreted' (P/S), 'receptor' (P/S), or 'transmembrane' (P).
     * - ``--genes``/``-g``
-       - One or multiple gene names, seperated by a comma.
+       - One or multiple gene names, separated by a comma.
     * - ``--node``
       - Retrieve the nucleotide nodes belonging to genes in homology groups

--- a/docs/source/analysis/go_enrichment.rst
+++ b/docs/source/analysis/go_enrichment.rst
@@ -69,9 +69,9 @@ the second and third rank this will be 0.0010 and 0.0015, respectively.
     :widths: 30 70
     * - ``--homology-file``/``-H``
-       - A text file with homology group node identifiers, seperated by a comma.
+       - A text file with homology group node identifiers, separated by a comma.
     * - ``--nodes``/``-n``
-       - mRNA node identifiers, seperated by a comma on the command line.
+       - mRNA node identifiers, separated by a comma on the command line.
     * - ``--include``/``-i``
       - Only include a selection of genomes.
     * - ``--exclude``/``-e``

--- a/docs/source/analysis/phylogeny.rst
+++ b/docs/source/analysis/phylogeny.rst
@@ -695,7 +695,7 @@ for the visualization of phylogenies in iTOL. Phenotypes must already be
 included in the pangenome with the
 :ref:`add_phenotypes <construction/annotate:add phenotypes>` functionality. How
 to use the template files in iTOL can be found in one of the
-:ref:`tutorials <tutorial/tutorial_part5:create itol templates>`.
+:ref:`tutorials <tutorial/tutorial_part3:create itol templates>`.
 If you run this function without a ``--phenotype`` argument, templates
 are created for trees that contain only genome numbers as node labels.

--- a/docs/source/construction/annotate.rst
+++ b/docs/source/construction/annotate.rst
@@ -40,6 +40,9 @@ selected and the second annotation of genome 2 and 3.
  if the GFF3 file contains a trans-spliced gene that has alternative
  splicing, it will not be able to handle it (it will only annotate one
  mRNA).
+  The PanUtils project contains a :ref:`data filtering pipeline
+  <getting_started/diy_pangenomics:Quality control pipeline>` that fixes most
+  common issues encountered by the parser.
 **Parameters**
  .. list-table::

--- a/docs/source/construction/group.rst
+++ b/docs/source/construction/group.rst
@@ -148,7 +148,7 @@ argument to the command.
 | **What BUSCO benchmark set to use**
  -  When using **BUSCO v3**, go to https://busco.ezlab.org, download a odb9
-     set, and untar it with ``tar -xvzf``. Include the entire directory in
+     set, and untar it with ``tar xvzf``. Include the entire directory in
     the command using the ``--input-file`` argument.
  -  For **BUSCO v4** or **v5**, you only have to provide the odb10 database
     name with the ``--input-file`` argument, the database is downloaded
@@ -493,5 +493,5 @@ settings that were used.
  -  **current_pantools_homology_groups.txt**, overview of the active
     homology groups. Each line represents one homology group. The line
     starts with the homology group (database) identifier followed by a
-     colon and the rest are mRNA IDs (from gff/genbank) seperated by a
+     colon and the rest are mRNA IDs (from gff/genbank) separated by a
     space.
--- a/docs/source/developer_guide/install.rst
+++ b/docs/source/developer_guide/install.rst
@@ -56,7 +56,7 @@ downloaded from https://github.com/refresh-bio/KMC/releases.
 .. code:: bash
-   $ tar -xvzf KMC* #uncompress the KMC binaries
+   $ tar xvzf KMC* #uncompress the KMC binaries
   # Edit your ~/.bashrc to include KMC to your PATH
   $ echo "export PATH=/path/to/KMC/:\$PATH" >> ~/.bashrc #replace /path/to with the correct path on your computer
@@ -76,7 +76,7 @@ https://micans.org/mcl under License & software.
 .. code:: bash
   $ wget https://micans.org/mcl/src/mcl-14-137.tar.gz
-   $ tar -xvzf mcl-*
+   $ tar xvzf mcl-*
   $ cd mcl-14-137
   $ ./configure --prefix=/path/to/mcl-14-137/shared #replace /path/to with the correct path on your computer
   $ make install
@@ -405,7 +405,7 @@ The following commands can be used to test and build the documentation:
 .. code:: bash
   # Test the documentation
-   sphinx-lint sphinx-lint -e=all --max-line-length=80
+   sphinx-lint -e=all --max-line-length=80
   # Build the documentation
   sphinx-build -W docs/source output

--- a/docs/source/getting_started/diy_pangenomics.rst
+++ b/docs/source/getting_started/diy_pangenomics.rst
+DIY Pangenomics
+===============
+Likely you want to do pangenomics using your own data, on your own system.
+To support this, we offer several tools:
+* `PanTools`_ : core application to construct and analyze pangenomes
+* `PanUtils`_: utilities to support pangenomics (e.g. QC)
+* `PanBrowse`_: applications for pangenome visualization (e.g. PanVA)
+When working with these software tools, consider the following tips:
+* Separate issues in the (technical) setup from issues in the data, so try
+  the tool on known (test) data first.
+* Garbage in, garbage out! Therefore, perform rigorous quality control on the
+  input data, e.g. using https://github.com/PanUtils/pantools-qc-pipeline.
+* Start small, e.g. 10-50 bacterial genomes or 3-10 eukaryotic genomes, before
+  scaling up.
+PanTools
+--------
+PanTools (https://git.wur.nl/bioinformatics/pantools) is a software package
+for pangenome construction and analysis. It works across the tree of life,
+from small viruses to large plant, animal, or human genomes. Because it is not
+alignment based, but employs a compacted De Bruijn Graph (cDBG) it works over
+large(r) evolutionary distances, e.g. at species, genus or family level.
+PanUtils
+--------
+We have been working on several pangenome utilities, collected in
+https://github.com/PanUtils. The repository currently contains three pipelines:
+one for quality control, one for pangenome construction and one for analysis
+and preparing a PanVA instance.
+Quality control pipeline
+^^^^^^^^^^^^^^^^^^^^^^^^
+This is a Snakemake pipeline that can be used for running quality control on
+the data. We highly recommend this pipeline for checking your files, as
+annotation files can be very tricky to work with (and therefore sometimes
+cause issues for tools like PanTools). This pipeline can be found at
+https://github.com/PanUtils/pantools-qc-pipeline which includes a README on
+how to install and run it. In summary, one needs to create a conda environment
+with the most recent version of Snakemake (at the moment of writing this is
+8.25.5). Please follow the instructions on which configuration to set and
+where all files should be placed.
+The pipeline has four different workflows:
+* raw_statistics: creates an overview of the statistics of the input data
+* filter: can be used for filtering genome and annotation files
+* proteins: creates protein files including statistics
+* functions: creates functional annotations for all proteins in the protein
+  files
+PanTools v4 pipeline
+^^^^^^^^^^^^^^^^^^^^
+The PanTools Snakemake pipeline runs through all major
+PanTools functionalities. You can use this pipeline for inspiration on what
+can be done with PanTools or to actually run PanTools.
+Please see https://github.com/PanUtils/pantools-pipeline-v4 for the code,
+including README that summarizes how to install, configure and run it.
+.. warning::
+ It is not recommended to use this pipeline with new experimental data;
+ adjustments to data and PanTools command settings might have to be made
+ during the process, for which this pipeline lacks the flexibility.
+The pipeline has three different workflows:
+* all_panproteome: runs through all major PanTools functionalities available
+  for a panproteome
+* all_pangenome: runs through all major PanTools functionalities available for
+  a pangenome
+* panva: runs all PanTools functions necessary to set up a PanVA instance
+PanVA conversion
+^^^^^^^^^^^^^^^^
+If one has already created a pangenome database with PanTools that needs to be
+converted to the data structure necessary for setting up a PanVA instance,
+please use this repository: https://github.com/PanUtils/export-to-panva. It
+contains a README on how to install and run it. In summary, one needs to
+create an environment using mamba from a predefined YAML file, configure
+settings in a config.ini file and then run the python script which takes the
+config.ini as input.
+PanBrowse
+---------
+We work on visualization methods for (interactive) visualization of
+complex pangenomes. PanBrowse (https://github.com/PanBrowse) is where
+we host the visualization applications.
+PanVA
+^^^^^
+`PanVA <https://github.com/PanBrowse/PanVA>`_ is a web application
+allowing users to visually and interatively explore sequence variants
+in pangenomes (generated by PanTools). It provides context for these
+variants by displaying their corresponding annotations, phylogenetic
+and phenotypic information.
--- a/docs/source/getting_started/technical_setup.rst
+++ b/docs/source/getting_started/technical_setup.rst
+Technical setup validation
+==========================
+Performance is important when running PanTools on larger datasets. Therefore,
+it is recommended to test your PanTools installation on one or more of the
+test data sets below, before applying it to your own data.
+PanTools pangenome construction commands make a large amount of
+transactions to the database and should be built on a location that can
+handle this for larger pangenomes. We recommend using an SSD or a RAM-disk
+(/dev/shm/, available on Linux machines). If pangenome construction takes
+too much time, the location of the database might be the issue. For reference,
+the building time of a small yeast data set of 10 genomes on our RAM-disk takes
+a little over two minutes, while it takes nearly 30 minutes on a network disk!
+Larger datasets exacerbate this gap in runtime. Disk size needed for
+construction depends on input data (genome size and number of genomes),
+for 8 arabidopsis genomes ±20GB should suffice, for 3 lettuce genomes ±250GB.
+Java heap space needed depends on the same parameters: in general it is
+recommendable to set this to larger than expected because java running out
+of heap space makes the tool crash (e.g. for lettuce we used
+-Xmx200g -Xms200g).
+.. warning::
+ If you want to do everything in a single directory, make sure to download the
+ data on the disk where you want to write the output. If the input data is on
+ a network disk, adjust the commands to write the pangenome database on a
+ local disk (RAM or SSD).
+Ten yeast genomes
+-----------------
+Download the data from:
+https://www.bioinformatics.nl/pangenomics/data/yeast_10strains.tgz
+.. code:: bash
+   $ wget https://www.bioinformatics.nl/pangenomics/data/yeast_10strains.tgz
+   $ tar zxvf yeast_10strains.tgz
+Build the pangenome:
+.. code:: bash
+   $ pantools -Xms20g -Xmx50g build_pangenome -t16 yeast_db yeast_10strains/metadata/genome_locations.txt
+Add annotations:
+.. code:: bash
+   $ pantools -Xms20g -Xmx50g add_annotations yeast_db yeast_10strains/metadata/annotation_locations.txt
+Group proteins:
+.. code:: bash
+   $ pantools -Xms20g -Xmx50g group yeast_db -t16 --relaxation=3
+Run time statistics (16 threads on RAM-disk):
+.. csv-table::
+   :file: /tables/runtime_statistics_yeast.csv
+   :header-rows: 1
+   :delim: ,
+Two *Arabidopsis* genomes
+-------------------------
+Start this use case with a bash script to download the data. Download the
+script from here:
+https://www.bioinformatics.nl/pangenomics/data/download_2_ara.sh
+To download the *Arabidopsis* data, run the script:
+.. code:: bash
+   $ bash download_2_ara.sh <threads>
+Build the pangenome:
+.. code:: bash
+   $ pantools -Xms40g -Xmx80g build_pangenome -t50 --kmer-size=17 ara_db a_thaliana_2genomes/metadata/genome_locations.txt
+Add annotations:
+.. code:: bash
+   $ pantools -Xms40g -Xmx80g add_annotations ara_db a_thaliana_2genomes/metadata/annotation_locations.txt
+Group proteins:
+.. code:: bash
+   $ pantools -Xms40g -Xmx80g group ara_db -t50 --relaxation=3
+Run time statistics (50 threads on RAM-disk)
+.. csv-table::
+   :file: /tables/runtime_statistics_arabidopsis.csv
+   :header-rows: 1
+   :delim: ,
+Three tomato genomes
+--------------------
+Download the script from here:
+https://www.bioinformatics.nl/pangenomics/data/tomato_3genomes.tar.gz
+Extract the tomato data:
+.. code:: bash
+   $ tar xvzf tomato_3genomes.tar.gz
+Build the pangenome:
+.. code:: bash
+   $ pantools -Xms40g -Xmx80g build_pangenome -t50 --kmer-size=13 tomato_db tomato_3genomes/metadata/genome_locations.txt
+Add annotations:
+.. code:: bash
+   $ pantools -Xms40g -Xmx80g add_annotations tomato_db tomato_3genomes/metadata/annotation_locations.txt
+Group proteins:
+.. code:: bash
+   $ pantools -Xms40g -Xmx80g group tomato_db -t50 --relaxation=3
+Run time statistics (50 threads on RAM-disk)
+.. csv-table::
+   :file: /tables/runtime_statistics_tomato.csv
+   :header-rows: 1
+   :delim: ,
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -91,12 +91,24 @@ Contents
   :caption: Getting Started
   :maxdepth: 1
+   getting_started/diy_pangenomics
   Install <getting_started/install>
+   getting_started/technical_setup
   getting_started/differences
   getting_started/workflows
   getting_started/selection
   getting_started/query
+.. toctree::
+   :caption: Tutorial PanTools
+   :maxdepth: 1
+   Tutorial 1 - Pangenome construction <tutorial/tutorial_part1>
+   Tutorial 2 - Pangenome analysis <tutorial/tutorial_part2>
+   Tutorial 3 - Phylogeny <tutorial/tutorial_part3>
+   Tutorial 4 - Visualization (PanVA) <tutorial/tutorial_part4>
+   Tutorial 5 - Neo4j browser <tutorial/tutorial_part5>
 .. toctree::
   :caption: Pangenome Construction
   :maxdepth: 1
@@ -123,16 +135,6 @@ Contents
   analysis/support
   analysis/dn_ds
-.. toctree::
-   :caption: Tutorial
-   :maxdepth: 1
-   Tutorial 1 - Install PanTools <tutorial/tutorial_part1>
-   Tutorial 2 - Construct pangenome <tutorial/tutorial_part2>
-   Tutorial 3 - Neo4j browser <tutorial/tutorial_part3>
-   Tutorial 4 - Characterization <tutorial/tutorial_part4>
-   Tutorial 5 - Phylogeny <tutorial/tutorial_part5>
 .. toctree::
   :caption: Developer guide
   :maxdepth: 1
@@ -140,4 +142,3 @@ Contents
   Installation <developer_guide/install>
   Release steps <developer_guide/release>
   Database structure <developer_guide/database>
--- a/docs/source/tables/runtime_statistics_arabidopsis.csv
+++ b/docs/source/tables/runtime_statistics_arabidopsis.csv
+command,running time (h:m:s),CPU time (s)
+build_pangenome,0:27:38,3102.41
+add_annotations,0:00:52,93.56
+group,0:02:25,507.60
\ No newline at end of file
--- a/docs/source/tables/runtime_statistics_tomato.csv
+++ b/docs/source/tables/runtime_statistics_tomato.csv
+command,running time (h:m:s),CPU time (s)
+build_pangenome,20:28:33,257610.72
+add_annotations,0:01:35,181.19
+group,0:03:02,5861.83
\ No newline at end of file
--- a/docs/source/tables/runtime_statistics_yeast.csv
+++ b/docs/source/tables/runtime_statistics_yeast.csv
+command,running time (h:m:s),CPU time (s)
+build_pangenome,0:02:23,276.50
+add_annotations,0:00:35,51.82
+group,0:01:58,1351.73
\ No newline at end of file
--- a/docs/source/tutorial/tutorial_part1.rst
+++ b/docs/source/tutorial/tutorial_part1.rst
-Part 1. Install PanTools
+Part 1. Build your own pangenome using PanTools
-========================
+===============================================
-For instructions on how to install PanTools,
+Install PanTools from bioconda following the
-see :doc:`/getting_started/install`.
+:ref:`install <getting_started/install:Install from bioconda>`
+instructions.
+To demonstrate the main functionalities of PanTools we use a small
+chloroplasts dataset to avoid long construction times.
+.. csv-table::
+   :file: /tables/chloroplast_datasets.csv
+   :header-rows: 1
+   :delim: ;
+Download the chloroplast fasta and gff files
+`here <http://bioinformatics.nl/pangenomics/tutorial/chloroplasts.tar.gz>`_
+or via wget.
+.. code:: bash
+   $ wget http://bioinformatics.nl/pangenomics/tutorial/chloroplasts.tar.gz
+   $ tar xvzf chloroplasts.tar.gz #unpack the archive
+--------------
+BUILD, ANNOTATE and GROUP
+-------------------------
+We start with building a pangenome using four of the five chloroplast
+genomes. For this you need a text file which directs PanTools to the
+FASTA files. Call your text file **genome_locations.txt** and include
+the following lines:
+.. code:: text
+   YOUR_PATH/C_sativus.fasta
+   YOUR_PATH/O_sativa.fasta
+   YOUR_PATH/S_lycopersicum.fasta
+   YOUR_PATH/S_tuberosum.fasta
+Make sure that ‘*YOUR_PATH*’ is the full path to the input files! The text
+file should only contain full paths to FASTA files, no additional spaces or
+empty lines. Then run PanTools with the
+:ref:`build_pangenome <construction/build:build pangenome>` function and
+include the text file
+.. code:: bash
+   $ pantools -Xmx5g build_pangenome chloroplast_DB genome_locations.txt
+Did the program run without any error messages? Congratulations, you’ve
+built your first pangenome!
+Adding additional genomes
+~~~~~~~~~~~~~~~~~~~~~~~~~
+PanTools has the ability to add additional genomes to an already
+existing pangenome. To test the function of PanTools, prepare a text
+file containing the path to the Maize chloroplast genome. Call your
+text file **fifth_genome_location.txt** and include the following
+line to the file:
+.. code:: text
+   YOUR_PATH/Z_mays.fasta
+Run PanTools on the new text file and using the
+:ref:`add_genomes <construction/build:add genomes>` function
+.. code:: bash
+   $ pantools add_genomes chloroplast_DB fifth_genome_location.txt
+Adding annotations
+~~~~~~~~~~~~~~~~~~
+To include gene annotations to the pangenome, prepare a text file
+containing paths to the GFF files. Call your text file
+**annotation_locations.txt** and include the following lines into the
+file:
+.. code:: text
+   1 YOUR_PATH/C_sativus.gff3
+   2 YOUR_PATH/O_sativa.gff3
+   3 YOUR_PATH/S_lycopersicum.gff3
+   4 YOUR_PATH/S_tuberosum.gff3
+   5 YOUR_PATH/Z_mays.gff3
+Run PanTools using the :ref:`add_annotations <construction/annotate:add
+annotations>` function and include the new text file. Also add the
+``--connect`` flag to attach the annotations to the nucleotide nodes;
+this is useful for exploring the pangenome in the Neo4j browser later
+in the tutorial.
+.. code:: bash
+   $ pantools add_annotations --connect chloroplast_DB annotation_locations.txt
+Homology grouping
+~~~~~~~~~~~~~~~~~
+PanTools can infer homology between the protein sequences of a pangenome
+and cluster them into homology groups with the
+:ref:`group <construction/group:group>` function. The ``--relaxation``
+parameter must be set to influence the sensitivity,  If you don't know the
+best setting for the grouping relaxation, you can use the
+:ref:`busco_protein <construction/group:busco protein>` and
+:ref:`optimal grouping <construction/group:optimal grouping>` functions
+instead, but for now we will use a relaxation of 4.
+.. code:: bash
+   $ pantools group --relaxation=4 chloroplast_DB
+--------------
+Adding phenotypes
+-----------------
+Phenotype values can be Integers, Double, String or Boolean values.
+Create a text file **phenotypes.txt**.
+.. code:: text
+   Genome,Solanum
+   1,false
+   2,false
+   3,true
+   4,true
+   5,false
+And use :ref:`add_phenotypes <construction/annotate:add phenotypes>` to add the
+information to the pangenome.
+.. code:: bash
+   $ pantools add_phenotypes chloroplast_DB phenotypes.txt
+--------------
+RETRIEVE functions
+------------------
+Now that the construction is complete, lets quickly validate if the
+construction was successful and the database can be used. To retrieve
+some genomic regions, prepare a text file containing genomic
+coordinates. Create the file **regions.txt** and include the following
+for each region: genome number, contig number, start and stop position
+and separate them by a single space
+.. code:: text
+   1 1 200 500
+   2 1 300 700
+   3 1 1 10000
+   3 1 1 10000 -
+   4 1 9999 15000
+   5 1 100000 110000
+Now run the :ref:`retrieve_regions <analysis/explore:retrieve regions>`
+function and include the new text file
+.. code:: bash
+   $ pantools retrieve_regions chloroplast_DB regions.txt
+Take a look at the extracted regions that are written to the
+**chloroplast_DB/retrieval/regions/** directory.
+To retrieve entire genomes, prepare a text file **genome_numbers.txt**
+and include each genome number on a separate line in the file
+.. code:: text
+   1
+   3
+   5
+Use the **retrieve_regions** function again but include the new text
+file
+.. code:: bash
+   $ pantools retrieve_regions chloroplast_DB genome_numbers.txt
+Genome files are written to same directory as before. Take a look at one
+of the three genomes you have just retrieved.
+In :doc:`part 2 <tutorial_part2>` of the tutorial we explore the
+pangenome you just built using the Neo4j browser and the Cypher language.
--- a/docs/source/tutorial/tutorial_part2.rst
+++ b/docs/source/tutorial/tutorial_part2.rst
--- a/docs/source/tutorial/tutorial_part3.rst
+++ b/docs/source/tutorial/tutorial_part3.rst
-Part 3. Explore the pangenome using the Neo4j browser
+Part 3. Phylogeny
-=====================================================
+=================
-Did you skip :doc:`part 2 <tutorial_part2>` of the tutorial or were you
+Part 3 preparation
-unable to build the chloroplast pangenome? Download the pre-constructed
+------------------
-pangenome
-`here <http://bioinformatics.nl/pangenomics/tutorial/chloroplast_DB.tar.gz>`_
+If you did not follow part 2 of the tutorial, download the
-or via wget.
+pre-constructed pangenome
+`here <http://bioinformatics.nl/pangenomics/tutorial/pecto_dickeya_DB.tar.gz>`_.
 .. code:: bash
-   $ wget http://bioinformatics.nl/pangenomics/tutorial/chloroplast_DB.tar.gz
+   $ wget http://bioinformatics.nl/pangenomics/tutorial/pecto_dickeya_DB.tar.gz
-   $ tar -xvzf chloroplast_DB.tar.gz
+   $ tar xvzf pecto_dickeya_DB.tar.gz
 --------------
-Configuring Neo4j
+Adding phenotype/metadata to the pangenome
-----------------
+------------------------------------------
-Set the full path to the chloroplast pangenome database by opening
+Before we construct the trees, we will add some phenotype data to the
-neo4j.conf ('*neo4j-community-3.5.30/conf/neo4j.conf*') and include the
+pangenome. Once the we have a phylogeny, the information can be included
-following line in the config file. Please make sure there is always only
+or be used to color parts of the tree. Below is a textfile with data for
-a single uncommented line with 'dbms.directories.data'.
+three phenotypes. The third phenotype, *low_temperature*, is in this
+case a made up example! It states whether the strain is capable of
+growing on (extreme) low temperatures. The phenotype file can be found
+inside the database directory, add the information to the pangenome by
+using :ref:`add_phenotypes <construction/annotate:add phenotypes>`.
 .. code:: text
-   #dbms.directories.data=/YOUR_PATH/any_other_database
+   Genome, species, strain_name, low_temperature
-   dbms.directories.data=/YOUR_PATH/chloroplast_DB
+   1,P. odoriferum,P. odoriferum Q166, false
+   2,P. fontis, P. fontis M022, true
-| **Allowing non-local connections**
+   3,P. polaris,P. polaris S4.16.03.2B, false
-| To be able to run Neo4j on a server and have access to it from
+   4,P. brasiliense, P. brasiliense S2, true
-  anywhere, some additional lines in the config file must be changed.
+   5,P. brasiliense, P. brasiliense Y49, false
+   6,D. dadantii, D. dadantii 3937,?
-  **Uncomment** the four following lines in
-   neo4j-community-3.5.30/conf/neo4j.conf.
-  Replace 7686, 7474, and 7473 by three different numbers that are not
-   in use by other people on your server. In this way, everyone can have
-   their own database running at the same time.
-.. code:: text
-   #dbms.connectors.default_listen_address=0.0.0.0
-   #dbms.connector.bolt.listen_address=:7687
-   #dbms.connector.http.listen_address=:7474
-   #dbms.connector.https.listen_address=:7473
-Lets start up the Neo4j server!
-.. code:: bash
-   $ neo4j start
-Start Firefox (or a web browser of your own preference) and let it run
-on the background.
 .. code:: bash
-   $ firefox &
+   $ pantools add_phenotypes pecto_dickeya_DB pecto_dickeya_DB/phenotypes.txt
-In case you did not change the config to allow non-local connections,
-browse to *http://localhost:7474*. Whenever you did change the config
-file, go to *server_address:7474*, where 7474 should be replaced with the
-number you chose earlier.
-If the database startup was successful, a login terminal will appear in
-the webpage. Use '*neo4j*' both as username and password. After logging
-in, you are requested to set a new password.
 --------------
-Exploring nodes and edges in Neo4j
+Constructing a phylogeny
----------------------------------
+------------------------
-Go through the following steps to become proficient in using the Neo4j
-browser and the underlying PanTools data structure. If you have any
-difficulty trouble finding a node, relationship or any type of
-information, download and use `this visual guide
-<http://www.bioinformatics.nl/pangenomics/tutorial/neo4j_browser.tar.gz>`_.
-1.  Click on the database icon on the left. A menu with all node types
-    and relationship types will appear.
-2.  Click on the '*gene*' button in the node label section. This
-    automatically generated a query. Execute the query.
-3.  The **LIMIT** clause prevents large numbers of nodes popping up to
-    avoid your web browser from crashing. Set LIMIT to 10 and execute the
-    query.
-4.  Hover over the nodes, click on them and take a look at the values
-    stored in the nodes. All these features (except ID) were extracted
-    from the GFF annotation files. ID is an unique number automatically
-    assigned to nodes and relationships by Neo4j.
-5.  Double-click on the **matK** gene node, all nodes with a connection
-    to this gene node will appear. The nodes have distinct colors as
-    these are different node types, such as **mRNA**, **CDS**,
-    **nucleotide**. Take a look at the node properties to observe that
-    most values and information is specific to a certain node type.
-6.  Double-click on the *matK* mRNA node, a **homology_group** node
-    should appear. These type of nodes connect homologous genes in the
-    graph. However, you can see this gene did not cluster with any other
-    gene.
-7.  Hover over the **start** relation of the *matK* gene node. As you
-    can see information is not only stored in nodes, but also in
-    relationships! A relationship always has a certain direction, in
-    this case the relation starts at the gene node and points to a
-    nucleotide node. Offset marks the location within the node.
-8.  Double-click on the **nucleotide** node at the end of the 'start'
-    relationship. An in- and outgoing relation appear that connect to
-    other nucleotide nodes. Hover over both the relations and compare
-    them. The relations holds the genomic coordinates and shows this
-    path only occurs in contig/sequence 1 of genome 1.
-9.  Follow the outgoing **FF**-relationship to the next nucleotide node
-    and expand this node by double-clicking. Three nodes will pop up
-    this time. If you hover over the relations you see the coordinates
-    belong to other genomes as well. You may also notice the
-    relationships between nucleotide nodes is always a two letter
-    combination of F (forward) and R (reverse) which state if a sequence
-    is reverse complemented or not. The first letter corresponds to the
-    sequence of the node at the start of the relation where the second
-    letters refers to the sequence of the end node.
-10. Finally, execute the following query to call the database scheme to
-    see how all node types are connected to each other: *CALL
-    db.schema()*. The schema will be useful when designing your own
-    queries!
--------------
-Query the pangenome database using CYPHER
-----------------------------------------
-Cypher is a declarative, SQL-inspired language and uses ASCII-Art to
-represent patterns. Nodes are represented by circles and relationships
-by arrows.
-  The **MATCH** clause allows you to specify the patterns Neo4j will
-   search for in the database.
-  With **WHERE** you can add constraints to the patterns described.
-  In the **RETURN** clause you define which parts of the pattern to
-   display.
-Cypher queries
+In this tutorial we will construct three phylogenies, each based on a
-~~~~~~~~~~~~~~
+different type of variation: SNPs, genes and k-mers. Take a look at the
+phylogeny manuals to get an understanding how the three methods work and
+how they differ from each other.
-**Match and return 100 nucleotide nodes**
+1. :ref:`analysis/phylogeny:core phylogeny`
+2. :ref:`analysis/phylogeny:gene distance tree`
+3. :ref:`analysis/phylogeny:k-mer distance tree`
-.. code:: text
+Core SNP phylogeny
+------------------
-   MATCH (n:nucleotide) RETURN n LIMIT 100
-**Find all the genome nodes**
+The core SNP phylogeny will run various Maximum Likelihood models on
+parsimony informative sites of single-copy orthologous sequences. A site
+is parsimony-informative when there are at least two types of
+nucleotides that occur with a minimum frequency of two. The informative
+sites are automatically identified by aligning the sequences; however,
+it does not know which sequences are single-copy orthologous. You can
+identify these conserved sequences by running
+:ref:`gene_classification <analysis/classification:gene classification>`.
-.. code:: text
+.. code:: bash
-   MATCH (n:genome) RETURN n
+   $ pantools gene_classification --phenotype=species pecto_dickeya_DB
-**Find the pangenome node**
+Open **gene_classification_overview.txt** and take a look at statistics.
+As you can see there are 2134 single-copy ortholog groups. Normally, all
+of these groups are aligned to identify SNPs but for this tutorial we’ll
+make a selection of only a few groups to accelerate the steps. You can
+do this in two different ways:
-.. code:: text
+Option 1: Open **single_copy_orthologs.csv** and remove all node
+identifiers after the first 20 homology groups and save the file.
-   MATCH (n:pangenome) RETURN n
+.. code:: bash
-**Match and return 100 genes**
+   $ pantools core_phylogeny --mode=ML pecto_dickeya_DB
-.. code:: text
+Option 2: Open **single_copy_orthologs.csv** and select the first 20
+homology_group node identifiers. Place them in a new file sco_groups.txt
+and include this file to the function.
-   MATCH (g:gene) RETURN g LIMIT 100
+.. code:: bash
-**Match and return 100 genes and order them by length**
+   $ pantools core_phylogeny --mode=ML -H=sco_groups.txt pecto_dickeya_DB
-.. code:: text
+The sequences of the homology groups are being aligned two consecutive
+times. After the initial alignment, input sequences are trimmed based on
+the longest start and end gap of the alignment. The parsimony
+informative positions are taken from the second alignment and
+concatenated into a sequence. When opening **informative.fasta** you can
+find 6 sequences, the length of the sequences being the number of
+parsimony-informative sites.
-   MATCH (g:gene) RETURN g ORDER BY g.length DESC LIMIT 100
+.. code:: bash
-**The same query as before but results are now returned in a table**
+   $ iqtree -nt 4 -s pecto_dickeya_DB/alignments/grouping_v1/core_phylogeny/informative.fasta -redo -bb 1000
-.. code:: text
+IQ-tree generates several files, the tree that we later on in the
+tutorial will continue with is called **informative.fasta.treefile**.
+When examining the **informative.fasta.iqtree** file you can find the
+best fit model of the data. This file also shows the number of sites
+that were used, as sites with gaps (which IQ-tree does not allow) were
+changed into singleton or constant sites.
-   MATCH (g:gene) RETURN g.name, g.address, g.length ORDER BY g.length DESC LIMIT 100
+Gene distance tree
+~~~~~~~~~~~~~~~~~~
-**Return genes which are longer as 100 but shorter than 250 bp** (this
+To create a phylogeny based on gene distances (absence/presence), we can
-can also be applied to other features such as exons introns or CDS)
+simply execute the Rscript that was created by
+:ref:`gene_classification <analysis/classification:gene classification>`.
-.. code:: text
+.. code:: bash
-   MATCH (g:gene) where g.length > 100 AND g.length < 250 RETURN * LIMIT 100
+   $ Rscript pecto_dickeya_DB/gene_classification/gene_distance_tree.R
-**Find genes located on first genome**
+The resulting tree is called **gene_distance.tree**.
-.. code:: text
+K-mer distance tree
+~~~~~~~~~~~~~~~~~~~
-   MATCH (g:gene) WHERE g.address[0] = 1 RETURN * LIMIT 100
+To obtain a k-mer distance phylogeny, the k-mers must first be counted
+with the :ref:`kmer_classification <analysis/classification:k-mer
+classification>`
+function. Afterwards, the tree can be constructed by executing the
+Rscript.
-**Find genes located on first genome and first sequence**
+.. code:: bash
-.. code:: text
+   $ pantools kmer_classification pecto_dickeya_DB
+   $ Rscript pecto_dickeya_DB/kmer_classification/genome_kmer_distance_tree.R
-   MATCH (g:gene) WHERE g.address[0] = 1 AND g.address[1] = 1 RETURN * LIMIT 100
+The resulting tree is written to **genome_kmer_distance.tree**.
 --------------
-Homology group queries
+Renaming tree nodes
-~~~~~~~~~~~~~~~~~~~~~~
+-------------------
-**Return 100 homology groups**
+So far, we used three different types of distances (SNPs, genes,
+k-mers), and two different methods (ML, NJ) to create three phylogenetic
-.. code:: text
+trees. First, lets take a look at the text files. The
+**informative.fasta.treefile** only contain genome numbers, bootstrap
+values and branch lengths but is lacking the metadata. Examining
+**gene_distance.tree** file also shows this information but the species
+names as well, because we included this as a phenotype during
+:ref:`gene_classification <analysis/classification:gene classification>`.
-   MATCH (h:homology_group) RETURN h LIMIT 100
+Let’s include the strain identifiers to the core snp tree to make the
+final figure more informative. Use the
+:ref:`rename_phylogeny <analysis/phylogeny:rename phylogeny>` function to
+rename the tree nodes.
-**Match homology groups which contain two members**
+.. code:: bash
-.. code:: text
-   MATCH (h:homology_group) WHERE h.num_members = 2 RETURN h
-**Match homology groups and 'walk' to the genes and corresponding start
-and end node**
-.. code:: text
-   MATCH (h:homology_group)-->(f:feature)<--(g:gene)-->(n:nucleotide) WHERE h.num_members = 2 RETURN * LIMIT 25
-Turn off autocomplete by clicking on the button on the bottom right. The
-graph falls apart because relations were not assigned to variables.
-**The same query as before but now the relations do have variables**
-.. code:: text
-   MATCH (h:homology_group)-[r1]-> (f:feature) <-[r2]-(g:gene)-[r3]-> (n:nucleotide) WHERE h.num_members = 2 RETURN * LIMIT 25
-When you turn off autocomplete again only the '*is_similar_to*' relation
-disappears since we did not call it
-**Find homology group that belong to the rpoC1 gene**
-.. code:: text
-   MATCH (n:homology_group)--(m:mRNA)--(g:gene) WHERE g.name = 'rpoC1' RETURN *
-**Find genes on genome 1 which don't show homology**
-.. code:: text
+   $ pantools rename_phylogeny --phenotype=strain_name pecto_dickeya_DB pecto_dickeya_DB/alignments/grouping_v1/core_phylogeny/informative.fasta.treefile
-   MATCH (n:homology_group)--(m:mRNA)--(g:gene) WHERE n.num_members = 1 and g.genome = 1 RETURN *
+Take a look at **informative.fasta_RENAMED.treefile**, strain
+identifiers have been added to the tree.
 --------------
-Structural variant detection
+Visualizing the tree in iTOL
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+----------------------------
-**Find SNP bubbles (for simplification we only use the FF relation)**
+Go to https://itol.embl.de and click on “Upload a tree” under the
+**ANNOTATE** box. On this page you can paste the tree directly into the
-.. code:: text
+**tree text:** textbox or can click the button to upload the .newick
+file.
-   MATCH p= (n:nucleotide) -[:FF]-> (a1)-[:FF]->(m:nucleotide) <-[:FF]-(b1) <-[:FF]- (n) return * limit 50
+.. figure:: /figures/tutorial_phylogeny1.png
-**The same query but returning the results in a table**
+   :width: 600
+   :align: center
-.. code:: text
+Basic controls iTOL
+-------------------
+-  The default way of visualizing a tree is the rectangular view.
+   Depending on the number of genomes, the circular view can be easier
+   to interpret. You can the view by clicking on the “Display Mode”
+   buttons.
+-  Increase the font size and branch width to improve readability
+-  When visualizing a Maximum likelihood (ML) tree, bootstrap values can
+   be displayed by clicking the “Display” button next to
+   **Bootstrap/metadata** in the Advanced tab of the Control window.
+   This enables you to visualize the values as text or symbol on the
+   branch. or by coloring the branch or adjusting the width.
+.. figure:: /figures/tutorial_phylogeny2.png
+   :width: 600
+   :align: center
+-  When you have a known out group or one of the genomes is a clear
+   outlier in the tree, you should reroot the tree. Hover over the name,
+   click it so a pop-up menu appears. Click “tree structure” followed by
+   “Reroot the tree here”.
+.. figure:: /figures/tutorial_phylogeny3.png
+   :width: 600
+   :align: center
+-  Clicking on the name of a node in the tree allows you to color the
+   name, branch, or background of that specific node.
+-  When you’re happy the way your tree looks, go to the Export tab of
+   the Control window. Select the desired output format, click on the
+   “Full image” button and export the file to a figure.
+-  Refresh the webpage to go back to the default view of your tree.
+Create iTOL templates
+---------------------
+In iTOL it is possible to add colors to the tree by coloring the
+terminal nodes or adding an outer ring. The PanTools function
+:ref:`create_tree_template <analysis/phylogeny:create tree template>` is able
+to create templates that allows for easy coloring (with maximum of 20
+possible colors). If the function is run without any additional
+argument, templates are created for trees that only contain genome
+numbers (e.g. k-mer distance tree). Here we want to color the (renamed)
+core SNP tree with the ‘low_temperature’ phenotype. Therefore, the
+``--phenotype`` strain_name must be included to the function.
-   MATCH (n:nucleotide) -[:FF]-> (a1)-[:FF]->(m:nucleotide) <-[:FF]-(b1) <-[:FF]- (n) return a1.length,b1.length, a1.sequence, b1.sequence limit 50
+.. code:: bash
-Functions such as **count()**, **sum()** and **stDev()** can be used in
-a query.
-**The same SNP query but count the hits instead of displaying them**
-.. code:: text
-   MATCH p= (n:nucleotide) -[:FF]-> (a1)-[:FF]->(m:nucleotide) <-[:FF]-(b1) <-[:FF]- (n) return count(p)
+   $ pantools create_tree_template pecto_dickeya_DB # Run this command when the tree contains genome numbers only
+   $ pantools create_tree_template --phenotype=strain_name pecto_dickeya_DB
--------------
+Copy the two low_temperature.txt files from the label/strain_name/ and
+ring/strain_name/ directories to your personal computer. Click and move
+the ring template file into the tree visualization webpage.
-Hopefully you know have some feeling with the Neo4j browser and cypher
+.. figure:: /figures/tutorial_phylogeny4.png
-and you're inspired to create your own queries!
+   :width: 600
+   :align: center
-When you're done working in the browser, close the database (by using
+The resulting tree should look this when: the tree is rooted with the
-the command line again).
+*Dickeya* genome, bootstrap values are displayed as text and the ring
+color template was included.
-.. code:: text
+.. figure:: /figures/tutorial_phylogeny_color.png
+   :width: 600
+   :align: center
-   $ neo4j stop
+Tree coloring is especially useful for large datasets. An example is
+shown in the figure below, where members of the same species share a
+color.
-| More information on Neo4j and the cypher language:
+.. figure:: /figures/tutorial_phylogeny_tree.png
-| `Neo4j Cypher Manual v3.5 <https://neo4j.com/docs/developer-manual/3.5/cypher/>`_
+   :width: 600
-| `Neo4j Cypher Refcard <http://neo4j.com/docs/cypher-refcard/3.5/>`_
+   :align: center
-| `Neo4j API <https://neo4j.com/developer/>`_
 --------------
-In :doc:`part 4 <tutorial_part4>` of the tutorial we explore some of the
-functionalities to analyze the pangenome.
--- a/docs/source/tutorial/tutorial_part4.rst
+++ b/docs/source/tutorial/tutorial_part4.rst
--- a/docs/source/tutorial/tutorial_part5.rst
+++ b/docs/source/tutorial/tutorial_part5.rst
-Part 5. Phylogeny
+Part 5. Explore the pangenome using the Neo4j browser
-=================
+=====================================================
-Part 5 preparation
+Did you skip :doc:`part 1 <tutorial_part1>` of the tutorial or were you
------------------
+unable to build the chloroplast pangenome? Download the pre-constructed
+pangenome
-Pantools v3 is required to follow this part of the tutorial. In
+`here <http://bioinformatics.nl/pangenomics/tutorial/chloroplast_DB.tar.gz>`_
-addition, MAFFT, FastTree, IQ-tree, R (and the ape R package) need to be
+or via wget.
-installed and set to your $PATH. Validate if the tools are executable by
-using the following commands.
 .. code:: bash
-   pantools --version
+   $ wget http://bioinformatics.nl/pangenomics/tutorial/chloroplast_DB.tar.gz
-   Rscript --help
+   $ tar xvzf chloroplast_DB.tar.gz
-   mafft -h
-   iqtree -h
-   fasttree -h
-If you did not follow part 4 of the tutorial, download the
+--------------
-pre-constructed pangenome
-`here <http://bioinformatics.nl/pangenomics/tutorial/pecto_dickeya_DB.tar.gz>`_.
-.. code:: bash
+Configuring Neo4j
+-----------------
-   $ wget http://bioinformatics.nl/pangenomics/tutorial/pecto_dickeya_DB.tar.gz
+Set the full path to the chloroplast pangenome database by opening
-   $ tar -xvzf pecto_dickeya_DB.tar.gz
+neo4j.conf ('*neo4j-community-3.5.30/conf/neo4j.conf*') and include the
+following line in the config file. Please make sure there is always only
+a single uncommented line with 'dbms.directories.data'.
--------------
+.. code:: text
-Adding phenotype/metadata to the pangenome
+   #dbms.directories.data=/YOUR_PATH/any_other_database
------------------------------------------
+   dbms.directories.data=/YOUR_PATH/chloroplast_DB
-Before we construct the trees, we will add some phenotype data to the
+| **Allowing non-local connections**
-pangenome. Once the we have a phylogeny, the information can be included
+| To be able to run Neo4j on a server and have access to it from
-or be used to color parts of the tree. Below is a textfile with data for
+  anywhere, some additional lines in the config file must be changed.
-three phenotypes. The third phenotype, *low_temperature*, is in this
-case a made up example! It states whether the strain is capable of
+-  **Uncomment** the four following lines in
-growing on (extreme) low temperatures. The phenotype file can be found
+   neo4j-community-3.5.30/conf/neo4j.conf.
-inside the database directory, add the information to the pangenome by
+-  Replace 7686, 7474 and 7473 by any port number not in use on your system.
-using :ref:`add_phenotypes <construction/annotate:add phenotypes>`.
+   This way everyone can have their own database running at the same time.
 .. code:: text
-   Genome, species, strain_name, low_temperature
+   #dbms.connectors.default_listen_address=0.0.0.0
-   1,P. odoriferum,P. odoriferum Q166, false
+   #dbms.connector.bolt.listen_address=:7687
-   2,P. fontis, P. fontis M022, true
+   #dbms.connector.http.listen_address=:7474
-   3,P. polaris,P. polaris S4.16.03.2B, false
+   #dbms.connector.https.listen_address=:7473
-   4,P. brasiliense, P. brasiliense S2, true
-   5,P. brasiliense, P. brasiliense Y49, false
+Lets start up the Neo4j server!
-   6,D. dadantii, D. dadantii 3937,?
+.. code:: bash
+   $ neo4j start
+Start Firefox (or a web browser of your own preference) and let it run
+on the background.
 .. code:: bash
-   $ pantools add_phenotypes pecto_dickeya_DB pecto_dickeya_DB/phenotypes.txt
+   $ firefox &
+In case you did not change the config to allow non-local connections,
+browse to *http://localhost:7474*. Whenever you did change the config
+file, go to *server_address:7474*, where 7474 should be replaced with the
+number you chose earlier, and *server_address* is the address set in the config.
+If the database startup was successful, a login terminal will appear in
+the webpage. Use '*neo4j*' both as username and password. After logging
+in, you are requested to set a new password.
 --------------
-Constructing a phylogeny
+Exploring nodes and edges in Neo4j
------------------------
+----------------------------------
+Go through the following steps to become proficient in using the Neo4j
+browser and the underlying PanTools data structure. If you have any
+difficulty trouble finding a node, relationship or any type of
+information, download and use `this visual guide
+<http://www.bioinformatics.nl/pangenomics/tutorial/neo4j_browser.tar.gz>`_.
+1.  Click on the database icon on the left. A menu with all node types
+    and relationship types will appear.
+2.  Click on the '*gene*' button in the node label section. This
+    automatically generated a query. Execute the query.
+3.  The **LIMIT** clause prevents large numbers of nodes popping up to
+    avoid your web browser from crashing. Set LIMIT to 10 and execute the
+    query.
+4.  Hover over the nodes, click on them and take a look at the values
+    stored in the nodes. All these features (except ID) were extracted
+    from the GFF annotation files. ID is an unique number automatically
+    assigned to nodes and relationships by Neo4j.
+5.  Double-click on the **matK** gene node, all nodes with a connection
+    to this gene node will appear. The nodes have distinct colors as
+    these are different node types, such as **mRNA**, **CDS**,
+    **nucleotide**. Take a look at the node properties to observe that
+    most values and information is specific to a certain node type.
+6.  Double-click on the *matK* mRNA node, a **homology_group** node
+    should appear. These type of nodes connect homologous genes in the
+    graph. However, you can see this gene did not cluster with any other
+    gene.
+7.  Hover over the **start** relation of the *matK* gene node. As you
+    can see information is not only stored in nodes, but also in
+    relationships! A relationship always has a certain direction, in
+    this case the relation starts at the gene node and points to a
+    nucleotide node. Offset marks the location within the node.
+8.  Double-click on the **nucleotide** node at the end of the 'start'
+    relationship. An in- and outgoing relation appear that connect to
+    other nucleotide nodes. Hover over both the relations and compare
+    them. The relations holds the genomic coordinates and shows this
+    path only occurs in contig/sequence 1 of genome 1.
+9.  Follow the outgoing **FF**-relationship to the next nucleotide node
+    and expand this node by double-clicking. Three nodes will pop up
+    this time. If you hover over the relations you see the coordinates
+    belong to other genomes as well. You may also notice the
+    relationships between nucleotide nodes is always a two letter
+    combination of F (forward) and R (reverse) which state if a sequence
+    is reverse complemented or not. The first letter corresponds to the
+    sequence of the node at the start of the relation where the second
+    letters refers to the sequence of the end node.
+10. Finally, execute the following query to call the database scheme to
+    see how all node types are connected to each other: *CALL
+    db.schema()*. The schema will be useful when designing your own
+    queries!
-In this tutorial we will construct three phylogenies, each based on a
+--------------
-different type of variation: SNPs, genes and k-mers. Take a look at the
-phylogeny manuals to get an understanding how the three methods work and
-how they differ from each other.
-1. :ref:`analysis/phylogeny:core phylogeny`
+Query the pangenome database using CYPHER
-2. :ref:`analysis/phylogeny:gene distance tree`
+-----------------------------------------
-3. :ref:`analysis/phylogeny:k-mer distance tree`
-Core SNP phylogeny
+Cypher is a declarative, SQL-inspired language and uses ASCII-Art to
------------------
+represent patterns. Nodes are represented by circles and relationships
+by arrows.
-The core SNP phylogeny will run various Maximum Likelihood models on
+-  The **MATCH** clause allows you to specify the patterns Neo4j will
-parsimony informative sites of single-copy orthologous sequences. A site
+   search for in the database.
-is parsimony-informative when there are at least two types of
+-  With **WHERE** you can add constraints to the patterns described.
-nucleotides that occur with a minimum frequency of two. The informative
+-  In the **RETURN** clause you define which parts of the pattern to
-sites are automatically identified by aligning the sequences; however,
+   display.
-it does not know which sequences are single-copy orthologous. You can
-identify these conserved sequences by running
-:ref:`gene_classification <analysis/classification:gene classification>`.
-.. code:: bash
+Cypher queries
+~~~~~~~~~~~~~~
-   $ pantools gene_classification --phenotype=species pecto_dickeya_DB
+**Match and return 100 nucleotide nodes**
-Open **gene_classification_overview.txt** and take a look at statistics.
+.. code:: text
-As you can see there are 2134 single-copy ortholog groups. Normally, all
-of these groups are aligned to identify SNPs but for this tutorial we’ll
-make a selection of only a few groups to accelerate the steps. You can
-do this in two different ways:
-Option 1: Open **single_copy_orthologs.csv** and remove all node
+   MATCH (n:nucleotide) RETURN n LIMIT 100
-identifiers after the first 20 homology groups and save the file.
-.. code:: bash
+**Find all the genome nodes**
-   $ pantools core_phylogeny --mode=ML pecto_dickeya_DB
+.. code:: text
-Option 2: Open **single_copy_orthologs.csv** and select the first 20
+   MATCH (n:genome) RETURN n
-homology_group node identifiers. Place them in a new file sco_groups.txt
-and include this file to the function.
-.. code:: bash
+**Find the pangenome node**
-   $ pantools core_phylogeny --mode=ML -H=sco_groups.txt pecto_dickeya_DB
+.. code:: text
-The sequences of the homology groups are being aligned two consecutive
+   MATCH (n:pangenome) RETURN n
-times. After the initial alignment, input sequences are trimmed based on
-the longest start and end gap of the alignment. The parsimony
-informative positions are taken from the second alignment and
-concatenated into a sequence. When opening **informative.fasta** you can
-find 6 sequences, the length of the sequences being the number of
-parsimony-informative sites.
-.. code:: bash
+**Match and return 100 genes**
-   $ iqtree -nt 4 -s pecto_dickeya_DB/alignments/grouping_v1/core_phylogeny/informative.fasta -redo -bb 1000
+.. code:: text
-IQ-tree generates several files, the tree that we later on in the
+   MATCH (g:gene) RETURN g LIMIT 100
-tutorial will continue with is called **informative.fasta.treefile**.
-When examining the **informative.fasta.iqtree** file you can find the
-best fit model of the data. This file also shows the number of sites
-that were used, as sites with gaps (which IQ-tree does not allow) were
-changed into singleton or constant sites.
-Gene distance tree
+**Match and return 100 genes and order them by length**
-~~~~~~~~~~~~~~~~~~
-To create a phylogeny based on gene distances (absence/presence), we can
+.. code:: text
-simply execute the Rscript that was created by
-:ref:`gene_classification <analysis/classification:gene classification>`.
-.. code:: bash
+   MATCH (g:gene) RETURN g ORDER BY g.length DESC LIMIT 100
-   $ Rscript pecto_dickeya_DB/gene_classification/gene_distance_tree.R
+**The same query as before but results are now returned in a table**
-The resulting tree is called **gene_distance.tree**.
+.. code:: text
-K-mer distance tree
+   MATCH (g:gene) RETURN g.name, g.address, g.length ORDER BY g.length DESC LIMIT 100
-~~~~~~~~~~~~~~~~~~~
-To obtain a k-mer distance phylogeny, the k-mers must first be counted
+**Return genes which are longer as 100 but shorter than 250 bp** (this
-with the :ref:`kmer_classification <analysis/classification:k-mer
+can also be applied to other features such as exons introns or CDS)
-classification>`
-function. Afterwards, the tree can be constructed by executing the
-Rscript.
-.. code:: bash
+.. code:: text
-   $ pantools kmer_classification pecto_dickeya_DB
+   MATCH (g:gene) where g.length > 100 AND g.length < 250 RETURN * LIMIT 100
-   $ Rscript pecto_dickeya_DB/kmer_classification/genome_kmer_distance_tree.R
-The resulting tree is written to **genome_kmer_distance.tree**.
+**Find genes located on first genome**
+.. code:: text
+   MATCH (g:gene) WHERE g.address[0] = 1 RETURN * LIMIT 100
+**Find genes located on first genome and first sequence**
+.. code:: text
+   MATCH (g:gene) WHERE g.address[0] = 1 AND g.address[1] = 1 RETURN * LIMIT 100
 --------------
-Renaming tree nodes
+Homology group queries
-------------------
+~~~~~~~~~~~~~~~~~~~~~~
+**Return 100 homology groups**
-So far, we used three different types of distances (SNPs, genes,
+.. code:: text
-k-mers), and two different methods (ML, NJ) to create three phylogenetic
-trees. First, lets take a look at the text files. The
-**informative.fasta.treefile** only contain genome numbers, bootstrap
-values and branch lengths but is lacking the metadata. Examining
-**gene_distance.tree** file also shows this information but the species
-names as well, because we included this as a phenotype during
-:ref:`gene_classification <analysis/classification:gene classification>`.
-Let’s include the strain identifiers to the core snp tree to make the
+   MATCH (h:homology_group) RETURN h LIMIT 100
-final figure more informative. Use the
-:ref:`rename_phylogeny <analysis/phylogeny:rename phylogeny>` function to
-rename the tree nodes.
-.. code:: bash
+**Match homology groups which contain two members**
+.. code:: text
+   MATCH (h:homology_group) WHERE h.num_members = 2 RETURN h
+**Match homology groups and 'walk' to the genes and corresponding start
+and end node**
+.. code:: text
+   MATCH (h:homology_group)-->(f:feature)<--(g:gene)-->(n:nucleotide) WHERE h.num_members = 2 RETURN * LIMIT 25
+Turn off autocomplete by clicking on the button on the bottom right. The
+graph falls apart because relations were not assigned to variables.
+**The same query as before but now the relations do have variables**
+.. code:: text
+   MATCH (h:homology_group)-[r1]-> (f:feature) <-[r2]-(g:gene)-[r3]-> (n:nucleotide) WHERE h.num_members = 2 RETURN * LIMIT 25
+When you turn off autocomplete again only the '*is_similar_to*' relation
+disappears since we did not call it
+**Find homology group that belong to the rpoC1 gene**
+.. code:: text
-   $ pantools rename_phylogeny --phenotype=strain_name pecto_dickeya_DB pecto_dickeya_DB/alignments/grouping_v1/core_phylogeny/informative.fasta.treefile
+   MATCH (n:homology_group)--(m:mRNA)--(g:gene) WHERE g.name = 'rpoC1' RETURN *
-Take a look at **informative.fasta_RENAMED.treefile**, strain
+**Find genes on genome 1 which don't show homology**
-identifiers have been added to the tree.
+.. code:: text
+   MATCH (n:homology_group)--(m:mRNA)--(g:gene) WHERE n.num_members = 1 and g.genome = 1 RETURN *
 --------------
-Visualizing the tree in iTOL
+Structural variant detection
----------------------------
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-Go to https://itol.embl.de and click on “Upload a tree” under the
-**ANNOTATE** box. On this page you can paste the tree directly into the
-**tree text:** textbox or can click the button to upload the .newick
-file.
-.. figure:: /figures/tutorial_phylogeny1.png
-   :width: 600
-   :align: center
-Basic controls ITOL
-------------------
-  The default way of visualizing a tree is the rectangular view.
-   Depending on the number of genomes, the circular view can be easier
-   to interpret. You can the view by clicking on the “Display Mode”
-   buttons.
-  Increase the font size and branch width to improve readability
-  When visualizing a Maximum likelihood (ML) tree, bootstrap values can
-   be displayed by clicking the “Display” button next to
-   **Bootstrap/metadata** in the Advanced tab of the Control window.
-   This enables you to visualize the values as text or symbol on the
-   branch. or by coloring the branch or adjusting the width.
-.. figure:: /figures/tutorial_phylogeny2.png
-   :width: 600
-   :align: center
-  When you have a known out group or one of the genomes is a clear
-   outlier in the tree, you should reroot the tree. Hover over the name,
-   click it so a pop-up menu appears. Click “tree structure” followed by
-   “Reroot the tree here”.
-.. figure:: /figures/tutorial_phylogeny3.png
-   :width: 600
-   :align: center
-  Clicking on the name of a node in the tree allows you to color the
-   name, branch, or background of that specific node.
-  When you’re happy the way your tree looks, go to the Export tab of
-   the Control window. Select the desired output format, click on the
-   “Full image” button and export the file to a figure.
-  Refresh the webpage to go back to the default view of your tree.
-Create iTOL templates
---------------------
-In iTOL it is possible to add colors to the tree by coloring the
-terminal nodes or adding an outer ring. The PanTools function
-:ref:`create_tree_template <analysis/phylogeny:create tree template>` is able
-to create templates that allows for easy coloring (with maximum of 20
-possible colors). If the function is run without any additional
-argument, templates are created for trees that only contain genome
-numbers (e.g. k-mer distance tree). Here we want to color the (renamed)
-core SNP tree with the ‘low_temperature’ phenotype. Therefore, the
-``--phenotype`` strain_name must be included to the function.
-.. code:: bash
+**Find SNP bubbles (for simplification we only use the FF relation)**
-   $ pantools create_tree_template pecto_dickeya_DB # Run this command when the tree contains genome numbers only
+.. code:: text
-   $ pantools create_tree_template --phenotype=strain_name pecto_dickeya_DB
+   MATCH p= (n:nucleotide) -[:FF]-> (a1)-[:FF]->(m:nucleotide) <-[:FF]-(b1) <-[:FF]- (n) return * limit 50
+**The same query but returning the results in a table**
+.. code:: text
-Copy the two low_temperature.txt files from the label/strain_name/ and
+   MATCH (n:nucleotide) -[:FF]-> (a1)-[:FF]->(m:nucleotide) <-[:FF]-(b1) <-[:FF]- (n) return a1.length,b1.length, a1.sequence, b1.sequence limit 50
-ring/strain_name/ directories to your personal computer. Click and move
-the ring template file into the tree visualization webpage.
-.. figure:: /figures/tutorial_phylogeny4.png
+Functions such as **count()**, **sum()** and **stDev()** can be used in
-   :width: 600
+a query.
-   :align: center
-The resulting tree should look this when: the tree is rooted with the
+**The same SNP query but count the hits instead of displaying them**
-*Dickeya* genome, bootstrap values are displayed as text and the ring
-color template was included.
-.. figure:: /figures/tutorial_phylogeny_color.png
+.. code:: text
-   :width: 600
-   :align: center
+   MATCH p= (n:nucleotide) -[:FF]-> (a1)-[:FF]->(m:nucleotide) <-[:FF]-(b1) <-[:FF]- (n) return count(p)
+--------------
+Hopefully you know have some feeling with the Neo4j browser and cypher
+and you're inspired to create your own queries!
+When you're done working in the browser, close the database (by using
+the command line again).
+.. code:: text
-Tree coloring is especially useful for large datasets. An example is
+   $ neo4j stop
-shown in the figure below, where members of the same species share a
-color.
-.. figure:: /figures/tutorial_phylogeny_tree.png
+| More information on Neo4j and the cypher language:
-   :width: 600
+| `Neo4j Cypher Manual v3.5 <https://neo4j.com/docs/developer-manual/3.5/cypher/>`_
-   :align: center
+| `Neo4j Cypher Refcard <http://neo4j.com/docs/cypher-refcard/3.5/>`_
+| `Neo4j API <https://neo4j.com/developer/>`_
 --------------
-..
+In :doc:`part 3 <tutorial_part3>` of the tutorial we explore some of the
-   In :doc:`part 6 <tutorial_part6>` of the tutorial we go through a couple of use cases with the pangenomic read mapping functionality implemented in PanTools.
+functionalities to analyze the pangenome.