Skip to content
Snippets Groups Projects
Commit 934cbb5f authored by Workum, Dirk-Jan van's avatar Workum, Dirk-Jan van
Browse files

Merge branch 'update_tutorial' into 'develop'

Update tutorial

See merge request !228
parents 90ea877a d901debc
No related branches found
No related tags found
2 merge requests!251Create release v4.3.2,!228Update tutorial
Pipeline #110364 passed
Showing
with 1395 additions and 963 deletions
...@@ -10,6 +10,7 @@ All notable changes to Pantools will be documented in this file. ...@@ -10,6 +10,7 @@ All notable changes to Pantools will be documented in this file.
### Changed ### Changed
- No longer accept tab characters in fasta input files (!232). - No longer accept tab characters in fasta input files (!232).
- Updated and reordered the tutorial section of the documentation (!228).
### Fixed ### Fixed
- `pantools blast` no longer crashes if not all mrna nodes contain protein IDs (!226). - `pantools blast` no longer crashes if not all mrna nodes contain protein IDs (!226).
......
...@@ -243,7 +243,7 @@ simulation. ...@@ -243,7 +243,7 @@ simulation.
Heaps' law (a power law) can be fitted to the number of new genes Heaps' law (a power law) can be fitted to the number of new genes
observed when increasing the pangenome by one random genome. The formula observed when increasing the pangenome by one random genome. The formula
for the power law model is :math:`n = k * N^{-a}`, where *n* is the for the power law model is :math:`\Delta n = k * N^{-a}`, where *Δn* is the
newly discovered genes, *N* is the total number of genomes, and *k* and newly discovered genes, *N* is the total number of genomes, and *k* and
*a* are the fitting parameters. A pangenome can be considered open when *a* are the fitting parameters. A pangenome can be considered open when
*a* < 1 and closed if *a* > 1. *a* < 1 and closed if *a* > 1.
......
...@@ -41,7 +41,7 @@ members. ...@@ -41,7 +41,7 @@ members.
* - <databaseDirectory> * - <databaseDirectory>
- Path to the database root directory. - Path to the database root directory.
* - <homologyFile> * - <homologyFile>
- A text file with homology group node identifiers, seperated by a - A text file with homology group node identifiers, separated by a
comma. comma.
**Options** **Options**
...@@ -180,10 +180,10 @@ annotation node and extract the nucleotide and protein sequence. ...@@ -180,10 +180,10 @@ annotation node and extract the nucleotide and protein sequence.
* - ``--functions`` * - ``--functions``
- One or multiple function identifiers (GO, InterPro, PFAM, - One or multiple function identifiers (GO, InterPro, PFAM,
TIGRFAM), seperated by a comma. TIGRFAM), separated by a comma.
* - ``--nodes``/``-n`` * - ``--nodes``/``-n``
- One or multiple identifiers of function nodes (GO, InterPro, - One or multiple identifiers of function nodes (GO, InterPro,
PFAM, TIGRFAM), seperated by a comma. PFAM, TIGRFAM), separated by a comma.
* - ``--include``/``-i`` * - ``--include``/``-i``
- Only include a selection of genomes. - Only include a selection of genomes.
* - ``--exclude``/``-e`` * - ``--exclude``/``-e``
...@@ -283,9 +283,9 @@ specific GO terms of the hierarchy to a sequence. ...@@ -283,9 +283,9 @@ specific GO terms of the hierarchy to a sequence.
:widths: 30 70 :widths: 30 70
* - ``--functions`` * - ``--functions``
- One or multiple GO term identifiers, seperated by a comma. - One or multiple GO term identifiers, separated by a comma.
* - ``--nodes``/``-n`` * - ``--nodes``/``-n``
- One or multiple identifiers of 'GO' nodes, seperated by a comma. - One or multiple identifiers of 'GO' nodes, separated by a comma.
**Example commands** **Example commands**
.. code:: bash .. code:: bash
...@@ -322,9 +322,9 @@ their location in the hierarchy is reported. ...@@ -322,9 +322,9 @@ their location in the hierarchy is reported.
:widths: 30 70 :widths: 30 70
* - ``--functions`` * - ``--functions``
- One or multiple GO term identifiers, seperated by a comma. - One or multiple GO term identifiers, separated by a comma.
* - ``--nodes``/``-n`` * - ``--nodes``/``-n``
- One or multiple identifiers of 'GO' nodes, seperated by a comma. - One or multiple identifiers of 'GO' nodes, separated by a comma.
* - ``--include``/``-i`` * - ``--include``/``-i``
- Only include a selection of genomes. - Only include a selection of genomes.
* - ``--exclude``/``-e`` * - ``--exclude``/``-e``
...@@ -378,7 +378,7 @@ Report all available information of one or multiple homology groups. ...@@ -378,7 +378,7 @@ Report all available information of one or multiple homology groups.
To find Phobius (P) or SignalP (S) annotations, include: To find Phobius (P) or SignalP (S) annotations, include:
'secreted' (P/S), 'receptor' (P/S), or 'transmembrane' (P). 'secreted' (P/S), 'receptor' (P/S), or 'transmembrane' (P).
* - ``--genes``/``-g`` * - ``--genes``/``-g``
- One or multiple gene names, seperated by a comma. - One or multiple gene names, separated by a comma.
* - ``--node`` * - ``--node``
- Retrieve the nucleotide nodes belonging to genes in homology groups - Retrieve the nucleotide nodes belonging to genes in homology groups
......
...@@ -69,9 +69,9 @@ the second and third rank this will be 0.0010 and 0.0015, respectively. ...@@ -69,9 +69,9 @@ the second and third rank this will be 0.0010 and 0.0015, respectively.
:widths: 30 70 :widths: 30 70
* - ``--homology-file``/``-H`` * - ``--homology-file``/``-H``
- A text file with homology group node identifiers, seperated by a comma. - A text file with homology group node identifiers, separated by a comma.
* - ``--nodes``/``-n`` * - ``--nodes``/``-n``
- mRNA node identifiers, seperated by a comma on the command line. - mRNA node identifiers, separated by a comma on the command line.
* - ``--include``/``-i`` * - ``--include``/``-i``
- Only include a selection of genomes. - Only include a selection of genomes.
* - ``--exclude``/``-e`` * - ``--exclude``/``-e``
......
...@@ -695,7 +695,7 @@ for the visualization of phylogenies in iTOL. Phenotypes must already be ...@@ -695,7 +695,7 @@ for the visualization of phylogenies in iTOL. Phenotypes must already be
included in the pangenome with the included in the pangenome with the
:ref:`add_phenotypes <construction/annotate:add phenotypes>` functionality. How :ref:`add_phenotypes <construction/annotate:add phenotypes>` functionality. How
to use the template files in iTOL can be found in one of the to use the template files in iTOL can be found in one of the
:ref:`tutorials <tutorial/tutorial_part5:create itol templates>`. :ref:`tutorials <tutorial/tutorial_part3:create itol templates>`.
If you run this function without a ``--phenotype`` argument, templates If you run this function without a ``--phenotype`` argument, templates
are created for trees that contain only genome numbers as node labels. are created for trees that contain only genome numbers as node labels.
......
...@@ -40,6 +40,9 @@ selected and the second annotation of genome 2 and 3. ...@@ -40,6 +40,9 @@ selected and the second annotation of genome 2 and 3.
if the GFF3 file contains a trans-spliced gene that has alternative if the GFF3 file contains a trans-spliced gene that has alternative
splicing, it will not be able to handle it (it will only annotate one splicing, it will not be able to handle it (it will only annotate one
mRNA). mRNA).
The PanUtils project contains a :ref:`data filtering pipeline
<getting_started/diy_pangenomics:Quality control pipeline>` that fixes most
common issues encountered by the parser.
**Parameters** **Parameters**
.. list-table:: .. list-table::
......
...@@ -148,7 +148,7 @@ argument to the command. ...@@ -148,7 +148,7 @@ argument to the command.
| **What BUSCO benchmark set to use** | **What BUSCO benchmark set to use**
- When using **BUSCO v3**, go to https://busco.ezlab.org, download a odb9 - When using **BUSCO v3**, go to https://busco.ezlab.org, download a odb9
set, and untar it with ``tar -xvzf``. Include the entire directory in set, and untar it with ``tar xvzf``. Include the entire directory in
the command using the ``--input-file`` argument. the command using the ``--input-file`` argument.
- For **BUSCO v4** or **v5**, you only have to provide the odb10 database - For **BUSCO v4** or **v5**, you only have to provide the odb10 database
name with the ``--input-file`` argument, the database is downloaded name with the ``--input-file`` argument, the database is downloaded
...@@ -493,5 +493,5 @@ settings that were used. ...@@ -493,5 +493,5 @@ settings that were used.
- **current_pantools_homology_groups.txt**, overview of the active - **current_pantools_homology_groups.txt**, overview of the active
homology groups. Each line represents one homology group. The line homology groups. Each line represents one homology group. The line
starts with the homology group (database) identifier followed by a starts with the homology group (database) identifier followed by a
colon and the rest are mRNA IDs (from gff/genbank) seperated by a colon and the rest are mRNA IDs (from gff/genbank) separated by a
space. space.
...@@ -56,7 +56,7 @@ downloaded from https://github.com/refresh-bio/KMC/releases. ...@@ -56,7 +56,7 @@ downloaded from https://github.com/refresh-bio/KMC/releases.
.. code:: bash .. code:: bash
$ tar -xvzf KMC* #uncompress the KMC binaries $ tar xvzf KMC* #uncompress the KMC binaries
# Edit your ~/.bashrc to include KMC to your PATH # Edit your ~/.bashrc to include KMC to your PATH
$ echo "export PATH=/path/to/KMC/:\$PATH" >> ~/.bashrc #replace /path/to with the correct path on your computer $ echo "export PATH=/path/to/KMC/:\$PATH" >> ~/.bashrc #replace /path/to with the correct path on your computer
...@@ -76,7 +76,7 @@ https://micans.org/mcl under License & software. ...@@ -76,7 +76,7 @@ https://micans.org/mcl under License & software.
.. code:: bash .. code:: bash
$ wget https://micans.org/mcl/src/mcl-14-137.tar.gz $ wget https://micans.org/mcl/src/mcl-14-137.tar.gz
$ tar -xvzf mcl-* $ tar xvzf mcl-*
$ cd mcl-14-137 $ cd mcl-14-137
$ ./configure --prefix=/path/to/mcl-14-137/shared #replace /path/to with the correct path on your computer $ ./configure --prefix=/path/to/mcl-14-137/shared #replace /path/to with the correct path on your computer
$ make install $ make install
...@@ -405,7 +405,7 @@ The following commands can be used to test and build the documentation: ...@@ -405,7 +405,7 @@ The following commands can be used to test and build the documentation:
.. code:: bash .. code:: bash
# Test the documentation # Test the documentation
sphinx-lint sphinx-lint -e=all --max-line-length=80 sphinx-lint -e=all --max-line-length=80
# Build the documentation # Build the documentation
sphinx-build -W docs/source output sphinx-build -W docs/source output
......
DIY Pangenomics
===============
Likely you want to do pangenomics using your own data, on your own system.
To support this, we offer several tools:
* `PanTools`_ : core application to construct and analyze pangenomes
* `PanUtils`_: utilities to support pangenomics (e.g. QC)
* `PanBrowse`_: applications for pangenome visualization (e.g. PanVA)
When working with these software tools, consider the following tips:
* Separate issues in the (technical) setup from issues in the data, so try
the tool on known (test) data first.
* Garbage in, garbage out! Therefore, perform rigorous quality control on the
input data, e.g. using https://github.com/PanUtils/pantools-qc-pipeline.
* Start small, e.g. 10-50 bacterial genomes or 3-10 eukaryotic genomes, before
scaling up.
PanTools
--------
PanTools (https://git.wur.nl/bioinformatics/pantools) is a software package
for pangenome construction and analysis. It works across the tree of life,
from small viruses to large plant, animal, or human genomes. Because it is not
alignment based, but employs a compacted De Bruijn Graph (cDBG) it works over
large(r) evolutionary distances, e.g. at species, genus or family level.
PanUtils
--------
We have been working on several pangenome utilities, collected in
https://github.com/PanUtils. The repository currently contains three pipelines:
one for quality control, one for pangenome construction and one for analysis
and preparing a PanVA instance.
Quality control pipeline
^^^^^^^^^^^^^^^^^^^^^^^^
This is a Snakemake pipeline that can be used for running quality control on
the data. We highly recommend this pipeline for checking your files, as
annotation files can be very tricky to work with (and therefore sometimes
cause issues for tools like PanTools). This pipeline can be found at
https://github.com/PanUtils/pantools-qc-pipeline which includes a README on
how to install and run it. In summary, one needs to create a conda environment
with the most recent version of Snakemake (at the moment of writing this is
8.25.5). Please follow the instructions on which configuration to set and
where all files should be placed.
The pipeline has four different workflows:
* raw_statistics: creates an overview of the statistics of the input data
* filter: can be used for filtering genome and annotation files
* proteins: creates protein files including statistics
* functions: creates functional annotations for all proteins in the protein
files
PanTools v4 pipeline
^^^^^^^^^^^^^^^^^^^^
The PanTools Snakemake pipeline runs through all major
PanTools functionalities. You can use this pipeline for inspiration on what
can be done with PanTools or to actually run PanTools.
Please see https://github.com/PanUtils/pantools-pipeline-v4 for the code,
including README that summarizes how to install, configure and run it.
.. warning::
It is not recommended to use this pipeline with new experimental data;
adjustments to data and PanTools command settings might have to be made
during the process, for which this pipeline lacks the flexibility.
The pipeline has three different workflows:
* all_panproteome: runs through all major PanTools functionalities available
for a panproteome
* all_pangenome: runs through all major PanTools functionalities available for
a pangenome
* panva: runs all PanTools functions necessary to set up a PanVA instance
PanVA conversion
^^^^^^^^^^^^^^^^
If one has already created a pangenome database with PanTools that needs to be
converted to the data structure necessary for setting up a PanVA instance,
please use this repository: https://github.com/PanUtils/export-to-panva. It
contains a README on how to install and run it. In summary, one needs to
create an environment using mamba from a predefined YAML file, configure
settings in a config.ini file and then run the python script which takes the
config.ini as input.
PanBrowse
---------
We work on visualization methods for (interactive) visualization of
complex pangenomes. PanBrowse (https://github.com/PanBrowse) is where
we host the visualization applications.
PanVA
^^^^^
`PanVA <https://github.com/PanBrowse/PanVA>`_ is a web application
allowing users to visually and interatively explore sequence variants
in pangenomes (generated by PanTools). It provides context for these
variants by displaying their corresponding annotations, phylogenetic
and phenotypic information.
Technical setup validation
==========================
Performance is important when running PanTools on larger datasets. Therefore,
it is recommended to test your PanTools installation on one or more of the
test data sets below, before applying it to your own data.
PanTools pangenome construction commands make a large amount of
transactions to the database and should be built on a location that can
handle this for larger pangenomes. We recommend using an SSD or a RAM-disk
(/dev/shm/, available on Linux machines). If pangenome construction takes
too much time, the location of the database might be the issue. For reference,
the building time of a small yeast data set of 10 genomes on our RAM-disk takes
a little over two minutes, while it takes nearly 30 minutes on a network disk!
Larger datasets exacerbate this gap in runtime. Disk size needed for
construction depends on input data (genome size and number of genomes),
for 8 arabidopsis genomes ±20GB should suffice, for 3 lettuce genomes ±250GB.
Java heap space needed depends on the same parameters: in general it is
recommendable to set this to larger than expected because java running out
of heap space makes the tool crash (e.g. for lettuce we used
-Xmx200g -Xms200g).
.. warning::
If you want to do everything in a single directory, make sure to download the
data on the disk where you want to write the output. If the input data is on
a network disk, adjust the commands to write the pangenome database on a
local disk (RAM or SSD).
Ten yeast genomes
-----------------
Download the data from:
https://www.bioinformatics.nl/pangenomics/data/yeast_10strains.tgz
.. code:: bash
$ wget https://www.bioinformatics.nl/pangenomics/data/yeast_10strains.tgz
$ tar zxvf yeast_10strains.tgz
Build the pangenome:
.. code:: bash
$ pantools -Xms20g -Xmx50g build_pangenome -t16 yeast_db yeast_10strains/metadata/genome_locations.txt
Add annotations:
.. code:: bash
$ pantools -Xms20g -Xmx50g add_annotations yeast_db yeast_10strains/metadata/annotation_locations.txt
Group proteins:
.. code:: bash
$ pantools -Xms20g -Xmx50g group yeast_db -t16 --relaxation=3
Run time statistics (16 threads on RAM-disk):
.. csv-table::
:file: /tables/runtime_statistics_yeast.csv
:header-rows: 1
:delim: ,
Two *Arabidopsis* genomes
-------------------------
Start this use case with a bash script to download the data. Download the
script from here:
https://www.bioinformatics.nl/pangenomics/data/download_2_ara.sh
To download the *Arabidopsis* data, run the script:
.. code:: bash
$ bash download_2_ara.sh <threads>
Build the pangenome:
.. code:: bash
$ pantools -Xms40g -Xmx80g build_pangenome -t50 --kmer-size=17 ara_db a_thaliana_2genomes/metadata/genome_locations.txt
Add annotations:
.. code:: bash
$ pantools -Xms40g -Xmx80g add_annotations ara_db a_thaliana_2genomes/metadata/annotation_locations.txt
Group proteins:
.. code:: bash
$ pantools -Xms40g -Xmx80g group ara_db -t50 --relaxation=3
Run time statistics (50 threads on RAM-disk)
.. csv-table::
:file: /tables/runtime_statistics_arabidopsis.csv
:header-rows: 1
:delim: ,
Three tomato genomes
--------------------
Download the script from here:
https://www.bioinformatics.nl/pangenomics/data/tomato_3genomes.tar.gz
Extract the tomato data:
.. code:: bash
$ tar xvzf tomato_3genomes.tar.gz
Build the pangenome:
.. code:: bash
$ pantools -Xms40g -Xmx80g build_pangenome -t50 --kmer-size=13 tomato_db tomato_3genomes/metadata/genome_locations.txt
Add annotations:
.. code:: bash
$ pantools -Xms40g -Xmx80g add_annotations tomato_db tomato_3genomes/metadata/annotation_locations.txt
Group proteins:
.. code:: bash
$ pantools -Xms40g -Xmx80g group tomato_db -t50 --relaxation=3
Run time statistics (50 threads on RAM-disk)
.. csv-table::
:file: /tables/runtime_statistics_tomato.csv
:header-rows: 1
:delim: ,
...@@ -91,12 +91,24 @@ Contents ...@@ -91,12 +91,24 @@ Contents
:caption: Getting Started :caption: Getting Started
:maxdepth: 1 :maxdepth: 1
getting_started/diy_pangenomics
Install <getting_started/install> Install <getting_started/install>
getting_started/technical_setup
getting_started/differences getting_started/differences
getting_started/workflows getting_started/workflows
getting_started/selection getting_started/selection
getting_started/query getting_started/query
.. toctree::
:caption: Tutorial PanTools
:maxdepth: 1
Tutorial 1 - Pangenome construction <tutorial/tutorial_part1>
Tutorial 2 - Pangenome analysis <tutorial/tutorial_part2>
Tutorial 3 - Phylogeny <tutorial/tutorial_part3>
Tutorial 4 - Visualization (PanVA) <tutorial/tutorial_part4>
Tutorial 5 - Neo4j browser <tutorial/tutorial_part5>
.. toctree:: .. toctree::
:caption: Pangenome Construction :caption: Pangenome Construction
:maxdepth: 1 :maxdepth: 1
...@@ -123,16 +135,6 @@ Contents ...@@ -123,16 +135,6 @@ Contents
analysis/support analysis/support
analysis/dn_ds analysis/dn_ds
.. toctree::
:caption: Tutorial
:maxdepth: 1
Tutorial 1 - Install PanTools <tutorial/tutorial_part1>
Tutorial 2 - Construct pangenome <tutorial/tutorial_part2>
Tutorial 3 - Neo4j browser <tutorial/tutorial_part3>
Tutorial 4 - Characterization <tutorial/tutorial_part4>
Tutorial 5 - Phylogeny <tutorial/tutorial_part5>
.. toctree:: .. toctree::
:caption: Developer guide :caption: Developer guide
:maxdepth: 1 :maxdepth: 1
...@@ -140,4 +142,3 @@ Contents ...@@ -140,4 +142,3 @@ Contents
Installation <developer_guide/install> Installation <developer_guide/install>
Release steps <developer_guide/release> Release steps <developer_guide/release>
Database structure <developer_guide/database> Database structure <developer_guide/database>
command,running time (h:m:s),CPU time (s)
build_pangenome,0:27:38,3102.41
add_annotations,0:00:52,93.56
group,0:02:25,507.60
\ No newline at end of file
command,running time (h:m:s),CPU time (s)
build_pangenome,20:28:33,257610.72
add_annotations,0:01:35,181.19
group,0:03:02,5861.83
\ No newline at end of file
command,running time (h:m:s),CPU time (s)
build_pangenome,0:02:23,276.50
add_annotations,0:00:35,51.82
group,0:01:58,1351.73
\ No newline at end of file
Part 1. Install PanTools Part 1. Build your own pangenome using PanTools
======================== ===============================================
For instructions on how to install PanTools, Install PanTools from bioconda following the
see :doc:`/getting_started/install`. :ref:`install <getting_started/install:Install from bioconda>`
instructions.
To demonstrate the main functionalities of PanTools we use a small
chloroplasts dataset to avoid long construction times.
.. csv-table::
:file: /tables/chloroplast_datasets.csv
:header-rows: 1
:delim: ;
Download the chloroplast fasta and gff files
`here <http://bioinformatics.nl/pangenomics/tutorial/chloroplasts.tar.gz>`_
or via wget.
.. code:: bash
$ wget http://bioinformatics.nl/pangenomics/tutorial/chloroplasts.tar.gz
$ tar xvzf chloroplasts.tar.gz #unpack the archive
--------------
BUILD, ANNOTATE and GROUP
-------------------------
We start with building a pangenome using four of the five chloroplast
genomes. For this you need a text file which directs PanTools to the
FASTA files. Call your text file **genome_locations.txt** and include
the following lines:
.. code:: text
YOUR_PATH/C_sativus.fasta
YOUR_PATH/O_sativa.fasta
YOUR_PATH/S_lycopersicum.fasta
YOUR_PATH/S_tuberosum.fasta
Make sure that ‘*YOUR_PATH*’ is the full path to the input files! The text
file should only contain full paths to FASTA files, no additional spaces or
empty lines. Then run PanTools with the
:ref:`build_pangenome <construction/build:build pangenome>` function and
include the text file
.. code:: bash
$ pantools -Xmx5g build_pangenome chloroplast_DB genome_locations.txt
Did the program run without any error messages? Congratulations, you’ve
built your first pangenome!
Adding additional genomes
~~~~~~~~~~~~~~~~~~~~~~~~~
PanTools has the ability to add additional genomes to an already
existing pangenome. To test the function of PanTools, prepare a text
file containing the path to the Maize chloroplast genome. Call your
text file **fifth_genome_location.txt** and include the following
line to the file:
.. code:: text
YOUR_PATH/Z_mays.fasta
Run PanTools on the new text file and using the
:ref:`add_genomes <construction/build:add genomes>` function
.. code:: bash
$ pantools add_genomes chloroplast_DB fifth_genome_location.txt
Adding annotations
~~~~~~~~~~~~~~~~~~
To include gene annotations to the pangenome, prepare a text file
containing paths to the GFF files. Call your text file
**annotation_locations.txt** and include the following lines into the
file:
.. code:: text
1 YOUR_PATH/C_sativus.gff3
2 YOUR_PATH/O_sativa.gff3
3 YOUR_PATH/S_lycopersicum.gff3
4 YOUR_PATH/S_tuberosum.gff3
5 YOUR_PATH/Z_mays.gff3
Run PanTools using the :ref:`add_annotations <construction/annotate:add
annotations>` function and include the new text file. Also add the
``--connect`` flag to attach the annotations to the nucleotide nodes;
this is useful for exploring the pangenome in the Neo4j browser later
in the tutorial.
.. code:: bash
$ pantools add_annotations --connect chloroplast_DB annotation_locations.txt
Homology grouping
~~~~~~~~~~~~~~~~~
PanTools can infer homology between the protein sequences of a pangenome
and cluster them into homology groups with the
:ref:`group <construction/group:group>` function. The ``--relaxation``
parameter must be set to influence the sensitivity, If you don't know the
best setting for the grouping relaxation, you can use the
:ref:`busco_protein <construction/group:busco protein>` and
:ref:`optimal grouping <construction/group:optimal grouping>` functions
instead, but for now we will use a relaxation of 4.
.. code:: bash
$ pantools group --relaxation=4 chloroplast_DB
--------------
Adding phenotypes
-----------------
Phenotype values can be Integers, Double, String or Boolean values.
Create a text file **phenotypes.txt**.
.. code:: text
Genome,Solanum
1,false
2,false
3,true
4,true
5,false
And use :ref:`add_phenotypes <construction/annotate:add phenotypes>` to add the
information to the pangenome.
.. code:: bash
$ pantools add_phenotypes chloroplast_DB phenotypes.txt
--------------
RETRIEVE functions
------------------
Now that the construction is complete, lets quickly validate if the
construction was successful and the database can be used. To retrieve
some genomic regions, prepare a text file containing genomic
coordinates. Create the file **regions.txt** and include the following
for each region: genome number, contig number, start and stop position
and separate them by a single space
.. code:: text
1 1 200 500
2 1 300 700
3 1 1 10000
3 1 1 10000 -
4 1 9999 15000
5 1 100000 110000
Now run the :ref:`retrieve_regions <analysis/explore:retrieve regions>`
function and include the new text file
.. code:: bash
$ pantools retrieve_regions chloroplast_DB regions.txt
Take a look at the extracted regions that are written to the
**chloroplast_DB/retrieval/regions/** directory.
To retrieve entire genomes, prepare a text file **genome_numbers.txt**
and include each genome number on a separate line in the file
.. code:: text
1
3
5
Use the **retrieve_regions** function again but include the new text
file
.. code:: bash
$ pantools retrieve_regions chloroplast_DB genome_numbers.txt
Genome files are written to same directory as before. Take a look at one
of the three genomes you have just retrieved.
In :doc:`part 2 <tutorial_part2>` of the tutorial we explore the
pangenome you just built using the Neo4j browser and the Cypher language.
This diff is collapsed.
Part 3. Explore the pangenome using the Neo4j browser Part 3. Phylogeny
===================================================== =================
Did you skip :doc:`part 2 <tutorial_part2>` of the tutorial or were you Part 3 preparation
unable to build the chloroplast pangenome? Download the pre-constructed ------------------
pangenome
`here <http://bioinformatics.nl/pangenomics/tutorial/chloroplast_DB.tar.gz>`_ If you did not follow part 2 of the tutorial, download the
or via wget. pre-constructed pangenome
`here <http://bioinformatics.nl/pangenomics/tutorial/pecto_dickeya_DB.tar.gz>`_.
.. code:: bash .. code:: bash
$ wget http://bioinformatics.nl/pangenomics/tutorial/chloroplast_DB.tar.gz $ wget http://bioinformatics.nl/pangenomics/tutorial/pecto_dickeya_DB.tar.gz
$ tar -xvzf chloroplast_DB.tar.gz $ tar xvzf pecto_dickeya_DB.tar.gz
-------------- --------------
Configuring Neo4j Adding phenotype/metadata to the pangenome
----------------- ------------------------------------------
Set the full path to the chloroplast pangenome database by opening Before we construct the trees, we will add some phenotype data to the
neo4j.conf ('*neo4j-community-3.5.30/conf/neo4j.conf*') and include the pangenome. Once the we have a phylogeny, the information can be included
following line in the config file. Please make sure there is always only or be used to color parts of the tree. Below is a textfile with data for
a single uncommented line with 'dbms.directories.data'. three phenotypes. The third phenotype, *low_temperature*, is in this
case a made up example! It states whether the strain is capable of
growing on (extreme) low temperatures. The phenotype file can be found
inside the database directory, add the information to the pangenome by
using :ref:`add_phenotypes <construction/annotate:add phenotypes>`.
.. code:: text .. code:: text
#dbms.directories.data=/YOUR_PATH/any_other_database Genome, species, strain_name, low_temperature
dbms.directories.data=/YOUR_PATH/chloroplast_DB 1,P. odoriferum,P. odoriferum Q166, false
2,P. fontis, P. fontis M022, true
| **Allowing non-local connections** 3,P. polaris,P. polaris S4.16.03.2B, false
| To be able to run Neo4j on a server and have access to it from 4,P. brasiliense, P. brasiliense S2, true
anywhere, some additional lines in the config file must be changed. 5,P. brasiliense, P. brasiliense Y49, false
6,D. dadantii, D. dadantii 3937,?
- **Uncomment** the four following lines in
neo4j-community-3.5.30/conf/neo4j.conf.
- Replace 7686, 7474, and 7473 by three different numbers that are not
in use by other people on your server. In this way, everyone can have
their own database running at the same time.
.. code:: text
#dbms.connectors.default_listen_address=0.0.0.0
#dbms.connector.bolt.listen_address=:7687
#dbms.connector.http.listen_address=:7474
#dbms.connector.https.listen_address=:7473
Lets start up the Neo4j server!
.. code:: bash
$ neo4j start
Start Firefox (or a web browser of your own preference) and let it run
on the background.
.. code:: bash .. code:: bash
$ firefox & $ pantools add_phenotypes pecto_dickeya_DB pecto_dickeya_DB/phenotypes.txt
In case you did not change the config to allow non-local connections,
browse to *http://localhost:7474*. Whenever you did change the config
file, go to *server_address:7474*, where 7474 should be replaced with the
number you chose earlier.
If the database startup was successful, a login terminal will appear in
the webpage. Use '*neo4j*' both as username and password. After logging
in, you are requested to set a new password.
-------------- --------------
Exploring nodes and edges in Neo4j Constructing a phylogeny
---------------------------------- ------------------------
Go through the following steps to become proficient in using the Neo4j
browser and the underlying PanTools data structure. If you have any
difficulty trouble finding a node, relationship or any type of
information, download and use `this visual guide
<http://www.bioinformatics.nl/pangenomics/tutorial/neo4j_browser.tar.gz>`_.
1. Click on the database icon on the left. A menu with all node types
and relationship types will appear.
2. Click on the '*gene*' button in the node label section. This
automatically generated a query. Execute the query.
3. The **LIMIT** clause prevents large numbers of nodes popping up to
avoid your web browser from crashing. Set LIMIT to 10 and execute the
query.
4. Hover over the nodes, click on them and take a look at the values
stored in the nodes. All these features (except ID) were extracted
from the GFF annotation files. ID is an unique number automatically
assigned to nodes and relationships by Neo4j.
5. Double-click on the **matK** gene node, all nodes with a connection
to this gene node will appear. The nodes have distinct colors as
these are different node types, such as **mRNA**, **CDS**,
**nucleotide**. Take a look at the node properties to observe that
most values and information is specific to a certain node type.
6. Double-click on the *matK* mRNA node, a **homology_group** node
should appear. These type of nodes connect homologous genes in the
graph. However, you can see this gene did not cluster with any other
gene.
7. Hover over the **start** relation of the *matK* gene node. As you
can see information is not only stored in nodes, but also in
relationships! A relationship always has a certain direction, in
this case the relation starts at the gene node and points to a
nucleotide node. Offset marks the location within the node.
8. Double-click on the **nucleotide** node at the end of the 'start'
relationship. An in- and outgoing relation appear that connect to
other nucleotide nodes. Hover over both the relations and compare
them. The relations holds the genomic coordinates and shows this
path only occurs in contig/sequence 1 of genome 1.
9. Follow the outgoing **FF**-relationship to the next nucleotide node
and expand this node by double-clicking. Three nodes will pop up
this time. If you hover over the relations you see the coordinates
belong to other genomes as well. You may also notice the
relationships between nucleotide nodes is always a two letter
combination of F (forward) and R (reverse) which state if a sequence
is reverse complemented or not. The first letter corresponds to the
sequence of the node at the start of the relation where the second
letters refers to the sequence of the end node.
10. Finally, execute the following query to call the database scheme to
see how all node types are connected to each other: *CALL
db.schema()*. The schema will be useful when designing your own
queries!
--------------
Query the pangenome database using CYPHER
-----------------------------------------
Cypher is a declarative, SQL-inspired language and uses ASCII-Art to
represent patterns. Nodes are represented by circles and relationships
by arrows.
- The **MATCH** clause allows you to specify the patterns Neo4j will
search for in the database.
- With **WHERE** you can add constraints to the patterns described.
- In the **RETURN** clause you define which parts of the pattern to
display.
Cypher queries In this tutorial we will construct three phylogenies, each based on a
~~~~~~~~~~~~~~ different type of variation: SNPs, genes and k-mers. Take a look at the
phylogeny manuals to get an understanding how the three methods work and
how they differ from each other.
**Match and return 100 nucleotide nodes** 1. :ref:`analysis/phylogeny:core phylogeny`
2. :ref:`analysis/phylogeny:gene distance tree`
3. :ref:`analysis/phylogeny:k-mer distance tree`
.. code:: text Core SNP phylogeny
------------------
MATCH (n:nucleotide) RETURN n LIMIT 100
**Find all the genome nodes** The core SNP phylogeny will run various Maximum Likelihood models on
parsimony informative sites of single-copy orthologous sequences. A site
is parsimony-informative when there are at least two types of
nucleotides that occur with a minimum frequency of two. The informative
sites are automatically identified by aligning the sequences; however,
it does not know which sequences are single-copy orthologous. You can
identify these conserved sequences by running
:ref:`gene_classification <analysis/classification:gene classification>`.
.. code:: text .. code:: bash
MATCH (n:genome) RETURN n $ pantools gene_classification --phenotype=species pecto_dickeya_DB
**Find the pangenome node** Open **gene_classification_overview.txt** and take a look at statistics.
As you can see there are 2134 single-copy ortholog groups. Normally, all
of these groups are aligned to identify SNPs but for this tutorial we’ll
make a selection of only a few groups to accelerate the steps. You can
do this in two different ways:
.. code:: text Option 1: Open **single_copy_orthologs.csv** and remove all node
identifiers after the first 20 homology groups and save the file.
MATCH (n:pangenome) RETURN n .. code:: bash
**Match and return 100 genes** $ pantools core_phylogeny --mode=ML pecto_dickeya_DB
.. code:: text Option 2: Open **single_copy_orthologs.csv** and select the first 20
homology_group node identifiers. Place them in a new file sco_groups.txt
and include this file to the function.
MATCH (g:gene) RETURN g LIMIT 100 .. code:: bash
**Match and return 100 genes and order them by length** $ pantools core_phylogeny --mode=ML -H=sco_groups.txt pecto_dickeya_DB
.. code:: text The sequences of the homology groups are being aligned two consecutive
times. After the initial alignment, input sequences are trimmed based on
the longest start and end gap of the alignment. The parsimony
informative positions are taken from the second alignment and
concatenated into a sequence. When opening **informative.fasta** you can
find 6 sequences, the length of the sequences being the number of
parsimony-informative sites.
MATCH (g:gene) RETURN g ORDER BY g.length DESC LIMIT 100 .. code:: bash
**The same query as before but results are now returned in a table** $ iqtree -nt 4 -s pecto_dickeya_DB/alignments/grouping_v1/core_phylogeny/informative.fasta -redo -bb 1000
.. code:: text IQ-tree generates several files, the tree that we later on in the
tutorial will continue with is called **informative.fasta.treefile**.
When examining the **informative.fasta.iqtree** file you can find the
best fit model of the data. This file also shows the number of sites
that were used, as sites with gaps (which IQ-tree does not allow) were
changed into singleton or constant sites.
MATCH (g:gene) RETURN g.name, g.address, g.length ORDER BY g.length DESC LIMIT 100 Gene distance tree
~~~~~~~~~~~~~~~~~~
**Return genes which are longer as 100 but shorter than 250 bp** (this To create a phylogeny based on gene distances (absence/presence), we can
can also be applied to other features such as exons introns or CDS) simply execute the Rscript that was created by
:ref:`gene_classification <analysis/classification:gene classification>`.
.. code:: text .. code:: bash
MATCH (g:gene) where g.length > 100 AND g.length < 250 RETURN * LIMIT 100 $ Rscript pecto_dickeya_DB/gene_classification/gene_distance_tree.R
**Find genes located on first genome** The resulting tree is called **gene_distance.tree**.
.. code:: text K-mer distance tree
~~~~~~~~~~~~~~~~~~~
MATCH (g:gene) WHERE g.address[0] = 1 RETURN * LIMIT 100 To obtain a k-mer distance phylogeny, the k-mers must first be counted
with the :ref:`kmer_classification <analysis/classification:k-mer
classification>`
function. Afterwards, the tree can be constructed by executing the
Rscript.
**Find genes located on first genome and first sequence** .. code:: bash
.. code:: text $ pantools kmer_classification pecto_dickeya_DB
$ Rscript pecto_dickeya_DB/kmer_classification/genome_kmer_distance_tree.R
MATCH (g:gene) WHERE g.address[0] = 1 AND g.address[1] = 1 RETURN * LIMIT 100 The resulting tree is written to **genome_kmer_distance.tree**.
-------------- --------------
Homology group queries Renaming tree nodes
~~~~~~~~~~~~~~~~~~~~~~ -------------------
**Return 100 homology groups** So far, we used three different types of distances (SNPs, genes,
k-mers), and two different methods (ML, NJ) to create three phylogenetic
.. code:: text trees. First, lets take a look at the text files. The
**informative.fasta.treefile** only contain genome numbers, bootstrap
values and branch lengths but is lacking the metadata. Examining
**gene_distance.tree** file also shows this information but the species
names as well, because we included this as a phenotype during
:ref:`gene_classification <analysis/classification:gene classification>`.
MATCH (h:homology_group) RETURN h LIMIT 100 Let’s include the strain identifiers to the core snp tree to make the
final figure more informative. Use the
:ref:`rename_phylogeny <analysis/phylogeny:rename phylogeny>` function to
rename the tree nodes.
**Match homology groups which contain two members** .. code:: bash
.. code:: text
MATCH (h:homology_group) WHERE h.num_members = 2 RETURN h
**Match homology groups and 'walk' to the genes and corresponding start
and end node**
.. code:: text
MATCH (h:homology_group)-->(f:feature)<--(g:gene)-->(n:nucleotide) WHERE h.num_members = 2 RETURN * LIMIT 25
Turn off autocomplete by clicking on the button on the bottom right. The
graph falls apart because relations were not assigned to variables.
**The same query as before but now the relations do have variables**
.. code:: text
MATCH (h:homology_group)-[r1]-> (f:feature) <-[r2]-(g:gene)-[r3]-> (n:nucleotide) WHERE h.num_members = 2 RETURN * LIMIT 25
When you turn off autocomplete again only the '*is_similar_to*' relation
disappears since we did not call it
**Find homology group that belong to the rpoC1 gene**
.. code:: text
MATCH (n:homology_group)--(m:mRNA)--(g:gene) WHERE g.name = 'rpoC1' RETURN *
**Find genes on genome 1 which don't show homology**
.. code:: text $ pantools rename_phylogeny --phenotype=strain_name pecto_dickeya_DB pecto_dickeya_DB/alignments/grouping_v1/core_phylogeny/informative.fasta.treefile
MATCH (n:homology_group)--(m:mRNA)--(g:gene) WHERE n.num_members = 1 and g.genome = 1 RETURN * Take a look at **informative.fasta_RENAMED.treefile**, strain
identifiers have been added to the tree.
-------------- --------------
Structural variant detection Visualizing the tree in iTOL
~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ----------------------------
**Find SNP bubbles (for simplification we only use the FF relation)** Go to https://itol.embl.de and click on “Upload a tree” under the
**ANNOTATE** box. On this page you can paste the tree directly into the
.. code:: text **tree text:** textbox or can click the button to upload the .newick
file.
MATCH p= (n:nucleotide) -[:FF]-> (a1)-[:FF]->(m:nucleotide) <-[:FF]-(b1) <-[:FF]- (n) return * limit 50
.. figure:: /figures/tutorial_phylogeny1.png
**The same query but returning the results in a table** :width: 600
:align: center
.. code:: text
Basic controls iTOL
-------------------
- The default way of visualizing a tree is the rectangular view.
Depending on the number of genomes, the circular view can be easier
to interpret. You can the view by clicking on the “Display Mode”
buttons.
- Increase the font size and branch width to improve readability
- When visualizing a Maximum likelihood (ML) tree, bootstrap values can
be displayed by clicking the “Display” button next to
**Bootstrap/metadata** in the Advanced tab of the Control window.
This enables you to visualize the values as text or symbol on the
branch. or by coloring the branch or adjusting the width.
.. figure:: /figures/tutorial_phylogeny2.png
:width: 600
:align: center
- When you have a known out group or one of the genomes is a clear
outlier in the tree, you should reroot the tree. Hover over the name,
click it so a pop-up menu appears. Click “tree structure” followed by
“Reroot the tree here”.
.. figure:: /figures/tutorial_phylogeny3.png
:width: 600
:align: center
- Clicking on the name of a node in the tree allows you to color the
name, branch, or background of that specific node.
- When you’re happy the way your tree looks, go to the Export tab of
the Control window. Select the desired output format, click on the
“Full image” button and export the file to a figure.
- Refresh the webpage to go back to the default view of your tree.
Create iTOL templates
---------------------
In iTOL it is possible to add colors to the tree by coloring the
terminal nodes or adding an outer ring. The PanTools function
:ref:`create_tree_template <analysis/phylogeny:create tree template>` is able
to create templates that allows for easy coloring (with maximum of 20
possible colors). If the function is run without any additional
argument, templates are created for trees that only contain genome
numbers (e.g. k-mer distance tree). Here we want to color the (renamed)
core SNP tree with the ‘low_temperature’ phenotype. Therefore, the
``--phenotype`` strain_name must be included to the function.
MATCH (n:nucleotide) -[:FF]-> (a1)-[:FF]->(m:nucleotide) <-[:FF]-(b1) <-[:FF]- (n) return a1.length,b1.length, a1.sequence, b1.sequence limit 50 .. code:: bash
Functions such as **count()**, **sum()** and **stDev()** can be used in
a query.
**The same SNP query but count the hits instead of displaying them**
.. code:: text
MATCH p= (n:nucleotide) -[:FF]-> (a1)-[:FF]->(m:nucleotide) <-[:FF]-(b1) <-[:FF]- (n) return count(p) $ pantools create_tree_template pecto_dickeya_DB # Run this command when the tree contains genome numbers only
$ pantools create_tree_template --phenotype=strain_name pecto_dickeya_DB
-------------- Copy the two low_temperature.txt files from the label/strain_name/ and
ring/strain_name/ directories to your personal computer. Click and move
the ring template file into the tree visualization webpage.
Hopefully you know have some feeling with the Neo4j browser and cypher .. figure:: /figures/tutorial_phylogeny4.png
and you're inspired to create your own queries! :width: 600
:align: center
When you're done working in the browser, close the database (by using The resulting tree should look this when: the tree is rooted with the
the command line again). *Dickeya* genome, bootstrap values are displayed as text and the ring
color template was included.
.. code:: text .. figure:: /figures/tutorial_phylogeny_color.png
:width: 600
:align: center
$ neo4j stop Tree coloring is especially useful for large datasets. An example is
shown in the figure below, where members of the same species share a
color.
| More information on Neo4j and the cypher language: .. figure:: /figures/tutorial_phylogeny_tree.png
| `Neo4j Cypher Manual v3.5 <https://neo4j.com/docs/developer-manual/3.5/cypher/>`_ :width: 600
| `Neo4j Cypher Refcard <http://neo4j.com/docs/cypher-refcard/3.5/>`_ :align: center
| `Neo4j API <https://neo4j.com/developer/>`_
-------------- --------------
In :doc:`part 4 <tutorial_part4>` of the tutorial we explore some of the
functionalities to analyze the pangenome.
This diff is collapsed.
Part 5. Phylogeny Part 5. Explore the pangenome using the Neo4j browser
================= =====================================================
Part 5 preparation Did you skip :doc:`part 1 <tutorial_part1>` of the tutorial or were you
------------------ unable to build the chloroplast pangenome? Download the pre-constructed
pangenome
Pantools v3 is required to follow this part of the tutorial. In `here <http://bioinformatics.nl/pangenomics/tutorial/chloroplast_DB.tar.gz>`_
addition, MAFFT, FastTree, IQ-tree, R (and the ape R package) need to be or via wget.
installed and set to your $PATH. Validate if the tools are executable by
using the following commands.
.. code:: bash .. code:: bash
pantools --version $ wget http://bioinformatics.nl/pangenomics/tutorial/chloroplast_DB.tar.gz
Rscript --help $ tar xvzf chloroplast_DB.tar.gz
mafft -h
iqtree -h
fasttree -h
If you did not follow part 4 of the tutorial, download the --------------
pre-constructed pangenome
`here <http://bioinformatics.nl/pangenomics/tutorial/pecto_dickeya_DB.tar.gz>`_.
.. code:: bash Configuring Neo4j
-----------------
$ wget http://bioinformatics.nl/pangenomics/tutorial/pecto_dickeya_DB.tar.gz Set the full path to the chloroplast pangenome database by opening
$ tar -xvzf pecto_dickeya_DB.tar.gz neo4j.conf ('*neo4j-community-3.5.30/conf/neo4j.conf*') and include the
following line in the config file. Please make sure there is always only
a single uncommented line with 'dbms.directories.data'.
-------------- .. code:: text
Adding phenotype/metadata to the pangenome #dbms.directories.data=/YOUR_PATH/any_other_database
------------------------------------------ dbms.directories.data=/YOUR_PATH/chloroplast_DB
Before we construct the trees, we will add some phenotype data to the | **Allowing non-local connections**
pangenome. Once the we have a phylogeny, the information can be included | To be able to run Neo4j on a server and have access to it from
or be used to color parts of the tree. Below is a textfile with data for anywhere, some additional lines in the config file must be changed.
three phenotypes. The third phenotype, *low_temperature*, is in this
case a made up example! It states whether the strain is capable of - **Uncomment** the four following lines in
growing on (extreme) low temperatures. The phenotype file can be found neo4j-community-3.5.30/conf/neo4j.conf.
inside the database directory, add the information to the pangenome by - Replace 7686, 7474 and 7473 by any port number not in use on your system.
using :ref:`add_phenotypes <construction/annotate:add phenotypes>`. This way everyone can have their own database running at the same time.
.. code:: text .. code:: text
Genome, species, strain_name, low_temperature #dbms.connectors.default_listen_address=0.0.0.0
1,P. odoriferum,P. odoriferum Q166, false #dbms.connector.bolt.listen_address=:7687
2,P. fontis, P. fontis M022, true #dbms.connector.http.listen_address=:7474
3,P. polaris,P. polaris S4.16.03.2B, false #dbms.connector.https.listen_address=:7473
4,P. brasiliense, P. brasiliense S2, true
5,P. brasiliense, P. brasiliense Y49, false Lets start up the Neo4j server!
6,D. dadantii, D. dadantii 3937,?
.. code:: bash
$ neo4j start
Start Firefox (or a web browser of your own preference) and let it run
on the background.
.. code:: bash .. code:: bash
$ pantools add_phenotypes pecto_dickeya_DB pecto_dickeya_DB/phenotypes.txt $ firefox &
In case you did not change the config to allow non-local connections,
browse to *http://localhost:7474*. Whenever you did change the config
file, go to *server_address:7474*, where 7474 should be replaced with the
number you chose earlier, and *server_address* is the address set in the config.
If the database startup was successful, a login terminal will appear in
the webpage. Use '*neo4j*' both as username and password. After logging
in, you are requested to set a new password.
-------------- --------------
Constructing a phylogeny Exploring nodes and edges in Neo4j
------------------------ ----------------------------------
Go through the following steps to become proficient in using the Neo4j
browser and the underlying PanTools data structure. If you have any
difficulty trouble finding a node, relationship or any type of
information, download and use `this visual guide
<http://www.bioinformatics.nl/pangenomics/tutorial/neo4j_browser.tar.gz>`_.
1. Click on the database icon on the left. A menu with all node types
and relationship types will appear.
2. Click on the '*gene*' button in the node label section. This
automatically generated a query. Execute the query.
3. The **LIMIT** clause prevents large numbers of nodes popping up to
avoid your web browser from crashing. Set LIMIT to 10 and execute the
query.
4. Hover over the nodes, click on them and take a look at the values
stored in the nodes. All these features (except ID) were extracted
from the GFF annotation files. ID is an unique number automatically
assigned to nodes and relationships by Neo4j.
5. Double-click on the **matK** gene node, all nodes with a connection
to this gene node will appear. The nodes have distinct colors as
these are different node types, such as **mRNA**, **CDS**,
**nucleotide**. Take a look at the node properties to observe that
most values and information is specific to a certain node type.
6. Double-click on the *matK* mRNA node, a **homology_group** node
should appear. These type of nodes connect homologous genes in the
graph. However, you can see this gene did not cluster with any other
gene.
7. Hover over the **start** relation of the *matK* gene node. As you
can see information is not only stored in nodes, but also in
relationships! A relationship always has a certain direction, in
this case the relation starts at the gene node and points to a
nucleotide node. Offset marks the location within the node.
8. Double-click on the **nucleotide** node at the end of the 'start'
relationship. An in- and outgoing relation appear that connect to
other nucleotide nodes. Hover over both the relations and compare
them. The relations holds the genomic coordinates and shows this
path only occurs in contig/sequence 1 of genome 1.
9. Follow the outgoing **FF**-relationship to the next nucleotide node
and expand this node by double-clicking. Three nodes will pop up
this time. If you hover over the relations you see the coordinates
belong to other genomes as well. You may also notice the
relationships between nucleotide nodes is always a two letter
combination of F (forward) and R (reverse) which state if a sequence
is reverse complemented or not. The first letter corresponds to the
sequence of the node at the start of the relation where the second
letters refers to the sequence of the end node.
10. Finally, execute the following query to call the database scheme to
see how all node types are connected to each other: *CALL
db.schema()*. The schema will be useful when designing your own
queries!
In this tutorial we will construct three phylogenies, each based on a --------------
different type of variation: SNPs, genes and k-mers. Take a look at the
phylogeny manuals to get an understanding how the three methods work and
how they differ from each other.
1. :ref:`analysis/phylogeny:core phylogeny` Query the pangenome database using CYPHER
2. :ref:`analysis/phylogeny:gene distance tree` -----------------------------------------
3. :ref:`analysis/phylogeny:k-mer distance tree`
Core SNP phylogeny Cypher is a declarative, SQL-inspired language and uses ASCII-Art to
------------------ represent patterns. Nodes are represented by circles and relationships
by arrows.
The core SNP phylogeny will run various Maximum Likelihood models on - The **MATCH** clause allows you to specify the patterns Neo4j will
parsimony informative sites of single-copy orthologous sequences. A site search for in the database.
is parsimony-informative when there are at least two types of - With **WHERE** you can add constraints to the patterns described.
nucleotides that occur with a minimum frequency of two. The informative - In the **RETURN** clause you define which parts of the pattern to
sites are automatically identified by aligning the sequences; however, display.
it does not know which sequences are single-copy orthologous. You can
identify these conserved sequences by running
:ref:`gene_classification <analysis/classification:gene classification>`.
.. code:: bash Cypher queries
~~~~~~~~~~~~~~
$ pantools gene_classification --phenotype=species pecto_dickeya_DB **Match and return 100 nucleotide nodes**
Open **gene_classification_overview.txt** and take a look at statistics. .. code:: text
As you can see there are 2134 single-copy ortholog groups. Normally, all
of these groups are aligned to identify SNPs but for this tutorial we’ll
make a selection of only a few groups to accelerate the steps. You can
do this in two different ways:
Option 1: Open **single_copy_orthologs.csv** and remove all node MATCH (n:nucleotide) RETURN n LIMIT 100
identifiers after the first 20 homology groups and save the file.
.. code:: bash **Find all the genome nodes**
$ pantools core_phylogeny --mode=ML pecto_dickeya_DB .. code:: text
Option 2: Open **single_copy_orthologs.csv** and select the first 20 MATCH (n:genome) RETURN n
homology_group node identifiers. Place them in a new file sco_groups.txt
and include this file to the function.
.. code:: bash **Find the pangenome node**
$ pantools core_phylogeny --mode=ML -H=sco_groups.txt pecto_dickeya_DB .. code:: text
The sequences of the homology groups are being aligned two consecutive MATCH (n:pangenome) RETURN n
times. After the initial alignment, input sequences are trimmed based on
the longest start and end gap of the alignment. The parsimony
informative positions are taken from the second alignment and
concatenated into a sequence. When opening **informative.fasta** you can
find 6 sequences, the length of the sequences being the number of
parsimony-informative sites.
.. code:: bash **Match and return 100 genes**
$ iqtree -nt 4 -s pecto_dickeya_DB/alignments/grouping_v1/core_phylogeny/informative.fasta -redo -bb 1000 .. code:: text
IQ-tree generates several files, the tree that we later on in the MATCH (g:gene) RETURN g LIMIT 100
tutorial will continue with is called **informative.fasta.treefile**.
When examining the **informative.fasta.iqtree** file you can find the
best fit model of the data. This file also shows the number of sites
that were used, as sites with gaps (which IQ-tree does not allow) were
changed into singleton or constant sites.
Gene distance tree **Match and return 100 genes and order them by length**
~~~~~~~~~~~~~~~~~~
To create a phylogeny based on gene distances (absence/presence), we can .. code:: text
simply execute the Rscript that was created by
:ref:`gene_classification <analysis/classification:gene classification>`.
.. code:: bash MATCH (g:gene) RETURN g ORDER BY g.length DESC LIMIT 100
$ Rscript pecto_dickeya_DB/gene_classification/gene_distance_tree.R **The same query as before but results are now returned in a table**
The resulting tree is called **gene_distance.tree**. .. code:: text
K-mer distance tree MATCH (g:gene) RETURN g.name, g.address, g.length ORDER BY g.length DESC LIMIT 100
~~~~~~~~~~~~~~~~~~~
To obtain a k-mer distance phylogeny, the k-mers must first be counted **Return genes which are longer as 100 but shorter than 250 bp** (this
with the :ref:`kmer_classification <analysis/classification:k-mer can also be applied to other features such as exons introns or CDS)
classification>`
function. Afterwards, the tree can be constructed by executing the
Rscript.
.. code:: bash .. code:: text
$ pantools kmer_classification pecto_dickeya_DB MATCH (g:gene) where g.length > 100 AND g.length < 250 RETURN * LIMIT 100
$ Rscript pecto_dickeya_DB/kmer_classification/genome_kmer_distance_tree.R
The resulting tree is written to **genome_kmer_distance.tree**. **Find genes located on first genome**
.. code:: text
MATCH (g:gene) WHERE g.address[0] = 1 RETURN * LIMIT 100
**Find genes located on first genome and first sequence**
.. code:: text
MATCH (g:gene) WHERE g.address[0] = 1 AND g.address[1] = 1 RETURN * LIMIT 100
-------------- --------------
Renaming tree nodes Homology group queries
------------------- ~~~~~~~~~~~~~~~~~~~~~~
**Return 100 homology groups**
So far, we used three different types of distances (SNPs, genes, .. code:: text
k-mers), and two different methods (ML, NJ) to create three phylogenetic
trees. First, lets take a look at the text files. The
**informative.fasta.treefile** only contain genome numbers, bootstrap
values and branch lengths but is lacking the metadata. Examining
**gene_distance.tree** file also shows this information but the species
names as well, because we included this as a phenotype during
:ref:`gene_classification <analysis/classification:gene classification>`.
Let’s include the strain identifiers to the core snp tree to make the MATCH (h:homology_group) RETURN h LIMIT 100
final figure more informative. Use the
:ref:`rename_phylogeny <analysis/phylogeny:rename phylogeny>` function to
rename the tree nodes.
.. code:: bash **Match homology groups which contain two members**
.. code:: text
MATCH (h:homology_group) WHERE h.num_members = 2 RETURN h
**Match homology groups and 'walk' to the genes and corresponding start
and end node**
.. code:: text
MATCH (h:homology_group)-->(f:feature)<--(g:gene)-->(n:nucleotide) WHERE h.num_members = 2 RETURN * LIMIT 25
Turn off autocomplete by clicking on the button on the bottom right. The
graph falls apart because relations were not assigned to variables.
**The same query as before but now the relations do have variables**
.. code:: text
MATCH (h:homology_group)-[r1]-> (f:feature) <-[r2]-(g:gene)-[r3]-> (n:nucleotide) WHERE h.num_members = 2 RETURN * LIMIT 25
When you turn off autocomplete again only the '*is_similar_to*' relation
disappears since we did not call it
**Find homology group that belong to the rpoC1 gene**
.. code:: text
$ pantools rename_phylogeny --phenotype=strain_name pecto_dickeya_DB pecto_dickeya_DB/alignments/grouping_v1/core_phylogeny/informative.fasta.treefile MATCH (n:homology_group)--(m:mRNA)--(g:gene) WHERE g.name = 'rpoC1' RETURN *
Take a look at **informative.fasta_RENAMED.treefile**, strain **Find genes on genome 1 which don't show homology**
identifiers have been added to the tree.
.. code:: text
MATCH (n:homology_group)--(m:mRNA)--(g:gene) WHERE n.num_members = 1 and g.genome = 1 RETURN *
-------------- --------------
Visualizing the tree in iTOL Structural variant detection
---------------------------- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Go to https://itol.embl.de and click on “Upload a tree” under the
**ANNOTATE** box. On this page you can paste the tree directly into the
**tree text:** textbox or can click the button to upload the .newick
file.
.. figure:: /figures/tutorial_phylogeny1.png
:width: 600
:align: center
Basic controls ITOL
-------------------
- The default way of visualizing a tree is the rectangular view.
Depending on the number of genomes, the circular view can be easier
to interpret. You can the view by clicking on the “Display Mode”
buttons.
- Increase the font size and branch width to improve readability
- When visualizing a Maximum likelihood (ML) tree, bootstrap values can
be displayed by clicking the “Display” button next to
**Bootstrap/metadata** in the Advanced tab of the Control window.
This enables you to visualize the values as text or symbol on the
branch. or by coloring the branch or adjusting the width.
.. figure:: /figures/tutorial_phylogeny2.png
:width: 600
:align: center
- When you have a known out group or one of the genomes is a clear
outlier in the tree, you should reroot the tree. Hover over the name,
click it so a pop-up menu appears. Click “tree structure” followed by
“Reroot the tree here”.
.. figure:: /figures/tutorial_phylogeny3.png
:width: 600
:align: center
- Clicking on the name of a node in the tree allows you to color the
name, branch, or background of that specific node.
- When you’re happy the way your tree looks, go to the Export tab of
the Control window. Select the desired output format, click on the
“Full image” button and export the file to a figure.
- Refresh the webpage to go back to the default view of your tree.
Create iTOL templates
---------------------
In iTOL it is possible to add colors to the tree by coloring the
terminal nodes or adding an outer ring. The PanTools function
:ref:`create_tree_template <analysis/phylogeny:create tree template>` is able
to create templates that allows for easy coloring (with maximum of 20
possible colors). If the function is run without any additional
argument, templates are created for trees that only contain genome
numbers (e.g. k-mer distance tree). Here we want to color the (renamed)
core SNP tree with the ‘low_temperature’ phenotype. Therefore, the
``--phenotype`` strain_name must be included to the function.
.. code:: bash **Find SNP bubbles (for simplification we only use the FF relation)**
$ pantools create_tree_template pecto_dickeya_DB # Run this command when the tree contains genome numbers only .. code:: text
$ pantools create_tree_template --phenotype=strain_name pecto_dickeya_DB
MATCH p= (n:nucleotide) -[:FF]-> (a1)-[:FF]->(m:nucleotide) <-[:FF]-(b1) <-[:FF]- (n) return * limit 50
**The same query but returning the results in a table**
.. code:: text
Copy the two low_temperature.txt files from the label/strain_name/ and MATCH (n:nucleotide) -[:FF]-> (a1)-[:FF]->(m:nucleotide) <-[:FF]-(b1) <-[:FF]- (n) return a1.length,b1.length, a1.sequence, b1.sequence limit 50
ring/strain_name/ directories to your personal computer. Click and move
the ring template file into the tree visualization webpage.
.. figure:: /figures/tutorial_phylogeny4.png Functions such as **count()**, **sum()** and **stDev()** can be used in
:width: 600 a query.
:align: center
The resulting tree should look this when: the tree is rooted with the **The same SNP query but count the hits instead of displaying them**
*Dickeya* genome, bootstrap values are displayed as text and the ring
color template was included.
.. figure:: /figures/tutorial_phylogeny_color.png .. code:: text
:width: 600
:align: center MATCH p= (n:nucleotide) -[:FF]-> (a1)-[:FF]->(m:nucleotide) <-[:FF]-(b1) <-[:FF]- (n) return count(p)
--------------
Hopefully you know have some feeling with the Neo4j browser and cypher
and you're inspired to create your own queries!
When you're done working in the browser, close the database (by using
the command line again).
.. code:: text
Tree coloring is especially useful for large datasets. An example is $ neo4j stop
shown in the figure below, where members of the same species share a
color.
.. figure:: /figures/tutorial_phylogeny_tree.png | More information on Neo4j and the cypher language:
:width: 600 | `Neo4j Cypher Manual v3.5 <https://neo4j.com/docs/developer-manual/3.5/cypher/>`_
:align: center | `Neo4j Cypher Refcard <http://neo4j.com/docs/cypher-refcard/3.5/>`_
| `Neo4j API <https://neo4j.com/developer/>`_
-------------- --------------
.. In :doc:`part 3 <tutorial_part3>` of the tutorial we explore some of the
In :doc:`part 6 <tutorial_part6>` of the tutorial we go through a couple of use cases with the pangenomic read mapping functionality implemented in PanTools. functionalities to analyze the pangenome.
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment