diff --git a/CHANGELOG.md b/CHANGELOG.md index a6a8837cce0d79a0f6b4aec0cf7eac7c65af5033..14d2e35060869104dd1c62e746075df3cb3416eb 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -10,6 +10,7 @@ All notable changes to Pantools will be documented in this file. ### Changed - No longer accept tab characters in fasta input files (!232). +- Updated and reordered the tutorial section of the documentation (!228). ### Fixed - `pantools blast` no longer crashes if not all mrna nodes contain protein IDs (!226). diff --git a/docs/source/analysis/classification.rst b/docs/source/analysis/classification.rst index 235bd21c77852d62f6e204b633feb4f775bc667e..18138fe4faca9ee304bfd2b46e4d374c9b20e49f 100644 --- a/docs/source/analysis/classification.rst +++ b/docs/source/analysis/classification.rst @@ -243,7 +243,7 @@ simulation. Heaps' law (a power law) can be fitted to the number of new genes observed when increasing the pangenome by one random genome. The formula -for the power law model is :math:`n = k * N^{-a}`, where *n* is the +for the power law model is :math:`\Delta n = k * N^{-a}`, where *Δn* is the newly discovered genes, *N* is the total number of genomes, and *k* and *a* are the fitting parameters. A pangenome can be considered open when *a* < 1 and closed if *a* > 1. diff --git a/docs/source/analysis/explore.rst b/docs/source/analysis/explore.rst index 9cbe53d65f09eba75c1f91f5396bd31866a222fe..5924ac7d1e48901ffe387588077e00cd7f856fab 100644 --- a/docs/source/analysis/explore.rst +++ b/docs/source/analysis/explore.rst @@ -41,7 +41,7 @@ members. * - <databaseDirectory> - Path to the database root directory. * - <homologyFile> - - A text file with homology group node identifiers, seperated by a + - A text file with homology group node identifiers, separated by a comma. **Options** @@ -180,10 +180,10 @@ annotation node and extract the nucleotide and protein sequence. * - ``--functions`` - One or multiple function identifiers (GO, InterPro, PFAM, - TIGRFAM), seperated by a comma. + TIGRFAM), separated by a comma. * - ``--nodes``/``-n`` - One or multiple identifiers of function nodes (GO, InterPro, - PFAM, TIGRFAM), seperated by a comma. + PFAM, TIGRFAM), separated by a comma. * - ``--include``/``-i`` - Only include a selection of genomes. * - ``--exclude``/``-e`` @@ -283,9 +283,9 @@ specific GO terms of the hierarchy to a sequence. :widths: 30 70 * - ``--functions`` - - One or multiple GO term identifiers, seperated by a comma. + - One or multiple GO term identifiers, separated by a comma. * - ``--nodes``/``-n`` - - One or multiple identifiers of 'GO' nodes, seperated by a comma. + - One or multiple identifiers of 'GO' nodes, separated by a comma. **Example commands** .. code:: bash @@ -322,9 +322,9 @@ their location in the hierarchy is reported. :widths: 30 70 * - ``--functions`` - - One or multiple GO term identifiers, seperated by a comma. + - One or multiple GO term identifiers, separated by a comma. * - ``--nodes``/``-n`` - - One or multiple identifiers of 'GO' nodes, seperated by a comma. + - One or multiple identifiers of 'GO' nodes, separated by a comma. * - ``--include``/``-i`` - Only include a selection of genomes. * - ``--exclude``/``-e`` @@ -378,7 +378,7 @@ Report all available information of one or multiple homology groups. To find Phobius (P) or SignalP (S) annotations, include: 'secreted' (P/S), 'receptor' (P/S), or 'transmembrane' (P). * - ``--genes``/``-g`` - - One or multiple gene names, seperated by a comma. + - One or multiple gene names, separated by a comma. * - ``--node`` - Retrieve the nucleotide nodes belonging to genes in homology groups diff --git a/docs/source/analysis/go_enrichment.rst b/docs/source/analysis/go_enrichment.rst index 71d2c7b1e4f5407d413ad4d6fc6fa8f0a1dd391c..00ea8c6cc66bec16a130cc3d3d2b854cff7e6629 100644 --- a/docs/source/analysis/go_enrichment.rst +++ b/docs/source/analysis/go_enrichment.rst @@ -69,9 +69,9 @@ the second and third rank this will be 0.0010 and 0.0015, respectively. :widths: 30 70 * - ``--homology-file``/``-H`` - - A text file with homology group node identifiers, seperated by a comma. + - A text file with homology group node identifiers, separated by a comma. * - ``--nodes``/``-n`` - - mRNA node identifiers, seperated by a comma on the command line. + - mRNA node identifiers, separated by a comma on the command line. * - ``--include``/``-i`` - Only include a selection of genomes. * - ``--exclude``/``-e`` diff --git a/docs/source/analysis/phylogeny.rst b/docs/source/analysis/phylogeny.rst index 4df7d04dd188f8c614d0349d31a258994dca558c..f90c3885be3a0d53d082b7b56958e2db521c36fa 100644 --- a/docs/source/analysis/phylogeny.rst +++ b/docs/source/analysis/phylogeny.rst @@ -695,7 +695,7 @@ for the visualization of phylogenies in iTOL. Phenotypes must already be included in the pangenome with the :ref:`add_phenotypes <construction/annotate:add phenotypes>` functionality. How to use the template files in iTOL can be found in one of the -:ref:`tutorials <tutorial/tutorial_part5:create itol templates>`. +:ref:`tutorials <tutorial/tutorial_part3:create itol templates>`. If you run this function without a ``--phenotype`` argument, templates are created for trees that contain only genome numbers as node labels. diff --git a/docs/source/construction/annotate.rst b/docs/source/construction/annotate.rst index 76786703b3a2e143f8110ffb7bcc8ca58035aa41..249f9ecb8523c50876508deea24ac96b7a57cf22 100644 --- a/docs/source/construction/annotate.rst +++ b/docs/source/construction/annotate.rst @@ -40,6 +40,9 @@ selected and the second annotation of genome 2 and 3. if the GFF3 file contains a trans-spliced gene that has alternative splicing, it will not be able to handle it (it will only annotate one mRNA). + The PanUtils project contains a :ref:`data filtering pipeline + <getting_started/diy_pangenomics:Quality control pipeline>` that fixes most + common issues encountered by the parser. **Parameters** .. list-table:: diff --git a/docs/source/construction/group.rst b/docs/source/construction/group.rst index e0c80dccb7ff90e47841f24756e274fe3c12967a..5d87e4b2711e2d2c6b5bbdef12eb6321d29c34cc 100644 --- a/docs/source/construction/group.rst +++ b/docs/source/construction/group.rst @@ -148,7 +148,7 @@ argument to the command. | **What BUSCO benchmark set to use** - When using **BUSCO v3**, go to https://busco.ezlab.org, download a odb9 - set, and untar it with ``tar -xvzf``. Include the entire directory in + set, and untar it with ``tar xvzf``. Include the entire directory in the command using the ``--input-file`` argument. - For **BUSCO v4** or **v5**, you only have to provide the odb10 database name with the ``--input-file`` argument, the database is downloaded @@ -493,5 +493,5 @@ settings that were used. - **current_pantools_homology_groups.txt**, overview of the active homology groups. Each line represents one homology group. The line starts with the homology group (database) identifier followed by a - colon and the rest are mRNA IDs (from gff/genbank) seperated by a + colon and the rest are mRNA IDs (from gff/genbank) separated by a space. diff --git a/docs/source/developer_guide/install.rst b/docs/source/developer_guide/install.rst index 0144964c79d2182127f533db0b6fcdd6b7f24a6e..c0820be5fff059198bbd3a4d0e133f18bc907f2a 100644 --- a/docs/source/developer_guide/install.rst +++ b/docs/source/developer_guide/install.rst @@ -56,7 +56,7 @@ downloaded from https://github.com/refresh-bio/KMC/releases. .. code:: bash - $ tar -xvzf KMC* #uncompress the KMC binaries + $ tar xvzf KMC* #uncompress the KMC binaries # Edit your ~/.bashrc to include KMC to your PATH $ echo "export PATH=/path/to/KMC/:\$PATH" >> ~/.bashrc #replace /path/to with the correct path on your computer @@ -76,7 +76,7 @@ https://micans.org/mcl under License & software. .. code:: bash $ wget https://micans.org/mcl/src/mcl-14-137.tar.gz - $ tar -xvzf mcl-* + $ tar xvzf mcl-* $ cd mcl-14-137 $ ./configure --prefix=/path/to/mcl-14-137/shared #replace /path/to with the correct path on your computer $ make install @@ -405,7 +405,7 @@ The following commands can be used to test and build the documentation: .. code:: bash # Test the documentation - sphinx-lint sphinx-lint -e=all --max-line-length=80 + sphinx-lint -e=all --max-line-length=80 # Build the documentation sphinx-build -W docs/source output diff --git a/docs/source/getting_started/diy_pangenomics.rst b/docs/source/getting_started/diy_pangenomics.rst new file mode 100644 index 0000000000000000000000000000000000000000..89dba3ba9bdfa3f7faf76c9d7878e986cf86a5b9 --- /dev/null +++ b/docs/source/getting_started/diy_pangenomics.rst @@ -0,0 +1,114 @@ +DIY Pangenomics +=============== + +Likely you want to do pangenomics using your own data, on your own system. +To support this, we offer several tools: + +* `PanTools`_ : core application to construct and analyze pangenomes + +* `PanUtils`_: utilities to support pangenomics (e.g. QC) + +* `PanBrowse`_: applications for pangenome visualization (e.g. PanVA) + +When working with these software tools, consider the following tips: + +* Separate issues in the (technical) setup from issues in the data, so try + the tool on known (test) data first. + +* Garbage in, garbage out! Therefore, perform rigorous quality control on the + input data, e.g. using https://github.com/PanUtils/pantools-qc-pipeline. + +* Start small, e.g. 10-50 bacterial genomes or 3-10 eukaryotic genomes, before + scaling up. + +PanTools +-------- + +PanTools (https://git.wur.nl/bioinformatics/pantools) is a software package +for pangenome construction and analysis. It works across the tree of life, +from small viruses to large plant, animal, or human genomes. Because it is not +alignment based, but employs a compacted De Bruijn Graph (cDBG) it works over +large(r) evolutionary distances, e.g. at species, genus or family level. + + +PanUtils +-------- + +We have been working on several pangenome utilities, collected in +https://github.com/PanUtils. The repository currently contains three pipelines: +one for quality control, one for pangenome construction and one for analysis +and preparing a PanVA instance. + +Quality control pipeline +^^^^^^^^^^^^^^^^^^^^^^^^ + +This is a Snakemake pipeline that can be used for running quality control on +the data. We highly recommend this pipeline for checking your files, as +annotation files can be very tricky to work with (and therefore sometimes +cause issues for tools like PanTools). This pipeline can be found at +https://github.com/PanUtils/pantools-qc-pipeline which includes a README on +how to install and run it. In summary, one needs to create a conda environment +with the most recent version of Snakemake (at the moment of writing this is +8.25.5). Please follow the instructions on which configuration to set and +where all files should be placed. + +The pipeline has four different workflows: + +* raw_statistics: creates an overview of the statistics of the input data + +* filter: can be used for filtering genome and annotation files + +* proteins: creates protein files including statistics + +* functions: creates functional annotations for all proteins in the protein + files + + +PanTools v4 pipeline +^^^^^^^^^^^^^^^^^^^^ + +The PanTools Snakemake pipeline runs through all major +PanTools functionalities. You can use this pipeline for inspiration on what +can be done with PanTools or to actually run PanTools. +Please see https://github.com/PanUtils/pantools-pipeline-v4 for the code, +including README that summarizes how to install, configure and run it. + +.. warning:: + It is not recommended to use this pipeline with new experimental data; + adjustments to data and PanTools command settings might have to be made + during the process, for which this pipeline lacks the flexibility. + +The pipeline has three different workflows: + +* all_panproteome: runs through all major PanTools functionalities available + for a panproteome +* all_pangenome: runs through all major PanTools functionalities available for + a pangenome +* panva: runs all PanTools functions necessary to set up a PanVA instance + +PanVA conversion +^^^^^^^^^^^^^^^^ + +If one has already created a pangenome database with PanTools that needs to be +converted to the data structure necessary for setting up a PanVA instance, +please use this repository: https://github.com/PanUtils/export-to-panva. It +contains a README on how to install and run it. In summary, one needs to +create an environment using mamba from a predefined YAML file, configure +settings in a config.ini file and then run the python script which takes the +config.ini as input. + +PanBrowse +--------- + +We work on visualization methods for (interactive) visualization of +complex pangenomes. PanBrowse (https://github.com/PanBrowse) is where +we host the visualization applications. + +PanVA +^^^^^ + +`PanVA <https://github.com/PanBrowse/PanVA>`_ is a web application +allowing users to visually and interatively explore sequence variants +in pangenomes (generated by PanTools). It provides context for these +variants by displaying their corresponding annotations, phylogenetic +and phenotypic information. diff --git a/docs/source/getting_started/technical_setup.rst b/docs/source/getting_started/technical_setup.rst new file mode 100644 index 0000000000000000000000000000000000000000..4010e0a97fe7b5b7859d0332ec0c61442f7c2642 --- /dev/null +++ b/docs/source/getting_started/technical_setup.rst @@ -0,0 +1,137 @@ +Technical setup validation +========================== + +Performance is important when running PanTools on larger datasets. Therefore, +it is recommended to test your PanTools installation on one or more of the +test data sets below, before applying it to your own data. + +PanTools pangenome construction commands make a large amount of +transactions to the database and should be built on a location that can +handle this for larger pangenomes. We recommend using an SSD or a RAM-disk +(/dev/shm/, available on Linux machines). If pangenome construction takes +too much time, the location of the database might be the issue. For reference, +the building time of a small yeast data set of 10 genomes on our RAM-disk takes +a little over two minutes, while it takes nearly 30 minutes on a network disk! +Larger datasets exacerbate this gap in runtime. Disk size needed for +construction depends on input data (genome size and number of genomes), +for 8 arabidopsis genomes ±20GB should suffice, for 3 lettuce genomes ±250GB. +Java heap space needed depends on the same parameters: in general it is +recommendable to set this to larger than expected because java running out +of heap space makes the tool crash (e.g. for lettuce we used +-Xmx200g -Xms200g). + +.. warning:: + If you want to do everything in a single directory, make sure to download the + data on the disk where you want to write the output. If the input data is on + a network disk, adjust the commands to write the pangenome database on a + local disk (RAM or SSD). + +Ten yeast genomes +----------------- + +Download the data from: +https://www.bioinformatics.nl/pangenomics/data/yeast_10strains.tgz + +.. code:: bash + + $ wget https://www.bioinformatics.nl/pangenomics/data/yeast_10strains.tgz + $ tar zxvf yeast_10strains.tgz + +Build the pangenome: + +.. code:: bash + + $ pantools -Xms20g -Xmx50g build_pangenome -t16 yeast_db yeast_10strains/metadata/genome_locations.txt + +Add annotations: + +.. code:: bash + + $ pantools -Xms20g -Xmx50g add_annotations yeast_db yeast_10strains/metadata/annotation_locations.txt + +Group proteins: + +.. code:: bash + + $ pantools -Xms20g -Xmx50g group yeast_db -t16 --relaxation=3 + +Run time statistics (16 threads on RAM-disk): + +.. csv-table:: + :file: /tables/runtime_statistics_yeast.csv + :header-rows: 1 + :delim: , + +Two *Arabidopsis* genomes +------------------------- + +Start this use case with a bash script to download the data. Download the +script from here: +https://www.bioinformatics.nl/pangenomics/data/download_2_ara.sh +To download the *Arabidopsis* data, run the script: + +.. code:: bash + + $ bash download_2_ara.sh <threads> + +Build the pangenome: + +.. code:: bash + + $ pantools -Xms40g -Xmx80g build_pangenome -t50 --kmer-size=17 ara_db a_thaliana_2genomes/metadata/genome_locations.txt + +Add annotations: + +.. code:: bash + + $ pantools -Xms40g -Xmx80g add_annotations ara_db a_thaliana_2genomes/metadata/annotation_locations.txt + +Group proteins: + +.. code:: bash + + $ pantools -Xms40g -Xmx80g group ara_db -t50 --relaxation=3 + +Run time statistics (50 threads on RAM-disk) + +.. csv-table:: + :file: /tables/runtime_statistics_arabidopsis.csv + :header-rows: 1 + :delim: , + +Three tomato genomes +-------------------- + +Download the script from here: +https://www.bioinformatics.nl/pangenomics/data/tomato_3genomes.tar.gz + +Extract the tomato data: + +.. code:: bash + + $ tar xvzf tomato_3genomes.tar.gz + +Build the pangenome: + +.. code:: bash + + $ pantools -Xms40g -Xmx80g build_pangenome -t50 --kmer-size=13 tomato_db tomato_3genomes/metadata/genome_locations.txt + +Add annotations: + +.. code:: bash + + $ pantools -Xms40g -Xmx80g add_annotations tomato_db tomato_3genomes/metadata/annotation_locations.txt + +Group proteins: + +.. code:: bash + + $ pantools -Xms40g -Xmx80g group tomato_db -t50 --relaxation=3 + +Run time statistics (50 threads on RAM-disk) + +.. csv-table:: + :file: /tables/runtime_statistics_tomato.csv + :header-rows: 1 + :delim: , diff --git a/docs/source/index.rst b/docs/source/index.rst index 6ec402e36fad3a3b49ae974a5684fee6d23fdc61..023b48b8d6efd81b7958b4951fa174b43b807237 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -91,12 +91,24 @@ Contents :caption: Getting Started :maxdepth: 1 + getting_started/diy_pangenomics Install <getting_started/install> + getting_started/technical_setup getting_started/differences getting_started/workflows getting_started/selection getting_started/query +.. toctree:: + :caption: Tutorial PanTools + :maxdepth: 1 + + Tutorial 1 - Pangenome construction <tutorial/tutorial_part1> + Tutorial 2 - Pangenome analysis <tutorial/tutorial_part2> + Tutorial 3 - Phylogeny <tutorial/tutorial_part3> + Tutorial 4 - Visualization (PanVA) <tutorial/tutorial_part4> + Tutorial 5 - Neo4j browser <tutorial/tutorial_part5> + .. toctree:: :caption: Pangenome Construction :maxdepth: 1 @@ -123,16 +135,6 @@ Contents analysis/support analysis/dn_ds -.. toctree:: - :caption: Tutorial - :maxdepth: 1 - - Tutorial 1 - Install PanTools <tutorial/tutorial_part1> - Tutorial 2 - Construct pangenome <tutorial/tutorial_part2> - Tutorial 3 - Neo4j browser <tutorial/tutorial_part3> - Tutorial 4 - Characterization <tutorial/tutorial_part4> - Tutorial 5 - Phylogeny <tutorial/tutorial_part5> - .. toctree:: :caption: Developer guide :maxdepth: 1 @@ -140,4 +142,3 @@ Contents Installation <developer_guide/install> Release steps <developer_guide/release> Database structure <developer_guide/database> - diff --git a/docs/source/tables/runtime_statistics_arabidopsis.csv b/docs/source/tables/runtime_statistics_arabidopsis.csv new file mode 100644 index 0000000000000000000000000000000000000000..e0a849e7f4f4d1a1572959e9fb7f33dae24d0e9b --- /dev/null +++ b/docs/source/tables/runtime_statistics_arabidopsis.csv @@ -0,0 +1,4 @@ +command,running time (h:m:s),CPU time (s) +build_pangenome,0:27:38,3102.41 +add_annotations,0:00:52,93.56 +group,0:02:25,507.60 \ No newline at end of file diff --git a/docs/source/tables/runtime_statistics_tomato.csv b/docs/source/tables/runtime_statistics_tomato.csv new file mode 100644 index 0000000000000000000000000000000000000000..3eb9c21b6f0cdcecacdab2561f22ab8955177fba --- /dev/null +++ b/docs/source/tables/runtime_statistics_tomato.csv @@ -0,0 +1,4 @@ +command,running time (h:m:s),CPU time (s) +build_pangenome,20:28:33,257610.72 +add_annotations,0:01:35,181.19 +group,0:03:02,5861.83 \ No newline at end of file diff --git a/docs/source/tables/runtime_statistics_yeast.csv b/docs/source/tables/runtime_statistics_yeast.csv new file mode 100644 index 0000000000000000000000000000000000000000..d478542cdbc2981b1a3f54c112ebcf479add7fe4 --- /dev/null +++ b/docs/source/tables/runtime_statistics_yeast.csv @@ -0,0 +1,4 @@ +command,running time (h:m:s),CPU time (s) +build_pangenome,0:02:23,276.50 +add_annotations,0:00:35,51.82 +group,0:01:58,1351.73 \ No newline at end of file diff --git a/docs/source/tutorial/tutorial_part1.rst b/docs/source/tutorial/tutorial_part1.rst index 939f06fa013c18acf3a1a50f929bb3a664ed13dc..68d89890b5492c472684216699c4434322cf0440 100644 --- a/docs/source/tutorial/tutorial_part1.rst +++ b/docs/source/tutorial/tutorial_part1.rst @@ -1,5 +1,191 @@ -Part 1. Install PanTools -======================== +Part 1. Build your own pangenome using PanTools +=============================================== -For instructions on how to install PanTools, -see :doc:`/getting_started/install`. +Install PanTools from bioconda following the +:ref:`install <getting_started/install:Install from bioconda>` +instructions. + +To demonstrate the main functionalities of PanTools we use a small +chloroplasts dataset to avoid long construction times. + +.. csv-table:: + :file: /tables/chloroplast_datasets.csv + :header-rows: 1 + :delim: ; + +Download the chloroplast fasta and gff files +`here <http://bioinformatics.nl/pangenomics/tutorial/chloroplasts.tar.gz>`_ +or via wget. + +.. code:: bash + + $ wget http://bioinformatics.nl/pangenomics/tutorial/chloroplasts.tar.gz + $ tar xvzf chloroplasts.tar.gz #unpack the archive + +-------------- + +BUILD, ANNOTATE and GROUP +------------------------- + +We start with building a pangenome using four of the five chloroplast +genomes. For this you need a text file which directs PanTools to the +FASTA files. Call your text file **genome_locations.txt** and include +the following lines: + +.. code:: text + + YOUR_PATH/C_sativus.fasta + YOUR_PATH/O_sativa.fasta + YOUR_PATH/S_lycopersicum.fasta + YOUR_PATH/S_tuberosum.fasta + +Make sure that ‘*YOUR_PATH*’ is the full path to the input files! The text +file should only contain full paths to FASTA files, no additional spaces or +empty lines. Then run PanTools with the +:ref:`build_pangenome <construction/build:build pangenome>` function and +include the text file + +.. code:: bash + + $ pantools -Xmx5g build_pangenome chloroplast_DB genome_locations.txt + +Did the program run without any error messages? Congratulations, you’ve +built your first pangenome! + +Adding additional genomes +~~~~~~~~~~~~~~~~~~~~~~~~~ +PanTools has the ability to add additional genomes to an already +existing pangenome. To test the function of PanTools, prepare a text +file containing the path to the Maize chloroplast genome. Call your +text file **fifth_genome_location.txt** and include the following +line to the file: + +.. code:: text + + YOUR_PATH/Z_mays.fasta + +Run PanTools on the new text file and using the +:ref:`add_genomes <construction/build:add genomes>` function + +.. code:: bash + + $ pantools add_genomes chloroplast_DB fifth_genome_location.txt + +Adding annotations +~~~~~~~~~~~~~~~~~~ + +To include gene annotations to the pangenome, prepare a text file +containing paths to the GFF files. Call your text file +**annotation_locations.txt** and include the following lines into the +file: + +.. code:: text + + 1 YOUR_PATH/C_sativus.gff3 + 2 YOUR_PATH/O_sativa.gff3 + 3 YOUR_PATH/S_lycopersicum.gff3 + 4 YOUR_PATH/S_tuberosum.gff3 + 5 YOUR_PATH/Z_mays.gff3 + +Run PanTools using the :ref:`add_annotations <construction/annotate:add +annotations>` function and include the new text file. Also add the +``--connect`` flag to attach the annotations to the nucleotide nodes; +this is useful for exploring the pangenome in the Neo4j browser later +in the tutorial. + +.. code:: bash + + $ pantools add_annotations --connect chloroplast_DB annotation_locations.txt + +Homology grouping +~~~~~~~~~~~~~~~~~ + +PanTools can infer homology between the protein sequences of a pangenome +and cluster them into homology groups with the +:ref:`group <construction/group:group>` function. The ``--relaxation`` +parameter must be set to influence the sensitivity, If you don't know the +best setting for the grouping relaxation, you can use the +:ref:`busco_protein <construction/group:busco protein>` and +:ref:`optimal grouping <construction/group:optimal grouping>` functions +instead, but for now we will use a relaxation of 4. + +.. code:: bash + + $ pantools group --relaxation=4 chloroplast_DB + +-------------- + +Adding phenotypes +----------------- + +Phenotype values can be Integers, Double, String or Boolean values. +Create a text file **phenotypes.txt**. + +.. code:: text + + Genome,Solanum + 1,false + 2,false + 3,true + 4,true + 5,false + +And use :ref:`add_phenotypes <construction/annotate:add phenotypes>` to add the +information to the pangenome. + +.. code:: bash + + $ pantools add_phenotypes chloroplast_DB phenotypes.txt + +-------------- + +RETRIEVE functions +------------------ + +Now that the construction is complete, lets quickly validate if the +construction was successful and the database can be used. To retrieve +some genomic regions, prepare a text file containing genomic +coordinates. Create the file **regions.txt** and include the following +for each region: genome number, contig number, start and stop position +and separate them by a single space + +.. code:: text + + 1 1 200 500 + 2 1 300 700 + 3 1 1 10000 + 3 1 1 10000 - + 4 1 9999 15000 + 5 1 100000 110000 + +Now run the :ref:`retrieve_regions <analysis/explore:retrieve regions>` +function and include the new text file + +.. code:: bash + + $ pantools retrieve_regions chloroplast_DB regions.txt + +Take a look at the extracted regions that are written to the +**chloroplast_DB/retrieval/regions/** directory. + +To retrieve entire genomes, prepare a text file **genome_numbers.txt** +and include each genome number on a separate line in the file + +.. code:: text + + 1 + 3 + 5 + +Use the **retrieve_regions** function again but include the new text +file + +.. code:: bash + + $ pantools retrieve_regions chloroplast_DB genome_numbers.txt + +Genome files are written to same directory as before. Take a look at one +of the three genomes you have just retrieved. + +In :doc:`part 2 <tutorial_part2>` of the tutorial we explore the +pangenome you just built using the Neo4j browser and the Cypher language. diff --git a/docs/source/tutorial/tutorial_part2.rst b/docs/source/tutorial/tutorial_part2.rst index f8fa9079322847b0ae859a3e1898956f6a53f8f5..d9331df7d2e3d7bd6ae92ec4f048606e02b7d4be 100644 --- a/docs/source/tutorial/tutorial_part2.rst +++ b/docs/source/tutorial/tutorial_part2.rst @@ -1,188 +1,435 @@ -Part 2. Build your own pangenome using PanTools -=============================================== +Part 2. Pangenome Analysis +========================== -To demonstrate the main functionalities of PanTools we use a small -chloroplasts dataset to avoid long construction times. +Input data +---------- .. csv-table:: - :file: /tables/chloroplast_datasets.csv + :file: /tables/bacteria_datasets.csv :header-rows: 1 :delim: ; -Download the chloroplast fasta and gff files -`here <http://bioinformatics.nl/pangenomics/tutorial/chloroplasts.tar.gz>`_ -or via wget. +To demonstrate how to use the PanTools functionalities we use a small +dataset of six bacteria to avoid long runtimes. Download a +pre-constructed pangenome or test your new skills and construct a +pangenome yourself using the fasta and gff files. + +Option 1: Download separate genome and annotation files .. code:: bash - $ wget http://bioinformatics.nl/pangenomics/tutorial/chloroplasts.tar.gz - $ tar -xvzf chloroplasts.tar.gz #unpack the archive + $ wget http://bioinformatics.nl/pangenomics/tutorial/pecto_dickeya_input.tar.gz + $ tar xvzf pecto_dickeya_input.tar.gz + $ gzip -d pecto_dickeya_input/annotations/* + $ gzip -d pecto_dickeya_input/genomes/* + $ gzip -d pecto_dickeya_input/functions/* + + $ pantools build_pangenome pecto_dickeya_DB pecto_dickeya_input/genomes.txt + $ pantools add_annotations pecto_dickeya_DB pecto_dickeya_input/annotations.txt + $ pantools group --relaxation=4 pecto_dickeya_DB + +Option 2: Download the pre-constructed pangenome + +.. code:: bash -We assume a PanTools alias was set during the -:ref:`installation <getting_started/install:set pantools alias>`. This allows -PanTools to be executed with ``pantools`` rather than the full path to the jar -file. If you don’t have an alias, either set one or replace the pantools -command with the full path to the .jar file in the tutorials. + $ wget http://bioinformatics.nl/pangenomics/tutorial/pecto_dickeya_DB.tar.gz + $ tar xvzf pecto_dickeya_DB.tar.gz -------------- -BUILD, ANNOTATE and GROUP -------------------------- +Adding phenotype/metadata to the pangenome +------------------------------------------ -We start with building a pangenome using four of the five chloroplast -genomes. For this you need a text file which directs PanTools to the -FASTA files. Call your text file **genome_locations.txt** and include -the following lines: +Before starting with the analysis, we will add some phenotype data to +the pangenome. Phenotypes allow you to find similarities for a group of +genomes sharing a phenotype as well as identifying variation between +different phenotypes. Below is a textfile with data for three +phenotypes. The third phenotype, *low_temperature*, is in this case a +made up example! It states whether the strain is capable of growing on +(extreme) low temperatures. The phenotype file can be found inside the +database directory or create a new file using the text from the box +below. Add the phenotype information to the pangenome using +:ref:`add_phenotypes <construction/annotate:add phenotypes>`. .. code:: text - YOUR_PATH/C_sativus.fasta - YOUR_PATH/O_sativa.fasta - YOUR_PATH/S_lycopersicum.fasta - YOUR_PATH/S_tuberosum.fasta + Genome, species, strain_name, low_temperature + 1,P. odoriferum,P. odoriferum Q166, false + 2,P. fontis, P. fontis M022, true + 3,P. polaris,P. polaris S4.16.03.2B, false + 4,P. brasiliense, P. brasiliense S2, true + 5,P. brasiliense, P. brasiliense Y49, false + 6,D. dadantii, D. dadantii 3937,? -Make sure that ‘*YOUR_PATH*’ is the full path to the input files! Then -run PanTools with the :ref:`build_pangenome <construction/build:build -pangenome>` function and include the text file +.. code:: bash + + $ pantools add_phenotypes pecto_dickeya_DB pecto_dickeya_input/phenotypes.txt + +-------------- + +Metrics and general statistics +------------------------------ + +After building or uncompressing the pangenome, run the +:ref:`metrics <analysis/metrics:metrics>` functionality to produce +various statistics that should verify an errorless construction. .. code:: bash - $ pantools build_pangenome chloroplast_DB genome_locations.txt + $ pantools metrics pecto_dickeya_DB -Did the program run without any error messages? Congratulations, you’ve -built your first pangenome! If not? Make sure your Java version is up to -date and kmc is executable. The text file should only contain full paths -to FASTA files, no additional spaces or empty lines. +Open **metrics_per_genome.csv** with a spreadsheet tool (Excel, +Libreoffice, Google sheets) and make sure the columns are split on +commas. You may easily notice the many empty columns in this table as +these type of annotations or features are not included in the database +(yet). Functional annotations are incorporated later in this tutorial. +Columns for features like exon and intron will remain empty as bacterial +coding sequences are not interrupted. -Adding additional genomes -~~~~~~~~~~~~~~~~~~~~~~~~~ -PanTools has the ability to add additional genomes to an already -existing pangenome. To test the function of PanTools, prepare a text -file containing the path to the Maize chloroplast genome. Call your -text file **fifth_genome_location.txt** and include the following -line to the file: +-------------- -.. code:: text +Gene classification +------------------- - YOUR_PATH/Z_mays.fasta +With the :ref:`gene_classification <analysis/classification:gene +classification>` functionality you are able to organize the gene repertoire +into the core, accessory or unique part of the pangenome. -Run PanTools on the new text file and use the -:ref:`add_genomes <construction/build:add genomes>` function +- **Core**, a gene is present in all genomes +- **Unique**, a gene is present in a single genome +- **Accessory**, a gene is present in some but not all genomes .. code:: bash - $ pantools add_genomes chloroplast_DB fifth_genome_location.txt + $ pantools gene_classification pecto_dickeya_DB -Adding annotations -To include gene annotations to the pangenome, prepare a text file -containing paths to the GFF files. Call your text file -**annotation_locations.txt** and include the following lines into the -file: +.. figure:: /figures/classify_table.png + :width: 600 + :align: center -.. code:: text +Take a look in **gene_classification_overview.txt**. Here you can find +the number of classified homology groups and genes on a pangenome level +but also for individual genomes. + +Open **additional_copies.csv** with a spreadsheet tool. This file can be +useful to identify duplicated genes in relation to other genomes. + +The default criteria to call a group core is presence in all genomes +where unique is only allowed to be present in one genome. These two +categories are highly influenced by annotation quality, especially in +large pangenomes. Luckily, the threshold for core and unique groups can +easily be adjusted. Let’s consider genes to be core when present in only +five of the six genomes by setting the ``--core-threshold`` argument. + +.. code:: bash + + $ pantools gene_classification --core-threshold=85 pecto_dickeya_DB + +Look in **gene_classification_overview.txt** again to observe the +increase of core groups/genes at the cost of accessory groups. + +For this pangenome, the *Dickeya* genome is considered an outgroup to +the five *Pectobacterium* genomes. While this outgroup is needed to root +and analyze phylogenetic trees (tutorial part 3), it affects the number +classified groups for the all other genomes. Use ``--include`` or +``--exclude`` to exclude the *Dickeya* genome. + +.. code:: bash + + $ pantools gene_classification --include=1-5 pecto_dickeya_DB + $ pantools gene_classification --exclude=6 pecto_dickeya_DB + +Take a look in **gene_classification_overview.txt** one more time to +examine the effect of excluding this genome. The total number of groups +in the analysis is lower now but the number of core and unique genes +have increased for the five remaining genomes. + +When phenotype information is used in the analysis, three additional +categories can be assigned to a group: + +- **Shared**, a gene present in all genomes of a phenotype +- **Exclusive**, a gene is only present in a certain phenotype +- **Specific**, a gene present in all genomes of a phenotype and is + also exclusive + +Include a ``--phenotype`` argument to find genes that are exclusive for +a certain species. + +.. code:: bash + + $ pantools gene_classification --phenotype=species pecto_dickeya_DB + +Open **gene_classification_phenotype_overview.txt** to see the number of +classified groups for the species phenotype. + +Open **phenotype_disrupted.csv** in a spreadsheet tool. This file +explains exactly why a homology groups is labeled as phenotype shared +and not specific. + +Open **phenotype_additional_copies.csv** in a spreadsheet tool. +Similarly to *phenotype_additional.csv* this file shows groups where all +genomes of a certain phenotype have additional gene copies to (at least +one of) the other phenotypes. - 1 YOUR_PATH/C_sativus.gff3 - 2 YOUR_PATH/O_sativa.gff3 - 3 YOUR_PATH/S_lycopersicum.gff3 - 4 YOUR_PATH/S_tuberosum.gff3 - 5 YOUR_PATH/Z_mays.gff3 +Each time you run the +:ref:`gene_classification <analysis/classification:gene classification>` +function, multiple files are created that contain node identifiers of a certain +homology group category. These files can be given to other PanTools +functions for a downstream analysis, for example, sequence alignment, +phylogeny, or GO enrichment. We will use one of the files later in this +tutorial. -Run PanTools using the :ref:`add_annotations <construction/annotate:add -annotations>` function and include the new text file +-------------- + +Pangenome structure +------------------- + +With the previous functionality we identified the core, accessory and +unique parts of the pangenome. Now we will use the +:ref:`pangenome_size_genes <analysis/classification:pangenome structure>` +function to observe how these numbers are reached by simulating the growth of +the pangenome. Simulating the growth helps explaining if a pangenome should +be considered open or closed. An pangenome is called open as long as a +significant number of new (unique) genes are added to the total gene +repertoire. The openness of a pangenome is usually tested using Heap’s +law. .. code:: bash - $ pantools add_annotations --connect chloroplast_DB annotation_locations.txt + $ pantools pangenome_structure pecto_dickeya_DB -PanTools attached the annotations to our nucleotide nodes so now we can -cluster them. +The current test set of six bacterial genomes is not representative of a +full-sized pangenome. Therefore we prepared the results for the +structure simulation on a set of 197 *Pectobacterium* genomes. The +runtime of the analysis using 10.000 loops and 24 threads was 1 minute +and 54 seconds. Download the files here, unpack the archive and take a +look at the files. + +.. code:: bash -Homology grouping -~~~~~~~~~~~~~~~~~ + $ wget wget http://bioinformatics.nl/pangenomics/tutorial/pectobacterium_structure.tar.gz + $ tar -xvf pectobacterium_structure.tar.gz -PanTools can infer homology between the protein sequences of a -pangenome and cluster them into homology groups. Multiple parameters -can be set to influence the sensitivity but for now we use the -:ref:`group <construction/group:group>` functionality with default -settings. +You still have to run the R scripts to create the output figures +and determine the openness of the pangenome. .. code:: bash - $ pantools group chloroplast_DB + cd pectobacterium_structure + $ Rscript pangenome_growth.R + $ Rscript gains_losses_median_and_average.R + $ Rscript heaps_law.R + +Take a look at the plot. In **core_accessory_unique_size.png**, the +number of classified groups are plotted for any of the genome +combination that occurred during the simulation. For the +**core_accessory_size.png** plots, the number of unique groups is +combined with accessory groups. + +The **gains_losses.png** files display the average and mean group gain +and loss between different pangenome sizes. The line of the core starts +below zero, meaning for every random genome added, the core genome +decreases by a number of *X* genes. + +.. raw:: html + + <details> + <summary> + The alpha value of heaps' law is written to + <b><i>heaps_law_alpha.txt</i></b>. Click here to reveal the result. + </summary> + <p> + <br> + With an alpha of 0.53 the <i>Pectobacterium</i> pangenome has an + open structure, which is typical for many bacterial species due + their extensive exchange of genetic information. Performing this + analysis on a set of closely related animal or plant genomes usually + results in a closed pangenome, indicating a limited gene pool. + </p> + </details> -------------- -Adding phenotypes (requires PanTools v3) ----------------------------------------- +Functional annotations +---------------------- -Phenotype values can be Integers, Double, String or Boolean values. -Create a text file **phenotypes.txt**. +PanTools is able to incorporate functional annotations into the +pangenome by reading output of various functional annotation tools. In +this tutorial we only include annotations from InterProScan. Please see +the :ref:`add_functions <construction/annotate:add functions>` +manual to check which other tools are available. To include the annotations, +create a file **functions.txt** using text from the box below and add it +to the command line argument. .. code:: text - Genome,Solanum - 1,false - 2,false - 3,true - 4,true - 5,false + 1 YOUR_PATH/GCF_002904195.1.gff3 + 2 YOUR_PATH/GCF_000803215.1.gff3 + 3 YOUR_PATH/GCF_003595035.1.gff3 + 4 YOUR_PATH/GCF_000808375.1.gff3 + 5 YOUR_PATH/GCF_000808115.1.gff3 + 6 YOUR_PATH/GCA_000147055.1.gff3 + +.. code:: bash + + $ pantools add_functions pecto_dickeya_DB functions.txt + +PanTools will ask you to download the InterPro database. Follow the +steps and execute the program again. + +The complete GO, PFAM, Interpro and TIGRFAM, databases are now +integrated in the graph database after. Genes with a predicted function +have gained a relationship to that function node. Retrieving a set of +genes that share a function is now possible through a simple cypher +query. If you would run **metrics** again, statistics for these type +functional annotations are calculated. To create a summary table for +each type of functional annotation, run +:ref:`function_overview <construction/annotate:function overview>`. + +.. code:: bash + + $ pantools function_overview pecto_dickeya_DB + +In **function_overview_per_group.csv** you can navigate to a homology +group or gene to see the connected functions. You can also search in the +opposite direction, use one of the created overview files for a type of +functional annotation and quickly navigate to a function of interest to +find which genes are connected. + +GO enrichment +~~~~~~~~~~~~~ + +We go back to the output files from gene classification that only +contain node identifiers. We can retrieve group functions by providing +one the files to +:ref:`group_info <analysis/explore:group info>` with the +``--homology-groups`` argument. However, interpreting groups by +assessing each one individually is not very practical. A common +approach to discover interesting genes from a large set is +GO-enrichment. This statistical method enables the identification of +genes sharing a function that are significantly over or +under-represented in a given gene set compared to the rest of the +genome. Let’s perform a :ref:`analysis/go_enrichment:go enrichment` +on homology groups of the core genome. + +.. code:: text + + Phenotype: P._brasiliense, 2 genomes, threshold of 2 genomes + 1278537,1282642,1283856,1283861,1283862,1283869,1283906,1283921,1283934,1283941,1283945,1283946 + +.. code:: bash + + $ pantools group_info -H brasiliense_groups.csv pecto_dickeya_DB + $ pantools go_enrichment -H brasiliense_groups.csv pecto_dickeya_DB + +Open **go_enrichment.csv** with a spreadsheet tool. This file holds GO +terms found in at least one of the genomes, the p-value of the +statistical test and whether it is actually enriched after the multiple +testing correction. as this is different for each genome a function +might enriched in one genome but not in another. + +A directory with separate output files is created for each genome, open +**go_enrichment.csv** for the genome 4 or 5 in a spreedsheet. Also take +a look at the PDF files that visualize part of the Gene ontology +hierarchy. + +Classifying functional annotations +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -And use :ref:`add_phenotypes <construction/annotate:add phenotypes>` to add the -information to the pangenome. +Similarly to classifying gene content, functional annotations can be +categorized using +:ref:`functional_classification <analysis/classification:functional classification>`. +This tool provides an easy way to identify functions shared by a group +of genomes of a certain phenotype but can also be used to identify +core or unique functions. The functionality uses the same set of +arguments as **gene_classification**. You can go through the same +steps again to see the effect of changing the arguments. .. code:: bash - $ pantools add_phenotypes chloroplast_DB phenotypes.txt + $ pantools functional_classification pecto_dickeya_DB + $ pantools functional_classification --core-threshold=85 pecto_dickeya_DB + $ pantools functional_classification --exclude=6 pecto_dickeya_DB + $ pantools functional_classification --phenotype=species pecto_dickeya_DB -------------- -RETRIEVE functions +Sequence alignment ------------------ -Now that the construction is complete, lets quickly validate if the -construction was successful and the database can be used. To retrieve -some genomic regions, prepare a text file containing genomic -coordinates. Create the file **regions.txt** and include the following -for each region: genome number, contig number, start and stop position -and separate them by a single space +In the final part of this tutorial we will test the alignment function +by aligning homology groups. PanTools is able to align genomic regions, +genes and proteins to identify SNPs or amino acid changes with +:doc:`msa </analysis/msa>`. -.. code:: text +Start with the alignment of protein sequences from the 12 *P. +brasiliense* specific homology groups. + +.. code:: bash + + $ pantools msa --mode=protein --method=per-group -H=brasiliense_groups.csv pecto_dickeya_DB - 1 1 200 500 - 2 1 300 700 - 3 1 1 10000 - 3 1 1 10000 - - 4 1 9999 15000 - 5 1 100000 110000 +Go to the **pecto_dickeya_DB/alignments/grouping_v1/groups/** directory +and select one of homology groups and check if you can find the +following files -Now run the :ref:`retrieve_regions <analysis/explore:retrieve regions>` -function and include the new text file +- The alignments are written to **prot_trimmed.fasta** and + **prot_trimmed.afa**. +- A gene tree is written to **prot_trimmed.newick** +- **prot_trimmed_variable_positions.csv** located in the + **var_inf_positions** subdirectory. This matrix holds every variable + position of the alignment; the rows are the position in the alignment + and the columns are the 20 amino acids and gaps. +- The identity and similarity (for proteins) is calculated between + sequences and written to tables in the **similarity_identity** + subdirectory. + +Run the function again but include the ``--no-trimming`` argument. .. code:: bash - $ pantools retrieve_regions chloroplast_DB regions.txt + $ pantools msa --no-trimming --mode=protein --method=per-group -H=brasiliense_groups.csv pecto_dickeya_DB -Take a look at the extracted regions that are written to the -**chloroplast_DB/retrieval/regions/** directory. +The output files are generated right after the first alignment without +trimming the sequences first. The file names differ from the trimmed +alignments by the ’*_trimmed*’ substring. -To retrieve entire genomes, prepare a text file **genome_numbers.txt** -and include each genome number on a separate line in the file +Run the function again but exclude the ``--mode=protein`` and +``--no-trimming`` arguments. When no additional arguments are included +to the command, both nucleotide and protein sequences are aligned two +consecutive times. -.. code:: text +.. code:: bash + + $ pantools msa --method=per-group -H=brasiliense_groups.csv pecto_dickeya_DB - 1 - 3 - 5 +Again, the same type of files are generated but the output files from +nucleotide sequence can be recognized by the ‘*nuc\_*’ substrings. The +matrix in **nuc_trimmed_variable_positions.csv** now only has columns +for the four nucleotides and gaps. -Use the **retrieve_regions** function again but include the new text -file +Finally, run the function one more time but include a phenotype. This +allows you to identify phenotype specific SNPs or amino acid changes. .. code:: bash - $ pantools retrieve_regions chloroplast_DB genome_numbers.txt + $ pantools msa --no-trimming --phenotype=low_temperature -H=brasiliense_groups.csv pecto_dickeya_DB + +Open the **nuc**- or **prot_trimmed_phenotype_specific_changes.info** +file inside one of the homology group output directories. + +-------------- + +Besides the functionalities in this tutorial, PanTools has more useful +functions that may aid you in retrieving more specific information from +the pangenome. -Genome files are written to same directory as before. Take a look at one -of the three genomes you have just retrieved. +- Identify shared k-mers between genomes with + :ref:`kmer_classification <analysis/classification:k-mer classification>`. +- Find co-localized genes in a set of homology groups: + :ref:`locate_genes <analysis/explore:locate genes>`. +- Mapping short reads against the pangenome with :ref:`map + <analysis/mapping:map>`. -In :doc:`part 3 <tutorial_part3>` of the tutorial we explore the -pangenome you just built using the Neo4j browser and the Cypher language. +In :doc:`part 3 <tutorial_part3>` of the tutorial we explore some of the +phylogenetic methods implemented in PanTools. diff --git a/docs/source/tutorial/tutorial_part3.rst b/docs/source/tutorial/tutorial_part3.rst index 93949bde0d367c790d43df68222521b1f098a242..dfb755355e7225df89b0923bfd61d0c3a6b7dcee 100644 --- a/docs/source/tutorial/tutorial_part3.rst +++ b/docs/source/tutorial/tutorial_part3.rst @@ -1,291 +1,259 @@ -Part 3. Explore the pangenome using the Neo4j browser -===================================================== +Part 3. Phylogeny +================= -Did you skip :doc:`part 2 <tutorial_part2>` of the tutorial or were you -unable to build the chloroplast pangenome? Download the pre-constructed -pangenome -`here <http://bioinformatics.nl/pangenomics/tutorial/chloroplast_DB.tar.gz>`_ -or via wget. +Part 3 preparation +------------------ + +If you did not follow part 2 of the tutorial, download the +pre-constructed pangenome +`here <http://bioinformatics.nl/pangenomics/tutorial/pecto_dickeya_DB.tar.gz>`_. .. code:: bash - $ wget http://bioinformatics.nl/pangenomics/tutorial/chloroplast_DB.tar.gz - $ tar -xvzf chloroplast_DB.tar.gz + $ wget http://bioinformatics.nl/pangenomics/tutorial/pecto_dickeya_DB.tar.gz + $ tar xvzf pecto_dickeya_DB.tar.gz -------------- -Configuring Neo4j ------------------ +Adding phenotype/metadata to the pangenome +------------------------------------------ -Set the full path to the chloroplast pangenome database by opening -neo4j.conf ('*neo4j-community-3.5.30/conf/neo4j.conf*') and include the -following line in the config file. Please make sure there is always only -a single uncommented line with 'dbms.directories.data'. +Before we construct the trees, we will add some phenotype data to the +pangenome. Once the we have a phylogeny, the information can be included +or be used to color parts of the tree. Below is a textfile with data for +three phenotypes. The third phenotype, *low_temperature*, is in this +case a made up example! It states whether the strain is capable of +growing on (extreme) low temperatures. The phenotype file can be found +inside the database directory, add the information to the pangenome by +using :ref:`add_phenotypes <construction/annotate:add phenotypes>`. .. code:: text - #dbms.directories.data=/YOUR_PATH/any_other_database - dbms.directories.data=/YOUR_PATH/chloroplast_DB - -| **Allowing non-local connections** -| To be able to run Neo4j on a server and have access to it from - anywhere, some additional lines in the config file must be changed. - -- **Uncomment** the four following lines in - neo4j-community-3.5.30/conf/neo4j.conf. -- Replace 7686, 7474, and 7473 by three different numbers that are not - in use by other people on your server. In this way, everyone can have - their own database running at the same time. - -.. code:: text - - #dbms.connectors.default_listen_address=0.0.0.0 - #dbms.connector.bolt.listen_address=:7687 - #dbms.connector.http.listen_address=:7474 - #dbms.connector.https.listen_address=:7473 - -Lets start up the Neo4j server! - -.. code:: bash - - $ neo4j start - -Start Firefox (or a web browser of your own preference) and let it run -on the background. + Genome, species, strain_name, low_temperature + 1,P. odoriferum,P. odoriferum Q166, false + 2,P. fontis, P. fontis M022, true + 3,P. polaris,P. polaris S4.16.03.2B, false + 4,P. brasiliense, P. brasiliense S2, true + 5,P. brasiliense, P. brasiliense Y49, false + 6,D. dadantii, D. dadantii 3937,? .. code:: bash - $ firefox & - -In case you did not change the config to allow non-local connections, -browse to *http://localhost:7474*. Whenever you did change the config -file, go to *server_address:7474*, where 7474 should be replaced with the -number you chose earlier. - -If the database startup was successful, a login terminal will appear in -the webpage. Use '*neo4j*' both as username and password. After logging -in, you are requested to set a new password. + $ pantools add_phenotypes pecto_dickeya_DB pecto_dickeya_DB/phenotypes.txt -------------- -Exploring nodes and edges in Neo4j ----------------------------------- - -Go through the following steps to become proficient in using the Neo4j -browser and the underlying PanTools data structure. If you have any -difficulty trouble finding a node, relationship or any type of -information, download and use `this visual guide -<http://www.bioinformatics.nl/pangenomics/tutorial/neo4j_browser.tar.gz>`_. - -1. Click on the database icon on the left. A menu with all node types - and relationship types will appear. -2. Click on the '*gene*' button in the node label section. This - automatically generated a query. Execute the query. -3. The **LIMIT** clause prevents large numbers of nodes popping up to - avoid your web browser from crashing. Set LIMIT to 10 and execute the - query. -4. Hover over the nodes, click on them and take a look at the values - stored in the nodes. All these features (except ID) were extracted - from the GFF annotation files. ID is an unique number automatically - assigned to nodes and relationships by Neo4j. -5. Double-click on the **matK** gene node, all nodes with a connection - to this gene node will appear. The nodes have distinct colors as - these are different node types, such as **mRNA**, **CDS**, - **nucleotide**. Take a look at the node properties to observe that - most values and information is specific to a certain node type. -6. Double-click on the *matK* mRNA node, a **homology_group** node - should appear. These type of nodes connect homologous genes in the - graph. However, you can see this gene did not cluster with any other - gene. -7. Hover over the **start** relation of the *matK* gene node. As you - can see information is not only stored in nodes, but also in - relationships! A relationship always has a certain direction, in - this case the relation starts at the gene node and points to a - nucleotide node. Offset marks the location within the node. -8. Double-click on the **nucleotide** node at the end of the 'start' - relationship. An in- and outgoing relation appear that connect to - other nucleotide nodes. Hover over both the relations and compare - them. The relations holds the genomic coordinates and shows this - path only occurs in contig/sequence 1 of genome 1. -9. Follow the outgoing **FF**-relationship to the next nucleotide node - and expand this node by double-clicking. Three nodes will pop up - this time. If you hover over the relations you see the coordinates - belong to other genomes as well. You may also notice the - relationships between nucleotide nodes is always a two letter - combination of F (forward) and R (reverse) which state if a sequence - is reverse complemented or not. The first letter corresponds to the - sequence of the node at the start of the relation where the second - letters refers to the sequence of the end node. -10. Finally, execute the following query to call the database scheme to - see how all node types are connected to each other: *CALL - db.schema()*. The schema will be useful when designing your own - queries! - --------------- - -Query the pangenome database using CYPHER ------------------------------------------ - -Cypher is a declarative, SQL-inspired language and uses ASCII-Art to -represent patterns. Nodes are represented by circles and relationships -by arrows. - -- The **MATCH** clause allows you to specify the patterns Neo4j will - search for in the database. -- With **WHERE** you can add constraints to the patterns described. -- In the **RETURN** clause you define which parts of the pattern to - display. +Constructing a phylogeny +------------------------ -Cypher queries -~~~~~~~~~~~~~~ +In this tutorial we will construct three phylogenies, each based on a +different type of variation: SNPs, genes and k-mers. Take a look at the +phylogeny manuals to get an understanding how the three methods work and +how they differ from each other. -**Match and return 100 nucleotide nodes** +1. :ref:`analysis/phylogeny:core phylogeny` +2. :ref:`analysis/phylogeny:gene distance tree` +3. :ref:`analysis/phylogeny:k-mer distance tree` -.. code:: text - - MATCH (n:nucleotide) RETURN n LIMIT 100 +Core SNP phylogeny +------------------ -**Find all the genome nodes** +The core SNP phylogeny will run various Maximum Likelihood models on +parsimony informative sites of single-copy orthologous sequences. A site +is parsimony-informative when there are at least two types of +nucleotides that occur with a minimum frequency of two. The informative +sites are automatically identified by aligning the sequences; however, +it does not know which sequences are single-copy orthologous. You can +identify these conserved sequences by running +:ref:`gene_classification <analysis/classification:gene classification>`. -.. code:: text +.. code:: bash - MATCH (n:genome) RETURN n + $ pantools gene_classification --phenotype=species pecto_dickeya_DB -**Find the pangenome node** +Open **gene_classification_overview.txt** and take a look at statistics. +As you can see there are 2134 single-copy ortholog groups. Normally, all +of these groups are aligned to identify SNPs but for this tutorial we’ll +make a selection of only a few groups to accelerate the steps. You can +do this in two different ways: -.. code:: text +Option 1: Open **single_copy_orthologs.csv** and remove all node +identifiers after the first 20 homology groups and save the file. - MATCH (n:pangenome) RETURN n +.. code:: bash -**Match and return 100 genes** + $ pantools core_phylogeny --mode=ML pecto_dickeya_DB -.. code:: text +Option 2: Open **single_copy_orthologs.csv** and select the first 20 +homology_group node identifiers. Place them in a new file sco_groups.txt +and include this file to the function. - MATCH (g:gene) RETURN g LIMIT 100 +.. code:: bash -**Match and return 100 genes and order them by length** + $ pantools core_phylogeny --mode=ML -H=sco_groups.txt pecto_dickeya_DB -.. code:: text +The sequences of the homology groups are being aligned two consecutive +times. After the initial alignment, input sequences are trimmed based on +the longest start and end gap of the alignment. The parsimony +informative positions are taken from the second alignment and +concatenated into a sequence. When opening **informative.fasta** you can +find 6 sequences, the length of the sequences being the number of +parsimony-informative sites. - MATCH (g:gene) RETURN g ORDER BY g.length DESC LIMIT 100 +.. code:: bash -**The same query as before but results are now returned in a table** + $ iqtree -nt 4 -s pecto_dickeya_DB/alignments/grouping_v1/core_phylogeny/informative.fasta -redo -bb 1000 -.. code:: text +IQ-tree generates several files, the tree that we later on in the +tutorial will continue with is called **informative.fasta.treefile**. +When examining the **informative.fasta.iqtree** file you can find the +best fit model of the data. This file also shows the number of sites +that were used, as sites with gaps (which IQ-tree does not allow) were +changed into singleton or constant sites. - MATCH (g:gene) RETURN g.name, g.address, g.length ORDER BY g.length DESC LIMIT 100 +Gene distance tree +~~~~~~~~~~~~~~~~~~ -**Return genes which are longer as 100 but shorter than 250 bp** (this -can also be applied to other features such as exons introns or CDS) +To create a phylogeny based on gene distances (absence/presence), we can +simply execute the Rscript that was created by +:ref:`gene_classification <analysis/classification:gene classification>`. -.. code:: text +.. code:: bash - MATCH (g:gene) where g.length > 100 AND g.length < 250 RETURN * LIMIT 100 + $ Rscript pecto_dickeya_DB/gene_classification/gene_distance_tree.R -**Find genes located on first genome** +The resulting tree is called **gene_distance.tree**. -.. code:: text +K-mer distance tree +~~~~~~~~~~~~~~~~~~~ - MATCH (g:gene) WHERE g.address[0] = 1 RETURN * LIMIT 100 +To obtain a k-mer distance phylogeny, the k-mers must first be counted +with the :ref:`kmer_classification <analysis/classification:k-mer +classification>` +function. Afterwards, the tree can be constructed by executing the +Rscript. -**Find genes located on first genome and first sequence** +.. code:: bash -.. code:: text + $ pantools kmer_classification pecto_dickeya_DB + $ Rscript pecto_dickeya_DB/kmer_classification/genome_kmer_distance_tree.R - MATCH (g:gene) WHERE g.address[0] = 1 AND g.address[1] = 1 RETURN * LIMIT 100 +The resulting tree is written to **genome_kmer_distance.tree**. -------------- -Homology group queries -~~~~~~~~~~~~~~~~~~~~~~ +Renaming tree nodes +------------------- -**Return 100 homology groups** - -.. code:: text +So far, we used three different types of distances (SNPs, genes, +k-mers), and two different methods (ML, NJ) to create three phylogenetic +trees. First, lets take a look at the text files. The +**informative.fasta.treefile** only contain genome numbers, bootstrap +values and branch lengths but is lacking the metadata. Examining +**gene_distance.tree** file also shows this information but the species +names as well, because we included this as a phenotype during +:ref:`gene_classification <analysis/classification:gene classification>`. - MATCH (h:homology_group) RETURN h LIMIT 100 +Let’s include the strain identifiers to the core snp tree to make the +final figure more informative. Use the +:ref:`rename_phylogeny <analysis/phylogeny:rename phylogeny>` function to +rename the tree nodes. -**Match homology groups which contain two members** - -.. code:: text - - MATCH (h:homology_group) WHERE h.num_members = 2 RETURN h - -**Match homology groups and 'walk' to the genes and corresponding start -and end node** - -.. code:: text - - MATCH (h:homology_group)-->(f:feature)<--(g:gene)-->(n:nucleotide) WHERE h.num_members = 2 RETURN * LIMIT 25 - -Turn off autocomplete by clicking on the button on the bottom right. The -graph falls apart because relations were not assigned to variables. - -**The same query as before but now the relations do have variables** - -.. code:: text - - MATCH (h:homology_group)-[r1]-> (f:feature) <-[r2]-(g:gene)-[r3]-> (n:nucleotide) WHERE h.num_members = 2 RETURN * LIMIT 25 - -When you turn off autocomplete again only the '*is_similar_to*' relation -disappears since we did not call it - -**Find homology group that belong to the rpoC1 gene** - -.. code:: text - - MATCH (n:homology_group)--(m:mRNA)--(g:gene) WHERE g.name = 'rpoC1' RETURN * - -**Find genes on genome 1 which don't show homology** +.. code:: bash -.. code:: text + $ pantools rename_phylogeny --phenotype=strain_name pecto_dickeya_DB pecto_dickeya_DB/alignments/grouping_v1/core_phylogeny/informative.fasta.treefile - MATCH (n:homology_group)--(m:mRNA)--(g:gene) WHERE n.num_members = 1 and g.genome = 1 RETURN * +Take a look at **informative.fasta_RENAMED.treefile**, strain +identifiers have been added to the tree. -------------- -Structural variant detection -~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -**Find SNP bubbles (for simplification we only use the FF relation)** - -.. code:: text - - MATCH p= (n:nucleotide) -[:FF]-> (a1)-[:FF]->(m:nucleotide) <-[:FF]-(b1) <-[:FF]- (n) return * limit 50 - -**The same query but returning the results in a table** - -.. code:: text +Visualizing the tree in iTOL +---------------------------- + +Go to https://itol.embl.de and click on “Upload a tree” under the +**ANNOTATE** box. On this page you can paste the tree directly into the +**tree text:** textbox or can click the button to upload the .newick +file. + +.. figure:: /figures/tutorial_phylogeny1.png + :width: 600 + :align: center + +Basic controls iTOL +------------------- + +- The default way of visualizing a tree is the rectangular view. + Depending on the number of genomes, the circular view can be easier + to interpret. You can the view by clicking on the “Display Mode” + buttons. +- Increase the font size and branch width to improve readability +- When visualizing a Maximum likelihood (ML) tree, bootstrap values can + be displayed by clicking the “Display” button next to + **Bootstrap/metadata** in the Advanced tab of the Control window. + This enables you to visualize the values as text or symbol on the + branch. or by coloring the branch or adjusting the width. + +.. figure:: /figures/tutorial_phylogeny2.png + :width: 600 + :align: center + +- When you have a known out group or one of the genomes is a clear + outlier in the tree, you should reroot the tree. Hover over the name, + click it so a pop-up menu appears. Click “tree structure” followed by + “Reroot the tree here”. + +.. figure:: /figures/tutorial_phylogeny3.png + :width: 600 + :align: center + +- Clicking on the name of a node in the tree allows you to color the + name, branch, or background of that specific node. +- When you’re happy the way your tree looks, go to the Export tab of + the Control window. Select the desired output format, click on the + “Full image” button and export the file to a figure. +- Refresh the webpage to go back to the default view of your tree. + +Create iTOL templates +--------------------- + +In iTOL it is possible to add colors to the tree by coloring the +terminal nodes or adding an outer ring. The PanTools function +:ref:`create_tree_template <analysis/phylogeny:create tree template>` is able +to create templates that allows for easy coloring (with maximum of 20 +possible colors). If the function is run without any additional +argument, templates are created for trees that only contain genome +numbers (e.g. k-mer distance tree). Here we want to color the (renamed) +core SNP tree with the ‘low_temperature’ phenotype. Therefore, the +``--phenotype`` strain_name must be included to the function. - MATCH (n:nucleotide) -[:FF]-> (a1)-[:FF]->(m:nucleotide) <-[:FF]-(b1) <-[:FF]- (n) return a1.length,b1.length, a1.sequence, b1.sequence limit 50 - -Functions such as **count()**, **sum()** and **stDev()** can be used in -a query. - -**The same SNP query but count the hits instead of displaying them** - -.. code:: text +.. code:: bash - MATCH p= (n:nucleotide) -[:FF]-> (a1)-[:FF]->(m:nucleotide) <-[:FF]-(b1) <-[:FF]- (n) return count(p) + $ pantools create_tree_template pecto_dickeya_DB # Run this command when the tree contains genome numbers only + $ pantools create_tree_template --phenotype=strain_name pecto_dickeya_DB --------------- +Copy the two low_temperature.txt files from the label/strain_name/ and +ring/strain_name/ directories to your personal computer. Click and move +the ring template file into the tree visualization webpage. -Hopefully you know have some feeling with the Neo4j browser and cypher -and you're inspired to create your own queries! +.. figure:: /figures/tutorial_phylogeny4.png + :width: 600 + :align: center -When you're done working in the browser, close the database (by using -the command line again). +The resulting tree should look this when: the tree is rooted with the +*Dickeya* genome, bootstrap values are displayed as text and the ring +color template was included. -.. code:: text +.. figure:: /figures/tutorial_phylogeny_color.png + :width: 600 + :align: center - $ neo4j stop +Tree coloring is especially useful for large datasets. An example is +shown in the figure below, where members of the same species share a +color. -| More information on Neo4j and the cypher language: -| `Neo4j Cypher Manual v3.5 <https://neo4j.com/docs/developer-manual/3.5/cypher/>`_ -| `Neo4j Cypher Refcard <http://neo4j.com/docs/cypher-refcard/3.5/>`_ -| `Neo4j API <https://neo4j.com/developer/>`_ +.. figure:: /figures/tutorial_phylogeny_tree.png + :width: 600 + :align: center -------------- - -In :doc:`part 4 <tutorial_part4>` of the tutorial we explore some of the -functionalities to analyze the pangenome. diff --git a/docs/source/tutorial/tutorial_part4.rst b/docs/source/tutorial/tutorial_part4.rst index fafb5b4680c75697f6c2603f900fc7d5e69dd53f..ea5c20ead32184a25db9716b8db213d928947888 100644 --- a/docs/source/tutorial/tutorial_part4.rst +++ b/docs/source/tutorial/tutorial_part4.rst @@ -1,464 +1,212 @@ -Part 4. Characterization -======================== +Part 4 - Visualization using PanVA +================================== -Part 4 preparation ------------------- +This part guides you through the process of getting from raw data to PanVA +instance. -PanTools v3 is required to follow this part of the tutorial. In -addition, MAFFT and R (and a few packages) need to be installed and set -to your $PATH. Everything should already be correctly installed if you -use the conda environment. Validate if the tools are executable by using -the following commands. -.. code:: bash - - $ Rscript --help - $ mafft -h - -We assume a PanTools alias was set during the -:ref:`installation <getting_started/install:set pantools alias>`. This allows -PanTools to be executed with ``pantools`` rather than the full path to the jar -file. If you don’t have an alias, either set one or replace the pantools -command with the full path to the .jar file in the tutorials. +Data Packages +------------- -Input data ----------- +Currently one small data package is available, which contain all the information +for creating a pangenome using PanTools and creating a PanVA instance from it, +all the way from raw data to pangenome browser. This data package is different +from the one in the performance test, so don't reuse that one here! -.. csv-table:: - :file: /tables/bacteria_datasets.csv - :header-rows: 1 - :delim: ; + .. list-table:: + :widths: 25 25 25 25 + :header-rows: 1 -To demonstrate how to use the PanTools functionalities we use a small -dataset of six bacteria to avoid long runtimes. Download a -pre-constructed pangenome or test your new skills and construct a -pangenome yourself using the fasta and gff files. + * - data type + - link + - contents + - database size + * - Fungi + - `Yeast <https://www.bioinformatics.nl/pangenomics/data/ + yeast_panva.tar.gz>`_ [4.3K] + - 10 genomes + - 4G -Option 1: Download separate genome and annotation files -.. code:: bash - - $ wget http://bioinformatics.nl/pangenomics/tutorial/pecto_dickeya_input.tar.gz - $ tar -xvzf pecto_dickeya_input.tar.gz - $ gzip -d pecto_dickeya_input/annotations/* - $ gzip -d pecto_dickeya_input/genomes/* - $ gzip -d pecto_dickeya_input/functions/* +Steps to generate PanVA input +----------------------------- - $ pantools build_pangenome pecto_dickeya_DB pecto_dickeya_input/genomes.txt - $ pantools add_annotations --connect pecto_dickeya_DB pecto_dickeya_input/annotations.txt - $ pantools group --relaxation=4 pecto_dickeya_DB -Option 2: Download the pre-constructed pangenome +1: Download the data package +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code:: bash - $ wget http://bioinformatics.nl/pangenomics/tutorial/pecto_dickeya_DB.tar.gz - $ tar -xvzf pecto_dickeya_DB.tar.gz - --------------- + $ wget https://www.bioinformatics.nl/pangenomics/data/<target-dataset>.tar.gz + $ tar xvzf <target-dataset>.tar.gz -Adding phenotype/metadata to the pangenome ------------------------------------------- +Every package contains a README with all the exact commands, so make sure to +check those if you're stuck. +For all packages, those can be found at the root of the decompressed TAR-files. +The Snakemake pipelines used in this workflow create conda environments +in the data package directory. If you want to re-use these pipelines for +different data, use the ``--conda-prefix`` Snakemake command to set a directory +where the conda environments will be stored. +All commands should be run from the root directory of the package. +Run the whole package from RAM-disk or SSD, or set the path of the results to +RAM-disk/SSD in the configs. -Before starting with the analysis, we will add some phenotype data to -the pangenome. Phenotypes allow you to find similarities for a group of -genomes sharing a phenotype as well as identifying variation between -different phenotypes. Below is a textfile with data for three -phenotypes. The third phenotype, *low_temperature*, is in this case a -made up example! It states whether the strain is capable of growing on -(extreme) low temperatures. The phenotype file can be found inside the -database directory or create a new file using the text from the box -below. Add the phenotype information to the pangenome using -:ref:`add_phenotypes <construction/annotate:add phenotypes>`. +2: Downloading data for the pangenome +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -.. code:: text +Goal: + * Acquire genome and structural annotation data - Genome, species, strain_name, low_temperature - 1,P. odoriferum,P. odoriferum Q166, false - 2,P. fontis, P. fontis M022, true - 3,P. polaris,P. polaris S4.16.03.2B, false - 4,P. brasiliense, P. brasiliense S2, true - 5,P. brasiliense, P. brasiliense Y49, false - 6,D. dadantii, D. dadantii 3937,? +To download corresponding raw-data for each of the above-linked packages, +follow the first steps of the instructions laid out in the README file. +This command looks looks like this: .. code:: bash - $ pantools add_phenotypes pecto_dickeya_DB pecto_dickeya_input/phenotypes.txt + $ ./<target-dataset>_download.sh --------------- +Make sure the ``gunzip`` command is available on your system. -Metrics and general statistics ------------------------------- -After building or uncompressing the pangenome, run the -:ref:`metrics <analysis/metrics:metrics>` functionality to produce -various statistics that should verify an errorless construction. +3: Preprocessing the data for PanTools +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -.. code:: bash - - $ pantools metrics pecto_dickeya_DB - -Open **metrics_per_genome.csv** with a spreadsheet tool (Excel, -Libreoffice, Google sheets) and make sure the columns are split on -commas. You may easily notice the many empty columns in this table as -these type of annotations or features are not included in the database -(yet). Functional annotations are incorporated later in this tutorial. -Columns for features like exon and intron will remain empty as bacterial -coding sequences are not interrupted. +Goal: + * Filtering the minimum sequence size of genomes in the FASTA file + * Filtering the minimum ORF size of CDS features in the annotation + * Extract protein-sequences by matching CDS features to genomic sequences + * Create functional annotations for extracted protein sequences (Optional) + * Generate statistics for raw and filtered data --------------- +This requires the following three steps: -Gene classification -------------------- +**Step 1**: Clone the PanUtils data filtering pipeline -With the :ref:`gene_classification <analysis/classification:gene -classification>` functionality you are able to organize the gene repertoire -into the core, accessory or unique part of the pangenome. - -- **Core**, a gene is present in all genomes -- **Unique**, a gene is present in a single genome -- **Accessory**, a gene is present in some but not all genomes +The +:ref:`data filtering pipeline <getting_started/diy_pangenomics:Quality control pipeline>` +filters out small sequences, matches FASTA with GFF contents and removes CDS +features with ORF below cutoff value. Also extracts protein sequences and +creates functional annotations from them. .. code:: bash - $ pantools gene_classification pecto_dickeya_DB - -.. figure:: /figures/classify_table.png - :width: 600 - :align: center + $ git clone https://github.com/PanUtils/pantools-qc-pipeline.git -Take a look in **gene_classification_overview.txt**. Here you can find -the number of classified homology groups and genes on a pangenome level -but also for individual genomes. +**Step 2**: Activate or create Snakemake -Open **additional_copies.csv** with a spreadsheet tool. This file can be -useful to identify duplicated genes in relation to other genomes. - -The default criteria to call a group core is presence in all genomes -where unique is only allowed to be present in one genome. These two -categories are highly influenced by annotation quality, especially in -large pangenomes. Luckily, the threshold for core and unique groups can -easily be adjusted. Let’s consider genes to be core when present in only -five of the six genomes by setting the ``--core-threshold`` argument. +Activate or create a snakemake environment (works in python +versions <= 3.11). .. code:: bash - $ pantools gene_classification --core-threshold=85 pecto_dickeya_DB + $ mamba create -c conda-forge -c bioconda -n snakemake snakemake=8 -Look in **gene_classification_overview.txt** again to observe the -increase of core groups/genes at the cost of accessory groups. +.. note:: -For this pangenome, the *Dickeya* genome is considered an outgroup to -the five *Pectobacterium* genomes. While this outgroup is needed to root -and analyze phylogenetic trees (tutorial part 5), it affects the number -classified groups for the all other genomes. Use ``--include`` or -``--exclude`` to exclude the *Dickeya* genome. + If you are using an ARM-based machine (such as an M4-based Mac), make sure + to make the new environment compatible with Intel-based packages. + Many dependencies in conda are not yet compatible with ARM systems. + Consider for example installing *Rosetta 2*. -.. code:: bash + .. code:: bash - $ pantools gene_classification --include=1-5 pecto_dickeya_DB - $ pantools gene_classification --exclude=6 pecto_dickeya_DB + $ softwareupdate --install-rosetta -Take a look in **gene_classification_overview.txt** one more time to -examine the effect of excluding this genome. The total number of groups -in the analysis is lower now but the number of core and unique genes -have increased for the five remaining genomes. + Please use this command to set up your environment: -When phenotype information is used in the analysis, three additional -categories can be assigned to a group: + .. code:: bash -- **Shared**, a gene present in all genomes of a phenotype -- **Exclusive**, a gene is only present in a certain phenotype -- **Specific**, a gene present in all genomes of a phenotype and is - also exclusive + $ CONDA_SUBDIR=osx-64 mamba create -c conda-forge -c bioconda -n snakemake snakemake=8 -Include a ``--phenotype`` argument to find genes that are exclusive for -a certain species. + This command ensures that packages are downloaded for an Intel architecture. + Afterwards, restart your shell with the "Open using Rosetta"-setting + enabled. For this, go to "Applications"/Utilities/Terminal" and click on + "Get Info". Select the option to start the terminal with Rosetta! -.. code:: bash +**Step 3**: Filter raw data and create functional annotations for protein +sequences - $ pantools gene_classification --phenotype=species pecto_dickeya_DB - -Open **gene_classification_phenotype_overview.txt** to see the number of -classified groups for the species phenotype. - -Open **phenotype_disrupted.csv** in a spreadsheet tool. This file -explains exactly why a homology groups is labeled as phenotype shared -and not specific. - -Open **phenotype_additional_copies.csv** in a spreadsheet tool. -Similarly to *phenotype_additional.csv* this file shows groups where all -genomes of a certain phenotype have additional gene copies to (at least -one of) the other phenotypes. - -Each time you run the -:ref:`gene_classification <analysis/classification:gene classification>` -function, multiple files are created that contain node identifiers of a certain -homology group category. These files can be given to other PanTools -functions for a downstream analysis, for example, sequence alignment, -phylogeny, or GO enrichment. We will use one of the files later in this -tutorial. - --------------- - -Pangenome structure -------------------- - -With the previous functionality we identified the core, accessory and -unique parts of the pangenome. Now we will use the -:ref:`pangenome_size_genes <analysis/classification:pangenome structure>` -function to observe how these numbers are reached by simulating the growth of -the pangenome. Simulating the growth helps explaining if a pangenome should -be considered open or closed. An pangenome is called open as long as a -significant number of new (unique) genes are added to the total gene -repertoire. The openness of a pangenome is usually tested using Heap’s -law. Heaps’ law (a power law) can be fitted to the number of new genes -observed when increasing the pangenome by one random genome. The formula -for the power law model is n = k x N-a, where n is the newly discovered -genes, N is the total number of genomes, and k and a are the fitting -parameters. A pangenome can be considered open when a < 1 and closed if -a > 1. - -The outcome of the function can again be controlled through command line -arguments. Genomes can be excluded from the analysis with ``--exclude``. -You can set the number of iterations with ``--loops``. +Filter the raw data and create protein sequences: .. code:: bash - $ pantools pangenome_structure pecto_dickeya_DB - -The current test set of six bacterial genomes is not representative of a -full-sized pangenome. Therefore we prepared the results for the -structure simulation on a set of 197 *Pectobacterium* genomes. The -runtime of the analysis using 10.000 loops and 24 threads was 1 minute -and 54 seconds. Download the files here, unpack the archive and take a -look at the files. - -.. code:: bash + $ snakemake --use-conda --snakefile pantools-qc-pipeline/workflow/Snakefile --configfile config/<target-dataset>_qc.yaml --cores <threads> - $ wget wget http://bioinformatics.nl/pangenomics/tutorial/pectobacterium_structure.tar.gz - $ tar -xvf pectobacterium_structure.tar.gz +4: Running all pangenome analyses using PanTools +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Normally you still have to run the R scripts to create the output figures -and determine the openness of the pangenome. +Goal: + * Build the pangenome + * Add structural annotations, functional annotations and phenotypes + * Add VCF information or phasing information if available + * Create homology groups + * Run the necessary analysis steps (gene classification, k-mer + classification, multiple sequence alignment, group info) in order to create + a PanVA instance -.. code:: bash +**Step 1**: Clone the PanTools pipeline v4 - cd pectobacterium_structure - $ Rscript pangenome_growth.R - $ Rscript gains_losses_median_and_average.R - $ Rscript heaps_law.R - -Take a look at the plot. In **core_accessory_unique_size.png**, the -number of classified groups are plotted for any of the genome -combination that occurred during the simulation. For the -**core_accessory_size.png** plots, the number of unique groups is -combined with accessory groups. - -The **gains_losses.png** files display the average and mean group gain -and loss between different pangenome sizes. The line of the core starts -below zero, meaning for every random genome added, the core genome -decreases by a number of *X* genes. - -.. raw:: html - - <details> - <summary> - The alpha value of heaps' law is written to - <b><i>heaps_law_alpha.txt</i></b>. Click here to reveal the result. - </summary> - <p> - <br> - With an alpha of 0.53 the <i>Pectobacterium</i> pangenome has an - open structure, which is typical for many bacterial species due - their extensive exchange of genetic information. Performing this - analysis on a set of closely related animal or plant genomes usually - results in a closed pangenome, indicating a limited gene pool. - </p> - </details> - --------------- - -Functional annotations ----------------------- - -PanTools is able to incorporate functional annotations into the -pangenome by reading output of various functional annotation tools. In -this tutorial we only include annotations from InterProScan. Please see -the :ref:`add_functions <construction/annotate:add functions>` -manual to check which other tools are available. To include the annotations, -create a file **functions.txt** using text from the box below and add it -to the command line argument. - -.. code:: text - - 1 YOUR_PATH/GCF_002904195.1.gff3 - 2 YOUR_PATH/GCF_000803215.1.gff3 - 3 YOUR_PATH/GCF_003595035.1.gff3 - 4 YOUR_PATH/GCF_000808375.1.gff3 - 5 YOUR_PATH/GCF_000808115.1.gff3 - 6 YOUR_PATH/GCA_000147055.1.gff3 +The :ref:`PanTools pipeline +<getting_started/diy_pangenomics:PanTools v4 pipeline>` contains all PanTools +functions. We will use it here to create a pangenome and run all steps discussed +above. .. code:: bash - $ pantools add_functions pecto_dickeya_DB functions.txt + $ git clone https://github.com/PanUtils/pantools-pipeline-v4.git -PanTools will ask you to download the InterPro database. Follow the -steps and execute the program again. +**Step 2**: Run PanTools to for PanVA-specific analyses -The complete GO, PFAM, Interpro and TIGRFAM, databases are now -integrated in the graph database after. Genes with a predicted function -have gained a relationship to that function node. Retrieving a set of -genes that share a function is now possible through a simple cypher -query. If you would run **metrics** again, statistics for these type -functional annotations are calculated. To create a summary table for -each type of functional annotation, run -:ref:`function_overview <construction/annotate:function overview>`. +The snakemake rule ``panva`` takes care of creating a pangenome and running all +necessary functions to create a PanVA instance. Those are started with the same +command, outlined below. .. code:: bash - $ pantools function_overview pecto_dickeya_DB - -In **function_overview_per_group.csv** you can navigate to a homology -group or gene to see the connected functions. You can also search in the -opposite direction, use one of the created overview files for a type of -functional annotation and quickly navigate to a function of interest to -find which genes are connected. + $ snakemake panva --use-conda --snakefile pantools-pipeline-v4/workflow/Snakefile --configfile config/<target-dataset>_pantools.yaml --cores <threads> -GO enrichment -~~~~~~~~~~~~~ +This will create a pangenome database from which PanVA files can be generated. -We go back to the output files from gene classification that only -contain node identifiers. We can retrieve group functions by providing -one the files to -:ref:`group_info <analysis/explore:group info>` with the -``--homology-groups`` argument. However, interpreting groups by -assessing each one individually is not very practical. A common -approach to discover interesting genes from a large set is -GO-enrichment. This statistical method enables the identification of -genes sharing a function that are significantly over or -under-represented in a given gene set compared to the rest of the -genome. Let’s perform a :ref:`analysis/go_enrichment:go enrichment` -on homology groups of the core genome. +5: Generate input for PanVA instance +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -.. code:: text +Goal: + * Preprocessing data for PanVA - Phenotype: P._brasiliense, 2 genomes, threshold of 2 genomes - 1278537,1282642,1283856,1283861,1283862,1283869,1283906,1283921,1283934,1283941,1283945,1283946 +**Step 1**: Clone the export-to-panva python script .. code:: bash - $ pantools group_info -H brasiliense_groups.csv pecto_dickeya_DB - $ pantools go_enrichment -H brasiliense_groups.csv pecto_dickeya_DB - -Open **go_enrichment.csv** with a spreadsheet tool. This file holds GO -terms found in at least one of the genomes, the p-value of the -statistical test and whether it is actually enriched after the multiple -testing correction. as this is different for each genome a function -might enriched in one genome but not in another. - -A directory with seperate output files is created for each genome, open -**go_enrichment.csv** for the genome 4 or 5 in a spreedsheet. Also take -a look at the PDF files that visualize part of the Gene ontology -hierarchy. + $ git clone https://github.com/PanUtils/export-to-panva.git -Classifying functional annotations -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Similarly to classifying gene content, functional annotations can be -categorized using -:ref:`functional_classification <analysis/classification:functional classification>`. -This tool provides an easy way to identify functions shared by a group -of genomes of a certain phenotype but can also be used to identify -core or unique functions. The functionality uses the same set of -arguments as **gene_classification**. You can go through the same -steps again to see the effect of changing the arguments. - -.. code:: bash - - $ pantools functional_classification pecto_dickeya_DB - $ pantools functional_classification --core-threshold=85 pecto_dickeya_DB - $ pantools functional_classification --exclude=6 pecto_dickeya_DB - $ pantools functional_classification --phenotype=species pecto_dickeya_DB - --------------- - -Sequence alignment ------------------- - -In the final part of this tutorial we will test the alignment function -by aligning homology groups. PanTools is able to align genomic regions, -genes and proteins to identify SNPs or amino acid changes with -:doc:`msa </analysis/msa>`. - -Start with the alignment of protein sequences from the 12 *P. -brasiliense* specific homology groups. - -.. code:: bash - - $ pantools msa --mode=protein --method=per-group -H=brasiliense_groups.csv pecto_dickeya_DB - -Go to the **pecto_dickeya_DB/alignments/grouping_v1/groups/** directory -and select one of homology groups and check if you can find the -following files - -- The alignments are written to **prot_trimmed.fasta** and - **prot_trimmed.afa**. -- A gene tree is written to **prot_trimmed.newick** -- **prot_trimmed_variable_positions.csv** located in the - **var_inf_positions** subdirectory. This matrix holds every variable - position of the alignment; the rows are the position in the alignment - and the columns are the 20 amino acids and gaps. -- The identity and similarity (for proteins) is calculated between - sequences and written to tables in the **similarity_identity** - subdirectory. - -Run the function again but include the ``--no-trimming`` argument. +**Step 2**: Create a conda environment for the script .. code:: bash - $ pantools msa --no-trimming --mode=protein --method=per-group -H=brasiliense_groups.csv pecto_dickeya_DB - -The output files are generated right after the first alignment without -trimming the sequences first. The file names differ from the trimmed -alignments by the ’*_trimmed*’ substring. + $ mamba env create -n export-to-panva -f export-to-panva/envs/pantova.yaml + $ conda activate export-to-panva -Run the function again but exclude the ``--mode=protein`` and -``--no-trimming`` arguments. When no additional arguments are included -to the command, both nucleotide and protein sequences are aligned two -consecutive times. +.. note:: + Make sure to create an environment that can deal with + Intel-based dependencies if you are on a silicon-based Mac. -.. code:: bash - - $ pantools msa --method=per-group -H=brasiliense_groups.csv pecto_dickeya_DB - -Again, the same type of files are generated but the output files from -nucleotide sequence can be recognized by the ‘*nuc\_*’ substrings. The -matrix in **nuc_trimmed_variable_positions.csv** now only has columns -for the four nucleotides and gaps. +**Step 3**: Run the export script -Finally, run the function one more time but include a phenotype. This -allows you to identify phenotype specific SNPs or amino acid changes. +The export script reads data from the pangenome database and converts it to +the proper format for PanVA. Run the following command from the root of the data +package, to create the inputs for PanVA: .. code:: bash - $ pantools msa --no-trimming --phenotype=low_temperature -H=brasiliense_groups.csv pecto_dickeya_DB - -Open the **nuc**- or **prot_trimmed_phenotype_specific_changes.info** -file inside one of the homology group output directories. - --------------- + $ python3 export-to-panva/scripts/pan_to_va.py config/<target-dataset>_panva.ini -Besides the functionalities in this tutorial, PanTools has more useful -functions that may aid you in retrieving more specific information from -the pangenome. +6: Create a PanVA instance +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -- Identify shared k-mers between genomes with - :ref:`kmer_classification <analysis/classification:k-mer classification>`. -- Find co-localized genes in a set of homology groups: - :ref:`locate_genes <analysis/explore:locate genes>`. -- Mapping short reads against the pangenome with :ref:`map - <analysis/mapping:map>`. +Goal: + * Set up the PanVA instance -In :doc:`part 5 <tutorial_part5>` of the tutorial we explore some of the -phylogenetic methods implemented in PanTools. +With the output of the export script, you should be able to create a PanVA +instance for your dataset using the instructions from +`PanVA's Technical Test +<https://panbrowse.github.io/PanVA/v0.0.0/technical-test.html>`_. diff --git a/docs/source/tutorial/tutorial_part5.rst b/docs/source/tutorial/tutorial_part5.rst index fcaaee486109df024348ce683c5f8b86f3435012..5dbe08a6c0942422af2fb264f436d1476a47f620 100644 --- a/docs/source/tutorial/tutorial_part5.rst +++ b/docs/source/tutorial/tutorial_part5.rst @@ -1,275 +1,290 @@ -Part 5. Phylogeny -================= +Part 5. Explore the pangenome using the Neo4j browser +===================================================== -Part 5 preparation ------------------- - -Pantools v3 is required to follow this part of the tutorial. In -addition, MAFFT, FastTree, IQ-tree, R (and the ape R package) need to be -installed and set to your $PATH. Validate if the tools are executable by -using the following commands. +Did you skip :doc:`part 1 <tutorial_part1>` of the tutorial or were you +unable to build the chloroplast pangenome? Download the pre-constructed +pangenome +`here <http://bioinformatics.nl/pangenomics/tutorial/chloroplast_DB.tar.gz>`_ +or via wget. .. code:: bash - pantools --version - Rscript --help - mafft -h - iqtree -h - fasttree -h + $ wget http://bioinformatics.nl/pangenomics/tutorial/chloroplast_DB.tar.gz + $ tar xvzf chloroplast_DB.tar.gz -If you did not follow part 4 of the tutorial, download the -pre-constructed pangenome -`here <http://bioinformatics.nl/pangenomics/tutorial/pecto_dickeya_DB.tar.gz>`_. +-------------- -.. code:: bash +Configuring Neo4j +----------------- - $ wget http://bioinformatics.nl/pangenomics/tutorial/pecto_dickeya_DB.tar.gz - $ tar -xvzf pecto_dickeya_DB.tar.gz +Set the full path to the chloroplast pangenome database by opening +neo4j.conf ('*neo4j-community-3.5.30/conf/neo4j.conf*') and include the +following line in the config file. Please make sure there is always only +a single uncommented line with 'dbms.directories.data'. --------------- +.. code:: text -Adding phenotype/metadata to the pangenome ------------------------------------------- + #dbms.directories.data=/YOUR_PATH/any_other_database + dbms.directories.data=/YOUR_PATH/chloroplast_DB -Before we construct the trees, we will add some phenotype data to the -pangenome. Once the we have a phylogeny, the information can be included -or be used to color parts of the tree. Below is a textfile with data for -three phenotypes. The third phenotype, *low_temperature*, is in this -case a made up example! It states whether the strain is capable of -growing on (extreme) low temperatures. The phenotype file can be found -inside the database directory, add the information to the pangenome by -using :ref:`add_phenotypes <construction/annotate:add phenotypes>`. +| **Allowing non-local connections** +| To be able to run Neo4j on a server and have access to it from + anywhere, some additional lines in the config file must be changed. + +- **Uncomment** the four following lines in + neo4j-community-3.5.30/conf/neo4j.conf. +- Replace 7686, 7474 and 7473 by any port number not in use on your system. + This way everyone can have their own database running at the same time. .. code:: text - Genome, species, strain_name, low_temperature - 1,P. odoriferum,P. odoriferum Q166, false - 2,P. fontis, P. fontis M022, true - 3,P. polaris,P. polaris S4.16.03.2B, false - 4,P. brasiliense, P. brasiliense S2, true - 5,P. brasiliense, P. brasiliense Y49, false - 6,D. dadantii, D. dadantii 3937,? + #dbms.connectors.default_listen_address=0.0.0.0 + #dbms.connector.bolt.listen_address=:7687 + #dbms.connector.http.listen_address=:7474 + #dbms.connector.https.listen_address=:7473 + +Lets start up the Neo4j server! + +.. code:: bash + + $ neo4j start + +Start Firefox (or a web browser of your own preference) and let it run +on the background. .. code:: bash - $ pantools add_phenotypes pecto_dickeya_DB pecto_dickeya_DB/phenotypes.txt + $ firefox & + +In case you did not change the config to allow non-local connections, +browse to *http://localhost:7474*. Whenever you did change the config +file, go to *server_address:7474*, where 7474 should be replaced with the +number you chose earlier, and *server_address* is the address set in the config. + +If the database startup was successful, a login terminal will appear in +the webpage. Use '*neo4j*' both as username and password. After logging +in, you are requested to set a new password. -------------- -Constructing a phylogeny ------------------------- +Exploring nodes and edges in Neo4j +---------------------------------- + +Go through the following steps to become proficient in using the Neo4j +browser and the underlying PanTools data structure. If you have any +difficulty trouble finding a node, relationship or any type of +information, download and use `this visual guide +<http://www.bioinformatics.nl/pangenomics/tutorial/neo4j_browser.tar.gz>`_. + +1. Click on the database icon on the left. A menu with all node types + and relationship types will appear. +2. Click on the '*gene*' button in the node label section. This + automatically generated a query. Execute the query. +3. The **LIMIT** clause prevents large numbers of nodes popping up to + avoid your web browser from crashing. Set LIMIT to 10 and execute the + query. +4. Hover over the nodes, click on them and take a look at the values + stored in the nodes. All these features (except ID) were extracted + from the GFF annotation files. ID is an unique number automatically + assigned to nodes and relationships by Neo4j. +5. Double-click on the **matK** gene node, all nodes with a connection + to this gene node will appear. The nodes have distinct colors as + these are different node types, such as **mRNA**, **CDS**, + **nucleotide**. Take a look at the node properties to observe that + most values and information is specific to a certain node type. +6. Double-click on the *matK* mRNA node, a **homology_group** node + should appear. These type of nodes connect homologous genes in the + graph. However, you can see this gene did not cluster with any other + gene. +7. Hover over the **start** relation of the *matK* gene node. As you + can see information is not only stored in nodes, but also in + relationships! A relationship always has a certain direction, in + this case the relation starts at the gene node and points to a + nucleotide node. Offset marks the location within the node. +8. Double-click on the **nucleotide** node at the end of the 'start' + relationship. An in- and outgoing relation appear that connect to + other nucleotide nodes. Hover over both the relations and compare + them. The relations holds the genomic coordinates and shows this + path only occurs in contig/sequence 1 of genome 1. +9. Follow the outgoing **FF**-relationship to the next nucleotide node + and expand this node by double-clicking. Three nodes will pop up + this time. If you hover over the relations you see the coordinates + belong to other genomes as well. You may also notice the + relationships between nucleotide nodes is always a two letter + combination of F (forward) and R (reverse) which state if a sequence + is reverse complemented or not. The first letter corresponds to the + sequence of the node at the start of the relation where the second + letters refers to the sequence of the end node. +10. Finally, execute the following query to call the database scheme to + see how all node types are connected to each other: *CALL + db.schema()*. The schema will be useful when designing your own + queries! -In this tutorial we will construct three phylogenies, each based on a -different type of variation: SNPs, genes and k-mers. Take a look at the -phylogeny manuals to get an understanding how the three methods work and -how they differ from each other. +-------------- -1. :ref:`analysis/phylogeny:core phylogeny` -2. :ref:`analysis/phylogeny:gene distance tree` -3. :ref:`analysis/phylogeny:k-mer distance tree` +Query the pangenome database using CYPHER +----------------------------------------- -Core SNP phylogeny ------------------- +Cypher is a declarative, SQL-inspired language and uses ASCII-Art to +represent patterns. Nodes are represented by circles and relationships +by arrows. -The core SNP phylogeny will run various Maximum Likelihood models on -parsimony informative sites of single-copy orthologous sequences. A site -is parsimony-informative when there are at least two types of -nucleotides that occur with a minimum frequency of two. The informative -sites are automatically identified by aligning the sequences; however, -it does not know which sequences are single-copy orthologous. You can -identify these conserved sequences by running -:ref:`gene_classification <analysis/classification:gene classification>`. +- The **MATCH** clause allows you to specify the patterns Neo4j will + search for in the database. +- With **WHERE** you can add constraints to the patterns described. +- In the **RETURN** clause you define which parts of the pattern to + display. -.. code:: bash +Cypher queries +~~~~~~~~~~~~~~ - $ pantools gene_classification --phenotype=species pecto_dickeya_DB +**Match and return 100 nucleotide nodes** -Open **gene_classification_overview.txt** and take a look at statistics. -As you can see there are 2134 single-copy ortholog groups. Normally, all -of these groups are aligned to identify SNPs but for this tutorial we’ll -make a selection of only a few groups to accelerate the steps. You can -do this in two different ways: +.. code:: text -Option 1: Open **single_copy_orthologs.csv** and remove all node -identifiers after the first 20 homology groups and save the file. + MATCH (n:nucleotide) RETURN n LIMIT 100 -.. code:: bash +**Find all the genome nodes** - $ pantools core_phylogeny --mode=ML pecto_dickeya_DB +.. code:: text -Option 2: Open **single_copy_orthologs.csv** and select the first 20 -homology_group node identifiers. Place them in a new file sco_groups.txt -and include this file to the function. + MATCH (n:genome) RETURN n -.. code:: bash +**Find the pangenome node** - $ pantools core_phylogeny --mode=ML -H=sco_groups.txt pecto_dickeya_DB +.. code:: text -The sequences of the homology groups are being aligned two consecutive -times. After the initial alignment, input sequences are trimmed based on -the longest start and end gap of the alignment. The parsimony -informative positions are taken from the second alignment and -concatenated into a sequence. When opening **informative.fasta** you can -find 6 sequences, the length of the sequences being the number of -parsimony-informative sites. + MATCH (n:pangenome) RETURN n -.. code:: bash +**Match and return 100 genes** - $ iqtree -nt 4 -s pecto_dickeya_DB/alignments/grouping_v1/core_phylogeny/informative.fasta -redo -bb 1000 +.. code:: text -IQ-tree generates several files, the tree that we later on in the -tutorial will continue with is called **informative.fasta.treefile**. -When examining the **informative.fasta.iqtree** file you can find the -best fit model of the data. This file also shows the number of sites -that were used, as sites with gaps (which IQ-tree does not allow) were -changed into singleton or constant sites. + MATCH (g:gene) RETURN g LIMIT 100 -Gene distance tree -~~~~~~~~~~~~~~~~~~ +**Match and return 100 genes and order them by length** -To create a phylogeny based on gene distances (absence/presence), we can -simply execute the Rscript that was created by -:ref:`gene_classification <analysis/classification:gene classification>`. +.. code:: text -.. code:: bash + MATCH (g:gene) RETURN g ORDER BY g.length DESC LIMIT 100 - $ Rscript pecto_dickeya_DB/gene_classification/gene_distance_tree.R +**The same query as before but results are now returned in a table** -The resulting tree is called **gene_distance.tree**. +.. code:: text -K-mer distance tree -~~~~~~~~~~~~~~~~~~~ + MATCH (g:gene) RETURN g.name, g.address, g.length ORDER BY g.length DESC LIMIT 100 -To obtain a k-mer distance phylogeny, the k-mers must first be counted -with the :ref:`kmer_classification <analysis/classification:k-mer -classification>` -function. Afterwards, the tree can be constructed by executing the -Rscript. +**Return genes which are longer as 100 but shorter than 250 bp** (this +can also be applied to other features such as exons introns or CDS) -.. code:: bash +.. code:: text - $ pantools kmer_classification pecto_dickeya_DB - $ Rscript pecto_dickeya_DB/kmer_classification/genome_kmer_distance_tree.R + MATCH (g:gene) where g.length > 100 AND g.length < 250 RETURN * LIMIT 100 -The resulting tree is written to **genome_kmer_distance.tree**. +**Find genes located on first genome** + +.. code:: text + + MATCH (g:gene) WHERE g.address[0] = 1 RETURN * LIMIT 100 + +**Find genes located on first genome and first sequence** + +.. code:: text + + MATCH (g:gene) WHERE g.address[0] = 1 AND g.address[1] = 1 RETURN * LIMIT 100 -------------- -Renaming tree nodes -------------------- +Homology group queries +~~~~~~~~~~~~~~~~~~~~~~ + +**Return 100 homology groups** -So far, we used three different types of distances (SNPs, genes, -k-mers), and two different methods (ML, NJ) to create three phylogenetic -trees. First, lets take a look at the text files. The -**informative.fasta.treefile** only contain genome numbers, bootstrap -values and branch lengths but is lacking the metadata. Examining -**gene_distance.tree** file also shows this information but the species -names as well, because we included this as a phenotype during -:ref:`gene_classification <analysis/classification:gene classification>`. +.. code:: text -Let’s include the strain identifiers to the core snp tree to make the -final figure more informative. Use the -:ref:`rename_phylogeny <analysis/phylogeny:rename phylogeny>` function to -rename the tree nodes. + MATCH (h:homology_group) RETURN h LIMIT 100 -.. code:: bash +**Match homology groups which contain two members** + +.. code:: text + + MATCH (h:homology_group) WHERE h.num_members = 2 RETURN h + +**Match homology groups and 'walk' to the genes and corresponding start +and end node** + +.. code:: text + + MATCH (h:homology_group)-->(f:feature)<--(g:gene)-->(n:nucleotide) WHERE h.num_members = 2 RETURN * LIMIT 25 + +Turn off autocomplete by clicking on the button on the bottom right. The +graph falls apart because relations were not assigned to variables. + +**The same query as before but now the relations do have variables** + +.. code:: text + + MATCH (h:homology_group)-[r1]-> (f:feature) <-[r2]-(g:gene)-[r3]-> (n:nucleotide) WHERE h.num_members = 2 RETURN * LIMIT 25 + +When you turn off autocomplete again only the '*is_similar_to*' relation +disappears since we did not call it + +**Find homology group that belong to the rpoC1 gene** + +.. code:: text - $ pantools rename_phylogeny --phenotype=strain_name pecto_dickeya_DB pecto_dickeya_DB/alignments/grouping_v1/core_phylogeny/informative.fasta.treefile + MATCH (n:homology_group)--(m:mRNA)--(g:gene) WHERE g.name = 'rpoC1' RETURN * -Take a look at **informative.fasta_RENAMED.treefile**, strain -identifiers have been added to the tree. +**Find genes on genome 1 which don't show homology** + +.. code:: text + + MATCH (n:homology_group)--(m:mRNA)--(g:gene) WHERE n.num_members = 1 and g.genome = 1 RETURN * -------------- -Visualizing the tree in iTOL ----------------------------- - -Go to https://itol.embl.de and click on “Upload a tree” under the -**ANNOTATE** box. On this page you can paste the tree directly into the -**tree text:** textbox or can click the button to upload the .newick -file. - -.. figure:: /figures/tutorial_phylogeny1.png - :width: 600 - :align: center - -Basic controls ITOL -------------------- - -- The default way of visualizing a tree is the rectangular view. - Depending on the number of genomes, the circular view can be easier - to interpret. You can the view by clicking on the “Display Mode” - buttons. -- Increase the font size and branch width to improve readability -- When visualizing a Maximum likelihood (ML) tree, bootstrap values can - be displayed by clicking the “Display” button next to - **Bootstrap/metadata** in the Advanced tab of the Control window. - This enables you to visualize the values as text or symbol on the - branch. or by coloring the branch or adjusting the width. - -.. figure:: /figures/tutorial_phylogeny2.png - :width: 600 - :align: center - -- When you have a known out group or one of the genomes is a clear - outlier in the tree, you should reroot the tree. Hover over the name, - click it so a pop-up menu appears. Click “tree structure” followed by - “Reroot the tree here”. - -.. figure:: /figures/tutorial_phylogeny3.png - :width: 600 - :align: center - -- Clicking on the name of a node in the tree allows you to color the - name, branch, or background of that specific node. -- When you’re happy the way your tree looks, go to the Export tab of - the Control window. Select the desired output format, click on the - “Full image” button and export the file to a figure. -- Refresh the webpage to go back to the default view of your tree. - -Create iTOL templates ---------------------- - -In iTOL it is possible to add colors to the tree by coloring the -terminal nodes or adding an outer ring. The PanTools function -:ref:`create_tree_template <analysis/phylogeny:create tree template>` is able -to create templates that allows for easy coloring (with maximum of 20 -possible colors). If the function is run without any additional -argument, templates are created for trees that only contain genome -numbers (e.g. k-mer distance tree). Here we want to color the (renamed) -core SNP tree with the ‘low_temperature’ phenotype. Therefore, the -``--phenotype`` strain_name must be included to the function. +Structural variant detection +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -.. code:: bash +**Find SNP bubbles (for simplification we only use the FF relation)** - $ pantools create_tree_template pecto_dickeya_DB # Run this command when the tree contains genome numbers only - $ pantools create_tree_template --phenotype=strain_name pecto_dickeya_DB +.. code:: text + + MATCH p= (n:nucleotide) -[:FF]-> (a1)-[:FF]->(m:nucleotide) <-[:FF]-(b1) <-[:FF]- (n) return * limit 50 + +**The same query but returning the results in a table** + +.. code:: text -Copy the two low_temperature.txt files from the label/strain_name/ and -ring/strain_name/ directories to your personal computer. Click and move -the ring template file into the tree visualization webpage. + MATCH (n:nucleotide) -[:FF]-> (a1)-[:FF]->(m:nucleotide) <-[:FF]-(b1) <-[:FF]- (n) return a1.length,b1.length, a1.sequence, b1.sequence limit 50 -.. figure:: /figures/tutorial_phylogeny4.png - :width: 600 - :align: center +Functions such as **count()**, **sum()** and **stDev()** can be used in +a query. -The resulting tree should look this when: the tree is rooted with the -*Dickeya* genome, bootstrap values are displayed as text and the ring -color template was included. +**The same SNP query but count the hits instead of displaying them** -.. figure:: /figures/tutorial_phylogeny_color.png - :width: 600 - :align: center +.. code:: text + + MATCH p= (n:nucleotide) -[:FF]-> (a1)-[:FF]->(m:nucleotide) <-[:FF]-(b1) <-[:FF]- (n) return count(p) + +-------------- + +Hopefully you know have some feeling with the Neo4j browser and cypher +and you're inspired to create your own queries! + +When you're done working in the browser, close the database (by using +the command line again). + +.. code:: text -Tree coloring is especially useful for large datasets. An example is -shown in the figure below, where members of the same species share a -color. + $ neo4j stop -.. figure:: /figures/tutorial_phylogeny_tree.png - :width: 600 - :align: center +| More information on Neo4j and the cypher language: +| `Neo4j Cypher Manual v3.5 <https://neo4j.com/docs/developer-manual/3.5/cypher/>`_ +| `Neo4j Cypher Refcard <http://neo4j.com/docs/cypher-refcard/3.5/>`_ +| `Neo4j API <https://neo4j.com/developer/>`_ -------------- -.. - In :doc:`part 6 <tutorial_part6>` of the tutorial we go through a couple of use cases with the pangenomic read mapping functionality implemented in PanTools. +In :doc:`part 3 <tutorial_part3>` of the tutorial we explore some of the +functionalities to analyze the pangenome.