@@ -5,8 +5,8 @@ This part guides you through some example cases, from raw data to PanVA
instance.
Data sets
---------
Data Packages
-------------
Multiple data packages are available, which contain all the information for
creating a pangenome using PanTools and creating a PanVA instance from it,
...
...
@@ -44,37 +44,44 @@ Steps to generate PanVA input
These example cases run through the following steps:
* Downloading publicly available data
* Genome and structural annotation data
* Accession data for arabidopsis and tomato
* Preprocessing the data for Pantools
* Filtering the minimum sequence size of the fasta file
1: Downloading publicly available data
* Acquire genome and structural annotation data
* Accession data for arabidopsis, pectobacterium, and yeast
2: Preprocessing the data for Pantools
* Filtering the minimum sequence size of genomes in the FASTA file
* Filtering the minimum ORF size of CDS features in the annotation
* Create functional annotations for extracted protein sequences
* Constructing and annotating a pangenome using Pantools
* Extract protein-sequences by matching CDS features to genomic sequences
* Create functional annotations for extracted protein sequences (Optional)
* Generate statistics for raw and filtered data
3: Constructing and annotating a pangenome using Pantools
* Build the pangenome
* Add structural annotations, functional annotations and phenotypes
* Add vcf information or phasing information if available
* Create homology groups
* Running the necessary analysis steps in order to create a PanVA instance
4: Running the necessary analysis steps in order to create a PanVA instance
* Gene classification
* K-mer classification
* Multiple sequence alignment
* Group info
* Create a PanVA instance
5: Create a PanVA instance
* Preprocessing data for PanVA
* Set up the PanVA instance
Instructions to create a pangenome and PanVA instance for *Arabidopsis*.
The other packages follow the same logic. Every package contains a README
with all the exact commands, so make sure to check those.
This tutorial contains instructions to create a pangenome and PanVA instance for different species. Every package contains a README
with all the exact commands, so make sure to check those if you're stuck.
Both Snakemake pipelines used in this workflow create conda environments
in the data package directory. If you want to re-use these pipelines for
different data, use the ``--conda-prefix`` Snakemake command to set a directory
where the conda environments will be stored.
Download the data package
~~~~~~~~~~~~~~~~~~~~~~~~~
1: Downloading publicly available data
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Goal:
* Acquire genome and structural annotation data
* Accession data for arabidopsis, pectobacterium, and yeast
1.1: Download the data package
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. code:: bash
...
...
@@ -85,18 +92,29 @@ All commands should be run from the root directory of the package.
Run the whole package from RAM-disk or SSD, or set the path of the results to
RAM-disk/SSD in the configs.
Download the raw data
~~~~~~~~~~~~~~~~~~~~~
1.2: Download the raw data
~~~~~~~~~~~~~~~~~~~~~~~~~~
To download corresponding raw-data for each of the above-linked packages, follow the first steps of the instructions laid out in the respective README file.
For all packages, those can be found at the root of the decompressed TAR-files.
To download corresponding raw-data for each of the above-linked packages, follow the instructions laid out in the respective README file.
For all packages, those can be found at the root of the decompressed tar-files.
Clone the PanUtils data filtering pipeline
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2: Preprocessing the data for Pantools
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Goal:
* Filtering the minimum sequence size of genomes in the FASTA file
* Filtering the minimum ORF size of CDS features in the annotation
* Extract protein-sequences by matching CDS features to genomic sequences
* Create functional annotations for extracted protein sequences (Optional)
* Generate statistics for raw and filtered data
2.1: Clone the PanUtils data filtering pipeline
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The
:ref:`data filtering pipeline <getting_started/diy_pangenomics:Quality control pipeline>`
filters out small sequences, matches fasta with gff contents and removes CDS
filters out small sequences, matches FASTA with GFF contents and removes CDS
features with ORF below cutoff value. Also extracts protein sequences and
creates functional annotations from them.
...
...
@@ -104,21 +122,58 @@ creates functional annotations from them.
This command ensures that packages are downloaded for an Intel-based architecture. Afterwards, restart your shell with the "Open using Rosetta"-setting enabled.
To do this via the GUI, go to "Applications"/Utilities/Terminal" and click on "Get Info". Select the option to start the terminal with Rosetta!
2.3: Filter the raw data and create functional annotations for extracted protein-sequences