Update README.md

ccf6591b · Tracanna, Vittorio · e0a7652c · ccf6591b
Commit ccf6591b authored 4 years ago by Tracanna, Vittorio
--- a/README.md
+++ b/README.md
 # dom2BGC

-Pipeline for annotation of functional amplicons targeting BGC domains. The tool is designed to transfer annotation of amplicons based on their similarity to in silico amplicons from natural product databases.
-Pre-parsed static version of the databases are provided. Beware: if you want to update the databases to a specific version computational time can be quite high.
+Dom2BGC is a pipeline for annotation of functional amplicons targeting BGC domains. The tool is designed to transfer annotations of amplicons based on their similarity to in silico amplicons from natural product databases.
+Pre-parsed static versions of the databases are provided. Beware: if you want to update the databases to a specific version, computational time can be quite high.

-An example of the command needed to run the pipeline is found in CMD_example.
+An example of the command needed to run the pipeline is found in [CMD_example](https://git.wageningenur.nl/traca001/dom2bgc/-/blob/master/CMD_example.sh).

-In order to create the feature table, I recommend following one of the many tutorial available in the qiime2 tutorial pages for creating [feature tables] [https://docs.qiime2.org/2020.2/tutorials/moving-pictures/#obtaining-and-importing-data ] from raw amplicon reads. We suggest to use only the forward reads as in our experience they contain the majority of the information and adding the reverse non-overlapping reads mostly introduces additional issues such as problems when merging the forward and reverse with an N in between which is not supported by many tools.
-Once the reads are denoised [with DADA2] [example: https://docs.qiime2.org/2020.2/tutorials/moving-pictures/#option-1-dada2 ]. After this step you should be able to export both the feature-table and the feature-data to non-qiime formats. The feature-data, which contains the denoised nucleotide amplicon sequences can be translated to protein sequence with any tool capable of it, here we used transeq from the EMBOSS suite [ftp://emboss.open-bio.org/pub/EMBOSS/EMBOSS-6.6.0.tar.gz].
+To create the feature table, it is recommended to follow one of the many tutorials available in the qiime2 tutorial pages for creating [feature tables] [https://docs.qiime2.org/2020.2/tutorials/moving-pictures/#obtaining-and-importing-data ] from raw amplicon reads. We suggest to use only the forward reads as in our experience they contain the majority of the information and adding the reverse non-overlapping reads mostly introduces additional issues such as problems when merging the forward and reverse with an N in between which is not supported by many tools.
+Once the reads are denoised [with DADA2] [example: https://docs.qiime2.org/2020.2/tutorials/moving-pictures/#option-1-dada2 ]. After this step you should be able to export both the feature-table and the feature-data to non-qiime formats. The feature-data, which contains the denoised nucleotide amplicon sequences can be translated to protein sequences using any tool capable of it; here we used transeq from the EMBOSS suite [ftp://emboss.open-bio.org/pub/EMBOSS/EMBOSS-6.6.0.tar.gz].

-If you want to generate in silico amplicons from any paired database or metagenome you may have paired with this data; run antismash to predict the BGCs and extract the protein sequences from the genbank files that contain the domain of interest.
-To generate the amplicons starting from the amplicons protein sequence [either from the natural source or the in silico amplicons], use hmmsearch tool with the HMM profile provided in this repo:
+In order to generate in silico amplicons from any paired database or metagenome you may have paired with this data, run antismash to predict the BGCs and extract the protein sequences from the genbank files that contain the domain of interest.
+To generate the amplicons starting from the amplicon protein sequence [either from the natural source or the in silico amplicons], use the hmmsearch tool with the HMM profile provided in this repo as follows:

 `hmmsearch -o /path/to/hmmsearch/output/and/filename /path/to/hmm_profile.hmm /path/to/protein/sequences.faa`


-Then run the parse_hmm.py script with the hmmsearch ouput file.
+Then run the parse_hmm.py script with the hmmsearch output file.

 `python hmm_profiles/parse_hmm.py /path/to/hmmsearch/output/and/filename /path/to/parsed/output/and/filename.faa`

-To generate the phylogeny tree you can use any tool capable of creating a newick file output from a MSA. [I used fasttree but you are welcome to use any other tool of your choice http://www.microbesonline.org/fasttree/]
+To generate the phylogenetic tree, you can use any tool capable of creating a newick file output from a MSA. [I used fasttree but you are welcome to use any other tool of your choice [fasttree](http://www.microbesonline.org/fasttree/)]. E.g.:

 `fasttree /path/to/parsed/output/and/filename.faa > /path/to/parsed/output/and/filename.tree`

-dom2BGC can also attempt regenerate the physical clustering of domains that is lost during the amplicon creation process using co-occurrence across different samples. 
-This putative clusters should be considered predictions that need further validation with dedicated experiments but can provide additional insight into biological mechanisms associated with their natural products.
-Obviously, multiple samples and biological/technical replicates are needed in order to enable cooccurrence-based putative cluster reconstruction. If active, spearman cooccurrence patterns above a user-set threshold are used to generate a network. 
-Clustering of the network results in "highly cooccurring amplicon hubs". Amplicons within the same hub mapping on multiple domains of the same cluster [from antismash-database] result in a predicted cluster.
-The network can be visualized in cytoscape where you can load annotation and clustering results. Predicted clusters are found in separate cluster files.
+Dom2BGC can also attempt to reconstruct the physical clustering of domains within the same BGCs, which is lost during the amplicon creation process using co-occurrence across different samples. 
+These putative 'clusters' (sets of domain amplicons putatively originating from the same BGC) should be considered predictions that need further validation with dedicated experiments but can provide additional insight into biological mechanisms associated with their natural products.
+Obviously, multiple samples and biological/technical replicates are needed in order to enable cooccurrence-based putative cluster reconstruction. If active, Spearman cooccurrence patterns above a user-set threshold are used to generate a network. 
+Clustering of the network results in "highly cooccurring amplicon hubs". Amplicons within the same hub mapping to multiple domains of the same gene cluster [from antismash-DB] result in a predicted cluster.
+The network can be visualized in Cytoscape where you can load annotations and clustering results. Predicted clusters are found in separate cluster files.

-The tool generates multiple plots to visualize your data characteristics. You can use the amplicon_counts_swarmplot.pdf [found in the output folder] to look the the diversity of your sample.
+The tool generates multiple plots to visualize your data characteristics. You can use the _amplicon_counts_swarmplot.pdf_ [found in the output folder] to assess the the diversity of your sample.

 <a href="https://ibb.co/Th5TKq1"><img src="https://i.ibb.co/88VBz4b/amplicon-counts-swarmplot.jpg" alt="amplicon-counts-swarmplot" border="0"></a>

-In addition, you can see how well your sample replicates and treatment group based on their community characteristics with the beta_diversity_mds.pdf which is based on UniFrac metric if you provided a rooted phylogeny tree for the amplicons in your sample.
+In addition, you can see how well your sample replicates and treatment group based on their community characteristics with the _beta_diversity_mds.pdf_ file that is generated, which is based on the UniFrac metric if you provided a rooted phylogeny tree for the amplicons in your sample.

 <a href="https://ibb.co/MghcBfq"><img src="https://i.ibb.co/x2s1FYV/beta-diversity-mds.jpg" alt="beta-diversity-mds" border="0"></a>

-Alternatively to MDS, dom2bgc also generates 3d pcoa plots showing the community characteristics relationships between sample replicates:
+As an alternative to MDS, dom2bgc also generates 3D PCoA plot showing the community characteristics relationships between sample replicates:

 <a href="https://ibb.co/54ySfgK"><img src="https://i.ibb.co/DWyvSHr/beta-diversity-pcoa.jpg" alt="beta-diversity-pcoa" border="0"></a>

-dom2bgc also provides annotation for the amplicons in a csv format in the "amplicon_annotation.csv" which can be inspected in any text editor or excel. In the toy example we used amplicons from Burkholderia, Pseudomonas, Collimonas and unassigned. Note that in our experience assigned/unassigned rates can vary significantly depending on the natural source.
+Dom2bgc also provides annotations for the amplicons in a csv format in the "_amplicon_annotation.csv_" file, which can be inspected in any text editor or in, e.g., MS Excel. In the toy example we used amplicons from Burkholderia, _Pseudomonas_, _Collimonas_ and _unassigned taxa_. Note that in our experience assigned/unassigned rates can vary significantly depending on the natural source of the samples.

 <a href="https://ibb.co/mC8Dm2Z"><img src="https://i.ibb.co/7J4YHZM/Screenshot-2020-06-06-at-13-13-39.png" alt="Screenshot-2020-06-06-at-13-13-39" border="0"></a> 

-When multiple samples/replicates are specified when running the dom2bgc pipeline, it generates also a cooccurrence network [in the form of the {}_corr_network_annot_table.csv] that can be inspected in cytoscape [https://cytoscape.org/]. Clustering of the network can be also added for the visualization by importing the network_clustering.csv file.
+When multiple samples/replicates are specified for a run of the dom2bgc pipeline, it will also generate a cooccurrence network [in the form of the {}_corr_network_annot_table.csv file] that can be inspected in Cytoscape [https://cytoscape.org/]. Clustering of the network can be also added for the visualization by importing the network_clustering.csv file.

 <a href="https://imgbb.com/"><img src="https://i.ibb.co/0FSn7bk/Screenshot-2020-06-06-at-13-38-12.png" alt="Screenshot-2020-06-06-at-13-38-12" border="0"></a>

-Where we can observe amplicons that cooccurr with eachother which may indicate their genetic clustering. Cooccurring amplicons mapping to the same gene cluster [from the antismash-database] are considered putative clusters. For each hub in the network [red, green and blue in the image above], dom2bgc generates 2 separate txt files describing these putative clusters subnetwork_{}_putative_clusters.csv and subnetwork_{}_multiple_domains_putative_clusters.csv.
-The first contains the annotation for all the hits in the database for the amplicons in the network with the number of matched domains, the number of amplicons in the subnetwork matching to it and the taxonomy of the entry. Details of which amplicons match to which entry are also found at the end of the txt file. subnetwork_{}_multiple_domains_putative_clusters.csv consist of only the matches for the database with multiple domains matching as shown in the following figure.
+Here we can observe amplicons that cooccur with eachother which may indicate their genetic clustering. Cooccurring amplicons mapping to the same gene cluster [from the antiSMASH-database] are considered putative clusters. For each hub in the network [red, green and blue in the image above], dom2bgc generates 2 separate TXT files describing these putative cluster subnetworks: _subnetwork_{}_putative_clusters.csv_ and _subnetwork_{}_multiple_domains_putative_clusters.csv_.
+ 
+The first contains the annotation for all the hits in the database for the amplicons in the network with the number of matched domains, the number of amplicons in the subnetwork matching to it and the taxonomy of the entry. Details of which amplicons match to which entry are also found at the end of the txt file. _subnetwork_{}_multiple_domains_putative_clusters.csv_ consist of only the matches for the database with multiple domains matching as shown in the following figure.

 <a href="https://ibb.co/QH3zRS1"><img src="https://i.ibb.co/jwPB201/Screenshot-2020-06-06-at-13-52-49.png" alt="Screenshot-2020-06-06-at-13-52-49" border="0"></a>

-Note that some amplicons may match on multiple BGCs from different entries in antismash-database as some gene clusters are widespread in multiple organisms. 
+Note that some amplicons may match to multiple BGCs from different entries in the antiSMASH-database, as some gene clusters are found in across multiple organisms. 

 Database availability:
-Antismashdb gbk database is not provided
-Mibig gbk database can be found at https://dl.secondarymetabolites.org/mibig/mibig_gbk_2.0.tar.gz
+The antismashdb GBK database can be found at [LINK](https://dl.secondarymetabolites.org/database/)
+Mibig gbk database can be found at [LINK](https://dl.secondarymetabolites.org/mibig/mibig_gbk_2.0.tar.gz)