Skip to content
Snippets Groups Projects
Forked from an inaccessible project.
Kautsar, Satria's avatar
Kautsar, Satria authored
Move html output folders into each networks output folder (i.e. networks_all,…

See merge request medema-group/BiG-SCAPE!9
a9cb9baa
History

Introduction

Bioinformatically, mining (meta)genomes for Biosynthetic Gene Clusters (BGCs) encoding specialized metabolites would entail identifying and annotating BGCs on the genome and taking additional steps to define a distance between BGCs in order to map the BGC diversity in similarity networks. These similarity networks would graphically summarize the diversity of the BGCs, as well as contain multiple annotations to help identify novel compounds, make ecological correlations and so on.

Defining a distance

BGCs are essentially a collection of genes that code for proteins that work together to produce a compound. These proteins are most likely the most important factor when it comes to the final structure of the compound. Thus, a good distance metric should use information on the similarity of the proteins between two BGCs.

In this project, three indices are combined to define a final distance metric between any given pair of BGCs:

  • The Jaccard index (J): The ratio between the distinct shared and distinct unshared domain types between two BGCs.
  • The DSS (Domain Sequence Similarity score): Measures the sequence similarity between the domains of both BGCs, for each type of domain. When each BGC contains only one copy of a certain domain, the sequence similarity can be obtained directly, otherwise, the Hungarian algorithm is used to select the most similar Pfam domain sequences (munkres.py). A special weight can be also be given to marked domains annotated as "anchor domains" (anchor_domains.txt).
  • The Adjacency Index (AI): Estimates the similarity in terms of proximal domain content by calculating the ratio between the distinct shared and distinct unshared adjacent domains (without taking order into account)

How does it work

BiG-SCAPE tries to (recursively) read all the GenBank files from the input folder (which, preferrably, correspond to identified gene clusters with a tool like antiSMASH). If the user has different subfolders in the main input directory, these can even be treated as different samples (and BiG-SCAPE can generate specific network files for this; activate with --samples).

BiG-SCAPE then uses the Pfam database and hmmscan from the HMMER (v3.1b2) suite to predict Pfam domains in each sequence.

For every pair of BGCs in the set(s), the pairwise distance between this BGCs is calculated as the weighted combination of the Jaccard, AI and DSS indices. Network files are generated containing a number of information: the name of the BGCs, the raw distance between them, and data from the the three indices' scores. This is done taking into account different cutoff values for the distances (i.e. only pairs with Raw Distance < cutoff are written in the final .network file).

The distances for each cutoff value will be used in a clustering algorithm to try to define 'Gene Cluster Families' (GCFs).

By default, BiG-SCAPE uses the /product information of antiSMASH-processed GenBank files to separate the analysis into eight BiG-SCAPE classes: PKS Type I, PKS Other types, NRPS, PKS/NRPS Hybrids, Saccharides, Terpenes, RiPPs and Others. Each has different (tuned) sets of weights for the distance components. You can also choose to combine all BGC classes in one network file (--mix) and deactivate the default classification (--no_classify). It is also possible to prevent analysis of BiG-SCAPE classes by using the --banned_classes parameter.

BGCs with more than one predicted product (hybrids) are either put into the PKS/NRPS Hybrids or the Others BiG-SCAPE classes depending on the classification of their subproducts. Use --hybrids to also add them to each of their individual BiG-SCAPE classes (e.g. a PKS/NRPS Hybrids BGC with 'nrps-t1pks' annotation would be put in the NRPS and PKS Type I BiG-SCAPE classes; a 'terpene-nrps' BGC from Others would also be included in the Terpene and NRPS BiG-SCAPE classes, etc.). Note that if this option is activated, it will try to re-classify these hybrid BGCs even if the PKS/NRPS Hybrids or Others classes are 'banned'.

See the full options with python bigscape.py -h.

How to run BiG-SCAPE

Requirements

Packages can be installed manually but using a virtual environment is recommended. For a quick guide, see here

  • Python 2
  • The HMMER suite
  • The (processed) Pfam database. For this, download the latest Pfam-A.hmm.gz file from the Pfam website, uncompress it and process it using the hmmpress command.
  • For sequence alignment (DSS score), BiG-SCAPE uses the hmmalign command from the HMMER suite by default, but you can also select MAFFT (activate with --use_mafft)
  • Biopython
  • Numpy
  • scipy
  • pySAPC (Affinity Propagation clustering algorithm with support for sparse matrices)

Workflow

  • Parses GenBank files (.gbk) and extracts CDS per BGC (fasta/*.fasta)
  • Predicts domains per BGC (.domtable)
  • Writes list of domains per BGC (.pfs)
  • Writes selected information from filtered domtable files per BGC (.pfd)
  • Writes list of sequences per domain (domains/*.fasta)
  • Creates 'Arrower'-like figures for each BGC (.svg)
  • Saves dictionary with list of specific domains per BGC (<output dir>/BGCs.dict)
  • Calculates multiple alignments for each domain sequence file (domains/*.algn)
  • Calculates distance between BGCs
  • Generates network files (.network)
  • Generates Network Annotation files (.tsv) with information about the input data
  • Generates GCF labels from the clustering algorithm (.tsv)
  • Generates json files for built-in visualization (.js) (work in progress)

BiG-SCAPE will try to re-use some of these files to continue in case the analysis stops (so take this into account if you e.g. change the version of the Pfam database or the alignment method).