Lokhorst, Wouter
BiG-SCAPE

Repository



Introduction
Bioinformatically, mining (meta)genomes for Biosynthetic Gene Clusters
(BGCs) encoding specialized metabolites would entail identifying and annotating
BGCs on the genome and taking additional steps to define a distance between BGCs
in order to map the BGC diversity in similarity networks. These similarity
networks would graphically summarize the diversity of the BGCs, as well as
contain multiple annotations to help identify novel compounds, make ecological
correlations and so on.

Defining a distance
BGCs are essentially a collection of genes that code for proteins that work
together to produce a compound. These proteins are most likely the most
important factor when it comes to the final structure of the compound. Thus, a
good distance metric should use information on the similarity of the proteins
between two BGCs.
In this project, three indices are combined to define a final distance metric
between any given pair of BGCs:

The Jaccard index (J): The ratio between the distinct shared and distinct
unshared domain types between two BGCs.
The DSS (Domain Sequence Similarity score): Measures the sequence similarity
between the domains of both BGCs, for each type of domain. When each BGC
contains only one copy of a certain domain, the sequence similarity can be
obtained directly, otherwise, the Hungarian algorithm is used to select the most
similar Pfam domain sequences (munkres.py). A special weight can be also be
given to marked domains annotated as "anchor domains" (anchor_domains.txt).
The Adjacency Index (AI): Estimates the similarity in terms of proximal domain
content by calculating the ratio between the distinct shared and distinct
unshared adjacent domains (without taking order into account)


How does it work
BiG-SCAPE tries to (recursively) read all the GenBank files from the input
folder (which, preferrably, correspond to identified gene clusters with a tool
like antiSMASH). If the user has
different subfolders in the main input directory, these can even be treated as
different samples (and BiG-SCAPE can generate specific network files for this;
activate with --samples).
BiG-SCAPE then uses the Pfam database and hmmscan from the HMMER (v3.1b2)
suite to predict Pfam domains in each sequence.
For every pair of BGCs in the set(s), the pairwise distance between this BGCs is
calculated as the weighted combination of the Jaccard, AI and DSS indices.
Network files are generated containing a number of information: the name of the
BGCs, the raw distance between them, and data from the the three indices'
scores. This is done taking into account different cutoff values for the
distances (i.e. only pairs with Raw Distance < cutoff are written in the
final .network file).
The distances for each cutoff value will be used in a clustering algorithm to
try to define 'Gene Cluster Families' (GCFs).
By default, BiG-SCAPE uses the /product information of antiSMASH-processed
GenBank files to separate the analysis into eight BiG-SCAPE classes: PKS Type I,
PKS Other types, NRPS, PKS/NRPS Hybrids, Saccharides, Terpenes, RiPPs and
Others. Each has different (tuned) sets of weights for the distance components.
You can also choose to combine all BGC classes in one network file (--mix) and
deactivate the default classification (--no_classify). It is also possible to
prevent analysis of BiG-SCAPE classes by using the --banned_classes parameter.
BGCs with more than one predicted product (hybrids) are either put into the
PKS/NRPS Hybrids or the Others BiG-SCAPE classes depending on the classification
of their subproducts. Use --hybrids to also add them to each of their
individual BiG-SCAPE classes (e.g. a PKS/NRPS Hybrids BGC with 'nrps-t1pks'
annotation would be put in the NRPS and PKS Type I BiG-SCAPE classes; a
'terpene-nrps' BGC from Others would also be included in the Terpene and NRPS
BiG-SCAPE classes, etc.). Note that if this option is activated, it will try to
re-classify these hybrid BGCs even if the PKS/NRPS Hybrids or Others classes are
'banned'.
See the full options with python bigscape.py -h.

How to run BiG-SCAPE

Requirements
Packages can be installed manually but using a virtual environment is
recommended. For a quick guide, see here

Python 2
The HMMER suite

The (processed) Pfam database. For this, download the latest Pfam-A.hmm.gz
file from the Pfam website, uncompress it and process
it using the hmmpress command.
For sequence alignment (DSS score), BiG-SCAPE uses the hmmalign command from
the HMMER suite by default, but you can also select
MAFFT (activate with --use_mafft)
Biopython
Numpy
scipy

pySAPC (Affinity Propagation
clustering algorithm with support for sparse matrices)


Workflow

Parses GenBank files (.gbk) and extracts CDS per BGC (fasta/*.fasta)
Predicts domains per BGC (.domtable)
Writes list of domains per BGC (.pfs)
Writes selected information from filtered domtable files per BGC (.pfd)
Writes list of sequences per domain (domains/*.fasta)
Creates 'Arrower'-like figures for each BGC (.svg)
Saves dictionary with list of specific domains per BGC
(<output dir>/BGCs.dict)
Calculates multiple alignments for each domain sequence file (domains/*.algn)
Calculates distance between BGCs
Generates network files (.network)
Generates Network Annotation files (.tsv) with information about the input
data
Generates GCF labels from the clustering algorithm (.tsv)
Generates json files for built-in visualization (.js) (work in progress)

BiG-SCAPE will try to re-use some of these files to continue in case the
analysis stops (so take this into account if you e.g. change the version of the
Pfam database or the alignment method).