In order to run BiG-SCAPE, three files are needed: bigscape.py
, functions.py
and munkres.py
.
Introduction
Bioinformatically, mining (meta)genomes for Biosynthetic Gene Clusters (BGCs) encoding specialized metabolites would entail identifying and annotating BGCs on the genome and taking additional steps to define a distance between BGCs in order to map the BGC diversity in similarity networks. These similarity networks would graphically summarize the diversity of the BGCs, as well as contain multiple annotations to help identify novel compounds, make ecological correlations and so on.
Defining a distance
BGCs are essentially a collection of genes that code for proteins that work together to produce a compound. These proteins are most likely the most important factor when it comes to the final structure of the compound. Thus, a good distance metric should use information on the similarity of the proteins between two BGCs.
In this project, three indices are combined to define a final distance metric between any given pair of BGCs:
- The Jaccard index: The ratio between the distinct shared and distinct unshared domains between two BGCs.
- The DDS (Domain Duplicate Score): Measures the similarity between two sets of
shared domains, for each domain. If the
seqdist
distance method is selected (default), this also takes into account the sequence similarity between shared domains (calculated from multiple alignments from all sequences with the same predicted domain using MAFFT). For cases where more than one copy of a certain domain exists on at least one of the BGCs, the Hungarian algorithm is used to select the most similar Pfam domains (munkres.py
). A special weight can be given to marked domains annotated as "anchor domains". - The Goodman-Kruskal gamma function: Estimates the similarity of the order of
two distinct domain sets accounting for the number of same-order and reversed pairs of domains within a certain domain count (parameter
nbhood
).
How does it work
BiG-SCAPE tries to read all the GenBank files from the input folder (which,
preferrably, correspond to identified gene clusters with a tool like
antiSMASH). If the user has
different subfolders in the main input directory, these can be treated as
different samples (and BiG-SCAPE can generate specific network files for this;
see the parameters with the --help
option).
BiG-SCAPE then uses hmmscan
from the HMMER (v3.1b2) package to predict Pfam
domains in each sequence.
For every pair of BGCs in the set(s), the pairwise distance between this BGCs is calculated as the weighted combination of the Jaccard, GK and DDS indices. Network files are generated containing a number of information: the name of the BGCs, the raw distance, the three indices' score, a -log2 score of the raw distance and more. This is done taking into account different cutoff values for the distances (i.e. only rows with raw distance <= cutoff are written in the file).
How to run/re-run BiG-SCAPE
Before using BiG-SCAPE, make sure that MAFFT
and hmmscan
are available in
your path. hmmscan
also uses the Pfam library (specifically, Pfam-A.hmm formatted with the hmmpress
command).
Run as python bigscape.py <options>
.
There are some options to re-run BiG-SCAPE and skip some of the most time-consuming calculations:
Parameter (nothing)
What it does
- Parses GenBank files (.gbk) and extracts CDS per BGC (.fasta)
- Predicts domains per BGC (.domtable)
- Writes list of sequences per domain (.fasta)
- Writes list of domains per BGC (.pfs)
- Writes selected information from domtable files per BGC (.pfd)
- Saves dictionary with list of specific domains per BGC
(
<output dir>/BGCs.dict
) - When using the
seqdist
distance method, calculates multiple alignments using MAFFT (.algn) for domains represented by more than one sequence. Saves alignment information in<output dir>/DMS.dict
- Calculates distance between BGCs
- Generates network files
When to use it
- First time
- When changing fundamental parameters like
domain_overlap_cutoff
ormin_bgc_size
. Note that BiG-SCAPE only works with files which do not yet have their corresponding processed output (so it's possible to save time when resuming from an unexpected end of the run by just working on the unprocessed files). This means that re-running BiG-SCAPE with different values for these parameters will not have an effect unless you empty your output directory or choose a new one. - When you are changing versions of the pfam database (if you have previous results from past runs, you can override BiG-SCAPE's normal behavior using the
--force_hmmscan
parameter)
What you need
- A list of GenBank files (.gbk). If either the
seqdist
ordomain_dist
methods are selected with the "S" option, BiG-SCAPE will treat subfolders within your input directory as "samples". The name of each sample is taken from each subfolder with .gbk files and, as it is used in the final network files duplicated subfolder/sample names is not allowed.
--skip_mafft
Parameter What it does
Skips domain sequence alignment using MAFFT
. Oherwise:
- Calculates distance between BGCs
- Generates network files
When to use it
- You already ran BiG-SCAPE once
- You are using the
seqdist
distance method - You want to change the
nbhood
parameter for the Goodman-Kruskal index - When you had only previously calculated distance within samples (i.e. not all-vs-all distances)
- When you want to change the distance method from first run (e.g. from
domaindist
toseqdist
) - When you need to recalculate distances (e.g. to change the anchor domains list)
What you need
- The original .gbk files. These are necessary to rebuild the original structure
- The list of domains per BGC (.pfs files). These are necessary for distance calculations
- The output from
hmmscan
(.domtable files). These are necesary to see whether a particular .gbk file had no predicted domains (in which case it'll be taken out of the analysis) - The BGCs dictionary (
<output dir>/BGCs.dict
). This contains a list of domains per BGC (and for each of these domains, a list of all predicted locations within the BGC). Needed for distance calculations - The DMS dictionary (
<output dir>/DMS.dict
). This contains information from the sequence alignment
--skip_all
Parameter What it does
- Skips all calculations
- Generates new network files
When to use it
- When you only want to generate new network files: use different cutoffs, include/exclude singletons, use input file structure to generate network files per sample or change weights for the indices
What you need
- The output from
hmmscan
. - The network file with the appropriate distance method and complete set of
distances. This means either
networkfile_seqdist_all_vs_all_c1.network
for theseqdist
method ornetworkfile_domain_dist_all_vs_all_c1.network
fordomain_dist
. BiG-SCAPE will make use of these pre-calculated distances so these cutoff=1 value is added by default in the list of cutoffs