README.md

# Please note that since 2016-08-01, development has moved to [https://git.wageningenur.nl/medema-group/BiG-SCAPE](https://git.wageningenur.nl/medema-group/BiG-SCAPE)
~~~
~~~

In order to run BiG-SCAPE, three files are needed: `bigscape.py`, `functions.py` and `munkres.py`.

# Introduction

Bioinformatically, mining (meta)genomes for **Biosynthetic Gene Clusters** (BGCs) encoding specialized metabolites would entail identifying and annotating BGCs on the genome and taking additional steps to define a distance between BGCs in order to map the BGC diversity in similarity networks. These similarity networks would graphically summarize the diversity of the BGCs, as well as contain multiple annotations to help identify novel compounds, make ecological correlations and so on.

## Defining a distance

BGCs are essentially a collection of genes that code for proteins that work together to produce a compound. These proteins are most likely the most important factor when it comes to the final structure of the compound. Thus, a good distance metric should use information on the similarity of the proteins between two BGCs.

In this project, three indices are combined to define a final distance metric between any given pair of BGCs:
* The Jaccard index: The ratio between the distinct shared and distinct unshared domains between two BGCs.
* The DDS (Domain Duplicate Score): Measures the similarity between two sets of shared domains, for each domain. If the `seqdist` distance method is selected (default), this also takes into account the sequence similarity between shared domains (calculated from multiple alignments from all sequences with the same predicted domain using MAFFT). For cases where more than one copy of a certain domain exists on at least one of the BGCs, the Hungarian algorithm is used to select the most similar Pfam domains (`munkres.py`). A special weight can be given to marked domains annotated as "anchor domains".
* The Goodman-Kruskal gamma function: Estimates the similarity of the order of two distinct domain sets accounting for the number of same-order and reversed pairs of domains within a certain domain count (parameter `nbhood`).


# How does it work

BiG-SCAPE tries to read all the GenBank files from the input folder (which, preferrably, correspond to identified gene clusters with a tool like [antiSMASH](https://antismash.secondarymetabolites.org/)). If the user has different subfolders in the main input directory, these can be treated as different samples (and BiG-SCAPE can generate specific network files for this; see the parameters with the `--help` option).

BiG-SCAPE then uses `hmmscan` from the HMMER (v3.1b2) package to predict Pfam domains in each sequence.

For every pair of BGCs in the set(s), the pairwise distance between this BGCs is calculated as the weighted combination of the Jaccard, GK and DDS indices. Network files are generated containing a number of information: the name of the BGCs, the raw distance, the three indices' score, a -log2 score of the raw distance and more. This is done taking into account different cutoff values for the distances (i.e. only rows with raw distance <= cutoff are written in the file).


# How to run/re-run BiG-SCAPE

Before using BiG-SCAPE, make sure that `MAFFT` and `hmmscan` are available in your path. `hmmscan` also uses the Pfam library (specifically, Pfam-A.hmm formatted with the `hmmpress` command).

Run as `python bigscape.py <options>`.

There are some options to re-run BiG-SCAPE and skip some of the most time-consuming calculations:

## Parameter (nothing)

### What it does
* Parses GenBank files (.gbk) and extracts CDS per BGC (.fasta)
* Predicts domains per BGC (.domtable)
* Writes list of sequences per domain (.fasta)
* Writes list of domains per BGC (.pfs)
* Writes selected information from domtable files per BGC (.pfd)
* Saves dictionary with list of specific domains per BGC (`<output dir>/BGCs.dict`)
* When using the `seqdist` distance method, calculates multiple alignments using MAFFT (.algn) for domains represented by more than one sequence. Saves alignment information in `<output dir>/DMS.dict`
* Calculates distance between BGCs
* Generates network files

### When to use it
* First time
* When changing fundamental parameters like `domain_overlap_cutoff` or `min_bgc_size`
* When you are changing versions of the pfam database

### What you need
* A list of GenBank files (.gbk). If either the `seqdist` or `domain_dist` methods are selected with the "S" option, BiG-SCAPE will treat subfolders within your input directory as "samples". The name of each sample is taken from each subfolder with .gbk files, so repeating subfolder names is not allowed

## Parameter `--skip_hmmscan`

### What it does
Skips domain prediction using `hmmscan` and extraction of information, but recalculates everything else:
* When using the `seqdist` distance method, calculates multiple alignments using MAFFT (.algn) for domains represented by more than one sequence. Saves alignment information in `<output dir>/DMS.dict`
* Calculates distance between BGCs
* Generates network files

### When to use it
* You already ran BiG-SCAPE once
* You *need* to change parameters in MAFFT
* You are using the `seqdist` method *and* you want to change the sequence domain identity score method (e.g. from BiG-SCAPE's to MAFFT's)

### What you need
* The original .gbk files. These are necessary to rebuild the original structure
* The list of domains per BGC (.pfs files). These are necessary for distance calculations
* The output from `hmmscan` (.domtable files). These are necesary to see whether a particular .gbk file had no predicted domains (in which case it will be taken out of the analysis)
* If using the `seqdist` method, the fasta sequence files per domain (`<output dir>/domains/`). These are necessary for the alignments
* The BGCs dictionary (`<output dir>/BGCs.dict`). This contains a list of domains per BGC (and for each of these domains, a list of all predicted locations within the BGC). Needed for distance calculations

## Parameter `--skip_mafft`

### What it does
Skips domain prediction usnig `hmmscan`, extraction of sequences and domain sequence alignment using `MAFFT`. Oherwise:
* Calculates distance between BGCs
* Generates network files

### When to use it
* You already ran BiG-SCAPE once
* You are using the `seqdist` distance method
* You want to change the `nbhood` parameter for the Goodman-Kruskal index
* When you had only previously calculated distance within samples (i.e. not all-vs-all distances)
* When you want to change the distance method from first run (e.g. from `domaindist` to `seqdist`)
* When you need to recaculate distances (e.g. to change the anchor domains list)

### What you need
* The original .gbk files. These are necessary to rebuild the original structure
* The list of domains per BGC (.pfs files). These are necessary for distance calculations
* The output from `hmmscan` (.domtable files). These are necesary to see whether a particular .gbk file had no predicted domains (in which case will be taken out of the analysis
* The BGCs dictionary (`<output dir>/BGCs.dict`). This contains a list of domains per BGC (and for each of these domains, a list of all predicted locations within the BGC). Needed for distance calculations
* The DMS dictionary (`<output dir>/DMS.dict`). This contains information from the sequence alignment

## Parameter `--skip_all`

### What it does
* Skips all calculations
* Generates new network files

### When to use it
* When you only want to generate new network files: use different cutoffs, include/exclude singletons, use input file structure to generate network files per sample or change weights for the indices

### What you need
* The output from `hmmscan`.
* The network file with the appropriate distance method and complete set of distances. This means either `networkfile_seqdist_all_vs_all_c1.network` for the `seqdist` method or `networkfile_domain_dist_all_vs_all_c1.network` for `domain_dist`. BiG-SCAPE will make use of these pre-calculated distances so these cutoff=1 value is added by default in the list of cutoffs