Skip to content
Snippets Groups Projects
Commit 35188f2e authored by Jorge Navarro Muñoz's avatar Jorge Navarro Muñoz
Browse files

New experimental branch: No DMS.dict

The DMS structure that held the precalculated sequence similarity between all
pairs of aligned domain sequences (for each domain) grew exponentially with
the number of input files. On top of this, the structure seemed to be copied
for parallelized calculation of pairwise distances.
Emzo proposed (in the direct_align branch) to avoid using MAFFT for domain
sequence alignment, as well as the storing the sequence similarity in the DMS
dictionary, and instead calculate both things on-the-fly, at the moment of
doing the distance calculation.
In this branch, I'm keeping the multiple alignment part with MAFFT but the
sequence similarity is left to do on-the-fly. This increase in computing time
each time the script needs to be re-run is a tradeoff for getting rid of DMS
(both in RAM-space as well as in disk-space, for it was also kept as a file).

In summary:
* Eliminated DMS usage. Using --skip_mafft avoids calling MAFFT, but otherwise
only the aligned domain sequences (.algn) in the domains folder are read and
kept in memory.
* Eliminated the --use_mafft_distout parameter (only internal sequence
similarity is used. We could as well avoid generating the .hat2 files in the
future as well)
* Dropped the ">" character from the keys of the dictionary returned by
fasta_parser()
* If running with the --skip_mafft parameter, BiG-SCAPE will not re-generate
the domain fasta files (take into account that if the user is adding new files
to her input directory, she should not use this parameter, or we should take
care to track which domains are affected and process the domain fasta files +
mafft-align only for those domains)
* Moved the extraction of the gbk_group information (BGC definition + antiSMASH
group annotation) to the first time that the GenBank files are opened (when
collecting the files and doing basic filtering). Also in that routine
(get_gbk_files), each file is only opened once.
parent d453df5d
No related branches found
No related tags found
No related merge requests found
Loading
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment