-
- Downloads
New experimental branch: No DMS.dict
The DMS structure that held the precalculated sequence similarity between all pairs of aligned domain sequences (for each domain) grew exponentially with the number of input files. On top of this, the structure seemed to be copied for parallelized calculation of pairwise distances. Emzo proposed (in the direct_align branch) to avoid using MAFFT for domain sequence alignment, as well as the storing the sequence similarity in the DMS dictionary, and instead calculate both things on-the-fly, at the moment of doing the distance calculation. In this branch, I'm keeping the multiple alignment part with MAFFT but the sequence similarity is left to do on-the-fly. This increase in computing time each time the script needs to be re-run is a tradeoff for getting rid of DMS (both in RAM-space as well as in disk-space, for it was also kept as a file). In summary: * Eliminated DMS usage. Using --skip_mafft avoids calling MAFFT, but otherwise only the aligned domain sequences (.algn) in the domains folder are read and kept in memory. * Eliminated the --use_mafft_distout parameter (only internal sequence similarity is used. We could as well avoid generating the .hat2 files in the future as well) * Dropped the ">" character from the keys of the dictionary returned by fasta_parser() * If running with the --skip_mafft parameter, BiG-SCAPE will not re-generate the domain fasta files (take into account that if the user is adding new files to her input directory, she should not use this parameter, or we should take care to track which domains are affected and process the domain fasta files + mafft-align only for those domains) * Moved the extraction of the gbk_group information (BGC definition + antiSMASH group annotation) to the first time that the GenBank files are opened (when collecting the files and doing basic filtering). Also in that routine (get_gbk_files), each file is only opened once.
Loading
Please register or sign in to comment