Add (sub)graph export (AKA region of interest)
NB: Still under active development (this branch is a small side project of mine).
This merge request will add a subcommand to retrieve the cDBG from PanTools in GFA format (and others?).
TODO:
-
Check accuracy GFA v1 output -
Whole pangenome export -
Decide on subcommand name -
Decide on what output formats should be supported (only GFA; which is slow) -
Check speed on large pangenomes
-
-
Add subcommand for building nucleotide layer from existing graph (GFA v1 format)- => edit: to be done with !198
-
Add subcommand for extracting a subgraph in GFA format, including annotations for Bandage -
Get separate subcommand for regions only -
Define outputs for region (see below for implementation status)
-
-
Write all output formats -
GFAv1 -
Include Bandage annotation CSV for outputs -
Fasta for each genome -
Gff3 for each genome -
PAV for each homology group -
PAV for each kmer/node -
Collinearity file (/visualization)
-
TODO after commit c565cb45 (where the 'novel' algorithm, which is a combination of kmer and alignment, has been implemented and tested):
-
Add parameter for minimal number of kmers in a block for the 'novel' algorithm -
Make 'novel' algorithm default and rename to more sensible name -
Remove other algorithms -
Create homology based search using the 'novel' algorithm -
Use simple (NJ?) clustering on kmer PAV for ordering the output- => edit: too difficult and more something for the exact visualisation tool to be used
-
Add new parameter --flanking
to add additional flanking sequence after the ROI finding algorithm -
Clean up unused code
TODO after commit 3bf43792 (where code has been discussed with both Sandra and Robin in person):
-
Let extract_region
not only find unique regions but possibly duplicated regions too by adding parameter to frequency of found kmers/hmgroups -
Increase speed of GFA writing by stopping using neo4j database and switch to kmer database
TODO after commit e486e186 (where code has been discussed with Dick, Eric and Sandra):
-
Investigate use of a dynamic programming to generalise more- => edit: it seems to me that the current parameters implemented with the possible combination of k-mer and homology suffices; maybe later DP
-
Implement --include
and--exclude
parameters -
Add warning when no homology grouping is present and deal with it code accordingly -
Make GFA output optional (even for shorter regions this file can be quite confusing) -
Don't assume there are no copies of the original ROI in the starting genome -
Check that end is physically located after start when giving a ROI
TODO after commit 51fc3d7f (where implementations and effects have been discussed with Sandra):
-
Implement blast wrapper around results to have optional links file showing blast results -
Split --max-distance
parameter in a--max-kmer-distance
and--max-block-distance
parameter with lower threshold for the former
Edited by Workum, Dirk-Jan van