Skip to content

Add (sub)graph export (AKA region of interest)

Workum, Dirk-Jan van requested to merge add_gfa_export into develop

NB: Still under active development (this branch is a small side project of mine).

This merge request will add a subcommand to retrieve the cDBG from PanTools in GFA format (and others?).

TODO:

  • Check accuracy GFA v1 output
  • Whole pangenome export
    • Decide on subcommand name
    • Decide on what output formats should be supported (only GFA; which is slow)
    • Check speed on large pangenomes
  • Add subcommand for building nucleotide layer from existing graph (GFA v1 format)
    • => edit: to be done with !198
  • Add subcommand for extracting a subgraph in GFA format, including annotations for Bandage
    • Get separate subcommand for regions only
    • Define outputs for region (see below for implementation status)
  • Write all output formats
    • GFAv1
    • Include Bandage annotation CSV for outputs
    • Fasta for each genome
    • Gff3 for each genome
    • PAV for each homology group
    • PAV for each kmer/node
    • Collinearity file (/visualization)

TODO after commit c565cb45 (where the 'novel' algorithm, which is a combination of kmer and alignment, has been implemented and tested):

  • Add parameter for minimal number of kmers in a block for the 'novel' algorithm
  • Make 'novel' algorithm default and rename to more sensible name
  • Remove other algorithms
  • Create homology based search using the 'novel' algorithm
  • Use simple (NJ?) clustering on kmer PAV for ordering the output
    • => edit: too difficult and more something for the exact visualisation tool to be used
  • Add new parameter --flanking to add additional flanking sequence after the ROI finding algorithm
  • Clean up unused code

TODO after commit 3bf43792 (where code has been discussed with both Sandra and Robin in person):

  • Let extract_region not only find unique regions but possibly duplicated regions too by adding parameter to frequency of found kmers/hmgroups
  • Increase speed of GFA writing by stopping using neo4j database and switch to kmer database

TODO after commit e486e186 (where code has been discussed with Dick, Eric and Sandra):

  • Investigate use of a dynamic programming to generalise more
    • => edit: it seems to me that the current parameters implemented with the possible combination of k-mer and homology suffices; maybe later DP
  • Implement --include and --exclude parameters
  • Add warning when no homology grouping is present and deal with it code accordingly
  • Make GFA output optional (even for shorter regions this file can be quite confusing)
  • Don't assume there are no copies of the original ROI in the starting genome
  • Check that end is physically located after start when giving a ROI

TODO after commit 51fc3d7f (where implementations and effects have been discussed with Sandra):

  • Implement blast wrapper around results to have optional links file showing blast results
  • Split --max-distance parameter in a --max-kmer-distance and --max-block-distance parameter with lower threshold for the former
Edited by Workum, Dirk-Jan van

Merge request reports

Loading