Skip to content

GitLab

Explore

Sign in

Add (sub)graph export (AKA region of interest)

Review changes
Download
Patches
Plain diff

Workum, Dirk-Jan van requested to merge add_gfa_export into develop Mar 29, 2023

Overview 10
Commits 257
Pipelines 100
Changes 58

NB: Still under active development (this branch is a small side project of mine).

This merge request will add a subcommand to retrieve the cDBG from PanTools in GFA format (and others?).

TODO:

Check accuracy GFA v1 output
Whole pangenome export
- Decide on subcommand name
- Decide on what output formats should be supported (only GFA; which is slow)
- Check speed on large pangenomes
~~Add subcommand for building nucleotide layer from existing graph (GFA v1 format)~~
- => edit: to be done with !198
Add subcommand for extracting a subgraph in GFA format, including annotations for Bandage
- Get separate subcommand for regions only
- Define outputs for region (see below for implementation status)
Write all output formats
- GFAv1
- Include Bandage annotation CSV for outputs
- Fasta for each genome
- Gff3 for each genome
- PAV for each homology group
- PAV for each kmer/node
- Collinearity file (/visualization)

TODO after commit c565cb45 (where the 'novel' algorithm, which is a combination of kmer and alignment, has been implemented and tested):

Add parameter for minimal number of kmers in a block for the 'novel' algorithm
Make 'novel' algorithm default and rename to more sensible name
Remove other algorithms
Create homology based search using the 'novel' algorithm
~~Use simple (NJ?) clustering on kmer PAV for ordering the output~~
- => edit: too difficult and more something for the exact visualisation tool to be used
Add new parameter --flanking to add additional flanking sequence after the ROI finding algorithm
Clean up unused code

TODO after commit 3bf43792 (where code has been discussed with both Sandra and Robin in person):

Let extract_region not only find unique regions but possibly duplicated regions too by adding parameter to frequency of found kmers/hmgroups
Increase speed of GFA writing by stopping using neo4j database and switch to kmer database

TODO after commit e486e186 (where code has been discussed with Dick, Eric and Sandra):

~~Investigate use of a dynamic programming to generalise more~~
- => edit: it seems to me that the current parameters implemented with the possible combination of k-mer and homology suffices; maybe later DP
Implement --include and --exclude parameters
Add warning when no homology grouping is present and deal with it code accordingly
Make GFA output optional (even for shorter regions this file can be quite confusing)
Don't assume there are no copies of the original ROI in the starting genome
Check that end is physically located after start when giving a ROI

TODO after commit 51fc3d7f (where implementations and effects have been discussed with Sandra):

Implement blast wrapper around results to have optional links file showing blast results
Split --max-distance parameter in a --max-kmer-distance and --max-block-distance parameter with lower threshold for the former

Edited Aug 16, 2024 by Workum, Dirk-Jan van

Merge request reports

Assignee Loading

Reviewers Loading

Request review from

Loading

Time tracking Loading

Loading