Skip to content

Add variation (SNPs, InDels and PAVs) to PanTools

Workum, Dirk-Jan van requested to merge include_variation into develop

This merge request describes all changes needed for adding add_variants, remove_variants, add_pavs and remove_pavs. All of these new subcommands are implemented in the Variation.java class as some form of a "variation layer". Importantly, these novel functionalities have effect on some downstream functions that rely on working with mRNA nodes: msa, core_phylogeny, consensus_tree, gene_classification and pangenome_structure. To all these subcommands a flag --variation/-v has been added to make use of this variation information.

The basic idea of this layer is implemented in the constructor of Variation.java: if any variation information has been added, they are described in an "accessionNode" per accession/strain/cultivar/... attached to the genome with respect to which this variation is called. This accessionNode must have one of: 1) VCF, 2) PAV properties or both. The accessionNodes contain all other relevant (metadata) information for these accessions. If there is an accessionNode for a given accession, there will also be mRNA nodes for this accession that are linked to the original mRNA node belonging to the genome to which the accessionNode is connected. These new mRNA nodes have an additional "variant_label" indicating that they don't belong to a genome but an additional accession.

For SNPs and InDels, variation is added by providing a VCF file to add_variants. This VCF file is processed in parallel for extracting a consensus sequence for each feature node. This consensus sequence is obtained using bedtoolsbcftools. NB: Not all SNPs and InDels are put in the database but only those located within an annotated feature. In each "variant_label" mRNA node per accession, the consensus sequence is present.

For PAVs, the presence or absence is added as a property to all "variant_label" mRNA nodes per accession.

Finally, only adding and removing PAVs works for both pangenomes and panproteomes. Adding and removing SNP/InDel information can only be done for pangenomes because they rely on a genome sequence.

TODO:

  • Check and double-check that there are no breaking changes to develop.
  • Add novel functionalities to documentation.
  • Discuss strategy for adding variation (SNP/InDel) to the graph that are not part of annotated features.
  • Restrict bcftools version in conda YAML files based on possible breaking changes in bcftools consensus.
  • Add parameter to keep temporary files.
  • Extensively check all possible downstream functionalities for compatibility.
  • Fix group subcommand if run after add_variants.
Edited by Workum, Dirk-Jan van

Merge request reports