functional annotations are not linked with mRNA nodes for `add_functions` command
Background:
- Bacterial genomes annotated with Prokka have CDS as the top-level feature in the annotation GFF file.
- For such features, PanTools GFF parser creates a gene, mRNA and exon nodes and adds
-gene
,-mRNA
,-exon
suffixes respectively while storing in Neo4j database. - This ensures a unique identifier for different features of a protein in the Neo4j database
Issue: InterProScan works on the original GFF file and uses the feature identifiers from GFF in output functional annotation files (stored as GFF). When PanTools add_functions
is run, it cannot find matching identifiers in the Neo4j database as suffixes werre added to the feature identifiers where top-level feature is not a gene.
Proposed fix: Make code changes to ensure that the mRNA identifier is used from the GFF file, irrespective of the top-level feature type. This will ensure the protein identifier always matches with the original annotation file and will allow the addition of any information with this link, including functional annotations generated by tools like InterProScan.
- If CDS “C1234” is the top-level feature: mRNA node: “C1234”; gene node: “C1234-gene”; CDS node: “C1234-cds”; exon node: “C1234-exon”;
- If Exon “E1234” is the top-level feature: mRNA node: “E1234”; gene node: “E1234-gene”; exon node: “E1234-exon”
- If mRNA “M1234” is the top-level feature: mRNA node: “M1234”; gene node: “M1234-gene”
Edited by Pardeshi, Lakhansing