Skip to content

Fixed inconsistency in homology grouping

Workum, Dirk-Jan van requested to merge fix_inconsistent_homology_grouping into develop

This merge requests makes homology grouping consistent. This bug was discovered while working on !221, where we found out that the order of proteins in the panproteome mattered for the final homology grouping. As this is clearly a bug, I had to fix the following things:

  • Don't assume only proteins with a higher ID might have intersections. Because of this, I had to introduce a hashset with intersections to not count duplicates.
  • Fix the counter by starting the check for intersections at i=1 instead of i=0
  • For proteins of equal length, decide which one is "smaller" lexicographically instead of using proteins order

Still to do in this merge request:

  • Fix finding intersections
  • Fix calculating similarities
  • Fix building the homology groups from the similarity queue (see MCL inconsistency below)
  • Test consistency on a number of different orders in different panproteomes

Later:

  • Add an edge case protein pair based on this merge request to the end-to-end test set
Edited by Workum, Dirk-Jan van

Merge request reports