Fix distinct count kmers in kmer_classification for subsets of genomes
We noticed that pantools kmer_classification
did not report the correct number of distinct kmers for subsets of genomes.
This bug was that the skip_array was completely ignored for distinct kmers and only used for the total.
This merge request fixes that by using the already present counter for each kmer and checks if it is larger than 0 after taking the skip_array into account.
Evidence that the counting is correct now
KMC counting of kmers
This is the kmc
output for a random set of four small sequences:
% kmc -cs127 -k13 -ci1 -fm @genome_locations.txt genome .
Stage 1: 100%
1st stage: 0.255s
2nd stage: 1.07611s
Total : 1.33112s
Tmp size : 0MB
Stats:
No. of k-mers below min. threshold : 0
No. of k-mers above max. threshold : 0
No. of unique k-mers : 29862
No. of unique counted k-mers : 29862
Total no. of k-mers : 119423
Total no. of sequences : 4
Total no. of super-k-mers : 0
And this is the kmc
output for three small sequences (removing the fourth sequence from the one above):
% kmc -cs127 -k13 -ci1 -fm @subset_locations.txt subset .
Stage 1: 100%
1st stage: 0.69278s
2nd stage: 1.11376s
Total : 1.80654s
Tmp size : 0MB
Stats:
No. of k-mers below min. threshold : 0
No. of k-mers above max. threshold : 0
No. of unique k-mers : 29836
No. of unique counted k-mers : 29836
Total no. of k-mers : 89583
Total no. of sequences : 3
Total no. of super-k-mers : 0
PanTools kmer counting (current develop)
When running PanTools with pantools kmer_classification tmp_DB
, we get this:
% head tmp_DB/kmer_classification/kmer_classification_overview.txt
Uncompressed the k-mers. K-mer size is 13, extracting 12 of every compressed k-mer
To obtain the distinct k-mer counts, the frequency of k-mers was ignored
Total k-mers: 119423
Total degenerate k-mers: 0
Distinct k-mers: 29862
Distinct degenerate k-mers: 0
Core: 29701 (99.46%)
Accessory: 110 (0.37%)
And for the same subset as with kmc
, we run PanTools like so: pantools kmer_classification -e tmp_DB
and we get this:
% head tmp_DB/kmer_classification/kmer_classification_overview.txt
Uncompressed the k-mers. K-mer size is 13, extracting 12 of every compressed k-mer
To obtain the distinct k-mer counts, the frequency of k-mers was ignored
Total k-mers: 89583
Total degenerate k-mers: 0
Distinct k-mers: 29862
Distinct degenerate k-mers: 0
Core: 29727 (99.55%)
Accessory: 84 (0.28%)
PanTools kmer counting (this merge request)
When running PanTools with pantools kmer_classification tmp_DB
, we get this:
% head tmp_DB/kmer_classification/kmer_classification_overview.txt
Uncompressed the k-mers. K-mer size is 13, extracting 12 of every compressed k-mer
To obtain the distinct k-mer counts, the frequency of k-mers was ignored
Total k-mers: 119423
Total degenerate k-mers: 0
Distinct k-mers: 29862
Distinct degenerate k-mers: 0
Core: 29701 (99.46%)
Accessory: 110 (0.37%)
And for the same subset as with kmc
, we run PanTools like so: pantools kmer_classification -e tmp_DB
and we get this:
% head tmp_DB/kmer_classification/kmer_classification_overview.txt
Uncompressed the k-mers. K-mer size is 13, extracting 12 of every compressed k-mer
To obtain the distinct k-mer counts, the frequency of k-mers was ignored
Total k-mers: 89583
Total degenerate k-mers: 0
Distinct k-mers: 29836
Distinct degenerate k-mers: 0
Core: 29727 (99.55%)
Accessory: 84 (0.28%)
These final numbers are identical to the kmc
output.