Skip to content

Limit KMC memory

According to KMC's help message, by default it uses 12 GB of memory. This poses a problem on smaller machines and Docker containers, and PanTools will error out during build_pangenome without any helpful message to the user:

$ docker run -e JAVA_ARGS="-Xmx4g" --ulimit nofile=1048576 --ulimit memlock=990456000 --ulimit core=-1 --memory 8g --cpus 2 --mount type=bind,source="$PWD/data",target=/data/ docker-registry.wur.nl/bioinformatics/pantools:75b224d6 build_pangenome /data/databases/yeast-4 /data/genomes/yeast-4/genomes.txt
[picocli WARN] defaults configuration file /opt/pantools/target/classes/.Pantools.properties does not exist or is not readable
12:51:08 [INFO ] Usage: pantools build_pangenome /data/databases/yeast-4 /data/genomes/yeast-4/genomes.txt

Constructing the pangenome graph database

Checking /data/genomes/yeast-4/GCA_000167035.1_ASM16703v1_genomic.fasta ...
Checking /data/genomes/yeast-4/GCA_000256765.1_Saccharomyces_kudriavzevii_strain_FM1066_v1.0_genomic.fasta ...
Checking /data/genomes/yeast-4/GCF_000146045.2_R64_genomic.fasta ...
Checking /data/genomes/yeast-4/GCF_001298625.1_SEUB3.0_genomic.fasta ...
Reading /data/genomes/yeast-4/GCA_000167035.1_ASM16703v1_genomic.fasta ...
Reading /data/genomes/yeast-4/GCA_000256765.1_Saccharomyces_kudriavzevii_strain_FM1066_v1.0_genomic.fasta ...
Reading /data/genomes/yeast-4/GCF_000146045.2_R64_genomic.fasta ...
Reading /data/genomes/yeast-4/GCF_001298625.1_SEUB3.0_genomic.fasta ...
Creating index in /data/databases/yeast-4//databases/index.db/
K = 15
No kmc index found in /data/databases/yeast-4//databases/index.db/

Running the KMC invocation with strace reveals mmap breaking on insufficient memory:

$ strace kmc -cs127 -k15 -t1 -ci1 -fm @/data/genomes/chloroplasts/genomes.txt kmers .
...
mmap(NULL, 805310464, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f3fa1370000
mmap(NULL, 9543290880, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
brk(0x23b65b000)                        = 0x2915000
mmap(NULL, 9543421952, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
futex(0x7f3fd15801f0, FUTEX_WAKE_PRIVATE, 2147483647) = 0
rt_sigprocmask(SIG_UNBLOCK, [ABRT], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1], [], 8) = 0
getpid()                                = 270
gettid()                                = 270
tgkill(270, 270, SIGABRT)               = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
--- SIGABRT {si_signo=SIGABRT, si_code=SI_TKILL, si_pid=270, si_uid=0} ---
+++ killed by SIGABRT (core dumped) +++
Aborted (core dumped)

With 2 GB of memory (-m2) KMC runs successfully:

$ kmc -m2 -cs127 -k15 -t1 -ci1 -fm @/data/genomes/chloroplasts/genomes.txt kmers .
******
Stage 1: 100%
Stage 2: 100%
1st stage: 0.31078s
2nd stage: 0.15981s
Total    : 0.47059s
Tmp size : 1MB

Stats:
   No. of k-mers below min. threshold :            0
   No. of k-mers above max. threshold :            0
   No. of unique k-mers               :       394168
   No. of unique counted k-mers       :       394168
   Total no. of k-mers                :       740860
   Total no. of sequences             :            5
   Total no. of super-k-mers          :       189852

Add an option to the build_pangenome command with a sensible default (2 GB?), and document suitable values for larger pangenomes.

Edited by Moed, Matthijs