Limit KMC memory
According to KMC's help message, by default it uses 12 GB of memory. This poses a problem on smaller machines and Docker containers, and PanTools will error out during build_pangenome
without any helpful message to the user:
$ docker run -e JAVA_ARGS="-Xmx4g" --ulimit nofile=1048576 --ulimit memlock=990456000 --ulimit core=-1 --memory 8g --cpus 2 --mount type=bind,source="$PWD/data",target=/data/ docker-registry.wur.nl/bioinformatics/pantools:75b224d6 build_pangenome /data/databases/yeast-4 /data/genomes/yeast-4/genomes.txt
[picocli WARN] defaults configuration file /opt/pantools/target/classes/.Pantools.properties does not exist or is not readable
12:51:08 [INFO ] Usage: pantools build_pangenome /data/databases/yeast-4 /data/genomes/yeast-4/genomes.txt
Constructing the pangenome graph database
Checking /data/genomes/yeast-4/GCA_000167035.1_ASM16703v1_genomic.fasta ...
Checking /data/genomes/yeast-4/GCA_000256765.1_Saccharomyces_kudriavzevii_strain_FM1066_v1.0_genomic.fasta ...
Checking /data/genomes/yeast-4/GCF_000146045.2_R64_genomic.fasta ...
Checking /data/genomes/yeast-4/GCF_001298625.1_SEUB3.0_genomic.fasta ...
Reading /data/genomes/yeast-4/GCA_000167035.1_ASM16703v1_genomic.fasta ...
Reading /data/genomes/yeast-4/GCA_000256765.1_Saccharomyces_kudriavzevii_strain_FM1066_v1.0_genomic.fasta ...
Reading /data/genomes/yeast-4/GCF_000146045.2_R64_genomic.fasta ...
Reading /data/genomes/yeast-4/GCF_001298625.1_SEUB3.0_genomic.fasta ...
Creating index in /data/databases/yeast-4//databases/index.db/
K = 15
No kmc index found in /data/databases/yeast-4//databases/index.db/
Running the KMC invocation with strace
reveals mmap
breaking on insufficient memory:
$ strace kmc -cs127 -k15 -t1 -ci1 -fm @/data/genomes/chloroplasts/genomes.txt kmers .
...
mmap(NULL, 805310464, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f3fa1370000
mmap(NULL, 9543290880, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
brk(0x23b65b000) = 0x2915000
mmap(NULL, 9543421952, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
futex(0x7f3fd15801f0, FUTEX_WAKE_PRIVATE, 2147483647) = 0
rt_sigprocmask(SIG_UNBLOCK, [ABRT], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1], [], 8) = 0
getpid() = 270
gettid() = 270
tgkill(270, 270, SIGABRT) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
--- SIGABRT {si_signo=SIGABRT, si_code=SI_TKILL, si_pid=270, si_uid=0} ---
+++ killed by SIGABRT (core dumped) +++
Aborted (core dumped)
With 2 GB of memory (-m2
) KMC runs successfully:
$ kmc -m2 -cs127 -k15 -t1 -ci1 -fm @/data/genomes/chloroplasts/genomes.txt kmers .
******
Stage 1: 100%
Stage 2: 100%
1st stage: 0.31078s
2nd stage: 0.15981s
Total : 0.47059s
Tmp size : 1MB
Stats:
No. of k-mers below min. threshold : 0
No. of k-mers above max. threshold : 0
No. of unique k-mers : 394168
No. of unique counted k-mers : 394168
Total no. of k-mers : 740860
Total no. of sequences : 5
Total no. of super-k-mers : 189852
Add an option to the build_pangenome
command with a sensible default (2 GB?), and document suitable values for larger pangenomes.
Edited by Moed, Matthijs