Compare revisions

ecd7b999 · d721060e · 2ca75bfc · 8ba840cb · 9d6a2200 · 1036f675
--- a/.gitignore
+++ b/.gitignore
@@ -117,6 +117,9 @@ fabric.properties
 # https://plugins.jetbrains.com/plugin/12206-codestream
 .idea/codestream.xml

+# Other
+.idea/copilot
+
 ### Maven ###
 target/
 pom.xml.tag
@@ -155,3 +158,6 @@ output

 # MacOS specific
 .DS_store
+
+# Development
+*.jfr
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
 All notable changes to Pantools will be documented in this file.

+## [UNRELEASED]
+
+### Added
+- Available resources are now added to the log files and validated for some processes (!224).
+- The possibility to analyse a specific region of interest across the pangenome (!147).
+- A GFA export function (not speed optimised, so not recommended for large pangenomes) (!147).
+
+### Changed
+- Renamed the developer function `export_pangenome` to `export_csv` (!147).
+
+## [4.3.1] - 08-12-2023
+
+### Changed
+- `pantools pangenome_structure` received a huge speed improvement and restored randomization (!216).
+
+### Fixed
+- Fixed a bug where variable/informative positions were incomplete, affecting PanVA instances (!212).
+- Fixed a bug that caused the pangenome growth curves to not show any randomization (!216).
+
 ## [4.3.0] - 23-11-2023

 ### Added

--- a/docs/source/analysis/phylogeny.rst
+++ b/docs/source/analysis/phylogeny.rst
@@ -620,8 +620,9 @@ PanTools.
     * - ``--phenotype``/``-p``
       - The phenotype used to rename the terminal nodes (leaves) of
         the selected tree.
-     * - ``--genome``/``--sequence``
-       - The phylogenetic tree contains genome numbers or sequence identifiers.
+     * - ``--genome``/``--sequence``/``--gene-tree``
+       - The phylogenetic tree contains genome numbers, sequence identifiers
+         or gene identifiers.
     * - ``--phasing``
       - Include phasing information to the terminal nodes (leaves) of the tree.


--- a/docs/source/analysis/region.rst
+++ b/docs/source/analysis/region.rst
+Region of interest
+==================
+
+After constructing a pangenome, a common question is "what does region X of
+genome A look like in genome B?" This question can be answered by obtaining the
+region of interest from the pangenome database using the functions described
+here.
+
+--------------
+
+Extract region
+--------------
+
+The function extract_region extracts a region of interest from the pangenome
+database. The underlying algorithm ...
+
+**TODO**: decide which algorithm to keep
+
+**Parameters**
+  .. list-table::
+     :widths: 30 70
+
+     * - <databaseDirectory>
+       - Path to the database root directory.
+
+**Options**
+  .. list-table::
+     :widths: 30 70
+
+     * - ``--genome-range``
+       - Required (unless ``--gene`` is used). The region of interest. Format:
+         <genome>#<sequence>:<start>-<end>. E.g.: "1#chromosome2:123000-234000".
+     * - ``--gene``
+       - Required (unless ``--genome-range`` is used). The name of a gene to
+         obtain the region for in other genomes.
+     * - ``--threads``/``-t``
+       - Number of parallel working threads, default is the number of cores
+         or 8, whichever is lower.
+     * - ``--output``/``-o``
+       - Directory for all output files. Has to be empty or non-existent.
+     * - ``--extraction-method``
+       - The method to use for graph extraction. Can currently be 'homology',
+         'alignment', 'novel' or 'kmer'. Default: kmer **TODO**
+     * - ``--num-kmer-samples``
+       - The number of kmer samples to use for the 'alignment' or 'novel'
+         method. Default: 100
+     * - ``--gfa``
+       - **TODO**
+
+**Example commands**
+  .. code:: bash
+
+     $ pantools extract_region --genome-range "1#chr1:100-1000" tomato_DB
+     $ pantools extract_region --output "roi_analysis_gene1" --gene "gene1" tomato_DB
+     $ pantools extract_region --output "roi_analysis" --gfa --extraction-method novel --num-kmer-samples 50 --genome-range "1#chr1:100-1000" tomato_DB
+
+**Output**
+  Output files are written to the specified output directory.
+
+  - **all.collinearity.bedpe**: A BEDPE file with collinear regions between the
+    extracted region in all genomes.
+  - **all.collinearity.m8**: A BLAST-like file (m8) with collinear regions
+    between the extracted region in all genomes.
+  - **all.collinearity.json**: A JSON file with collinear regions between the
+    extracted region in all genomes.
+  - **all.fai**: A FASTA index file for the extracted region in all genomes.
+  - **all.gff3**: A GFF3 file for the extracted region in all genomes.
+  - **hm_presence_absence.tsv**: A tab-separated file with the presence/absence
+    variation of the homology groups of all mRNAs in the extracted region.
+  - **kmer_presence_absence.tsv**: A tab-separated file with the
+    presence/absence variation of all k-mers in the extracted region.
+
+  - **TODO**: GFA (+CSV)
+
+  For each genome and sequence in the found region, the following files are
+  written:
+
+  - **<genomeNr>.fasta**: A FASTA file with the extracted region. The <genomeNr>
+    is the number of the genome in the database.
+  - **<genomeNr>.gff**: A GFF file with the extracted region. The <genomeNr> is
+    the number of the genome in the database.
+
+**Further analysis**
+  For visualization, the following example script can be used to plot the
+  extracted region of interest in all genomes. Please note that it requires
+  gggenomes, dplyr and stringr R packages.
+
+  .. code:: R
+
+      require(gggenomes)
+      require(dplyr)
+      require(stringr)
+
+      seqs <- read_seqs("all.fai") %>%
+        mutate(genome = str_split(seq_id, "_", simplify = TRUE)[, 1]) %>%
+        arrange(match(genome, c(1:9)))  # sort the first 9 genomes in order
+      genes <- read_gff3("all.gff3")
+      links <- read_links("all.collinearity.m8")
+
+      p <- gggenomes(
+        genes = genes[which(endsWith(genes$type, "RNA")),],  # only RNAs
+        seqs = seqs,
+        links = links,
+        infer_bin_id = genome
+      ) +
+        geom_seq() +
+        geom_gene(aes(fill=type)) +
+        geom_link() +
+        geom_seq_label() +
+        geom_gene_label(aes(label=feat_id), size=1)
+
+      ggsave("all.collinearity.pdf", plot = p,
+             width = 20,
+             height = 15)
+
--- a/docs/source/conf.py
+++ b/docs/source/conf.py
@@ -10,8 +10,8 @@ project = 'PanTools'
 copyright = '2016, the PanTools team'
 author = 'Sandra Smit'

-release = '4.3.0'
-version = '4.3.0'
+release = '4.3.1'
+version = '4.3.1'

 # -- General configuration


--- a/docs/source/getting_started/install.rst
+++ b/docs/source/getting_started/install.rst
@@ -42,6 +42,15 @@ the tools.

   $ mamba create -n pantools pantools

+.. note::
+
+  If this doesn't work, you can try to install PanTools using the
+  following way:
+
+  .. substitution-code-block:: bash
+
+     $ mamba create -n pantools --strict-channel-priority -c conda-forge -c bioconda pantools=|ProjectVersion|
+
 Please make sure to activate the environment before using PanTools:

 .. substitution-code-block:: bash

--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -44,38 +44,22 @@ PanTools currently provides these functionalities:
 -  Gene classification
 -  Phylogenetic methods

-Requirements
------------
-
-  \ **Java Virtual Machine**\  version 1.8 or higher, Add path to the
-   java executable to your OS path environment variable.
-  \ **KMC**\ : A disk-based k-mer counter, After downloading the
-   appropriate version (linux, macos or windows), add path to the *kmc*
-   and *kmc_tools* executables to your OS path environment variable.
-  \ **MCL**\ : The Markov Clustering Algorithm, After downloading and
-   compiling the software, add path to the *mcl* executable to your OS
-   path environment variable.
-
-For installing and configuring all required software, please see our
-:doc:`/getting_started/install` page.
-
 Running the program
 -------------------

-Make sure you can run java on your command line. Then you can run PanTools from
-the command line by:
+Install PanTools as described in the :doc:`/getting_started/install` page.
+You can run PanTools from the command line using:

 .. substitution-code-block:: bash

-   $ java <JVM options> -jar pantools-|ProjectVersion|.jar <subcommand> <arguments>
+   $ pantools <JVM options> <subcommand> <arguments>

-| **Useful JVM options**
-| - **-server** : To optimize JIT compilations for higher performance
-| - **-Xmn(a number followed by m/g)** : Minimum heap size in mega/giga
-  bytes
-| - **-Xmx(a number followed by m/g)** : Maximum heap size in mega/giga
-  bytes
+Useful JVM options:

+-  **-Xms<heap size>[unit]** : Initial heap size in MB ('m') or GB ('g')
+-  **-Xmx<heap size>[unit]** : Maximum heap size in MB ('m') or GB ('g')
+-  **-XX:StartFlightRecording=filename=<name>.jfr,disk=true** : Enable
+   the Flight Recorder (for developers)

 Options
 -------
@@ -135,6 +119,7 @@ Contents
   analysis/msa
   analysis/phylogeny
   analysis/mapping
+   analysis/region
   analysis/sequence_visualization
   analysis/support
   analysis/dn_ds

--- a/pom.xml
+++ b/pom.xml
@@ -6,7 +6,7 @@

    <groupId>nl.wur.bif</groupId>
    <artifactId>pantools</artifactId>
-    <version>4.3.0</version>
+    <version>4.3.1</version>

    <properties>
        <maven.compiler.source>8</maven.compiler.source>

--- a/src/main/java/nl/wur/bif/pantools/Pantools.java
+++ b/src/main/java/nl/wur/bif/pantools/Pantools.java
@@ -8,10 +8,11 @@ package nl.wur.bif.pantools;


 import nl.wur.bif.pantools.analysis.classification.*;
+import nl.wur.bif.pantools.analysis.region_analysis.ExtractRegionCLI;
 import nl.wur.bif.pantools.analysis.function_analysis.*;
 import nl.wur.bif.pantools.analysis.phylogeny.*;
 import nl.wur.bif.pantools.construction.build_panproteome.BuildPanproteomeCLI;
-import nl.wur.bif.pantools.construction.conversion.PangenomeConversionCLI;
+import nl.wur.bif.pantools.analysis.export_pangenome.ExportGfaCLI;
 import nl.wur.bif.pantools.construction.phenotypes.add_phenotypes.AddPhenotypesCLI;
 import nl.wur.bif.pantools.construction.phenotypes.remove_phenotypes.RemovePhenotypesCLI;
 import nl.wur.bif.pantools.construction.annotations.RemoveAnnotationsCLI;
@@ -25,7 +26,7 @@ import nl.wur.bif.pantools.analysis.calculate_dn_ds.CalculateDnDsCLI;
 import nl.wur.bif.pantools.construction.functions.add_antismash.AddAntismashCLI;
 import nl.wur.bif.pantools.construction.functions.add_functions.AddFunctionsCLI;
 import nl.wur.bif.pantools.analysis.map_reads.MapReadsCLI;
-import nl.wur.bif.pantools.construction.export_pangenome.ExportPangenomeCLI;
+import nl.wur.bif.pantools.analysis.export_pangenome.ExportCsvCLI;
 import nl.wur.bif.pantools.analysis.gene_classification.GeneClassificationCLI;
 import nl.wur.bif.pantools.analysis.gene_retention.GeneRetentionCLI;
 import nl.wur.bif.pantools.construction.build_pangenome.AddGenomesCLI;
@@ -59,10 +60,7 @@ import picocli.CommandLine.*;
 import picocli.CommandLine.Model.CommandSpec;

 import java.awt.*;
-import java.io.BufferedReader;
-import java.io.IOException;
-import java.io.InputStream;
-import java.io.InputStreamReader;
+import java.io.*;
 import java.net.URI;
 import java.net.URISyntaxException;
 import java.net.URL;
@@ -74,6 +72,7 @@ import java.time.Duration;
 import java.time.Instant;
 import java.util.List;
 import java.util.*;
+import java.util.stream.Collectors;

 import static java.util.ResourceBundle.getBundle;
 import static nl.wur.bif.pantools.utils.Globals.setGlobals;
@@ -154,9 +153,10 @@ import static picocli.CommandLine.ScopeType.INHERIT;
        AddSyntenyCLI.class,
        AddPhasingCLI.class,
        SyntenyStatisticsCLI.class,
-        ExportPangenomeCLI.class,
+        ExportCsvCLI.class,
        VariationOverviewCLI.class,
-        PangenomeConversionCLI.class,
+        ExportGfaCLI.class,
+        ExtractRegionCLI.class,
 })
 public class Pantools {
    @Parameters(descriptionKey = "database-path", index = "0", scope = INHERIT)
@@ -272,7 +272,38 @@ public class Pantools {

        // create logger
        logger = LogManager.getLogger(this);
+
+        // log PanTools version
+        final String[] versionInfo = new GitVersionProvider().getVersion();
+        logger.info("PanTools {}", Arrays.stream(versionInfo)
+                .map(Objects::toString)
+                .collect(Collectors.joining(", ")));
+
+        // log usage
        logger.info("Usage: pantools {}", String.join(" ", spec.commandLine().getParseResult().originalArgs()));
+
+        // log number of processors/cores available to the JVM
+        logger.debug("Available processors (cores): {}",
+                Runtime.getRuntime().availableProcessors());
+
+        // log total free memory available to the JVM
+        logger.debug("Free memory (MiB): {}",
+                Runtime.getRuntime().freeMemory()/1048576);
+
+        // log max amount of memory the JVM will attempt to use
+        final long maxMemory = Runtime.getRuntime().maxMemory();
+        logger.debug("Maximum memory (MiB): {}",
+                (maxMemory == Long.MAX_VALUE ? "no limit" : maxMemory/1048576));
+
+        // Log currently allocated memory
+        logger.debug("Currently allocated memory (MiB): {}",
+                Runtime.getRuntime().totalMemory()/1048576);
+
+        // Log database info
+        final File database = databaseDirectory.toFile();
+        logger.debug("Total space (MiB): {}", database.getTotalSpace()/1048576);
+        logger.debug("Free space (MiB): {}", database.getFreeSpace()/1048576);
+        logger.debug("Usable space (MiB): {}", database.getUsableSpace()/1048576);
    }

    /**

--- a/src/main/java/nl/wur/bif/pantools/analysis/classification/Classification.java
+++ b/src/main/java/nl/wur/bif/pantools/analysis/classification/Classification.java
@@ -23,7 +23,6 @@ import java.util.*;
 import java.util.concurrent.*;
 import java.util.concurrent.atomic.AtomicInteger;
 import java.util.concurrent.atomic.AtomicLong;
-import java.util.stream.Collectors;

 import static nl.wur.bif.pantools.utils.local_sequence_alignment.BoundedLocalSequenceAlignment.match;
 import static nl.wur.bif.pantools.utils.Globals.*;
@@ -3125,7 +3124,7 @@ public class Classification {
                allHmGroupArray = new int[total_hmgroups][genome_list.size()];
                hmCounter = 0;
                while (hm_nodes.hasNext()) {
-                    Pantools.logger.info("Getting copy number for homology group {}/{}", hmCounter + 1, total_hmgroups);
+                    Pantools.logger.debug("Getting copy number for homology group {}/{}", hmCounter + 1, total_hmgroups); // make progress bar
                    Node hmNode = hm_nodes.next();
                    allHmGroupArray[hmCounter] = getCopyNumberArray(hmNode, genome_list, pavs);
                    hmCounter++;
@@ -3240,9 +3239,17 @@ public class Classification {
                                               int[][] accessoryGroupsPerGenome,
                                               int[][] uniqueGroupsPerGenome,
                                               long seed) {
-        Pantools.logger.info("Calculating pangenome size for {} in {} loops.", genomeList, total_loops);
+        Pantools.logger.info("Calculating pangenome size for {} genomes in {} loops.", genomeList.size(), total_loops);
+        Pantools.logger.debug("Genome list: {}", genomeList);

-        int[][] newGroupsPerGenome = new int[total_loops][genomeList.size()]; // key = loop number, value = array with new genes per genome
+        // create an index using a map for the genomeList
+        final Map<String, Integer> genomeToIndex = new HashMap<>(); // key = genome name, value = index in genomeList
+        for (int i = 0; i < genomeList.size(); i++) {
+            genomeToIndex.put(genomeList.get(i), i);
+        }
+
+        // create a random generator to create seeds for all loops and store in array to make threadsafe
+        Random masterGenerator = new Random(seed);

        // create threadpool for parallel processing
        ExecutorService es = Executors.newFixedThreadPool(THREADS);
@@ -3250,15 +3257,18 @@ public class Classification {
        Pantools.logger.debug("Created threadpool with {} threads.", THREADS);

        // perform the calculations as often as total_loops indicates in parallel
+        int[][] newGroupsPerGenome = new int[total_loops][genomeList.size()]; // key = loop number, value = array with new genes per genome
        for (int loopNr = 0; loopNr < total_loops; loopNr++) {
+            final long currentSeed = masterGenerator.nextLong();
            Future<?> f = es.submit(new callCalculatePangenomeSizeGenes(genomeList,
+                    genomeToIndex,
                    allHmGroupArray,
                    coreGroupsPerGenome,
                    accessoryGroupsPerGenome,
                    uniqueGroupsPerGenome,
                    newGroupsPerGenome,
                    loopNr,
-                    seed));
+                    currentSeed));
            futures.add(f);
        }

@@ -3292,6 +3302,7 @@ public class Classification {

    private class callCalculatePangenomeSizeGenes implements Callable<Integer> {
        private final ArrayList<String> genomeList;
+        private final Map<String, Integer> genomeToIndex;
        private final int[][] allHmGroupArray;
        private final int[][] coreGroupsPerGenome;
        private final int[][] accessoryGroupsPerGenome;
@@ -3301,6 +3312,7 @@ public class Classification {
        private final long seed;

        public callCalculatePangenomeSizeGenes(ArrayList<String> genomeList,
+                                               Map<String, Integer> genomeToIndex,
                                               int[][] allHmGroupArray,
                                               int[][] coreGroupsPerGenome,
                                               int[][] accessoryGroupsPerGenome,
@@ -3309,6 +3321,7 @@ public class Classification {
                                               int loopNr,
                                               long seed) {
            this.genomeList = genomeList;
+            this.genomeToIndex = genomeToIndex;
            this.allHmGroupArray = allHmGroupArray;
            this.coreGroupsPerGenome = coreGroupsPerGenome;
            this.accessoryGroupsPerGenome = accessoryGroupsPerGenome;
@@ -3320,29 +3333,40 @@ public class Classification {

        public Integer call() {
            // create a random generator
-            Random generator = new Random(seed);
+            final Random generator = new Random(seed);
+
+            // log what we are doing
+            Pantools.logger.debug("Calculating pangenome size for {} genomes in loop {}.", genomeList.size(), loopNr);
+            Pantools.logger.trace("Genome list: {}", genomeList);
+            Pantools.logger.trace("Seed: {}", seed);
+
+            // create a list that contains the indices of the genomes in genomeList for shuffling
+            final List<Integer> genomeIndexList = new ArrayList<>(); // list with indices of genomes that are in genomeList
+            for (int i = 0; i < genomeList.size(); i++) {
+                genomeIndexList.add(i);
+            }
+
+            // shuffle the genomeIndexList
+            Collections.shuffle(genomeIndexList, generator);

            // create a new list for genomes and a list for picking a random number
-            ArrayList<String> copyGenomeList = new ArrayList<>(genomeList);
-            ArrayList<String> newGenomeList = new ArrayList<>();
-            ArrayList<Integer> seenHmGroups = new ArrayList<>();
+            final Set<String> genomeSubset = new HashSet<>();
+            final Set<Integer> seenHmGroups = new HashSet<>();

            // add one genome at a time
-            int totalGenomes = genomeList.size();
-            for (int genomeNr = 0; genomeNr < totalGenomes; genomeNr++) {
-                int randomGenome = generator.nextInt(copyGenomeList.size());
-                String genome = copyGenomeList.get(randomGenome);
-                copyGenomeList.remove(randomGenome);
-                newGenomeList.add(genome);
-                Pantools.logger.debug("Adding genome {} to the set of genomes.", genome);
+            for (int i = 0; i < genomeIndexList.size(); i++) {
+                int genomeNr = genomeIndexList.get(i);
+                final String genome = genomeList.get(genomeNr);
+                genomeSubset.add(genome);
+                Pantools.logger.trace("Adding genome {} to the set of genomes.", genome);

                // calculate the core, accessory, unique and new genes for the new genome list
-                int[] coreAccessoryUniqueNew = calculateCoreAccessoryUniqueNewGroups(newGenomeList, genomeList, allHmGroupArray, seenHmGroups);
-                Pantools.logger.debug("[core, accessory, unique, new] = {}.", Arrays.toString(coreAccessoryUniqueNew));
-                coreGroupsPerGenome[loopNr][genomeNr] = coreAccessoryUniqueNew[0];
-                accessoryGroupsPerGenome[loopNr][genomeNr] = coreAccessoryUniqueNew[1];
-                uniqueGroupsPerGenome[loopNr][genomeNr] = coreAccessoryUniqueNew[2];
-                newGroupsPerGenome[loopNr][genomeNr] = coreAccessoryUniqueNew[3];
+                final int[] coreAccessoryUniqueNew = calculateCoreAccessoryUniqueNewGroups(genomeSubset, genomeToIndex, allHmGroupArray, seenHmGroups);
+                Pantools.logger.trace("[core, accessory, unique, new] = {}.", Arrays.toString(coreAccessoryUniqueNew));
+                coreGroupsPerGenome[loopNr][i] = coreAccessoryUniqueNew[0];
+                accessoryGroupsPerGenome[loopNr][i] = coreAccessoryUniqueNew[1];
+                uniqueGroupsPerGenome[loopNr][i] = coreAccessoryUniqueNew[2];
+                newGroupsPerGenome[loopNr][i] = coreAccessoryUniqueNew[3];
            }

            return loopNr;
@@ -3354,18 +3378,18 @@ public class Classification {
     * present in all genomes, accessory means that the group is present in at least one genome but not in all
     * genomes and unique means that the group is present in only one genome. New means that the homology group was not
     * seen in a provided previous list of indices corresponding to allHmGroups.
-     * @param genomeList a list with all genome names to use for the calculation
-     * @param allGenomeList a list with all genome names for getting the indices of genomeList in allHmGroupsArray
+     * @param genomeSubset a HashSet with all genome names to use for the calculation
+     * @param genomeToIndex a map with genome names as keys and indices as values for fast access
     * @param allHmGroupArray a 2D array with the copy numbers per genome (inner array) per hm group (outer array)
     *                        this array is not updated in this function
-     * @param seenHmGroups a list of homology groups that have already been seen (for calculating the new groups)
+     * @param seenHmGroups a set of homology groups that have already been seen (for calculating the new groups)
     * @return an array with the number of core, accessory, unique and new genes
     */
-    public int[] calculateCoreAccessoryUniqueNewGroups(ArrayList<String> genomeList,
-                                                         ArrayList<String> allGenomeList,
-                                                         int[][] allHmGroupArray,
-                                                         ArrayList<Integer> seenHmGroups) {
-        Pantools.logger.debug("Calculating core, accessory, unique and new genes for {} genomes.", genomeList.size());
+    public int[] calculateCoreAccessoryUniqueNewGroups(Set<String> genomeSubset,
+                                                       Map<String, Integer> genomeToIndex,
+                                                       int[][] allHmGroupArray,
+                                                       Set<Integer> seenHmGroups) {
+        Pantools.logger.trace("Calculating core, accessory, unique and new genes for {} genomes.", genomeSubset.size());

        int coreGroups = 0;
        int accessoryGroups = 0;
@@ -3378,28 +3402,25 @@ public class Classification {
            int genomes = 0;

            // loop over all genomes
-            for (String genome : genomeList) {
-                int genomeNr = allGenomeList.indexOf(genome);
+            for (String genome : genomeSubset) {
+                int genomeNr = genomeToIndex.get(genome);
                if (hmGroup[genomeNr] != 0) {
                    genomes++;
                }
            }

            // check if the homology group is core, accessory or unique
-            if (genomes == genomeList.size()) {
+            if (genomes == genomeSubset.size()) {
                coreGroups++;
-            } else if (genomes == 1) {
-                uniqueGroups++;
-            } else if (genomes > 1 && genomes < genomeList.size()) {
-                accessoryGroups++;
            } else if (genomes == 0) {
                continue;
+            } else if (genomes == 1) {
+                uniqueGroups++;
            } else {
-                Pantools.logger.error("Something went wrong with the calculation of core, accessory and unique genes.");
-                System.exit(1);
+                accessoryGroups++;
            }

-            // if we get here, the gene is definitely present
+            // if we get here, the gene is present (genomes > 0)
            if (!seenHmGroups.contains(hmGroupNr)) {
                newGroups++;
                seenHmGroups.add(hmGroupNr);
@@ -4111,7 +4132,7 @@ public class Classification {
        StringBuilder gene_nr_builder = new StringBuilder();

        for (int[] loop : newGroupsPerGenome) {
-            for (int index = 0; index < loop.length; index++) {
+            for (int index = 1; index < loop.length; index++) {
                int genomeNr = index + 1;
                new_genes_builder.append(loop[index]).append(",");
                gene_nr_builder.append(genomeNr).append(",");
@@ -4322,6 +4343,9 @@ public class Classification {
                                                 int[][] accessoryGenesPerGenome,
                                                 int[][] uniqueGenesPerGenome) {
        Pantools.logger.info("Writing pangenome size gene output.");
+        Pantools.logger.trace("coreGenesPerGenome: {}", Arrays.deepToString(coreGenesPerGenome));
+        Pantools.logger.trace("accessoryGenesPerGenome: {}", Arrays.deepToString(accessoryGenesPerGenome));
+        Pantools.logger.trace("uniqueGenesPerGenome: {}", Arrays.deepToString(uniqueGenesPerGenome));

        // define output builder
        StringBuilder output_builder = new StringBuilder();
@@ -6016,6 +6040,14 @@ public class Classification {
            Path output_path,
            String trim) {

+        Pantools.logger.debug("Creating matrix with variable and informative positions.");
+        Pantools.logger.trace("variable_positions: {}", variable_positions.toString());
+        Pantools.logger.trace("uninformative_positions: {}", uninformative_positions.toString());
+        Pantools.logger.trace("position_map: {}", position_map.toString());
+        Pantools.logger.trace("alignmentTypeShort: {}", alignmentTypeShort);
+        Pantools.logger.trace("output_path: {}", output_path.toString());
+        Pantools.logger.trace("trim: {}", trim);
+
        if (output_path.resolve("mlsa").resolve("input").toFile().exists()) { // do not create this file when running the mlsa_concatenate function
            return;
        }
@@ -6030,20 +6062,24 @@ public class Classification {
            matrix_builder.append("Position,Informative,A,T,C,G,gap,other\n");
        }

-        String prev_position = "";
+        int prev_position = -1;
        for (String position_character_key : position_map.keySet()) {
            Set<Integer> sequence_numbers = position_map.get(position_character_key);

-            String position = position_character_key.replaceFirst(".$",""); // remove last character
-            String character = position_character_key.replace(position, "");
+            String positionStr = position_character_key.replaceFirst(".$",""); // remove last character
+            int position = Integer.parseInt(positionStr);
+            String character = position_character_key.replace(positionStr, "");
+
+            if (prev_position != position && prev_position != -1) {
+                final boolean isInformative = !uninformative_positions.contains(prev_position);
+                matrix_builder
+                        .append((prev_position+1))
+                        .append(",")
+                        .append(isInformative)
+                        .append(",")
+                        .append(Arrays.toString(counts_per_position).replace("[","") .replace("]","").replace(" ",""))
+                        .append("\n");

-            if (!prev_position.equals(position) && !prev_position.equals("")) {
-                int pos_int = Integer.parseInt(prev_position);
-                if (uninformative_positions.contains(prev_position)) {
-                    matrix_builder.append((pos_int+1)).append(",,").append(Arrays.toString(counts_per_position).replace("[","") .replace("]","").replace(" ","")).append("\n");
-                } else {
-                    matrix_builder.append((pos_int+1)).append(",true,").append(Arrays.toString(counts_per_position).replace("[","") .replace("]","").replace(" ","")).append("\n");
-                }
                if (alignmentTypeShort.equals("prot")) {
                    counts_per_position = new int[22]; // A R N D C Q E G H I L K M F P S T W Y V - other
                } else {
@@ -6120,15 +6156,14 @@ public class Classification {
            prev_position = position;
        }

-        if (!prev_position.equals("")) { // there are variable positions in the alignment
-            int pos_int = Integer.parseInt(prev_position);
+        if (prev_position != -1) { // there are variable positions in the alignment
            if (uninformative_positions.contains(prev_position)) {
-                matrix_builder.append((pos_int+1)).append(",,").append(Arrays.toString(counts_per_position).replace("[","") .replace("]","").replace(" ","")).append("\n");
+                matrix_builder.append((prev_position+1)).append(",,").append(Arrays.toString(counts_per_position).replace("[","") .replace("]","").replace(" ","")).append("\n");
            } else {
-                matrix_builder.append((pos_int+1)).append(",true,").append(Arrays.toString(counts_per_position).replace("[","") .replace("]","").replace(" ","")).append("\n");
+                matrix_builder.append((prev_position+1)).append(",true,").append(Arrays.toString(counts_per_position).replace("[","") .replace("]","").replace(" ","")).append("\n");
            }
        } else { // no variable positions
-            matrix_builder = new StringBuilder("Alignment does not contain variable positions");
+            matrix_builder = new StringBuilder("Alignment does not contain variable positions\n");
        }
        output_path.resolve("var_inf_positions").toFile().mkdirs(); // create directory
        write_SB_to_file_full_path(matrix_builder, output_path.resolve("var_inf_positions").resolve(alignmentTypeShort + trim + "_variable_positions.csv"));

--- a/src/main/java/nl/wur/bif/pantools/construction/export_pangenome/ExportPangenomeCLI.java
+++ b/src/main/java/nl/wur/bif/pantools/construction/export_pangenome/ExportPangenomeCLI.java
-package nl.wur.bif.pantools.construction.export_pangenome;
+package nl.wur.bif.pantools.analysis.export_pangenome;

 import nl.wur.bif.pantools.Pantools;
 import nl.wur.bif.pantools.utils.BeanUtils;
@@ -17,8 +17,8 @@ import static picocli.CommandLine.*;
 *
 * @author Robin van Esch, Wageningen University, the Netherlands.
 */
-@CommandLine.Command(name = "export_pangenome", sortOptions = false, abbreviateSynopsis = true)
-public class ExportPangenomeCLI implements Callable<Integer> {
+@CommandLine.Command(name = "export_csv", sortOptions = false, abbreviateSynopsis = true)
+public class ExportCsvCLI implements Callable<Integer> {
    @Spec CommandSpec spec;

    @ParentCommand
@@ -39,7 +39,7 @@ public class ExportPangenomeCLI implements Callable<Integer> {
        pantools.createLogger(spec);
        BeanUtils.argValidation(spec, this);
        pantools.setPangenomeGraph();
-        new PangenomeExporter(nodePropertiesFile, relationshipPropertiesFile, sequenceNodeAnchorsFile);
+        new PangenomeExporterCsv(nodePropertiesFile, relationshipPropertiesFile, sequenceNodeAnchorsFile);
        return 0;
    }
 }
--- a/src/main/java/nl/wur/bif/pantools/analysis/export_pangenome/ExportGfaCLI.java
+++ b/src/main/java/nl/wur/bif/pantools/analysis/export_pangenome/ExportGfaCLI.java
+package nl.wur.bif.pantools.analysis.export_pangenome;
+
+import jakarta.validation.constraints.Pattern;
+import nl.wur.bif.pantools.Pantools;
+import nl.wur.bif.pantools.utils.BeanUtils;
+import picocli.CommandLine.Model.CommandSpec;
+
+import java.nio.file.Path;
+import java.util.concurrent.Callable;
+
+import static picocli.CommandLine.*;
+
+/**
+ * Convert a pangenome graph to a different format.
+ * @author Dirk-Jan van Workum, Wageningen University, the Netherlands.
+ */
+@Command(name = "export_gfa", sortOptions = false)
+public class ExportGfaCLI implements Callable<Integer> {
+    @Spec CommandSpec spec;
+
+    @ParentCommand
+    private Pantools pantools;
+
+    @Option(names = {"-o", "--output"}, required = true)
+    private Path outputPath;
+
+    @Option(names = "--format")
+    @Pattern(regexp = "GFAv1|GFAv2", message = "{pattern.graph-format}")
+    private String outputFormat;
+
+    @Override
+    public Integer call() throws Exception {
+        pantools.createLogger(spec);
+        BeanUtils.argValidation(spec, this);
+
+        pantools.setPangenomeGraph("pangenome");
+
+        PangenomeExporterGfa pangenomeExporterGfa = new PangenomeExporterGfa(outputPath);
+        pangenomeExporterGfa.convert(outputFormat);
+
+        return 0;
+    }
+}
--- a/src/main/java/nl/wur/bif/pantools/construction/export_pangenome/NodeUUIDAndLabel.java
+++ b/src/main/java/nl/wur/bif/pantools/construction/export_pangenome/NodeUUIDAndLabel.java
-package nl.wur.bif.pantools.construction.export_pangenome;
+package nl.wur.bif.pantools.analysis.export_pangenome;

 import org.neo4j.graphdb.Label;


--- a/src/main/java/nl/wur/bif/pantools/construction/export_pangenome/PangenomeExporter.java
+++ b/src/main/java/nl/wur/bif/pantools/construction/export_pangenome/PangenomeExporter.java
-package nl.wur.bif.pantools.construction.export_pangenome;
+package nl.wur.bif.pantools.analysis.export_pangenome;

 import com.google.common.collect.ImmutableMap;
 import com.google.common.collect.ImmutableSet;
 import com.google.common.collect.Streams;
-import nl.wur.bif.pantools.construction.export_pangenome.records.NodeProperty;
-import nl.wur.bif.pantools.construction.export_pangenome.records.Record;
-import nl.wur.bif.pantools.construction.export_pangenome.records.RelationshipProperty;
-import nl.wur.bif.pantools.construction.export_pangenome.records.SequenceAnchor;
+import nl.wur.bif.pantools.analysis.export_pangenome.records.NodeProperty;
+import nl.wur.bif.pantools.analysis.export_pangenome.records.Record;
+import nl.wur.bif.pantools.analysis.export_pangenome.records.RelationshipProperty;
+import nl.wur.bif.pantools.analysis.export_pangenome.records.SequenceAnchor;
 import org.apache.commons.csv.CSVFormat;
 import org.apache.commons.csv.CSVPrinter;
 import org.neo4j.graphdb.*;
@@ -31,7 +31,7 @@ import static nl.wur.bif.pantools.utils.Utils.registerShutdownHook;
 * contents as three CSV files for comparison purposes.
 *
 */
-public class PangenomeExporter {
+public class PangenomeExporterCsv {
    // TODO: we're using names because RelTypes.x contains ordinal properties, and relationship types don't
    private static final Set<String> VALID_RELATIONSHIP_TYPE_NAMES = new HashSet<>(Arrays.asList(
        RelTypes.has.name(),
@@ -62,7 +62,7 @@ public class PangenomeExporter {
     * @param sequenceAnchorsOutputPath path to sequence anchors output file.
     * @throws IOException in case of Neo4j or output file IO error.
     */
-    public PangenomeExporter(
+    public PangenomeExporterCsv(
        Path nodePropertiesOutputPath,
        Path relationshipPropertiesOutputPath,
        Path sequenceAnchorsOutputPath) throws IOException {

--- a/src/main/java/nl/wur/bif/pantools/analysis/export_pangenome/PangenomeExporterGfa.java
+++ b/src/main/java/nl/wur/bif/pantools/analysis/export_pangenome/PangenomeExporterGfa.java
+package nl.wur.bif.pantools.analysis.export_pangenome;
+
+import nl.wur.bif.pantools.Pantools;
+import nl.wur.bif.pantools.utils.Globals;
+import nl.wur.bif.pantools.utils.graph.Graph;
+import nl.wur.bif.pantools.utils.graph.Neo4jGraphExtractor;
+import nl.wur.bif.pantools.utils.writer.Gfa1Writer;
+import nl.wur.bif.pantools.utils.writer.Gfa2Writer;
+import nl.wur.bif.pantools.utils.writer.GfaWriter;
+
+import java.io.IOException;
+import java.nio.file.Path;
+
+/**
+ * Convert a pangenome graph to a different format.
+ * NB: Currently only GFA is supported.
+ * @author Dirk-Jan van Workum, Wageningen University, the Netherlands.
+ */
+public class PangenomeExporterGfa {
+    private final Path outputPath;
+
+    /**
+     * Constructor.
+     * NB: Currently only GFA is supported.
+     * @param outputPath The path to the output file.
+     */
+    public PangenomeExporterGfa(Path outputPath) {
+        this.outputPath = outputPath;
+    }
+
+    /**
+     * Convert the pangenome graph to a different format.
+     * @param format The format to convert to.
+     * @throws IOException If an IO error occurs during conversion.
+     */
+    public void convert(String format) throws IOException {
+        Pantools.logger.debug("Converting pangenome for {}", this);
+
+        // start graph extractor
+        Neo4jGraphExtractor neo4jGraphExtractor = new Neo4jGraphExtractor(Globals.GRAPH_DB);
+        Graph graph = neo4jGraphExtractor.extractNucleotideGraph();
+
+        // write the graph
+        GfaWriter gfaWriter;
+        switch (format) {
+            case "GFAv1":
+                gfaWriter = new Gfa1Writer();
+                break;
+            case "GFAv2":
+                gfaWriter = new Gfa2Writer();
+                break;
+            default:
+                Pantools.logger.error("Unsupported format: {}", format);
+                throw new IllegalArgumentException("Unsupported format");
+        }
+        Pantools.logger.trace("Graph: {}", graph);
+        gfaWriter.write(graph, outputPath);
+    }
+
+    /**
+     * toString method.
+     * @return A string representation of this object.
+     */
+    @Override
+    public String toString() {
+        return "";
+    }
+}
--- a/src/main/java/nl/wur/bif/pantools/construction/export_pangenome/records/NodeProperty.java
+++ b/src/main/java/nl/wur/bif/pantools/construction/export_pangenome/records/NodeProperty.java
-package nl.wur.bif.pantools.construction.export_pangenome.records;
+package nl.wur.bif.pantools.analysis.export_pangenome.records;

 import com.google.common.collect.ImmutableList;


--- a/src/main/java/nl/wur/bif/pantools/construction/export_pangenome/records/Record.java
+++ b/src/main/java/nl/wur/bif/pantools/construction/export_pangenome/records/Record.java
-package nl.wur.bif.pantools.construction.export_pangenome.records;
+package nl.wur.bif.pantools.analysis.export_pangenome.records;

 import java.util.List;


--- a/src/main/java/nl/wur/bif/pantools/construction/export_pangenome/records/RelationshipProperty.java
+++ b/src/main/java/nl/wur/bif/pantools/construction/export_pangenome/records/RelationshipProperty.java
-package nl.wur.bif.pantools.construction.export_pangenome.records;
+package nl.wur.bif.pantools.analysis.export_pangenome.records;

 import com.google.common.collect.ImmutableList;


--- a/src/main/java/nl/wur/bif/pantools/construction/export_pangenome/records/SequenceAnchor.java
+++ b/src/main/java/nl/wur/bif/pantools/construction/export_pangenome/records/SequenceAnchor.java
-package nl.wur.bif.pantools.construction.export_pangenome.records;
+package nl.wur.bif.pantools.analysis.export_pangenome.records;

 import com.google.common.collect.ImmutableList;


--- a/src/main/java/nl/wur/bif/pantools/analysis/region_analysis/ExtractRegionCLI.java
+++ b/src/main/java/nl/wur/bif/pantools/analysis/region_analysis/ExtractRegionCLI.java
+package nl.wur.bif.pantools.analysis.region_analysis;
+
+import jakarta.validation.constraints.NotNull;
+import jakarta.validation.constraints.Pattern;
+import nl.wur.bif.pantools.Pantools;
+import nl.wur.bif.pantools.utils.BeanUtils;
+import nl.wur.bif.pantools.utils.Globals;
+import nl.wur.bif.pantools.utils.cli.mixins.ThreadNumber;
+import nl.wur.bif.pantools.utils.cli.validation.Constraints.OutputDirectory;
+import nl.wur.bif.pantools.utils.graph.GenomeRange;
+import picocli.CommandLine.*;
+
+import java.nio.file.Path;
+import java.util.concurrent.Callable;
+
+/**
+ * Convert a pangenome graph to a different format.
+ *
+ * @author Dirk-Jan van Workum, Wageningen University, the Netherlands.
+ */
+@Command(name = "extract_region", sortOptions = false)
+@OutputDirectory(directory = "outputPath")
+public class ExtractRegionCLI implements Callable<Integer> {
+    @Spec Model.CommandSpec spec;
+    @Mixin private ThreadNumber threadNumber;
+
+    @ParentCommand
+    private Pantools pantools;
+
+    @Option(names = {"-o", "--output"})
+    private Path outputPath;
+
+    @Option(names = {"--extraction-method"},
+            defaultValue = "kmer",
+            description = "The method to use for graph extraction (can be 'homology', 'alignment', 'novel' or 'kmer'). " +
+                    "THIS OPTION IS FOR TESTING PURPOSES ONLY AND WILL BE REMOVED IN THE FUTURE. " +
+                    "Default: kmer"
+    )
+    @Pattern(regexp = "^(homology|alignment|kmer|novel)$", message = "Please provide a valid extraction method.")
+    private String method;
+
+    @Option(names = {"--num-kmer-samples"},
+            defaultValue = "100",
+            description = "The number of kmer samples to use for the kmer method. Default: 100"
+    )
+    private int numKmerSamples;
+
+    @Option(names = {"--gfa"},
+            defaultValue = "false",
+            description = "Output the graph in GFAv1 format. Default: false"
+    )
+    private boolean writeGfa;
+
+    @NotNull(message = "Must define either --genome-range or --gene")
+    @ArgGroup
+    TargetPosition targetPosition;
+
+    static class TargetPosition {
+        @Option(names = "--genome-range")
+        @Pattern(regexp = "^[0-9]+#[^#:\\s]+:[0-9]+-[0-9]+$", message = "{pattern.genome-range}")
+        String genomeRangeStr;
+
+        @Option(names = "--gene")
+        String gene;
+    }
+
+    @Override
+    public Integer call() throws Exception {
+        pantools.createLogger(spec);
+        BeanUtils.argValidation(spec, this, threadNumber, targetPosition);
+
+        pantools.setPangenomeGraph("pangenome");
+        crossValidate();
+        setGlobalParameters();
+
+        RegionExtractor regionExtractor = new RegionExtractor(outputPath);
+        if (targetPosition.genomeRangeStr != null) {
+            regionExtractor.setGenomeRange(
+                    GenomeRange.parseGenomeRange(targetPosition.genomeRangeStr)
+            );
+        } else if (targetPosition.gene != null) {
+            regionExtractor.setGene(
+                    targetPosition.gene
+            );
+        }
+        regionExtractor.extractRegion(method, writeGfa);
+
+        return 0;
+    }
+
+    private void setGlobalParameters() {
+        Globals.THREADS = Math.min(threadNumber.getnThreads(), numKmerSamples);
+        Globals.NUM_KMER_SAMPLES = numKmerSamples;
+    }
+
+    private void crossValidate() {
+        // Assert that --num-kmer-samples is only used for alignment and novel method
+        if (!method.equals("alignment") && !method.equals("novel")) {
+            if (numKmerSamples != 100) {
+                throw new ParameterException(spec.commandLine(), "--num-kmer-samples can only be used with --extraction-method=alignment or --extraction-method=novel.");
+            }
+        }
+    }
+
+    public Pantools getPantools() {
+        return pantools;
+    }
+    public Path getOutputPath() {
+        return outputPath;
+    }
+}
No results found