Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found
Select Git revision
  • add-logging
  • add-phenotype-to-consensus-tree
  • add-progress-bars
  • add-scoring-matrices-to-resources
  • add_RG_option
  • add_gfa_build_pangenome
  • add_gfa_export
  • add_loading_bars
  • add_panva_queries
  • add_promoter_sequence_analysis
  • add_wgs_roi
  • automatic_skip_core_phylogeny_fix
  • develop
  • docker
  • eggnog_fix
  • fasta-validation
  • feature/fix-msa-with-threshold
  • feature/map_kmers_to_coords
  • feature/update-msa-with-aa-presence-threshold
  • fix-core-phylogeny-mode
  • fix-indexing-error-add_annotations
  • fix_blast_for_panproteomes
  • fix_genomelayer_large_pangenomes
  • fix_inconsistent_homology_grouping
  • fix_integer_overflow_kmer_structure
  • fix_kmer_pangenome_structure
  • get_hm_groups_from_genes_of_interest
  • hotfix/kmer_classification_with_phenotypes_crashes
  • java-11
  • optimized-build-pangenome
  • optimized-map
  • pantools_v1
  • pantools_v2
  • pantools_v3
  • pantools_v3.1
  • pantools_v3.2
  • pantools_v4
  • phased_pangenomics_NW_blosum
  • phased_pangenomics_restructure
  • playing_around_with_cicd
  • refactor-check-if-nucleotide-sequence
  • refactor-retrieve-features
  • release
  • release_v4.0.0
  • remove-url-validation
  • remove_genomes
  • remove_skiparrays
  • remove_unused_phylogeny_functions
  • separate_connect_pangenome
  • single-seqalign-class
  • split_classification_busco
  • test_mcl_files
  • try_create_export_json_entire_pangenome
  • update_dependencies_pantools
  • update_retrieve_function
  • validate_feature_strands
  • v1.3.0
  • v2.0.0
  • v3.0.0
  • v3.1.0
  • v3.2.0
  • v3.4.0
  • v4.0.0
  • v4.1.0
  • v4.1.1
  • v4.2.0
  • v4.2.1
  • v4.2.2
  • v4.2.3
  • v4.3.0
  • v4.3.1
  • v4.3.2
  • v4.3.3
73 results

Target

Select target project
  • tan0531/pantools
  • he063/pantools
  • sakel001/pantools
  • bioinformatics/pantools
4 results
Select Git revision
  • add-logging
  • add-phenotype-to-consensus-tree
  • add-progress-bars
  • add-scoring-matrices-to-resources
  • add_RG_option
  • add_gfa_build_pangenome
  • add_gfa_export
  • add_loading_bars
  • add_panva_queries
  • add_promoter_sequence_analysis
  • add_wgs_roi
  • automatic_skip_core_phylogeny_fix
  • develop
  • docker
  • eggnog_fix
  • fasta-validation
  • feature/fix-msa-with-threshold
  • feature/map_kmers_to_coords
  • feature/update-msa-with-aa-presence-threshold
  • fix-core-phylogeny-mode
  • fix-indexing-error-add_annotations
  • fix_blast_for_panproteomes
  • fix_genomelayer_large_pangenomes
  • fix_inconsistent_homology_grouping
  • fix_integer_overflow_kmer_structure
  • fix_kmer_pangenome_structure
  • get_hm_groups_from_genes_of_interest
  • hotfix/kmer_classification_with_phenotypes_crashes
  • java-11
  • optimized-build-pangenome
  • optimized-map
  • pantools_v1
  • pantools_v2
  • pantools_v3
  • pantools_v3.1
  • pantools_v3.2
  • pantools_v4
  • phased_pangenomics_NW_blosum
  • phased_pangenomics_restructure
  • playing_around_with_cicd
  • refactor-check-if-nucleotide-sequence
  • refactor-retrieve-features
  • release
  • release_v4.0.0
  • remove-url-validation
  • remove_genomes
  • remove_skiparrays
  • remove_unused_phylogeny_functions
  • separate_connect_pangenome
  • single-seqalign-class
  • split_classification_busco
  • test_mcl_files
  • try_create_export_json_entire_pangenome
  • update_dependencies_pantools
  • update_retrieve_function
  • validate_feature_strands
  • v1.3.0
  • v2.0.0
  • v3.0.0
  • v3.1.0
  • v3.2.0
  • v3.4.0
  • v4.0.0
  • v4.1.0
  • v4.1.1
  • v4.2.0
  • v4.2.1
  • v4.2.2
  • v4.2.3
  • v4.3.0
  • v4.3.1
  • v4.3.2
  • v4.3.3
73 results
Show changes
Commits on Source (114)
Showing
with 532 additions and 109 deletions
......@@ -117,6 +117,9 @@ fabric.properties
# https://plugins.jetbrains.com/plugin/12206-codestream
.idea/codestream.xml
# Other
.idea/copilot
### Maven ###
target/
pom.xml.tag
......@@ -155,3 +158,6 @@ output
# MacOS specific
.DS_store
# Development
*.jfr
All notable changes to Pantools will be documented in this file.
## [UNRELEASED]
### Added
- Available resources are now added to the log files and validated for some processes (!224).
- The possibility to analyse a specific region of interest across the pangenome (!147).
- A GFA export function (not speed optimised, so not recommended for large pangenomes) (!147).
### Changed
- Renamed the developer function `export_pangenome` to `export_csv` (!147).
## [4.3.1] - 08-12-2023
### Changed
- `pantools pangenome_structure` received a huge speed improvement and restored randomization (!216).
### Fixed
- Fixed a bug where variable/informative positions were incomplete, affecting PanVA instances (!212).
- Fixed a bug that caused the pangenome growth curves to not show any randomization (!216).
## [4.3.0] - 23-11-2023
### Added
......
......@@ -620,8 +620,9 @@ PanTools.
* - ``--phenotype``/``-p``
- The phenotype used to rename the terminal nodes (leaves) of
the selected tree.
* - ``--genome``/``--sequence``
- The phylogenetic tree contains genome numbers or sequence identifiers.
* - ``--genome``/``--sequence``/``--gene-tree``
- The phylogenetic tree contains genome numbers, sequence identifiers
or gene identifiers.
* - ``--phasing``
- Include phasing information to the terminal nodes (leaves) of the tree.
......
Region of interest
==================
After constructing a pangenome, a common question is "what does region X of
genome A look like in genome B?" This question can be answered by obtaining the
region of interest from the pangenome database using the functions described
here.
--------------
Extract region
--------------
The function extract_region extracts a region of interest from the pangenome
database. The underlying algorithm ...
**TODO**: decide which algorithm to keep
**Parameters**
.. list-table::
:widths: 30 70
* - <databaseDirectory>
- Path to the database root directory.
**Options**
.. list-table::
:widths: 30 70
* - ``--genome-range``
- Required (unless ``--gene`` is used). The region of interest. Format:
<genome>#<sequence>:<start>-<end>. E.g.: "1#chromosome2:123000-234000".
* - ``--gene``
- Required (unless ``--genome-range`` is used). The name of a gene to
obtain the region for in other genomes.
* - ``--threads``/``-t``
- Number of parallel working threads, default is the number of cores
or 8, whichever is lower.
* - ``--output``/``-o``
- Directory for all output files. Has to be empty or non-existent.
* - ``--extraction-method``
- The method to use for graph extraction. Can currently be 'homology',
'alignment', 'novel' or 'kmer'. Default: kmer **TODO**
* - ``--num-kmer-samples``
- The number of kmer samples to use for the 'alignment' or 'novel'
method. Default: 100
* - ``--gfa``
- **TODO**
**Example commands**
.. code:: bash
$ pantools extract_region --genome-range "1#chr1:100-1000" tomato_DB
$ pantools extract_region --output "roi_analysis_gene1" --gene "gene1" tomato_DB
$ pantools extract_region --output "roi_analysis" --gfa --extraction-method novel --num-kmer-samples 50 --genome-range "1#chr1:100-1000" tomato_DB
**Output**
Output files are written to the specified output directory.
- **all.collinearity.bedpe**: A BEDPE file with collinear regions between the
extracted region in all genomes.
- **all.collinearity.m8**: A BLAST-like file (m8) with collinear regions
between the extracted region in all genomes.
- **all.collinearity.json**: A JSON file with collinear regions between the
extracted region in all genomes.
- **all.fai**: A FASTA index file for the extracted region in all genomes.
- **all.gff3**: A GFF3 file for the extracted region in all genomes.
- **hm_presence_absence.tsv**: A tab-separated file with the presence/absence
variation of the homology groups of all mRNAs in the extracted region.
- **kmer_presence_absence.tsv**: A tab-separated file with the
presence/absence variation of all k-mers in the extracted region.
- **TODO**: GFA (+CSV)
For each genome and sequence in the found region, the following files are
written:
- **<genomeNr>.fasta**: A FASTA file with the extracted region. The <genomeNr>
is the number of the genome in the database.
- **<genomeNr>.gff**: A GFF file with the extracted region. The <genomeNr> is
the number of the genome in the database.
**Further analysis**
For visualization, the following example script can be used to plot the
extracted region of interest in all genomes. Please note that it requires
gggenomes, dplyr and stringr R packages.
.. code:: R
require(gggenomes)
require(dplyr)
require(stringr)
seqs <- read_seqs("all.fai") %>%
mutate(genome = str_split(seq_id, "_", simplify = TRUE)[, 1]) %>%
arrange(match(genome, c(1:9))) # sort the first 9 genomes in order
genes <- read_gff3("all.gff3")
links <- read_links("all.collinearity.m8")
p <- gggenomes(
genes = genes[which(endsWith(genes$type, "RNA")),], # only RNAs
seqs = seqs,
links = links,
infer_bin_id = genome
) +
geom_seq() +
geom_gene(aes(fill=type)) +
geom_link() +
geom_seq_label() +
geom_gene_label(aes(label=feat_id), size=1)
ggsave("all.collinearity.pdf", plot = p,
width = 20,
height = 15)
......@@ -10,8 +10,8 @@ project = 'PanTools'
copyright = '2016, the PanTools team'
author = 'Sandra Smit'
release = '4.3.0'
version = '4.3.0'
release = '4.3.1'
version = '4.3.1'
# -- General configuration
......
......@@ -42,6 +42,15 @@ the tools.
$ mamba create -n pantools pantools
.. note::
If this doesn't work, you can try to install PanTools using the
following way:
.. substitution-code-block:: bash
$ mamba create -n pantools --strict-channel-priority -c conda-forge -c bioconda pantools=|ProjectVersion|
Please make sure to activate the environment before using PanTools:
.. substitution-code-block:: bash
......
......@@ -44,38 +44,22 @@ PanTools currently provides these functionalities:
- Gene classification
- Phylogenetic methods
Requirements
------------
- \ **Java Virtual Machine**\ version 1.8 or higher, Add path to the
java executable to your OS path environment variable.
- \ **KMC**\ : A disk-based k-mer counter, After downloading the
appropriate version (linux, macos or windows), add path to the *kmc*
and *kmc_tools* executables to your OS path environment variable.
- \ **MCL**\ : The Markov Clustering Algorithm, After downloading and
compiling the software, add path to the *mcl* executable to your OS
path environment variable.
For installing and configuring all required software, please see our
:doc:`/getting_started/install` page.
Running the program
-------------------
Make sure you can run java on your command line. Then you can run PanTools from
the command line by:
Install PanTools as described in the :doc:`/getting_started/install` page.
You can run PanTools from the command line using:
.. substitution-code-block:: bash
$ java <JVM options> -jar pantools-|ProjectVersion|.jar <subcommand> <arguments>
$ pantools <JVM options> <subcommand> <arguments>
| **Useful JVM options**
| - **-server** : To optimize JIT compilations for higher performance
| - **-Xmn(a number followed by m/g)** : Minimum heap size in mega/giga
bytes
| - **-Xmx(a number followed by m/g)** : Maximum heap size in mega/giga
bytes
Useful JVM options:
- **-Xms<heap size>[unit]** : Initial heap size in MB ('m') or GB ('g')
- **-Xmx<heap size>[unit]** : Maximum heap size in MB ('m') or GB ('g')
- **-XX:StartFlightRecording=filename=<name>.jfr,disk=true** : Enable
the Flight Recorder (for developers)
Options
-------
......@@ -135,6 +119,7 @@ Contents
analysis/msa
analysis/phylogeny
analysis/mapping
analysis/region
analysis/sequence_visualization
analysis/support
analysis/dn_ds
......
......@@ -6,7 +6,7 @@
<groupId>nl.wur.bif</groupId>
<artifactId>pantools</artifactId>
<version>4.3.0</version>
<version>4.3.1</version>
<properties>
<maven.compiler.source>8</maven.compiler.source>
......
......@@ -8,10 +8,11 @@ package nl.wur.bif.pantools;
import nl.wur.bif.pantools.analysis.classification.*;
import nl.wur.bif.pantools.analysis.region_analysis.ExtractRegionCLI;
import nl.wur.bif.pantools.analysis.function_analysis.*;
import nl.wur.bif.pantools.analysis.phylogeny.*;
import nl.wur.bif.pantools.construction.build_panproteome.BuildPanproteomeCLI;
import nl.wur.bif.pantools.construction.conversion.PangenomeConversionCLI;
import nl.wur.bif.pantools.analysis.export_pangenome.ExportGfaCLI;
import nl.wur.bif.pantools.construction.phenotypes.add_phenotypes.AddPhenotypesCLI;
import nl.wur.bif.pantools.construction.phenotypes.remove_phenotypes.RemovePhenotypesCLI;
import nl.wur.bif.pantools.construction.annotations.RemoveAnnotationsCLI;
......@@ -25,7 +26,7 @@ import nl.wur.bif.pantools.analysis.calculate_dn_ds.CalculateDnDsCLI;
import nl.wur.bif.pantools.construction.functions.add_antismash.AddAntismashCLI;
import nl.wur.bif.pantools.construction.functions.add_functions.AddFunctionsCLI;
import nl.wur.bif.pantools.analysis.map_reads.MapReadsCLI;
import nl.wur.bif.pantools.construction.export_pangenome.ExportPangenomeCLI;
import nl.wur.bif.pantools.analysis.export_pangenome.ExportCsvCLI;
import nl.wur.bif.pantools.analysis.gene_classification.GeneClassificationCLI;
import nl.wur.bif.pantools.analysis.gene_retention.GeneRetentionCLI;
import nl.wur.bif.pantools.construction.build_pangenome.AddGenomesCLI;
......@@ -59,10 +60,7 @@ import picocli.CommandLine.*;
import picocli.CommandLine.Model.CommandSpec;
import java.awt.*;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.*;
import java.net.URI;
import java.net.URISyntaxException;
import java.net.URL;
......@@ -74,6 +72,7 @@ import java.time.Duration;
import java.time.Instant;
import java.util.List;
import java.util.*;
import java.util.stream.Collectors;
import static java.util.ResourceBundle.getBundle;
import static nl.wur.bif.pantools.utils.Globals.setGlobals;
......@@ -154,9 +153,10 @@ import static picocli.CommandLine.ScopeType.INHERIT;
AddSyntenyCLI.class,
AddPhasingCLI.class,
SyntenyStatisticsCLI.class,
ExportPangenomeCLI.class,
ExportCsvCLI.class,
VariationOverviewCLI.class,
PangenomeConversionCLI.class,
ExportGfaCLI.class,
ExtractRegionCLI.class,
})
public class Pantools {
@Parameters(descriptionKey = "database-path", index = "0", scope = INHERIT)
......@@ -272,7 +272,38 @@ public class Pantools {
// create logger
logger = LogManager.getLogger(this);
// log PanTools version
final String[] versionInfo = new GitVersionProvider().getVersion();
logger.info("PanTools {}", Arrays.stream(versionInfo)
.map(Objects::toString)
.collect(Collectors.joining(", ")));
// log usage
logger.info("Usage: pantools {}", String.join(" ", spec.commandLine().getParseResult().originalArgs()));
// log number of processors/cores available to the JVM
logger.debug("Available processors (cores): {}",
Runtime.getRuntime().availableProcessors());
// log total free memory available to the JVM
logger.debug("Free memory (MiB): {}",
Runtime.getRuntime().freeMemory()/1048576);
// log max amount of memory the JVM will attempt to use
final long maxMemory = Runtime.getRuntime().maxMemory();
logger.debug("Maximum memory (MiB): {}",
(maxMemory == Long.MAX_VALUE ? "no limit" : maxMemory/1048576));
// Log currently allocated memory
logger.debug("Currently allocated memory (MiB): {}",
Runtime.getRuntime().totalMemory()/1048576);
// Log database info
final File database = databaseDirectory.toFile();
logger.debug("Total space (MiB): {}", database.getTotalSpace()/1048576);
logger.debug("Free space (MiB): {}", database.getFreeSpace()/1048576);
logger.debug("Usable space (MiB): {}", database.getUsableSpace()/1048576);
}
/**
......
......@@ -23,7 +23,6 @@ import java.util.*;
import java.util.concurrent.*;
import java.util.concurrent.atomic.AtomicInteger;
import java.util.concurrent.atomic.AtomicLong;
import java.util.stream.Collectors;
import static nl.wur.bif.pantools.utils.local_sequence_alignment.BoundedLocalSequenceAlignment.match;
import static nl.wur.bif.pantools.utils.Globals.*;
......@@ -3125,7 +3124,7 @@ public class Classification {
allHmGroupArray = new int[total_hmgroups][genome_list.size()];
hmCounter = 0;
while (hm_nodes.hasNext()) {
Pantools.logger.info("Getting copy number for homology group {}/{}", hmCounter + 1, total_hmgroups);
Pantools.logger.debug("Getting copy number for homology group {}/{}", hmCounter + 1, total_hmgroups); // make progress bar
Node hmNode = hm_nodes.next();
allHmGroupArray[hmCounter] = getCopyNumberArray(hmNode, genome_list, pavs);
hmCounter++;
......@@ -3240,9 +3239,17 @@ public class Classification {
int[][] accessoryGroupsPerGenome,
int[][] uniqueGroupsPerGenome,
long seed) {
Pantools.logger.info("Calculating pangenome size for {} in {} loops.", genomeList, total_loops);
Pantools.logger.info("Calculating pangenome size for {} genomes in {} loops.", genomeList.size(), total_loops);
Pantools.logger.debug("Genome list: {}", genomeList);
int[][] newGroupsPerGenome = new int[total_loops][genomeList.size()]; // key = loop number, value = array with new genes per genome
// create an index using a map for the genomeList
final Map<String, Integer> genomeToIndex = new HashMap<>(); // key = genome name, value = index in genomeList
for (int i = 0; i < genomeList.size(); i++) {
genomeToIndex.put(genomeList.get(i), i);
}
// create a random generator to create seeds for all loops and store in array to make threadsafe
Random masterGenerator = new Random(seed);
// create threadpool for parallel processing
ExecutorService es = Executors.newFixedThreadPool(THREADS);
......@@ -3250,15 +3257,18 @@ public class Classification {
Pantools.logger.debug("Created threadpool with {} threads.", THREADS);
// perform the calculations as often as total_loops indicates in parallel
int[][] newGroupsPerGenome = new int[total_loops][genomeList.size()]; // key = loop number, value = array with new genes per genome
for (int loopNr = 0; loopNr < total_loops; loopNr++) {
final long currentSeed = masterGenerator.nextLong();
Future<?> f = es.submit(new callCalculatePangenomeSizeGenes(genomeList,
genomeToIndex,
allHmGroupArray,
coreGroupsPerGenome,
accessoryGroupsPerGenome,
uniqueGroupsPerGenome,
newGroupsPerGenome,
loopNr,
seed));
currentSeed));
futures.add(f);
}
......@@ -3292,6 +3302,7 @@ public class Classification {
private class callCalculatePangenomeSizeGenes implements Callable<Integer> {
private final ArrayList<String> genomeList;
private final Map<String, Integer> genomeToIndex;
private final int[][] allHmGroupArray;
private final int[][] coreGroupsPerGenome;
private final int[][] accessoryGroupsPerGenome;
......@@ -3301,6 +3312,7 @@ public class Classification {
private final long seed;
public callCalculatePangenomeSizeGenes(ArrayList<String> genomeList,
Map<String, Integer> genomeToIndex,
int[][] allHmGroupArray,
int[][] coreGroupsPerGenome,
int[][] accessoryGroupsPerGenome,
......@@ -3309,6 +3321,7 @@ public class Classification {
int loopNr,
long seed) {
this.genomeList = genomeList;
this.genomeToIndex = genomeToIndex;
this.allHmGroupArray = allHmGroupArray;
this.coreGroupsPerGenome = coreGroupsPerGenome;
this.accessoryGroupsPerGenome = accessoryGroupsPerGenome;
......@@ -3320,29 +3333,40 @@ public class Classification {
public Integer call() {
// create a random generator
Random generator = new Random(seed);
final Random generator = new Random(seed);
// log what we are doing
Pantools.logger.debug("Calculating pangenome size for {} genomes in loop {}.", genomeList.size(), loopNr);
Pantools.logger.trace("Genome list: {}", genomeList);
Pantools.logger.trace("Seed: {}", seed);
// create a list that contains the indices of the genomes in genomeList for shuffling
final List<Integer> genomeIndexList = new ArrayList<>(); // list with indices of genomes that are in genomeList
for (int i = 0; i < genomeList.size(); i++) {
genomeIndexList.add(i);
}
// shuffle the genomeIndexList
Collections.shuffle(genomeIndexList, generator);
// create a new list for genomes and a list for picking a random number
ArrayList<String> copyGenomeList = new ArrayList<>(genomeList);
ArrayList<String> newGenomeList = new ArrayList<>();
ArrayList<Integer> seenHmGroups = new ArrayList<>();
final Set<String> genomeSubset = new HashSet<>();
final Set<Integer> seenHmGroups = new HashSet<>();
// add one genome at a time
int totalGenomes = genomeList.size();
for (int genomeNr = 0; genomeNr < totalGenomes; genomeNr++) {
int randomGenome = generator.nextInt(copyGenomeList.size());
String genome = copyGenomeList.get(randomGenome);
copyGenomeList.remove(randomGenome);
newGenomeList.add(genome);
Pantools.logger.debug("Adding genome {} to the set of genomes.", genome);
for (int i = 0; i < genomeIndexList.size(); i++) {
int genomeNr = genomeIndexList.get(i);
final String genome = genomeList.get(genomeNr);
genomeSubset.add(genome);
Pantools.logger.trace("Adding genome {} to the set of genomes.", genome);
// calculate the core, accessory, unique and new genes for the new genome list
int[] coreAccessoryUniqueNew = calculateCoreAccessoryUniqueNewGroups(newGenomeList, genomeList, allHmGroupArray, seenHmGroups);
Pantools.logger.debug("[core, accessory, unique, new] = {}.", Arrays.toString(coreAccessoryUniqueNew));
coreGroupsPerGenome[loopNr][genomeNr] = coreAccessoryUniqueNew[0];
accessoryGroupsPerGenome[loopNr][genomeNr] = coreAccessoryUniqueNew[1];
uniqueGroupsPerGenome[loopNr][genomeNr] = coreAccessoryUniqueNew[2];
newGroupsPerGenome[loopNr][genomeNr] = coreAccessoryUniqueNew[3];
final int[] coreAccessoryUniqueNew = calculateCoreAccessoryUniqueNewGroups(genomeSubset, genomeToIndex, allHmGroupArray, seenHmGroups);
Pantools.logger.trace("[core, accessory, unique, new] = {}.", Arrays.toString(coreAccessoryUniqueNew));
coreGroupsPerGenome[loopNr][i] = coreAccessoryUniqueNew[0];
accessoryGroupsPerGenome[loopNr][i] = coreAccessoryUniqueNew[1];
uniqueGroupsPerGenome[loopNr][i] = coreAccessoryUniqueNew[2];
newGroupsPerGenome[loopNr][i] = coreAccessoryUniqueNew[3];
}
return loopNr;
......@@ -3354,18 +3378,18 @@ public class Classification {
* present in all genomes, accessory means that the group is present in at least one genome but not in all
* genomes and unique means that the group is present in only one genome. New means that the homology group was not
* seen in a provided previous list of indices corresponding to allHmGroups.
* @param genomeList a list with all genome names to use for the calculation
* @param allGenomeList a list with all genome names for getting the indices of genomeList in allHmGroupsArray
* @param genomeSubset a HashSet with all genome names to use for the calculation
* @param genomeToIndex a map with genome names as keys and indices as values for fast access
* @param allHmGroupArray a 2D array with the copy numbers per genome (inner array) per hm group (outer array)
* this array is not updated in this function
* @param seenHmGroups a list of homology groups that have already been seen (for calculating the new groups)
* @param seenHmGroups a set of homology groups that have already been seen (for calculating the new groups)
* @return an array with the number of core, accessory, unique and new genes
*/
public int[] calculateCoreAccessoryUniqueNewGroups(ArrayList<String> genomeList,
ArrayList<String> allGenomeList,
int[][] allHmGroupArray,
ArrayList<Integer> seenHmGroups) {
Pantools.logger.debug("Calculating core, accessory, unique and new genes for {} genomes.", genomeList.size());
public int[] calculateCoreAccessoryUniqueNewGroups(Set<String> genomeSubset,
Map<String, Integer> genomeToIndex,
int[][] allHmGroupArray,
Set<Integer> seenHmGroups) {
Pantools.logger.trace("Calculating core, accessory, unique and new genes for {} genomes.", genomeSubset.size());
int coreGroups = 0;
int accessoryGroups = 0;
......@@ -3378,28 +3402,25 @@ public class Classification {
int genomes = 0;
// loop over all genomes
for (String genome : genomeList) {
int genomeNr = allGenomeList.indexOf(genome);
for (String genome : genomeSubset) {
int genomeNr = genomeToIndex.get(genome);
if (hmGroup[genomeNr] != 0) {
genomes++;
}
}
// check if the homology group is core, accessory or unique
if (genomes == genomeList.size()) {
if (genomes == genomeSubset.size()) {
coreGroups++;
} else if (genomes == 1) {
uniqueGroups++;
} else if (genomes > 1 && genomes < genomeList.size()) {
accessoryGroups++;
} else if (genomes == 0) {
continue;
} else if (genomes == 1) {
uniqueGroups++;
} else {
Pantools.logger.error("Something went wrong with the calculation of core, accessory and unique genes.");
System.exit(1);
accessoryGroups++;
}
// if we get here, the gene is definitely present
// if we get here, the gene is present (genomes > 0)
if (!seenHmGroups.contains(hmGroupNr)) {
newGroups++;
seenHmGroups.add(hmGroupNr);
......@@ -4111,7 +4132,7 @@ public class Classification {
StringBuilder gene_nr_builder = new StringBuilder();
for (int[] loop : newGroupsPerGenome) {
for (int index = 0; index < loop.length; index++) {
for (int index = 1; index < loop.length; index++) {
int genomeNr = index + 1;
new_genes_builder.append(loop[index]).append(",");
gene_nr_builder.append(genomeNr).append(",");
......@@ -4322,6 +4343,9 @@ public class Classification {
int[][] accessoryGenesPerGenome,
int[][] uniqueGenesPerGenome) {
Pantools.logger.info("Writing pangenome size gene output.");
Pantools.logger.trace("coreGenesPerGenome: {}", Arrays.deepToString(coreGenesPerGenome));
Pantools.logger.trace("accessoryGenesPerGenome: {}", Arrays.deepToString(accessoryGenesPerGenome));
Pantools.logger.trace("uniqueGenesPerGenome: {}", Arrays.deepToString(uniqueGenesPerGenome));
// define output builder
StringBuilder output_builder = new StringBuilder();
......@@ -6016,6 +6040,14 @@ public class Classification {
Path output_path,
String trim) {
Pantools.logger.debug("Creating matrix with variable and informative positions.");
Pantools.logger.trace("variable_positions: {}", variable_positions.toString());
Pantools.logger.trace("uninformative_positions: {}", uninformative_positions.toString());
Pantools.logger.trace("position_map: {}", position_map.toString());
Pantools.logger.trace("alignmentTypeShort: {}", alignmentTypeShort);
Pantools.logger.trace("output_path: {}", output_path.toString());
Pantools.logger.trace("trim: {}", trim);
if (output_path.resolve("mlsa").resolve("input").toFile().exists()) { // do not create this file when running the mlsa_concatenate function
return;
}
......@@ -6030,20 +6062,24 @@ public class Classification {
matrix_builder.append("Position,Informative,A,T,C,G,gap,other\n");
}
String prev_position = "";
int prev_position = -1;
for (String position_character_key : position_map.keySet()) {
Set<Integer> sequence_numbers = position_map.get(position_character_key);
String position = position_character_key.replaceFirst(".$",""); // remove last character
String character = position_character_key.replace(position, "");
String positionStr = position_character_key.replaceFirst(".$",""); // remove last character
int position = Integer.parseInt(positionStr);
String character = position_character_key.replace(positionStr, "");
if (prev_position != position && prev_position != -1) {
final boolean isInformative = !uninformative_positions.contains(prev_position);
matrix_builder
.append((prev_position+1))
.append(",")
.append(isInformative)
.append(",")
.append(Arrays.toString(counts_per_position).replace("[","") .replace("]","").replace(" ",""))
.append("\n");
if (!prev_position.equals(position) && !prev_position.equals("")) {
int pos_int = Integer.parseInt(prev_position);
if (uninformative_positions.contains(prev_position)) {
matrix_builder.append((pos_int+1)).append(",,").append(Arrays.toString(counts_per_position).replace("[","") .replace("]","").replace(" ","")).append("\n");
} else {
matrix_builder.append((pos_int+1)).append(",true,").append(Arrays.toString(counts_per_position).replace("[","") .replace("]","").replace(" ","")).append("\n");
}
if (alignmentTypeShort.equals("prot")) {
counts_per_position = new int[22]; // A R N D C Q E G H I L K M F P S T W Y V - other
} else {
......@@ -6120,15 +6156,14 @@ public class Classification {
prev_position = position;
}
if (!prev_position.equals("")) { // there are variable positions in the alignment
int pos_int = Integer.parseInt(prev_position);
if (prev_position != -1) { // there are variable positions in the alignment
if (uninformative_positions.contains(prev_position)) {
matrix_builder.append((pos_int+1)).append(",,").append(Arrays.toString(counts_per_position).replace("[","") .replace("]","").replace(" ","")).append("\n");
matrix_builder.append((prev_position+1)).append(",,").append(Arrays.toString(counts_per_position).replace("[","") .replace("]","").replace(" ","")).append("\n");
} else {
matrix_builder.append((pos_int+1)).append(",true,").append(Arrays.toString(counts_per_position).replace("[","") .replace("]","").replace(" ","")).append("\n");
matrix_builder.append((prev_position+1)).append(",true,").append(Arrays.toString(counts_per_position).replace("[","") .replace("]","").replace(" ","")).append("\n");
}
} else { // no variable positions
matrix_builder = new StringBuilder("Alignment does not contain variable positions");
matrix_builder = new StringBuilder("Alignment does not contain variable positions\n");
}
output_path.resolve("var_inf_positions").toFile().mkdirs(); // create directory
write_SB_to_file_full_path(matrix_builder, output_path.resolve("var_inf_positions").resolve(alignmentTypeShort + trim + "_variable_positions.csv"));
......
package nl.wur.bif.pantools.construction.export_pangenome;
package nl.wur.bif.pantools.analysis.export_pangenome;
import nl.wur.bif.pantools.Pantools;
import nl.wur.bif.pantools.utils.BeanUtils;
......@@ -17,8 +17,8 @@ import static picocli.CommandLine.*;
*
* @author Robin van Esch, Wageningen University, the Netherlands.
*/
@CommandLine.Command(name = "export_pangenome", sortOptions = false, abbreviateSynopsis = true)
public class ExportPangenomeCLI implements Callable<Integer> {
@CommandLine.Command(name = "export_csv", sortOptions = false, abbreviateSynopsis = true)
public class ExportCsvCLI implements Callable<Integer> {
@Spec CommandSpec spec;
@ParentCommand
......@@ -39,7 +39,7 @@ public class ExportPangenomeCLI implements Callable<Integer> {
pantools.createLogger(spec);
BeanUtils.argValidation(spec, this);
pantools.setPangenomeGraph();
new PangenomeExporter(nodePropertiesFile, relationshipPropertiesFile, sequenceNodeAnchorsFile);
new PangenomeExporterCsv(nodePropertiesFile, relationshipPropertiesFile, sequenceNodeAnchorsFile);
return 0;
}
}
package nl.wur.bif.pantools.analysis.export_pangenome;
import jakarta.validation.constraints.Pattern;
import nl.wur.bif.pantools.Pantools;
import nl.wur.bif.pantools.utils.BeanUtils;
import picocli.CommandLine.Model.CommandSpec;
import java.nio.file.Path;
import java.util.concurrent.Callable;
import static picocli.CommandLine.*;
/**
* Convert a pangenome graph to a different format.
* @author Dirk-Jan van Workum, Wageningen University, the Netherlands.
*/
@Command(name = "export_gfa", sortOptions = false)
public class ExportGfaCLI implements Callable<Integer> {
@Spec CommandSpec spec;
@ParentCommand
private Pantools pantools;
@Option(names = {"-o", "--output"}, required = true)
private Path outputPath;
@Option(names = "--format")
@Pattern(regexp = "GFAv1|GFAv2", message = "{pattern.graph-format}")
private String outputFormat;
@Override
public Integer call() throws Exception {
pantools.createLogger(spec);
BeanUtils.argValidation(spec, this);
pantools.setPangenomeGraph("pangenome");
PangenomeExporterGfa pangenomeExporterGfa = new PangenomeExporterGfa(outputPath);
pangenomeExporterGfa.convert(outputFormat);
return 0;
}
}
package nl.wur.bif.pantools.construction.export_pangenome;
package nl.wur.bif.pantools.analysis.export_pangenome;
import org.neo4j.graphdb.Label;
......
package nl.wur.bif.pantools.construction.export_pangenome;
package nl.wur.bif.pantools.analysis.export_pangenome;
import com.google.common.collect.ImmutableMap;
import com.google.common.collect.ImmutableSet;
import com.google.common.collect.Streams;
import nl.wur.bif.pantools.construction.export_pangenome.records.NodeProperty;
import nl.wur.bif.pantools.construction.export_pangenome.records.Record;
import nl.wur.bif.pantools.construction.export_pangenome.records.RelationshipProperty;
import nl.wur.bif.pantools.construction.export_pangenome.records.SequenceAnchor;
import nl.wur.bif.pantools.analysis.export_pangenome.records.NodeProperty;
import nl.wur.bif.pantools.analysis.export_pangenome.records.Record;
import nl.wur.bif.pantools.analysis.export_pangenome.records.RelationshipProperty;
import nl.wur.bif.pantools.analysis.export_pangenome.records.SequenceAnchor;
import org.apache.commons.csv.CSVFormat;
import org.apache.commons.csv.CSVPrinter;
import org.neo4j.graphdb.*;
......@@ -31,7 +31,7 @@ import static nl.wur.bif.pantools.utils.Utils.registerShutdownHook;
* contents as three CSV files for comparison purposes.
*
*/
public class PangenomeExporter {
public class PangenomeExporterCsv {
// TODO: we're using names because RelTypes.x contains ordinal properties, and relationship types don't
private static final Set<String> VALID_RELATIONSHIP_TYPE_NAMES = new HashSet<>(Arrays.asList(
RelTypes.has.name(),
......@@ -62,7 +62,7 @@ public class PangenomeExporter {
* @param sequenceAnchorsOutputPath path to sequence anchors output file.
* @throws IOException in case of Neo4j or output file IO error.
*/
public PangenomeExporter(
public PangenomeExporterCsv(
Path nodePropertiesOutputPath,
Path relationshipPropertiesOutputPath,
Path sequenceAnchorsOutputPath) throws IOException {
......
package nl.wur.bif.pantools.analysis.export_pangenome;
import nl.wur.bif.pantools.Pantools;
import nl.wur.bif.pantools.utils.Globals;
import nl.wur.bif.pantools.utils.graph.Graph;
import nl.wur.bif.pantools.utils.graph.Neo4jGraphExtractor;
import nl.wur.bif.pantools.utils.writer.Gfa1Writer;
import nl.wur.bif.pantools.utils.writer.Gfa2Writer;
import nl.wur.bif.pantools.utils.writer.GfaWriter;
import java.io.IOException;
import java.nio.file.Path;
/**
* Convert a pangenome graph to a different format.
* NB: Currently only GFA is supported.
* @author Dirk-Jan van Workum, Wageningen University, the Netherlands.
*/
public class PangenomeExporterGfa {
private final Path outputPath;
/**
* Constructor.
* NB: Currently only GFA is supported.
* @param outputPath The path to the output file.
*/
public PangenomeExporterGfa(Path outputPath) {
this.outputPath = outputPath;
}
/**
* Convert the pangenome graph to a different format.
* @param format The format to convert to.
* @throws IOException If an IO error occurs during conversion.
*/
public void convert(String format) throws IOException {
Pantools.logger.debug("Converting pangenome for {}", this);
// start graph extractor
Neo4jGraphExtractor neo4jGraphExtractor = new Neo4jGraphExtractor(Globals.GRAPH_DB);
Graph graph = neo4jGraphExtractor.extractNucleotideGraph();
// write the graph
GfaWriter gfaWriter;
switch (format) {
case "GFAv1":
gfaWriter = new Gfa1Writer();
break;
case "GFAv2":
gfaWriter = new Gfa2Writer();
break;
default:
Pantools.logger.error("Unsupported format: {}", format);
throw new IllegalArgumentException("Unsupported format");
}
Pantools.logger.trace("Graph: {}", graph);
gfaWriter.write(graph, outputPath);
}
/**
* toString method.
* @return A string representation of this object.
*/
@Override
public String toString() {
return "";
}
}
package nl.wur.bif.pantools.construction.export_pangenome.records;
package nl.wur.bif.pantools.analysis.export_pangenome.records;
import com.google.common.collect.ImmutableList;
......
package nl.wur.bif.pantools.construction.export_pangenome.records;
package nl.wur.bif.pantools.analysis.export_pangenome.records;
import java.util.List;
......
package nl.wur.bif.pantools.construction.export_pangenome.records;
package nl.wur.bif.pantools.analysis.export_pangenome.records;
import com.google.common.collect.ImmutableList;
......
package nl.wur.bif.pantools.construction.export_pangenome.records;
package nl.wur.bif.pantools.analysis.export_pangenome.records;
import com.google.common.collect.ImmutableList;
......
package nl.wur.bif.pantools.analysis.region_analysis;
import jakarta.validation.constraints.NotNull;
import jakarta.validation.constraints.Pattern;
import nl.wur.bif.pantools.Pantools;
import nl.wur.bif.pantools.utils.BeanUtils;
import nl.wur.bif.pantools.utils.Globals;
import nl.wur.bif.pantools.utils.cli.mixins.ThreadNumber;
import nl.wur.bif.pantools.utils.cli.validation.Constraints.OutputDirectory;
import nl.wur.bif.pantools.utils.graph.GenomeRange;
import picocli.CommandLine.*;
import java.nio.file.Path;
import java.util.concurrent.Callable;
/**
* Convert a pangenome graph to a different format.
*
* @author Dirk-Jan van Workum, Wageningen University, the Netherlands.
*/
@Command(name = "extract_region", sortOptions = false)
@OutputDirectory(directory = "outputPath")
public class ExtractRegionCLI implements Callable<Integer> {
@Spec Model.CommandSpec spec;
@Mixin private ThreadNumber threadNumber;
@ParentCommand
private Pantools pantools;
@Option(names = {"-o", "--output"})
private Path outputPath;
@Option(names = {"--extraction-method"},
defaultValue = "kmer",
description = "The method to use for graph extraction (can be 'homology', 'alignment', 'novel' or 'kmer'). " +
"THIS OPTION IS FOR TESTING PURPOSES ONLY AND WILL BE REMOVED IN THE FUTURE. " +
"Default: kmer"
)
@Pattern(regexp = "^(homology|alignment|kmer|novel)$", message = "Please provide a valid extraction method.")
private String method;
@Option(names = {"--num-kmer-samples"},
defaultValue = "100",
description = "The number of kmer samples to use for the kmer method. Default: 100"
)
private int numKmerSamples;
@Option(names = {"--gfa"},
defaultValue = "false",
description = "Output the graph in GFAv1 format. Default: false"
)
private boolean writeGfa;
@NotNull(message = "Must define either --genome-range or --gene")
@ArgGroup
TargetPosition targetPosition;
static class TargetPosition {
@Option(names = "--genome-range")
@Pattern(regexp = "^[0-9]+#[^#:\\s]+:[0-9]+-[0-9]+$", message = "{pattern.genome-range}")
String genomeRangeStr;
@Option(names = "--gene")
String gene;
}
@Override
public Integer call() throws Exception {
pantools.createLogger(spec);
BeanUtils.argValidation(spec, this, threadNumber, targetPosition);
pantools.setPangenomeGraph("pangenome");
crossValidate();
setGlobalParameters();
RegionExtractor regionExtractor = new RegionExtractor(outputPath);
if (targetPosition.genomeRangeStr != null) {
regionExtractor.setGenomeRange(
GenomeRange.parseGenomeRange(targetPosition.genomeRangeStr)
);
} else if (targetPosition.gene != null) {
regionExtractor.setGene(
targetPosition.gene
);
}
regionExtractor.extractRegion(method, writeGfa);
return 0;
}
private void setGlobalParameters() {
Globals.THREADS = Math.min(threadNumber.getnThreads(), numKmerSamples);
Globals.NUM_KMER_SAMPLES = numKmerSamples;
}
private void crossValidate() {
// Assert that --num-kmer-samples is only used for alignment and novel method
if (!method.equals("alignment") && !method.equals("novel")) {
if (numKmerSamples != 100) {
throw new ParameterException(spec.commandLine(), "--num-kmer-samples can only be used with --extraction-method=alignment or --extraction-method=novel.");
}
}
}
public Pantools getPantools() {
return pantools;
}
public Path getOutputPath() {
return outputPath;
}
}