PanTools version 1
PanTools is a java application for computational pangenomics. It is developed at the Bioinformatics Group, Wageningen University, the Netherlands.
If you use PanTools please cite:
Code repository:
https://git.wur.nl/bioinformatics/pantools
Requirements
PanTools is based on Neo4j graph database community edition 3.3.1.
-
KMC: is a disk-based programm for counting k-mers from (possibly gzipped) FASTQ/FASTA files (http://sun.aei.polsl.pl/kmc). You need to download it and add the path to the appropriate version (linux, macos or windows) of kmc and kmc_tools executables to your OS path environment variable.
-
Java Virtual Machine version 1.8 or higher: Add the path to the java executable to your OS path environment variable.
-
MCL: The Markov Cluster Algorithm, is a fast and scalable unsupervised cluster algorithm for graphs (http://micans.org/mcl ) which is needed for group functionality of PanTools. You need to download, unzip and compile it (see README), and add the path to the mcl executable to your path environment variable.
Running the program
Add the path to the java archive of PanTools, located in the /dist subdirectory of the PanTools project, to the OS path environment variable. Then run PanTools from the command line by:
java <JVM options> -jar pantools.jar <subcommand> <arguments>
Arguments is a list of key value pairs separated by whitespace.
JVM options
- -server : To optimize JIT compilations for higher performance
- -XX:+UseConcMarkSweepGC
- -Xmx(a number followed by m/g) : Maximum heap size in mega/giga bytes
PanTools subcommands
-
build_pangenome or bg: To build a pangenome out of a set of genomes.
arguments:
- --database_path or -dp: Path to the pangenome database.
- --genomes-file or -gf: a text file containing paths to FASTA files of genomes; each on a separate line.
- --kmer-size or -ks: the size of k-mers; if not given or out of range (6 <= K_SIZE <= 255), an optimal value would be calculated automatically.
-
build_panproteome or bp: To build a pan-proteome out of a set of proteins.
arguments:
- --database_path or -dp : Path to the pangenome database.
- --proteomes_file or -pf : A text file containing paths to FASTA files of proteomes; each on a separate line.
-
add_genomes or ag: To add new genomes to an available pan-genome.
arguments:
- --database_path or -dp: Path to the pangenome database.
- --genomes-file or -gf: a text file containing paths to FASTA files of genomes; each on a separate line.
-
add_annotations or aa: To add new annotations to an available pan-genome.
arguments:
- --database_path or -dp: Path to the pangenome database.
- --annotations-file or -af: a text file of which each line contains a genome number and path to the corresponding GFF file separated by one space. Genomes are numbered in the same order they have been added to the pangenome. The protein sequence of the annotated genes will be also stored in the folder "proteins" in the same path as the pangenome.
- --connect_annotations or -ca: connect the annotated genomic features to the nodes of gDBG.
-
retrieve_features or rf : To retrieve the sequence of annotated features from the pan-genome. For each genome a FASTA file containing the retrieved features will be stored in the output path. For example, genes.1.fasta contains all the genes annotated in genome 1.
arguments:
- --database_path or -dp : Path to the pangenome database.
- --output-path or -op (default value: Database path determined by -dp) : Path to the output files.
- --genome-numbers or -gn : A text file containing genome_numbers for which the features will be retrieved.
- --feature-type or -ft (default value: gene) : The feature name; for example gene, mRNA, exon, tRNA, etc.
-
retrieve_regions or rr: To retrieve the sequence of some genomic regions from the pan-genome. The resulting FASTA files will be stored in the output path.
arguments:
- --database_path or -dp: Path to the pangenome database.
- --regions-file or -rf: a text file containing records with genome_number, sequence_number, begin and end positions separated by one space for each region. The resulting FASTA file would have the same name with an additional .fasta extention.
-
retrieve_genomes or rg: To retrieve the full sequence of some genomes. The resulting FASTA files will be stored in the output path.
arguments:
- --database_path or -dp : Path to the pangenome database. path to the pangenome database.
- --genome-numbers or -gn: a text file containing genome numbers to be retrieved in each line. The resulting FASTA files are named like Genome_x.fasta.
-
group or g : To create homology groups in the protein space of the pangenome (panproteome). The resulting homology groups will be stored in the output path.
arguments:
- --database_path or -dp : Path to the pangenome database.
- --intersection-rate or -ir (default value: 0.09, valid range: [0.001..0.1]) : The fraction of k-mers needs to be shared by two intersecting proteins.
- --min-protein-identity or -mpi (default value: 95): the minimum similarity score. Should be in range [1-99].
- --mcl-inflation or -mi (default value: 9.6, valid range: (1..19)): The MCL inflation.
- --contrast or -ct (default value: 8, valid range: (0..10)) : The contrast factor.
- --relaxation or rn (default value: 1, valid range: [1..8]) : The relaxation in homology calls.
- --threads-number or -tn (default value: 1) : The number of parallel working threads.
-
version or v: To show the versions of PanTools and Neo4j.
-
help or h: To show the mannual of the tool.
Visualization in the Neo4j browser
Neo4j browser allows you to run Cypher queries and receive the results in a tabular or a graph representation mode. You need to download the appropriate version of Neo4j. To visualize the pangenome of two HIV strains provided as a sample data in pantools repositiory, take these actions on a linux machine. Windows users could also download the Neo4j desktop application for starting and stopping a server instead of usingn commandline.
- Add the path to the Neo4j /bin directory to the path environment variable.
- Hard-code the path to your pangenome in the configuration file, NEO4J-DIRECTORY/conf/neo4j.conf, by:
dbms.directories.data = PATH_TO_THE_PANGENOME_DATABASE
- Start the Neo4j database server by: neo4j start
- open an internet browser and Open the URL
http://localhost:7474
- To visualize the whole pangenome of two HIV strains,
type this simple Cypher command:
MATCH (n) RETURN n
- To stop the Neo4j server type:
neo4j stop