Commit 5f8671bc authored by Workum, Dirk-Jan van's avatar Workum, Dirk-Jan van
Browse files

Merge branch 'release' into 'pantools_v3'

Release v3.4.0

Closes #1, #2, #4, #5, #16, and #9

See merge request !98
parents 3dcdaf34 db71492e
addons/interpro.xml
addons/java_libraries/*
build/built-jar.properties
!dist
dist/*
......@@ -6,3 +7,143 @@ dist/*
*.class
*.jar
# Created by https://www.toptal.com/developers/gitignore/api/intellij,maven,netbeans
# Edit at https://www.toptal.com/developers/gitignore?templates=intellij,maven,netbeans
### Intellij ###
# Covers JetBrains IDEs: IntelliJ, RubyMine, PhpStorm, AppCode, PyCharm, CLion, Android Studio, WebStorm and Rider
# Reference: https://intellij-support.jetbrains.com/hc/en-us/articles/206544839
# User-specific stuff
.idea/**/workspace.xml
.idea/**/tasks.xml
.idea/**/usage.statistics.xml
.idea/**/dictionaries
.idea/**/shelf
# AWS User-specific
.idea/**/aws.xml
# Generated files
.idea/**/contentModel.xml
# Sensitive or high-churn files
.idea/**/dataSources/
.idea/**/dataSources.ids
.idea/**/dataSources.local.xml
.idea/**/sqlDataSources.xml
.idea/**/dynamic.xml
.idea/**/uiDesigner.xml
.idea/**/dbnavigator.xml
# Gradle
.idea/**/gradle.xml
.idea/**/libraries
# Gradle and Maven with auto-import
# When using Gradle or Maven with auto-import, you should exclude module files,
# since they will be recreated, and may cause churn. Uncomment if using
# auto-import.
# .idea/artifacts
# .idea/compiler.xml
# .idea/jarRepositories.xml
# .idea/modules.xml
# .idea/*.iml
# .idea/modules
# *.iml
# *.ipr
# CMake
cmake-build-*/
# Mongo Explorer plugin
.idea/**/mongoSettings.xml
# File-based project format
*.iws
# IntelliJ
out/
# mpeltonen/sbt-idea plugin
.idea_modules/
# JIRA plugin
atlassian-ide-plugin.xml
# Cursive Clojure plugin
.idea/replstate.xml
# Crashlytics plugin (for Android Studio and IntelliJ)
com_crashlytics_export_strings.xml
crashlytics.properties
crashlytics-build.properties
fabric.properties
# Editor-based Rest Client
.idea/httpRequests
# Android studio 3.1+ serialized cache file
.idea/caches/build_file_checksums.ser
### Intellij Patch ###
# Comment Reason: https://github.com/joeblau/gitignore.io/issues/186#issuecomment-215987721
# *.iml
# modules.xml
# .idea/misc.xml
# *.ipr
# Sonarlint plugin
# https://plugins.jetbrains.com/plugin/7973-sonarlint
.idea/**/sonarlint/
# SonarQube Plugin
# https://plugins.jetbrains.com/plugin/7238-sonarqube-community-plugin
.idea/**/sonarIssues.xml
# Markdown Navigator plugin
# https://plugins.jetbrains.com/plugin/7896-markdown-navigator-enhanced
.idea/**/markdown-navigator.xml
.idea/**/markdown-navigator-enh.xml
.idea/**/markdown-navigator/
# Cache file creation bug
# See https://youtrack.jetbrains.com/issue/JBR-2257
.idea/$CACHE_FILE$
# CodeStream plugin
# https://plugins.jetbrains.com/plugin/12206-codestream
.idea/codestream.xml
### Maven ###
target/
pom.xml.tag
pom.xml.releaseBackup
pom.xml.versionsBackup
pom.xml.next
release.properties
dependency-reduced-pom.xml
buildNumber.properties
.mvn/timing.properties
# https://github.com/takari/maven-wrapper#usage-without-binary-jar
.mvn/wrapper/maven-wrapper.jar
### Maven Patch ###
# Eclipse m2e generated files
# Eclipse Core
.project
# JDT-specific (Eclipse Java Development Tools)
.classpath
### NetBeans ###
**/nbproject/private/
**/nbproject/Makefile-*.mk
**/nbproject/Package-*.bash
build/
nbbuild/
dist/
nbdist/
.nb-gradle/
# End of https://www.toptal.com/developers/gitignore/api/intellij,maven,netbeans
[submodule "ASTER"]
path = ASTER
url = https://github.com/chaoszhang/ASTER.git
<?xml version="1.0" encoding="UTF-8"?>
<project version="4">
<component name="CompilerConfiguration">
<annotationProcessing>
<profile name="Maven default annotation processors profile" enabled="true">
<sourceOutputDir name="target/generated-sources/annotations" />
<sourceTestOutputDir name="target/generated-test-sources/test-annotations" />
<outputRelativeToContentRoot value="true" />
<module name="pantools" />
</profile>
</annotationProcessing>
</component>
</project>
\ No newline at end of file
<?xml version="1.0" encoding="UTF-8"?>
<project version="4">
<component name="Encoding">
<file url="file://$PROJECT_DIR$/src/main/java" charset="UTF-8" />
<file url="file://$PROJECT_DIR$/src/main/resources" charset="UTF-8" />
</component>
</project>
\ No newline at end of file
<component name="InspectionProjectProfileManager">
<profile version="1.0">
<option name="myName" value="Project Default" />
<inspection_tool class="GrazieInspection" enabled="false" level="TYPO" enabled_by_default="false" />
</profile>
</component>
\ No newline at end of file
<?xml version="1.0" encoding="UTF-8"?>
<project version="4">
<component name="RemoteRepositoriesConfiguration">
<remote-repository>
<option name="id" value="central" />
<option name="name" value="Central Repository" />
<option name="url" value="https://repo.maven.apache.org/maven2" />
</remote-repository>
<remote-repository>
<option name="id" value="central" />
<option name="name" value="Maven Central repository" />
<option name="url" value="https://repo1.maven.org/maven2" />
</remote-repository>
<remote-repository>
<option name="id" value="jboss.community" />
<option name="name" value="JBoss Community repository" />
<option name="url" value="https://repository.jboss.org/nexus/content/repositories/public/" />
</remote-repository>
</component>
</project>
\ No newline at end of file
<?xml version="1.0" encoding="UTF-8"?>
<project version="4">
<component name="ExternalStorageConfigurationManager" enabled="true" />
<component name="MavenProjectsManager">
<option name="originalFiles">
<list>
<option value="$PROJECT_DIR$/pom.xml" />
</list>
</option>
</component>
<component name="ProjectRootManager" version="2" languageLevel="JDK_11" project-jdk-name="11" project-jdk-type="JavaSDK" />
</project>
\ No newline at end of file
<?xml version="1.0" encoding="UTF-8"?>
<project version="4">
<component name="VcsDirectoryMappings">
<mapping directory="" vcs="Git" />
<mapping directory="$PROJECT_DIR$/ASTER" vcs="Git" />
</component>
</project>
\ No newline at end of file
repos:
- repo: local
hooks:
- id: maven-compile
name: Compile with Maven
description: Compile all code with mvn compile
entry: mvn
args:
- compile
language: system
pass_filenames: false
types: [java]
Subproject commit ce55e0c5871c47843009116710c00f6b0535645c
All notable changes to Pantools will be documented in this file.
## [3.4.0] - 2022-05-04
### Added
- Version and commit ID are reported when PanTools is initialized.
- `--allow-polytomies` argument for `consensus_tree`.
- Included option to bin numerical values in `add_phenotypes`.
### Changed
- `msa_group`,`msa_of_multiple_groups`, `msa_of_regions` are reorganised in new function `msa`.
- `pangenome_structure` uses a colorblindfriendly palette.
- `create_tree_templates` uses a colorblindfriendly palette when using 8 phenotypes or less.
- `add_antismash` now only works with Antismash versions >= 6.0.
### Fixed
- Changed the orientation of the 'has_busco' relationship
## [3.3.0] - 2021-12-23
### Changed
- Migrate to Maven
- Executable .jar file moved from `pantools/dist` to `pantools/target`
### Fixed
- Reading gzip-compressed input files
## [3.2.0] - 2021-11-25
### Added
......
# Contributing to _PanTools_
## Get in touch
There are many ways to get in touch with the _PanTools_ team!
- GitLab [issues][pantools-issues] and [merge requests][pantools-mergerequests]
- Join a discussion, collaborate on an ongoing task and exchange your thoughts with others. You will have to request a guest role to start contributing.
- Can't find your idea being discussed anywhere?
[Open a new issue](https://git.wur.nl/bioinformatics/pantools/-/issues/new)! (See our [Where to start: issues](#where-to-start-issues) section below.)
- Contact the Project Lead of the _PanTools_ project - Sandra Smit - by email at [sandra.smit@wur.nl](mailto:sandra.smit@wur.nl).
## Contributing through GitLab
[Git][git] is a really useful tool for version control.
[GitLab][gitlab] sits on top of Git and supports collaborative and distributed working.
We know that it can be daunting to start using Git and GitLab if you haven't worked with them in the past, but the _PanTools_ maintainers are here to help you figure out any of the jargon or confusing instructions you encounter.
GitLab has a helpful page on [getting started with GitLab][gitlab-gettingstarted].
In order to contribute via GitLab, you'll need to [set up an account][gitlab-signup] and [sign in][gitlab-signin].
Remember that you can ask us any questions you need to along the way.
## Writing in Markdown
Most of the writing that you'll do will be in [Markdown][markdown].
You can think of Markdown as a few little symbols around your text that will allow GitLab to render the text with a little bit of formatting.
For example, you could write words as **bold** (`**bold**`), or in _italics_ (`_italics_`), or as a [link][rick-roll] (`[link](https://youtu.be/dQw4w9WgXcQ)`) to another webpage.
Also when writing in Markdown, please start each new sentence on a new line.
Having each sentence on a new line will make no difference to how the text is displayed, there will still be paragraphs, but it makes the diffs produced during the merge request review easier to read!
## Where to start: issues
Before you open a new issue, please check if any of our [open issues](https://git.wur.nl/bioinformatics/pantools/-/issues?scope=all&state=opened) cover your idea already.
## Making a change with a merge request
We appreciate all contributions to _PanTools_.
**THANK YOU** for helping us build this useful resource.
All project management, conversations and questions related to the _PanTools_ project happens here in the _PanTools_ repository.
The following steps are a guide to help you contribute in a way that will be easy for everyone to review and accept with ease.
### 1. Comment on an [existing issue][pantools-issues] or open a new issue referencing your addition
This allows other members of the _PanTools_ team to confirm that you aren't overlapping with work that's currently underway and that everyone is on the same page with the goal of the work you're going to carry out.
[This blog](https://www.igvita.com/2011/12/19/dont-push-your-pull-requests/) is a nice explanation of why putting this work in upfront is so useful to everyone involved.
You will need a guest role to create new issues using the _PanTools_ repository website. You can request a guest role by [getting in touch](#get-in-touch).
Alternatively, you can send an [email][pantools-email] to create a new issue. The title of the email will be used as the issue title and the email body will be put in the issue description.
Note that we are in the process of setting up this email functionality; when this is done, we will update the email link here.
### 2. [Fork][gitlab-fork] the [_PanTools_ repository][pantools-repo]
This is now your own unique copy of _PanTools_.
Changes here won't affect anyone else's work, so it's a safe space to explore edits to the code!
Make sure to [keep your fork up to date][gitlab-syncfork] with the main repository, otherwise, you can end up with lots of dreaded [merge conflicts][gitlab-mergeconflicts].
The repository website only provides functionality for forking inside the same server (git.wur.nl). You can fork the _PanTools_ project onto another GitLab server. Forking onto a GitHub server is currently not possible, but this is being worked on.
First create a blank repository on GitLab, let's assume it's called https://gitlab.com/johndoe/my-pantools
Use the following commands in a directory where you wish to fork _PanTools_.
~~~bash
mkdir my-pantools
cd my-pantools
git init
git remote add origin https://gitlab.com/johndoe/my-pantools
git remote add upstream https://git.wur.nl/bioinformatics/pantools
#There are now two remotes: origin (your remote fork) and upstream (the pantools repository).
git pull upstream master #Get the pantools content.
git push --set-upstream origin master #Push this content to the origin (fork).
~~~
### 3. Make the changes you've discussed
Try to keep the changes focused.
If you submit a large amount of work all in one go it will be much more work for whoever is reviewing your merge request.
While making your changes, commit often and write good, detailed commit messages.
[This blog](https://chris.beams.io/posts/git-commit/) explains how to write a good Git commit message and why it matters.
It is also perfectly fine to have a lot of commits - including ones that break code.
A good rule of thumb is to push up to GitLab when you _do_ have passing tests then the continuous integration (CI) has a good chance of passing everything.
If you feel tempted to "branch out" then please make a [new branch][gitlab-branches] and a [new issue](https://git.wur.nl/bioinformatics/pantools/-/issues/new) to go with it. [This blog](https://nvie.com/posts/a-successful-git-branching-model/) details the different Git branching models.
Please do not re-write history!
That is, please do not use the [rebase](https://docs.gitlab.com/ee/topics/git/git_rebase.html) command to edit previous commit messages, combine multiple commits into one, or delete or revert commits that are no longer necessary.
Are you new to Git and GitLab or just want a detailed guide on getting started with version control? Check out the [Version Control chapter](https://the-turing-way.netlify.com/version_control/version_control.html) in _The Turing Way_ Book!
### 4. Submit a [merge request][gitlab-mergerequest]
We encourage you to open a merge request as early in your contributing process as possible.
This allows everyone to see what is currently being worked on.
It also provides you, the contributor, feedback in real-time from both the community and the continuous integration as you make commits (which will help prevent stuff from breaking).
When you are ready to submit a merge request, please describe in the merge request body:
- The problem you're trying to fix in the merge request, reference any related issue and use fixes/close to automatically close them, if pertinent.
- A list of changes proposed in the merge request.
- What the reviewer should concentrate their feedback on.
By providing as much detail as possible, you will make it really easy for someone to review your contribution!
If you have opened the merge request early and know that its contents are not ready for review or to be merged, add "[WIP]" at the start of the merge request title, which stands for "Work in Progress".
When you are happy with it and are happy for it to be merged into the main repository, change the "[WIP]" in the title of the merge request to "[Ready for review]".
A member of the _PanTools_ team will then review your changes to confirm that they can be merged into the main repository.
A [review][gitlab-review] will probably consist of a few questions to help clarify the work you've done.
Keep an eye on your GitLab notifications and be prepared to join in that conversation.
You can update your [fork][gitlab-fork] of the _PanTools_ [repository][pantools-repo] and the merge request will automatically update with those changes.
You don't need to submit a new merge request when you make a change in response to a review.
You can also submit merge requests to other contributors' branches!
Do you see an [open merge request](https://git.wur.nl/bioinformatics/pantools/-/merge_requests?scope=all&state=opened ) that you find interesting and want to contribute to?
Simply make your edits on their files and open a merge request to their branch!
GitLab has a [nice introduction][gitlab-flow] to the merge request workflow, but please [get in touch](#get-in-touch) if you have any questions.
## Local development
You can build and run _PanTools_ locally. Please refer to the [manual][pantools-manual] for instructions on how to build, and run _PanTools_.
## Recognizing Contributions
We welcome and recognise all kinds of contributions, from fixing small errors, to developing documentation, maintaining the project infrastructure, writing code or reviewing existing resources.
### Current Contributors
The _PanTools_ team wants to graciously thank the following people for their contributions to the _PanTools_ project.
- Astrid van der Brandt
- Dirk-Jan van Workum
- Eef Jonkheer
- Matthijs Moed
- Sandra Smit
- Siavash Sheikhizadeh
- Thijs van Lankveld
---
_These Contributing Guidelines have been adapted from the [Contributing Guidelines](https://github.com/bids-standard/bids-starter-kit/blob/master/CONTRIBUTING.md) of the [BIDS Starter Kit](https://github.com/bids-standard/bids-starter-kit)! (License: CC-BY)_
[pantools-repo]: https://git.wur.nl/bioinformatics/pantools
[pantools-issues]: https://git.wur.nl/bioinformatics/pantools/-/issues
[pantools-mergerequests]: https://git.wur.nl/bioinformatics/pantools/-/merge_request
[pantools-manual]: https://www.bioinformatics.nl/pangenomics/manual/
[pantools-email]: mailto:broken_link
[git]: https://git-scm.com
[gitlab]: https://gitlab.com
[gitlab-signup]: https://gitlab.com/users/sign_up
[gitlab-signin]:https://gitlab.com/users/sign_in
[gitlab-gettingstarted]:https://about.gitlab.com/get-started/
[gitlab-fork]: https://docs.gitlab.com/ee/user/project/repository/forking_workflow.html
[gitlab-syncfork]: https://about.gitlab.com/blog/2016/12/01/how-to-keep-your-fork-up-to-date-with-its-origin/
[gitlab-mergeconflicts ]:https://docs.gitlab.com/ee/topics/git/merge_conflicts.html
[gitlab-branches]: https://docs.gitlab.com/ee/user/project/repository/web_editor.html#create-a-new-branch
[gitlab-rebase]:https://docs.gitlab.com/ee/topics/git/git_rebase.html
[gitlab-mergerequest]: https://docs.gitlab.com/ee/user/project/merge_requests/creating_merge_requests.html
[gitlab-review]: https://docs.gitlab.com/ee/development/code_review.html
[gitlab-flow]: https://docs.gitlab.com/ee/topics/gitlab_flow.html
[markdown]: https://daringfireball.net/projects/markdown
[rick-roll]: https://www.youtube.com/watch?v=dQw4w9WgXcQ
\ No newline at end of file
......@@ -25,44 +25,48 @@ PanTools currently provides these functionalities:
- Phylogenetic methods
- Optimal homology grouping using BUSCO
## Compiling PanTools
**NB: still under construction, end goal is to easily create a fat jar using the commandline**
## Cloning this git
For cloning this git, please run:
```bash
git clone --recursive https://git.wur.nl/bioinformatics/pantools.git
```
To compile PanTools, there are two options:
- Create a slim jar using the following commands:
```
git clone https://git.wur.nl/bioinformatics/pantools.git
cd pantools/
rm nbproject/private/private.*
ant -Dplatforms.JDK_1.8.home=/path/to/<openjdk 8 home> clean jar
```
Next, check whether PanTools is compiled correctly by running:
```
java -cp $(echo addons/java_libraries/*.jar dist/pantools.jar | sed 's/ /:/g') pantools.Pantools --help
```
This should display the help of PanTools
For easily running PanTools, add the following line to ~/.bashrc:
```
alias pantools="java -cp $(echo /path/to/pantools/addons/java_libraries/*.jar /path/to/pantools/dist/pantools.jar | sed 's/ /:/g') pantools.Pantools"
```
And check out a desired version (e.g. `v3.4.0`):
```bash
git checkout v3.4.0
```
## Creating a bundled jar
To create a runnable, standalone jar with all dependencies included, run the `package-for-store` with Ant:
## Building a runnable jar
To build a standalone jar that can be run on any machine with a compatible JVM without any dependencies, install [Maven](https://maven.apache.org) and JDK version 8.
Then run `mvn package` in the PanTools root directory:
```bash
ant package-for-store
mvn package
```
This will create a file called `pantools.jar` in the `store/` directory, which can be copied to any location without the need for also copying the dependent libraries of PanTools.
<!--
Note: tests are broken at the moment, which is why we're skipping the Maven test phase with `-DskipTests=true`.
-->
`mvn package` will generate two jar files in the `target/` directory.
The standalone jar file is named `pantools-<VERSION>.jar`, with `<VERSION>` being the PanTools version you have checked out.
To run it, use (we'll take version `v3.4.0` as an example):
```bash
java -jar target/pantools-3.4.0.jar
```
## Requirements, Installation, Usage
An extensive manual is available at [http://www.bioinformatics.nl/pangenomics/manual/](http://www.bioinformatics.nl/pangenomics/manual/)
An extensive manual is available at [pantools.readthedocs.io](https://pantools.readthedocs.io/).
When running `pantools add_functional_annotations`, an InterPro database needs to be present in the `addons/` directory.
Calling the following bash script will download an InterPro database:
~~~(bash)
addons/wget_interpro.sh
~~~
To use the `consensus_tree` functionality, PanTools depends on ASTER. For compiling this submodule, please run:
```bash
cd ASTER
g++ -std=gnu++11 -march=native -Ofast -pthread astral.cpp -o astral
g++ -std=gnu++11 -march=native -Ofast -pthread astral-pro.cpp -o astral-pro
```
In case the directory pantools/ASTER is empty, please run the following to retrieve its contents:
```bash
git submodule update --init --recursive
```
This diff is collapsed.
# Accurate Species Tree EstimatoR
A family of ASTRAL-like algorithms
# ASTERISK
Accurate Species Tree Estimation by diRectly Inferring from Site Kernels
# Compile for Linux/Unix
`g++ -std=gnu++11 -march=native -Ofast -pthread asterisk.cpp -o asterisk`
# Run
asterisk [-o oFilePath -r nRound -s nSample -p probability -t nThread -y] inputList
-o path to output file (default: stdout)
-r number of total rounds of placements (default: 4)
-s number of total rounds of subsampling (default: 4)
-p subsampling probability of keeping each taxon (default: 0.5)
-t number of threads (default: 1)
-y take one input in PHYLIP format instead of a list of inputs in FASTA format
inputList: the path to a file containing a list of paths to input aligned gene files, one file per line
Gene files must be in FASTA format. The header line should be ">Species_Name".
Example run:
`./asterisk example/list.txt`
`./asterisk -y example/example.phylip`
# Assumptions for Statistical Consistency
## The multi-species coalescent model
1. The gene trees are generated independently, and as the number of genes goes to infinity, ASTERISK is statistically consistent.
2. The coalescent units do not need to be the same across branches.
## Gene tree and sequence model
1. The mutation rates (per time) do not need to be the same across branches, and within each branch across different time, the mutation rate does not need to scale the same way as coalescent does, as long as being reasonable (e.g. infimum/minimum above zero and capped).
2. The length of each gene can be arbitrary and may be dependent on parameters above, as long as being reasonable (e.g. infimum/minimum above zero and capped).
## Felsenstein 1981 model-like
1. Base frequencies are provided and allowed to vary from 0.25, but the rate matrix must be F81-like.
2. The sum of top 2 base frequencies must be less than 1. In other words, the number of categories must be at least 3, which unfortunately excludes binary inputs (e.g. major or minor alleles) but allowing nucleotides (4) and amino acids (20). (Base positions with the number of effective categories no more than 2 will neither contribute to nor bias the inferred species tree.)
3. Different base positions (or genes) are allowed to have different base frequencies and be dependent on parameters above, as long as being reasonable (e.g. non-zero for at least 3 categories) and provided.
# astral(-pro)
Optimizing ASTRAL(-pro) objective function using ASTER method
# Compile for Linux/Unix
`g++ -std=gnu++11 -march=native -Ofast -pthread astral.cpp -o astral`
`g++ -std=gnu++11 -march=native -Ofast -pthread astral-pro.cpp -o astral-pro`
# Run
astral(-pro) [-o oFilePath -r nRound -s nSample -p probability -t nThread -a taxonNameMaps] inputGeneTrees
-o path to output file (default: stdout)
-r number of total rounds of placements (default: 4)
-s number of total rounds of subsampling (default: 4)
-p subsampling probability of keeping each taxon (default: 0.5)
-t number of threads (default: 1)
-a a list of gene name to taxon name maps, each line contains one gene name followed by one taxon name separated by a space or tab
inputGeneTrees: the path to a file containing all gene trees in Newick format
This diff is collapsed.
#include<iostream>
#include<fstream>
#include<unordered_map>
#include<cstdio>
#include<cstdlib>
#include<cstring>
//#define LARGE_DATA
#ifdef LARGE_DATA
typedef long double score_t;
typedef long long count_t;
#else
typedef double score_t;
typedef int count_t;
#endif
#include "binary.hpp"
#include "algorithms.hpp"
using namespace std;
TripartitionInitializer tripInit;
vector<string> names;
unordered_map<string, int> name2id;
string formatName(const string name){
string res;
for (char c: name){