Newer
Older
Whole genome alignment pipeline
dependencies:
MUMmer 4.x (https://github.com/mummer4/mummer); make sure you added the locations of binaries to your PATH
samtools
SLURM (made to run on a cluster with SLURM as queue manager)
usage:
in your working directory, you will need fasta files of your genomes, fasta indexes of these files (for f in *.fa; do samtools faidx $f; done), and a seqfile with this format:
*species_name fastafile.fa
species_name2 fastafile2.fa
species_name3 fastafile3.fa
the star is to indicate the reference species.
Note that the pipeline does not handle dots in the fasta headers well, so if you have them please replace with something else (like _)
then, inside this directory, do:
ln -s /path/to/whole_genome_alignment/* .
to fetch all the necessary scripts
after this you can submit a whole genome alignment by using:
python submit_multi_genome_v2.py --seqfile your_seqfile.txt --prefix output_prefix --split_in_this_amount max_amount_of_jobs
Note that the theoretical maximum amount of jobs is determined by the amount of fasta entries in your reference fasta (chromosomes if these are completely assembled)
the script will submit jobs to SLURM corresponding to the different pipeline stages, with dependencies so they are scheduled after the previous stage has finished.
Stages are:
1. alignment (MUMmer)
2. merging then splitting the alignments by scaffolds of the reference species
3. merging pairwise alignments into multi alignments for these scaffolds (e.g. local re-alignment)
4. concatenate results and cleanup
you can keep an eye on the jobs by typing squeue -u yourusername; if you see any jobs with reason DependencyNeverSatisfied then something went wrong
good luck ;)