Skip to content
Snippets Groups Projects
README 1.68 KiB
Newer Older
Kruistum's avatar
Kruistum committed
Whole genome alignment pipeline

dependencies:
MUMmer 4.x (https://github.com/mummer4/mummer); make sure you added the locations of binaries to your PATH
samtools
SLURM (made to run on a cluster with SLURM as queue manager)

usage:
in your working directory, you will need fasta files of your genomes, fasta indexes of these files (for f in *.fa; do samtools faidx $f; done),  and a seqfile with this format:
*species_name	fastafile.fa
species_name2	fastafile2.fa
species_name3	fastafile3.fa

the star is to indicate the reference species.
Note that the pipeline does not handle dots in the fasta headers well, so if you have them please replace with something else (like _)

then, inside this directory, do:
ln -s /path/to/whole_genome_alignment/* .
to fetch all the necessary scripts 

after this you can submit a whole genome alignment by using:
python submit_multi_genome_v2.py --seqfile your_seqfile.txt --prefix output_prefix --split_in_this_amount max_amount_of_jobs

Note that the theoretical maximum amount of jobs is determined by the amount of fasta entries in your reference fasta (chromosomes if these are completely assembled)

the script will submit jobs to SLURM corresponding to the different pipeline stages, with dependencies so they are scheduled after the previous stage has finished.
Stages are:
1. alignment (MUMmer)
2. merging then splitting the alignments by scaffolds of the reference species
3. merging pairwise alignments into multi alignments for these scaffolds (e.g. local re-alignment)
4. concatenate results and cleanup 

you can keep an eye on the jobs by typing squeue -u yourusername; if you see any jobs with reason DependencyNeverSatisfied then something went wrong

good luck ;)