VCF Quality and Population Structure Pipeline
Author information
Author: Nino Menger
Under the guidance of: Mirte Bosse
Date finisched:
Goal and functionality
The pipeline visualises multi-sample VCF data quality and structure. These visualisations can be used to select individuals for futher analysis. PCA plots will be created to give an insight into population structure. Sequencing depth heatmaps will be created to give an insight into quality.
Futhermore, scrips are included to estimate the sex of all individuals based on X chromosome sequencing depth.
Please read 'Thesis/NinoMenger_s1098386_Basftu_Thesis_v1.0.0.pdf' for an full overview of the output, functionality and limitations of the pipeline
Run the pipeline
The pipeline is build to run on the Anunna cluster using the SLURM system to queue jobs. To run it the following steps must be followed:
1- Download the entire project and place it on the Anunna cluster.
Can be achieved by the following command:
git clone https://git.wur.nl/NinoMenger/vcf-quality-and-population-structure-pipeline
2- Make sure your multi-sample VCF data is ready for use.
The VCF data must contain multiple samples and must be seperated by chromosome over multiple VCF files.
Futhermore, a tab seperated annotation file must be present. This anootation file can contain a maximum of five columns, whereof the first one contains the sample identifiers. The other four columns can be used to classefy the samples as the user wishes. For the orginal project the following four calssification columns were used: species, domestication status, continental origin and specific origin. For the 'Multisample_VCFs_Sscrofa11.1' data set, which is used to create the pipeline, a script capeable of converting the orinial annotation file to a more orginised one is included.
3- Setup the 'SnakefileConfig.yaml' file
This file is used to give the user the possibility to setup the pipeline according to their wishes. For the 'Multisample_VCFs_Sscrofa11.1' data set, which is used to create the pipeline, all settings are already optimised. For other datasets, the variables listed down below have to be changed. The downloaded config file can be used as an example how the settings should look like.
INFILES: file structure of the muti-sample VCF dataset. Must be structured the folowing way: [path to directory containing all files][base file name]{CHR}[end of file name including the extention (vcf.gz)]
ANNOTATIONFILE: [path including file name to the sample annotation file discussed above]
SEQDEPTH: Sequencing depth output file. Must be structured the folowing way: [path including file name to the desired location]
HEATMAP: Sequencing depth heatmap output structure. Must be structured the folowing way:[path including the base of the file name to the desiredlocation]{GROUP}[.png extention]
PCA: PCA output structure. Must be structured the folowing way: [path including the base of the file name to the desired location]
SD_CHROMOSOMES: The amount of chromosomes to derive the sequencing depth from. (starts at chromosome 1, ends at the x Chromosome).
PCA_CHROMOSOME: Which chromosome to base the PCA on.
GROUPS: Define groups to seperatly visualise in plots. The user need to seperate the samples based on regex statements that will be used to select the samples in the annotation file. Must be structured the folowing way: [Group name]: [second column name from annotation file]: "[regex used to select individuals from annotation file]" [... the above for the other three columns in the annotation file] Title: "[Desription of the group, used in the plot titles]" POI: ["[Column names that need to be visualised in the plot]"] [... the above for the other groups the user likes to visualise]
The column names do not refer to the column names in the annotation file, since those ar unnamed. However, the names are used in the plots. The following regex statement can be used to select all samples from a certain column: ".*". cat
4- Check the the cluster config file: 'clusterConfig.yaml'.
This file is used by the SLURM system to reserve resources, email about progress and write log files. The resources are already set in the downloaded file, however other QoL settings can be altered if so desired.
5- Run the RUNME.sh
Run the following commant do run the pipeline: bash RUNME.sh
Used programs and versions
snakemake 7.8.3
PLINK 1.90b3.38
VCFtools 0.1.16
R 4.2.0