Map population to reference genome
First follow the instructions here:
Step by step guide on how to use my pipelines
Click here for an introduction to Snakemake
Create conda environmet:
conda env create --name <name-of-pipeline> --file population-variant-calling.yml
This environment contains snakemake and the other packages that are needed to run the pipeline.
Activate environmet:
conda activate <name-of-pipeline>
Create HPC config file:
Necessary for snakemake to prepare and send jobs.
mkdir -p ~/.config/snakemake/<name-of-pipeline>
nano ~/.config/snakemake/<name-of-pipeline>/config.yaml
Include the following and save the file (ctr+x)
jobs: 10
cluster: "sbatch -t 2-0:0:0 --mem=16000 -c8 --job-name={rule} --output=logs_slurm/{rule}_%j.out --error=logs_slurm/{rule}_%j.err"
use-conda: true
ABOUT
This is a pipeline to map short reads from several individuals to a reference assembly. It outputs the mapped reads and a qualimap report.
Tools used:
- Bwa-mem2 - mapping
- Samtools - processing
- Samblaster - marking duplicates
- Qualimap - mapping summary
![]() |
---|
Pipeline workflow |
Edit config.yaml with the paths to your files
ASSEMBLY: /path/to/assembly
OUTDIR: /path/to/outdir
PATHS_WITH_FILES:
path1: /path/to/dir
- ASSEMBLY - path to the assembly file
- OUTDIR - directory where snakemake will run and where the results will be written to.
If you want the results to be written to this directory (not to a new directory), comment outOUTDIR: /path/to/outdir
- PATHS_WITH_FILES - directory that can contain subdirectories where the fastq reads are located. You can add several paths by adding
path2: /path/to/dir
underPATHS_WITH_FILES
. (The line you add has to have indentation)
The script goes through the subdirectories of the directory you choose under PATHS_WITH_FILES
looking for files with fastq extension.
Example: if path1: /lustre/nobackup/WUR/ABGC/shared/Chicken/Africa/X201SC20031230-Z01-F006_multipath
, the subdirectory structure could be:
/lustre/nobackup/WUR/ABGC/shared/Chicken/Africa/X201SC20031230-Z01-F006_multipath
├── X201SC20031230-Z01-F006_1
│ └── raw_data
│ ├── a109_26_15_1_H
│ │ ├── a109_26_15_1_H_FDSW202597655-1r_HWFFFDSXY_L3_1.fq.gz
│ │ ├── a109_26_15_1_H_FDSW202597655-1r_HWFFFDSXY_L3_2.fq.gz
│ │ └── MD5.txt
│ └── a20_10_16_1_H
│ ├── a20_10_16_1_H_FDSW202597566-1r_HWFFFDSXY_L3_1.fq.gz
│ ├── a20_10_16_1_H_FDSW202597566-1r_HWFFFDSXY_L3_2.fq.gz
│ └── MD5.txt
└── X201SC20031230-Z01-F006_2
└── raw_data
├── a349_Be_17_1_C
│ ├── a349_Be_17_1_C_FDSW202597895-1r_HWFFFDSXY_L3_1.fq.gz
│ ├── a349_Be_17_1_C_FDSW202597895-1r_HWFFFDSXY_L3_2.fq.gz
│ └── MD5.txt
└── a360_Be_05_1_H
├── a360_Be_05_1_H_FDSW202597906-1r_HWFFFDSXY_L3_1.fq.gz
├── a360_Be_05_1_H_FDSW202597906-1r_HWFFFDSXY_L3_2.fq.gz
└── MD5.txt
RESULTS
- <run_date>_files.txt dated file with an overview of the files used to run the pipeline (for documentation purposes)
- processed_reads directory with the bam files with the mapped reads for every sample
-
mapping_stats directory containing the qualimap results and a summary of the qualimap results for all samples in
sample_quality_summary.tsv
- qualimap contains qualimap results per sample