Assemble nanopore reads and do variant calling with short and long reads
Owner: https://github.com/CarolinaPB copied from: https://github.com/CarolinaPB/nanopore-assembly
First follow the instructions here:
Step by step guide on how to use my pipelines
Click here for an introduction to Snakemake
ABOUT
This is a pipeline that uses Flye
to create a nanopore assembly. It also does variant calling with long and short reads.
The pipeline starts by using porechop
to trim the adaptors, then it uses Flye
to create the assembly. After that, ntLink-arks
from Lonstitch
is used to scaffold the assembly using the nanopore reads. The scaffolded assembly is polished with polca
. Bwa-mem2
is used to map the short reads to the assembly and Freebayes
to do variant calling using these reads. Minimap2
is used to map the long reads to the assembly, and longshot
for variant calling using these.
In the end, in addition to your assembly and variant calling results, you'll also get assembly statistics and busco scores before and after the polishing.
Tools used:
- Porechop - trim adaptors
- Flye - assembly
- Seqtk - convert fasta to one line fasta
- LongStitch (ntLink-arks) - scaffolding with nanopore reads
- BUSCO - assess assembly completeness
- MaSuRCA (polca) - polish assembly
- Python - get assembly stats
- Minimap2 - map long reads to reference. Genome alignment
- R - pafCoordsDotPlotly - plot genome alignment, to compare the new assembly with another genome
![]() |
---|
Pipeline workflow |
Copy the pipeline:
Copy the pipeline to your directory, where you would run the pipeline.
cp -r /lustre/nobackup/WUR/ABGC/shared/pipelines_version2/nanopore-assembly <directory where you want to save it to>
Install conda if it is not yet installed
See installation instructions below.
Activate environmet:
Since the pipelines can take a while to run, it’s best if you use a screen session. By using a screen session, Snakemake stays “active” in the shell while it’s running, there’s no risk of the connection going down and Snakemake stopping. screen -S <name of session>
to start a new screen session and screen -r <name of session
to reattach to the screen session. In the screen session activate the conda environment.
conda activate /lustre/nobackup/WUR/ABGC/shared/pipelines_version2/envs/nanopore-assembly2
Edit config.yaml with the paths to your files
LONGREADS: <nanopore_reads.fq.gz>
SHORTREADS:
- /path/to/short/reads_1.fq.gz
- /path/to/short/reads_2.fq.gz
GENOME_SIZE: <approximate genome size>
PREFIX: <prefix>
OUTDIR: /path/to/outdir
BUSCO_LINEAGE:
- <lineage>
# genome alignment parameters:
COMPARISON_GENOME (to plot against the new assembly):
<species>: /path/to/genome/fasta
# filter alignments less than cutoff X bp
MIN_ALIGNMENT_LENGTH: 10000
MIN_QUERY_LENGTH: 50000
- LONGREADS - name of file with long reads. This file should be in the working directory (where this config and the Snakefile are)
- SHORTREADS - paths to short reads fq.gz of sample, used for polishing (if no short reads available leave it empty)
- GENOME_SIZE - approximate genome size
haploid genome size (bp)(e.g. '3e9' for human genome)
from longstitch - PREFIX - prefix for the created files
- OUTDIR - directory where snakemake will run and where the results will be written to
If you want the results to be written to this directory (not to a new directory), open config.yaml and comment outOUTDIR: /path/to/outdir
- BUSCO_LINEAGE - lineage used for busco. Can be one or more (one per line). To see available lineages run
busco --list-datasets
- COMPARISON_GENOME - genome for whole genome comparison. Add your species name and the path to the fasta file. ex:
chicken: /path/to/chicken.fna.gz
. You can add several genomes, one on each line.- If you don't want to run the genome alignment step, comment out
COMPARISON_GENOME:
<species>: /path/to/genome/fasta
- MIN_ALIGNMENT_LENGTH and MIN_QUERY_LENGTH - parameters for plotting. If your plot is coming out blank or if there's an error with the plotting step, try lowering these thresholds. This happens because the alignments are not large enough.
If you have your long reads in several fastq files and need to create one file compressed file with all the reads:
- In your pipeline directory create one file with all the reads
cat /path/to/fastq/directory/*.fastq > <name of file>.fq
- Compress the file you just created:
gzip <name of file>.fq
If no shortreads of the sample available for polising remove "after" in the rule all: expand("busco_{prefix}{busco_cat}polish{lineage}/short_summary.specific.{lineage}.busco{prefix}_{busco_cat}polish{lineage}.txt", prefix=PREFIX, lineage=BUSCO_LINEAGE, busco_cat = ["before"]),
Run the pipeline on the cluster:
First test the pipeline with a dry run: snakemake -np
. This will show you the steps and commands that will be executed. Check the commands and file names to see if there’s any mistake. If all looks ok, you can now run your pipeline
To run the pipeline, run the following command:
snakemake -j 2 --cluster-config cluster.yaml --cluster "sbatch --mem={cluster.mem} --time {cluster.time} --cpus-per-task {cluster.threads} --job-name={cluster.name} --output={cluster.output} --error={cluster.error}"
RESULTS
The most important files are and directories are:
- <run_date>_files.txt dated file with an overview of the files used to run the pipeline (for documentation purposes)
-
results directory that contains
-
{prefix}_oneline.k32.w100.ntLink-arks.longstitch-scaffolds.fa.PolcaCorrected.fa final assembly
-
assembly_stats_<prefix>.txt file with assembly statistics for the final assembly
-
genome_alignment directory with results and figure from whole genome alignment
- {prefix}_{species}.png
-
-
mapped directory that contains the bam file with long reads mapped to the new assembly
- {prefix}_longreads.mapped.sorted.bam
-
busco_{prefix}_before_polish_ and busco_{prefix}_after_polish directories - contain busco results before and after polishing
- short_summary.specific.{lineage}.{prefix}_before_polish.txt
- short_summary.specific.{lineage}.{prefix}_after_polish.txt"
- other_files - directory containing other files created during the pipeline
- assembly - directory containing files created during the assembly step
If you want to do variant calling in the same pipeline with short and longreads use: https://git.wur.nl/kim.lensing/nanopore-assembly2