-
Derks, Martijn authoredDerks, Martijn authored
Code and tutorials for CIHEAM course. Variation annotation and function prediction sessions
Variant annotation
make a new folder to work in, and go there:
mkdir annotation_session
cd annotation_session
Create an Env.Var. to the communal session directory that holds some of the files you need:
SESSIONDIR=/lustre/nobackup/WUR/ABGC/shared/IMAGEcourse2018/sessions/variant_annotation
And make sure python3 can be found:
module load python/3.4.2
Slicing and dicing of VCF files
VCF files can get very big. Being able to manipulate them and extract relevant information is important. In practice, flat text files such as VCF are currently the basis for further analysis - see the Imputation and Population practical sessions. The first task involves selecting all called variants of chromosome 18, bgzip-ping them on-the-fly, and then indexing that file with tabix:
tabix -h $SESSIONDIR/allbt.vcf.gz 18 | bgzip >BT18.vcf.gz
tabix -p vcf BT18.vcf.gz
Have a look at what is inside this file:
gunzip -c BT18.vcf.gz | grep -v '^##' | more
or
tabix BT18.vcf.gz 18:1-10000 | more
Make sure you understand the basic features of the VCF. Have a look here, if you haven't done already:
http://www.1000genomes.org/wiki/analysis/vcf4.0
Annotating VCF with rs-numbers
The VCF file you are working does not yet have the dbSNP identifiers (rs-numbers) annotated. The rs-numbers of all currently known variants as databased in dbSNP can be found in this file:
/lustre/nobackup/WUR/ABGC/shared/IMAGEcourse2018/sessions/variant_annotation/BT_incl_cons.18.vcf.gz
You can use vcftools to do many things related to VCF files. This includes annotating the VCF based on other interval-file formatted input, including other VCF files. For instance, you can annotate an interval of the VCF on the fly, by selecting that interval using tabix, and then piping it to vcf-annotate, like so:
## will take an interval of 1000bp and annotate the rs numbers; piped into 'tail' means last 10 lines are shown
module load vcftools/gcc/0.1.12b ## first load vcftools
tabix -h BT18.vcf.gz 18:100000-101000 | vcf-annotate -a $SESSIONDIR/BT_incl_cons.18.vcf.gz -c CHROM,FROM,ID | tail
Now make a new VCF file for all the variants called for chromosome 18 using this tool:
tabix -h BT18.vcf.gz 18 | vcf-annotate -a $SESSIONDIR/BT_incl_cons.18.vcf.gz -c CHROM,FROM,ID | bgzip >BT18_rsnumbers.vcf.gz
tabix -p vcf BT18_rsnumbers.vcf.gz
Again, inspect the results. What do you notice in the third column? Do all variants now have an rs-number annotated?
Extracting variants using BEDtools
You will very often want to extract variants based on interval files, e.g. a certain region on a chromosome. This can be easily achieve using the tabix command described above. However, sometimes you may want to extract a substantial number of intervals. You could use tabix in a loop, either as a shell-script or in a scripting language like Perl or Python. However, BedTools provides many utilities for manipulating interval data, such as BED, VCF, or GTF/GFF formats.
Say you are interested in the variants in all the genes lying between coordinates 1,000,000 and 2,000,000 on chromosome 18. The annotation of genes can be found in a GTF file. The GTF file for the UMD3.1 genome build can be found at Ensembl and downloaded using the command line.
## NOTE: the compute nodes you are working on do not have internet connectivity.
## This line of code is just for completeness):
wget ftp://ftp.ensembl.org/pub/release-78/gtf/bos_taurus/Bos_taurus.UMD3.1.78.gtf.gz
The file is already downloaded and can be found here: