|
|
Code and tutorials for CIHEAM course. Variation annotation and function
|
|
|
prediction sessions
|
|
|
|
|
|
Variant annotation
|
|
|
------------------
|
|
|
|
|
|
=
|
|
|
|
|
|
### Slicing and dicing of VCF files
|
|
|
|
|
|
VCF files can get very big. Being able to manipulate them and extract
|
|
|
relevant information is important. In practice, flat text files such as
|
|
|
VCF are currently the basis for further analysis - see the Imputation
|
|
|
and Population practical sessions.
|
|
|
|
|
|
The first task involves selecting all called variants of chromosome 18,
|
|
|
bgzip-ping them on-the-fly, and then indexing that file with tabix:
|
|
|
|
|
|
` tabix -h allbt.vcf.gz 18 | bgzip >BT18.vcf.gz`\
|
|
|
` tabix -p vcf BT18.vcf.gz`
|
|
|
|
|
|
### Annotating VCF with rs-numbers
|
|
|
|
|
|
The VCF file you are working does not yet have the dbSNP identifiers
|
|
|
(rs-numbers) annotated. The rs-numbers of all currently known variants
|
|
|
as databased in dbSNP can be found in this file:
|
|
|
|
|
|
` /path/to/ BT18.vcf.gz`
|
|
|
|
|
|
Have a look at what is inside this file:
|
|
|
` gunzip -c BT18.vcf.gz | grep -v '^##' | more`
|
|
|
|
|
|
or
|
|
|
|
|
|
` tabix BT18.vcf.gz 18:1-10000 | more`
|
|
|
|
|
|
You can use vcftools to do many things related to VCF files. This
|
|
|
includes annotating the VCF based on other interval-file formatted
|
|
|
input, including other VCF files. For instance, you can annotate an
|
|
|
interval of the VCF on the fly, by selecting that interval using tabix,
|
|
|
and then piping it to vcf-annotate, like so:
|
|
|
|
|
|
` tabix -h BT18.vcf.gz 18:100000-101000 | vcf-annotate -a BT_incl_cons.18.vcf.gz -c CHROM,FROM,ID`
|
|
|
|
|
|
Now make a new VCF file for all the variants called for chromosome 18
|
|
|
using this tool:
|
|
|
|
|
|
` tabix -h BT18.vcf.gz 18 | vcf-annotate -a BT_incl_cons.18.vcf.gz -c CHROM,FROM,ID | bgzip >BT18_rsnumbers.vcf.gz`\
|
|
|
` tabix -p vcf BT18_rsnumbers.vcf.gz`
|
|
|
|
|
|
### Extracting variants using BEDtools
|
|
|
|
|
|
You will very often want to extract variants based on interval files,
|
|
|
e.g. a certain region on a chromosome. This can be easily achieve using
|
|
|
the tabix command described above. However, sometimes you may want to
|
|
|
extract a substantial number of intervals. You could use tabix in a loop, either as a shell-script or in a scripting language like Perl or
|
|
|
Python. However, BedTools provides many utilities for manipulating
|
|
|
interval data, such as BED, VCF, or GTF/GFF formats.
|
|
|
|
|
|
Say you are interested in the variants in all the genes lying between
|
|
|
coordinates 1,000,000 and 2,000,000 on chromosome 18. The annotation of
|
|
|
genes can be found in a GTF file. The GTF file for the UMD3.1 genome
|
|
|
build can be found at Ensembl and downloaded using the command line
|
|
|
(NOTE: the compute nodes you are working on do not have internet
|
|
|
connectivity. This line of code is just for completeness):
|
|
|
|
|
|
` wget `[`ftp://ftp.ensembl.org/pub/release-78/gtf/bos_taurus/Bos_taurus.UMD3.1.78.gtf.gz`](ftp://ftp.ensembl.org/pub/release-78/gtf/bos_taurus/Bos_taurus.UMD3.1.78.gtf.gz)
|
|
|
|
|
|
The file is already downloaded and can be found here:
|
|
|
|
|
|
` /path/to/Bos_taurus.UMD3.1.78.gtf.gz`
|
|
|
|
|
|
Have a quick look at what is in this file:
|
|
|
|
|
|
gunzip -c /path/to/Bos\_taurus.UMD3.1.78.gtf.gz | more
|
|
|
|
|
|
Familiarise yourself with the GTF format if you have not already. For
|
|
|
more information:
|
|
|
|
|
|
` `[`http://www.ensembl.org/info/website/upload/gff.html`](http://www.ensembl.org/info/website/upload/gff.html)
|
|
|
|
|
|
You can select the appropriate region from the GTF file:
|
|
|
|
|
|
` gunzip -c Bos_taurus.UMD3.1.78.gtf.gz | awk '$3=="gene"' | awk '$1==18&&$4>1000000&&$4<2000000' >mygenes.gtf`
|
|
|
|
|
|
Subsequently you can intersect the VCF with the GTF file that contains
|
|
|
just the genes between coordinates 1 and 2 million:
|
|
|
|
|
|
` bedtools intersect -a BT18_rsnumbers.vcf.gz -b mygenes.gtf >intersection_genes.vcf`
|
|
|
|
|
|
Have a quick look at the output, and count the number of variants in
|
|
|
this VCF (e.g. by using 'wc -l').
|
|
|
|
|
|
Now do the same, but select only the exonic variants in the same
|
|
|
interval, from 1 to 2 million.
|
|
|
|
|
|
How many variants do you have now?
|
|
|
### Variant Effect Predictor
|
|
|
|
|
|
Various tools exist to make more precise annotations than what you can
|
|
|
achieve using Bedtools and a GTF. Among them are Variant Effect
|
|
|
Predictor (VEP), snpEff, and Annovar. For the current practical we will
|
|
|
use former two.
|
|
|
|
|
|
VEP uses a cache directory that has the annotation information. On the
|
|
|
infrastructure you are working on that directory can be found here:
|
|
|
|
|
|
` /home/formacion/COMUNES/IAMZ/data/CIHEAM/ReferenceGenome/VEP/`
|
|
|
|
|
|
You can create a file that will provide annotation information for each
|
|
|
SNP:
|
|
|
|
|
|
` gunzip -c BT18_rsnumbers.vcf.gz | perl /home/formacion/COMUNES/IAMZ/soft/ensembl-tools-release-78/scripts/variant_effect_predictor/variant_effect_predictor.pl --dir /home/formacion/COMUNES/IAMZ/data/CIHEAM/ReferenceGenome/VEP/ --species bos_taurus -o BT18.vep --fork 4 --canonical --sift b --coding_only --no_intergenic --offline --force_overwrite`
|
|
|
|
|
|
Have a look inside the output file and study its content. Why are there
|
|
|
no SNPs in this list that are intergenic?
|
|
|
|
|
|
For some purposes, it might be more convenient to have the annotations
|
|
|
directly in the VCF. You could for instance do the following:
|
|
|
|
|
|
` gunzip -c BT18_rsnumbers.vcf.gz | perl /home/formacion/COMUNES/IAMZ/soft/ensembl-tools-release-78/scripts/variant_effect_predictor/variant_effect_predictor.pl --dir /home/formacion/COMUNES/IAMZ/data/CIHEAM/ReferenceGenome/VEP/ --species bos_taurus -o BT18.vep.vcf --fork 4 --canonical --sift b --coding_only --offline --force_overwrite --vcf `\
|
|
|
` bgzip BT18.vep.vcf`\
|
|
|
` tabix -p vcf BT18.vep.vcf.gz`
|
|
|
|
|
|
Again, study and interpret your outcome by looking inside the file.
|
|
|
Specifically, note WHERE the annotation ended up.
|
|
|
|
|
|
When dealing with deleterious alleles, we might be interested to learn
|
|
|
how frequent that allele is in the population we are studying. Our
|
|
|
expectation is that genuinly deleterious alleles should be relatively
|
|
|
rare in a population. If not they might hint at either a sign of
|
|
|
inbreeding, or, perhaps, something that might be deleterious in the wild
|
|
|
but advantageous in the context of domestication/breeding. The VCF file
|
|
|
holds the (reference) allele frequency information. The 8th column of
|
|
|
the VCF file in fact is a bunch of fields concatenated with the ';'
|
|
|
delimitor. One of the fields has the tag 'AF=', where AF stands for
|
|
|
Allele Frequency. We can expand the number of space-delimited fields
|
|
|
easily and then look for the 11th column. We need to remove the 'AF='
|
|
|
tag though, but if we do that we can further select for instance for
|
|
|
allele frequencies between 0 and 15%.
|
|
|
|
|
|
` gunzip -c BT18.vep.vcf.gz | grep deleterious | sed 's/;/\t/g' | awk '{gsub("AF=","",$11); print}' | awk '$11<0.15&&$11>0' | wc -l`
|
|
|
|