Changes

Megens, Hendrik-Jan · f185af35
--- a/home.markdown
+++ b/home.markdown
+Code and tutorials for CIHEAM course. Variation annotation and function
+prediction sessions
+
+Variant annotation
+------------------
+
+=
+
+### Slicing and dicing of VCF files
+
+VCF files can get very big. Being able to manipulate them and extract
+relevant information is important. In practice, flat text files such as
+VCF are currently the basis for further analysis - see the Imputation
+and Population practical sessions.
+
+The first task involves selecting all called variants of chromosome 18,
+bgzip-ping them on-the-fly, and then indexing that file with tabix:
+
+` tabix -h allbt.vcf.gz 18 | bgzip >BT18.vcf.gz`\
+` tabix -p vcf BT18.vcf.gz`
+
+### Annotating VCF with rs-numbers
+
+The VCF file you are working does not yet have the dbSNP identifiers
+(rs-numbers) annotated. The rs-numbers of all currently known variants
+as databased in dbSNP can be found in this file:
+
+` /path/to/ BT18.vcf.gz`
+
+Have a look at what is inside this file:
+` gunzip -c BT18.vcf.gz | grep -v '^##' | more`
+
+or
+
+` tabix BT18.vcf.gz 18:1-10000 | more`
+
+You can use vcftools to do many things related to VCF files. This
+includes annotating the VCF based on other interval-file formatted
+input, including other VCF files. For instance, you can annotate an
+interval of the VCF on the fly, by selecting that interval using tabix,
+and then piping it to vcf-annotate, like so:
+
+` tabix -h BT18.vcf.gz 18:100000-101000 | vcf-annotate -a BT_incl_cons.18.vcf.gz -c CHROM,FROM,ID`
+
+Now make a new VCF file for all the variants called for chromosome 18
+using this tool:
+
+` tabix -h BT18.vcf.gz 18 | vcf-annotate -a BT_incl_cons.18.vcf.gz -c CHROM,FROM,ID | bgzip >BT18_rsnumbers.vcf.gz`\
+` tabix -p vcf BT18_rsnumbers.vcf.gz`
+
+### Extracting variants using BEDtools
+
+You will very often want to extract variants based on interval files,
+e.g. a certain region on a chromosome. This can be easily achieve using
+the tabix command described above. However, sometimes you may want to
+extract a substantial number of intervals. You could use tabix in a loop, either as a shell-script or in a scripting language like Perl or
+Python. However, BedTools provides many utilities for manipulating
+interval data, such as BED, VCF, or GTF/GFF formats.
+
+Say you are interested in the variants in all the genes lying between
+coordinates 1,000,000 and 2,000,000 on chromosome 18. The annotation of
+genes can be found in a GTF file. The GTF file for the UMD3.1 genome
+build can be found at Ensembl and downloaded using the command line
+(NOTE: the compute nodes you are working on do not have internet
+connectivity. This line of code is just for completeness):
+
+` wget `[`ftp://ftp.ensembl.org/pub/release-78/gtf/bos_taurus/Bos_taurus.UMD3.1.78.gtf.gz`](ftp://ftp.ensembl.org/pub/release-78/gtf/bos_taurus/Bos_taurus.UMD3.1.78.gtf.gz)
+
+The file is already downloaded and can be found here:
+
+` /path/to/Bos_taurus.UMD3.1.78.gtf.gz`
+
+Have a quick look at what is in this file:
+
+gunzip -c /path/to/Bos\_taurus.UMD3.1.78.gtf.gz | more
+
+Familiarise yourself with the GTF format if you have not already. For
+more information:
+
+` `[`http://www.ensembl.org/info/website/upload/gff.html`](http://www.ensembl.org/info/website/upload/gff.html)
+
+You can select the appropriate region from the GTF file:
+
+` gunzip -c Bos_taurus.UMD3.1.78.gtf.gz | awk '$3=="gene"' | awk '$1==18&&$4>1000000&&$4<2000000' >mygenes.gtf`
+
+Subsequently you can intersect the VCF with the GTF file that contains
+just the genes between coordinates 1 and 2 million:
+
+` bedtools intersect -a BT18_rsnumbers.vcf.gz -b mygenes.gtf >intersection_genes.vcf`
+
+Have a quick look at the output, and count the number of variants in
+this VCF (e.g. by using 'wc -l').
+
+Now do the same, but select only the exonic variants in the same
+interval, from 1 to 2 million.
+
+How many variants do you have now?
+### Variant Effect Predictor
+
+Various tools exist to make more precise annotations than what you can
+achieve using Bedtools and a GTF. Among them are Variant Effect
+Predictor (VEP), snpEff, and Annovar. For the current practical we will
+use former two.
+
+VEP uses a cache directory that has the annotation information. On the
+infrastructure you are working on that directory can be found here:
+
+` /home/formacion/COMUNES/IAMZ/data/CIHEAM/ReferenceGenome/VEP/`
+
+You can create a file that will provide annotation information for each
+SNP:
+
+` gunzip -c BT18_rsnumbers.vcf.gz | perl /home/formacion/COMUNES/IAMZ/soft/ensembl-tools-release-78/scripts/variant_effect_predictor/variant_effect_predictor.pl --dir /home/formacion/COMUNES/IAMZ/data/CIHEAM/ReferenceGenome/VEP/ --species bos_taurus -o BT18.vep --fork 4 --canonical --sift b --coding_only --no_intergenic --offline --force_overwrite`
+
+Have a look inside the output file and study its content. Why are there
+no SNPs in this list that are intergenic?
+
+For some purposes, it might be more convenient to have the annotations
+directly in the VCF. You could for instance do the following:
+
+` gunzip -c BT18_rsnumbers.vcf.gz | perl /home/formacion/COMUNES/IAMZ/soft/ensembl-tools-release-78/scripts/variant_effect_predictor/variant_effect_predictor.pl --dir /home/formacion/COMUNES/IAMZ/data/CIHEAM/ReferenceGenome/VEP/ --species bos_taurus -o BT18.vep.vcf --fork 4 --canonical --sift b --coding_only --offline --force_overwrite --vcf `\
+` bgzip BT18.vep.vcf`\
+` tabix -p vcf BT18.vep.vcf.gz`
+
+Again, study and interpret your outcome by looking inside the file.
+Specifically, note WHERE the annotation ended up.
+
+When dealing with deleterious alleles, we might be interested to learn
+how frequent that allele is in the population we are studying. Our
+expectation is that genuinly deleterious alleles should be relatively
+rare in a population. If not they might hint at either a sign of
+inbreeding, or, perhaps, something that might be deleterious in the wild
+but advantageous in the context of domestication/breeding. The VCF file
+holds the (reference) allele frequency information. The 8th column of
+the VCF file in fact is a bunch of fields concatenated with the ';'
+delimitor. One of the fields has the tag 'AF=', where AF stands for
+Allele Frequency. We can expand the number of space-delimited fields
+easily and then look for the 11th column. We need to remove the 'AF='
+tag though, but if we do that we can further select for instance for
+allele frequencies between 0 and 15%.
+
+` gunzip -c BT18.vep.vcf.gz | grep deleterious | sed 's/;/\t/g' | awk '{gsub("AF=","",$11); print}' | awk '$11<0.15&&$11>0' | wc -l`
+