Skip to content
Snippets Groups Projects
Commit 8e302298 authored by Megens, Hendrik-Jan's avatar Megens, Hendrik-Jan
Browse files

remove file

parent 0883fdf2
Branches
No related tags found
No related merge requests found
Code and tutorials for CIHEAM course. Variation annotation and function
prediction sessions
Variant annotation
------------------
=
### Slicing and dicing of VCF files
VCF files can get very big. Being able to manipulate them and extract
relevant information is important. In practice, flat text files such as
VCF are currently the basis for further analysis - see the Imputation
and Population practical sessions.
The first task involves selecting all called variants of chromosome 18,
bgzip-ping them on-the-fly, and then indexing that file with tabix:
` tabix -h allbt.vcf.gz 18 | bgzip >BT18.vcf.gz`\
` tabix -p vcf BT18.vcf.gz`
### Annotating VCF with rs-numbers
The VCF file you are working does not yet have the dbSNP identifiers
(rs-numbers) annotated. The rs-numbers of all currently known variants
as databased in dbSNP can be found in this file:
` /path/to/ BT18.vcf.gz`
Have a look at what is inside this file:
` gunzip -c BT18.vcf.gz | grep -v '^##' | more`
or
` tabix BT18.vcf.gz 18:1-10000 | more`
You can use vcftools to do many things related to VCF files. This
includes annotating the VCF based on other interval-file formatted
input, including other VCF files. For instance, you can annotate an
interval of the VCF on the fly, by selecting that interval using tabix,
and then piping it to vcf-annotate, like so:
` tabix -h BT18.vcf.gz 18:100000-101000 | vcf-annotate -a BT_incl_cons.18.vcf.gz -c CHROM,FROM,ID`
Now make a new VCF file for all the variants called for chromosome 18
using this tool:
` tabix -h BT18.vcf.gz 18 | vcf-annotate -a BT_incl_cons.18.vcf.gz -c CHROM,FROM,ID | bgzip >BT18_rsnumbers.vcf.gz`\
` tabix -p vcf BT18_rsnumbers.vcf.gz`
### Extracting variants using BEDtools
You will very often want to extract variants based on interval files,
e.g. a certain region on a chromosome. This can be easily achieve using
the tabix command described above. However, sometimes you may want to
extract a substantial number of intervals. You could use tabix in a
loop, either as a shell-script or in a scripting language like Perl or
Python. However, BedTools provides many utilities for manipulating
interval data, such as BED, VCF, or GTF/GFF formats.
Say you are interested in the variants in all the genes lying between
coordinates 1,000,000 and 2,000,000 on chromosome 18. The annotation of
genes can be found in a GTF file. The GTF file for the UMD3.1 genome
build can be found at Ensembl and downloaded using the command line
(NOTE: the compute nodes you are working on do not have internet
connectivity. This line of code is just for completeness):
` wget `[`ftp://ftp.ensembl.org/pub/release-78/gtf/bos_taurus/Bos_taurus.UMD3.1.78.gtf.gz`](ftp://ftp.ensembl.org/pub/release-78/gtf/bos_taurus/Bos_taurus.UMD3.1.78.gtf.gz)
The file is already downloaded and can be found here:
` /path/to/Bos_taurus.UMD3.1.78.gtf.gz`
Have a quick look at what is in this file:
gunzip -c /path/to/Bos\_taurus.UMD3.1.78.gtf.gz | more
Familiarise yourself with the GTF format if you have not already. For
more information:
` `[`http://www.ensembl.org/info/website/upload/gff.html`](http://www.ensembl.org/info/website/upload/gff.html)
You can select the appropriate region from the GTF file:
` gunzip -c Bos_taurus.UMD3.1.78.gtf.gz | awk '$3=="gene"' | awk '$1==18&&$4>1000000&&$4<2000000' >mygenes.gtf`
Subsequently you can intersect the VCF with the GTF file that contains
just the genes between coordinates 1 and 2 million:
` bedtools intersect -a BT18_rsnumbers.vcf.gz -b mygenes.gtf >intersection_genes.vcf`
Have a quick look at the output, and count the number of variants in
this VCF (e.g. by using 'wc -l').
Now do the same, but select only the exonic variants in the same
interval, from 1 to 2 million.
How many variants do you have now?
### Variant Effect Predictor
Various tools exist to make more precise annotations than what you can
achieve using Bedtools and a GTF. Among them are Variant Effect
Predictor (VEP), snpEff, and Annovar. For the current practical we will
use former two.
VEP uses a cache directory that has the annotation information. On the
infrastructure you are working on that directory can be found here:
` /home/formacion/COMUNES/IAMZ/data/CIHEAM/ReferenceGenome/VEP/`
You can create a file that will provide annotation information for each
SNP:
` gunzip -c BT18_rsnumbers.vcf.gz | perl /home/formacion/COMUNES/IAMZ/soft/ensembl-tools-release-78/scripts/variant_effect_predictor/variant_effect_predictor.pl --dir /home/formacion/COMUNES/IAMZ/data/CIHEAM/ReferenceGenome/VEP/ --species bos_taurus -o BT18.vep --fork 4 --canonical --sift b --coding_only --no_intergenic --offline --force_overwrite`
Have a look inside the output file and study its content. Why are there
no SNPs in this list that are intergenic?
For some purposes, it might be more convenient to have the annotations
directly in the VCF. You could for instance do the following:
` gunzip -c BT18_rsnumbers.vcf.gz | perl /home/formacion/COMUNES/IAMZ/soft/ensembl-tools-release-78/scripts/variant_effect_predictor/variant_effect_predictor.pl --dir /home/formacion/COMUNES/IAMZ/data/CIHEAM/ReferenceGenome/VEP/ --species bos_taurus -o BT18.vep.vcf --fork 4 --canonical --sift b --coding_only --offline --force_overwrite --vcf `\
` bgzip BT18.vep.vcf`\
` tabix -p vcf BT18.vep.vcf.gz`
Again, study and interpret your outcome by looking inside the file.
Specifically, note WHERE the annotation ended up.
When dealing with deleterious alleles, we might be interested to learn
how frequent that allele is in the population we are studying. Our
expectation is that genuinly deleterious alleles should be relatively
rare in a population. If not they might hint at either a sign of
inbreeding, or, perhaps, something that might be deleterious in the wild
but advantageous in the context of domestication/breeding. The VCF file
holds the (reference) allele frequency information. The 8th column of
the VCF file in fact is a bunch of fields concatenated with the ';'
delimitor. One of the fields has the tag 'AF=', where AF stands for
Allele Frequency. We can expand the number of space-delimited fields
easily and then look for the 11th column. We need to remove the 'AF='
tag though, but if we do that we can further select for instance for
allele frequencies between 0 and 15%.
` gunzip -c BT18.vep.vcf.gz | grep deleterious | sed 's/;/\t/g' | awk '{gsub("AF=","",$11); print}' | awk '$11<0.15&&$11>0' | wc -l`
Analysing the annotations a bit further, you could discover that some
` gunzip -c BT18.vep.vcf.gz | grep deleterious | sed 's/;/ /g' | awk '{gsub("AF=","",$11); print}' | awk '$11<1&&$11>0' | sed 's/ /#_#/g' | sed 's/|/\t/g' | cut -f2 | sort | uniq -c | sed 's/^ \+//' | awk '$1>1' >genes_with_more_than1.txt `
Take two or three genes that have the highest number of deleterious
variants. Manually look them up at the Ensembl website (ensembl.org).
Check the 'orthologous' status. What do you observe? How would you
interpret the validity of the gene annotation and/or the likelihood of
off-site mapping of reads?
This provides us with a strategy to filter a bit more rigorously for
genes that may actually contain genuine deleterious alleles. Ideally we
would a priory filter for genes that have proper one-to-one orthologous
relations with other mammalian species. Time and lack of direct internet
connection makes that unfeasible for this practical. However, genes that
have only one deleterious variant segregating in the study population at
a low frequency would certainly be better candidates than genes that
have multiple deleterious variants, as the above example indicates. To
filter you can do the following:
` gunzip -c BT18.vep.vcf.gz | grep deleterious | sed 's/;/ /g' | awk '{gsub("AF=","",$11); print}' | awk '$11<0.15&&$11>0' | sed 's/ /#_#/g' | sed 's/|/\t/g' | cut -f2 | sort | uniq -c | sed 's/^ \+//' | awk '$1==1' | awk '{print $2}' >genes_with1.txt`
How many genes are there that meet this condition?
### snpEff
Another popular variant annotation tool is 'snpEff'. Among the
advantages over VEP is that it is applicable to a wider range of
species. Furthermore, it is blazingly fast. For chromosome 18, we will
also use this tool:
` java -Xmx4G -jar $snpEff -dataDir ~/snpEff/data -v UMD3.1.78 BT18_rsnumbers.vcf.gz >testresult_snpEff_Bt.txt`
Study the outcome. Try to filter out 'intergenic' SNPs and other things
that are perhaps a bit less relevant (use your skills in 'grep-ping',
etc.). What are some of the differences in how the variants are
annotated, compared to VEP?
You will notice that the variants that have an effect are labelled
'MODERATE' or 'HIGH'. Try to filter out the ones that are 'HIGH'. How
many are there?
Compare the results by selecting the 'MODERATE' and "HIGH" variants, and
compare that with the 'deleterious' variants as annotated (through SIFT)
in VEP. You can use, e.g. vcf-compare.
Inferring function
------------------
### Provean
` >ENSSSCP00000018263 pep:novel chromosome:Sscrofa10.2:12:6621092:6624938:1 gene:ENSSSCG00000017236 transcript:ENSSSCT00000018765 gene_biotype:protein_coding t transcript_biotype:protein_coding`\
` MTPRVGAVWLPSALLLLRVPGCLSLSGPPTAMGTKGGSLSVQCRYEEEYIDDKKYWDKSP`\
` CFLSWKHIVETTESAREVRRGRVSIRDDPANLTFTVTLERLTEEDAGTYCCGITAQFSVD`\
` PTHEVEVVVFPALGTSRPPSMPGPPTTLPATTWSFVSERETMANNLGKGPASQDPGQHPR`\
` SKHPSIRLLLLVFLEVPLFLGMLGAVLWVHRPLRSSESRSVAMDPVPGNTAPSAGWK`
` 235,G,E`\
` 5,V,A`\
` 22,C,Y`\
` 34,T,I`\
` 51,D,N`\
` 53,K,N`\
` 59,S,Y`\
` 61,C,R`\
` 64,S,L`\
` 67,H,P`\
` 68,I,T`\
` 75,A,V`\
` 108,T,K`\
` 115,A,T`\
` 124,E,D`\
` 130,F,Y`\
` 133,L,P`\
` 142,P,A `\
` 155,F,I`\
` 158,E,G`\
` 186,I,V `\
` 21,G,D`
` ## PROVEAN scores ##`\
` # VARIATION SCORE`\
` 235,G,E 0.076`\
` 5,V,A 0.287`\
` 22,C,Y -8.028`\
` 34,T,I -1.932`\
` 51,D,N -1.613`\
` 53,K,N 1.565`\
` 59,S,Y -2.140`\
` 61,C,R -5.826`\
` 64,S,L -0.437`\
` 67,H,P -0.511`\
` 68,I,T -2.664`\
` 75,A,V -1.061`\
` 108,T,K -4.051`\
` 115,A,T 1.557`\
` 124,E,D -1.587`\
` 130,F,Y -0.983`\
` 133,L,P 3.203`\
` 142,P,A -2.058`\
` 155,F,I 0.502`\
` 158,E,G -0.077`\
` 186,I,V 0.220`\
` 21,G,D -6.014`
At the end of the VEP exercise we created a filtered list of genes that
have only a single deleterious variant annotated. Among these genes was
BLOC1S3 gene, ENSBTAG00000007070.
` `[`http://www.ensembl.org/Bos_taurus/Gene/Summary?db=core;g=ENSBTAG00000007070;r=18:53231599-53232201;t=ENSBTAT00000009293`](http://www.ensembl.org/Bos_taurus/Gene/Summary?db=core;g=ENSBTAG00000007070;r=18:53231599-53232201;t=ENSBTAT00000009293)
` GENE=ENSBTAG00000007070`\
`` PROT=`cat ~/Bos_taurus.UMD3.1.pep.all.fa | grep $GENE | awk '{print $1}' | sed 's/>//'` ``\
` cat test.vep | grep $GENE | grep missense | awk '{print $10","$11}' | sed 's/\//,/' >$PROT.var`
` provean.sh -q $PROT.fa -v $PROT.var --save_supporting_set $PROT.sss >$PROT.result.txt 2>$PROT.error;`
` faOneRecord ~/Bos_taurus.UMD3.1.pep.all.fa $PROT >$PROT.fa`
` cat ENSBTAG00000007070.fa >>ENSBTAG00000007070.sss.fasta`\
` mafft --auto ENSBTAG00000007070.sss.fasta >ENSBTAG00000007070.sss.fasta.out`
Transfer the alignment to your local environment and view (seaview?).
What do you observe for the 26th amino-acid of the cow protein sequence?
In this 'implicit evolutionary context' there is one alternative amino
acid that occurs at this site. What do the D amino acid and this
evolutionary viable alternative share in terms of properties?
Through the orthology with human, try to find some possible phenotypic
consequences of the alternative allele (e.g. through OMIM). Could there
be a reason that the variant is at a relatively high frequency in this
cattle population?
### Polyphen
` /lustre/nobackup/WUR/ABGC/shared/public_data_store/polyphen/polyphen-2.2.2/bin/run_pph.pl -s test1033.fa test1033.coord `\
` #o_acc o_pos o_aa1 o_aa2 rsid acc pos aa1 aa2 nt1 nt2 prediction based_on effect site region PHAT dScore Score1 Score2 MSAv Nobs Nstruct Nfilt PDB_id PDB_pos PDB_ch ident lengthNormASA SecStr MapReg dVol dProp B-fact H-bonds AveNHet MinDHet AveNInt MinDInt AveNSit MinDSit Transv CodPos CpG MinDJxn PfamHit IdPmax IdPSNP IdQmin`\
` ENSSSCG00000001099ENSSSCT00000001195 368 S A ENSSSCG00000001099ENSSSCT00000001195 368 S A benign alignment +0.721 -1.180 -1.901 2 68 12.350 12.350 42.63`\
` ENSSSCG00000001099ENSSSCT00000001195 453 R H ENSSSCG00000001099ENSSSCT00000001195 453 R H benign alignment +0.351 -2.130 -2.481 2 67 33.194 33.194 93.76`\
` ENSSSCG00000001099ENSSSCT00000001195 431 K T ENSSSCG00000001099ENSSSCT00000001195 431 K T possibly damaging alignment +1.705 -1.678 -3.383 2 68 2.382 68.80`
What goes in:
` test1033.fa`\
` >ENSSSCG00000001099ENSSSCT00000001195`\
` MSSIEQTTEILLCLSPAEAANLKEGINFVRNKSTGKDYILFKNKSRLKACKNMCKHQGGL`\
` FIKDIEDLNGRSVKCTKHNWKLDVSSMKYINPPGSFCQDELVVEKDEENGVLLLELNPPN`\
` PWDSEPRSPEDLAFGEVQITYLTHACMDLKLGDKRMVFDPWLIGPAFARGWWLLHEPPSD`\
` WLERLSRADLIYISHMHSDHLSYPTLKKLAERRPDVPIYVGNTERPVFWNLNQSGVQLTN`\
` INVVPFGIWQQVDKNLRFMILMDGVHPEMDTCIIVEYKGHKILNTVDCTRPNGGRLPMKV`\
` ALMMSDFAGGASGFPMTFSGGKFTEEWKAQFIKTERKKLLNYKARLVKDLQPRIYCPFAG`\
` YFVESHPSDKYIKETNIKNDPNELNNLIKKNSEVVTWTPRPGATLDLGRMLKDPTDSKGI`\
` VEPPEGTKIYKDSWDFGPYLNILNAAIGDEIFRHSSWIKEYFTWAGFKDYNLVVRMIETD`\
` EDFSPLPGGYDYLVDFLDLSFPKERPSREHPYEEIRSRVDVIRHVVKNGLLWDDLYIGFQ`\
` TRLQRDPDIYHHLFWNHFQIKLPLTPPDWKSFLMCSG `
` test1033.coord`\
` ENSSSCG00000001099ENSSSCT00000001195 368 S A`\
` ENSSSCG00000001099ENSSSCT00000001195 453 R H`\
` ENSSSCG00000001099ENSSSCT00000001195 431 K T`
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment