... | ... | @@ -211,13 +211,25 @@ tabix -p vcf BT18.vep.vcf.gz |
|
|
~~~~
|
|
|
|
|
|
Again, study and interpret your outcome by looking inside the file
|
|
|
'BT18.vep.vcf.gz'. Specifically, note WHERE the annotation ended up.
|
|
|
'BT18.vep.vcf.gz'. Specifically, note WHERE the annotation ended up. For
|
|
|
this you have to understand the VCF format in detail. If not done
|
|
|
already, further familiarise yourself with the VCF format.
|
|
|
|
|
|
` `[`http://www.1000genomes.org/wiki/analysis/vcf4.0`](http://www.1000genomes.org/wiki/analysis/vcf4.0)
|
|
|
|
|
|
The description states that the 8th column is the "INFO" column. The VEP
|
|
|
annotation has been added with the "CSQ" field tag.
|
|
|
|
|
|
18 53205918 . G A 0.329662 . AB=0;ABP=0;AC=0;AF=0.125;AN=16;AO=2;CIGAR=1X;DP=70;DPB=70;DPRA=1.03279;EPP=7.35324;EPPR=3.13803;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=60;NS=8;NUMALT=1;ODDS=2.54006;PAIRED=1;PAIREDR=0.985294;PAO=0;PQA=0;PQR=0;PRO=0;QA=70;QR=2421;RO=68;RPP=3.0103;RPPR=11.1853;RUN=1;SAF=1;SAP=3.0103;SAR=1;SRF=49;SRP=31.7504;SRR=19;TYPE=snp;technology.ILLUMINA=1;CSQ=A|ENSBTAG00000018834|ENSBTAT00000025071|Transcript|missense_variant|2132|1984|662|E/K|Gag/Aag|||1|YES|deleterious(0.02) GT:DP:RO:QR:AO:QA:GL 0/0:7:7:263:0:0:0,-1.92588,-10 0/0:7:7:238:0:0:0,-1.92376,-10 0/0:9:7:266:2:70:-3.90732,0,-10 0/0:10:10:362:0:0:0,-2.73805,-10 0/0:10:10:345:0:0:0,-2.73789,-10 0/0:8:8:287:0:0:0,-2.19662,-10 0/0:5:5:178:0:0:0,-1.38409,-10 0/0:14:14:482:0:0:0,-3.82032,-10
|
|
|
|
|
|
The two VEP runs have generated reports in HTML format. Transfer them to
|
|
|
your local computer and study them.
|
|
|
|
|
|
You can compare one or a few examples manually by checking them at the
|
|
|
Ensembl VEP website:
|
|
|
|
|
|
` `[`http://www.ensembl.org/Bos_taurus/Tools/VEP`](http://www.ensembl.org/Bos_taurus/Tools/VEP)
|
|
|
|
|
|
### A further look at the VEP output
|
|
|
|
|
|
From studying the HTML reports it should be clear that the information
|
... | ... | @@ -229,11 +241,12 @@ how to extract information from it. |
|
|
|
|
|
Although the VCF now looks even more complex than before, there is a
|
|
|
good reason to add the VEP annotation data the way it was done. The 8th
|
|
|
column of the VCF provides a lot of information, each field separated by
|
|
|
a ';'. In addition, each field contains a field name and a value,
|
|
|
delimited by a '='. The VEP annotation has its own field name, or tag:
|
|
|
'CSQ'. Within this field, the values are again structured, but this time
|
|
|
delimited by a '|'. What we in fact have now is a nested data structure:
|
|
|
column of the VCF (the "INFO" column) provides a lot of information,
|
|
|
each field separated by a ';'. In addition, each field contains a field
|
|
|
name and a value, delimited by a '='. The VEP annotation has its own
|
|
|
field name, or tag: 'CSQ'. Within this field, the values are again
|
|
|
structured, but this time delimited by a '|'. What we in fact have now
|
|
|
is a nested data structure:
|
|
|
|
|
|
~~~~ {.python}
|
|
|
['18', '53205918', '.', 'G', 'A', '0.329662', '.', {'PAIREDR': '0.985294', 'DPB': '70', 'MQMR': '60', 'PQA': '0', 'PAO': '0',
|
... | ... | @@ -269,8 +282,8 @@ deleterious alleles and count them (by omitting the last part you can |
|
|
retrieve the relevant variations themselves).
|
|
|
|
|
|
~~~~ {.bash}
|
|
|
gunzip -c BT18.vep.vcf.gz | grep deleterious | sed 's/;/\t/g' | awk '{gsub("AF=","",$11); print}' | \
|
|
|
awk '$11<0.15&&$11>0' | wc -l
|
|
|
gunzip -c BT18.vep.vcf.gz | grep deleterious | sed 's/;/\t/g' | awk '{gsub("AF=","",$11); print}' | \
|
|
|
awk '$11<0.15&&$11>0' | wc -l
|
|
|
~~~~
|
|
|
|
|
|
Analysing the annotations a bit further, you could discover that there
|
... | ... | @@ -370,7 +383,7 @@ snpEff is Java-based. The path to the snpEff.jar Java archive is |
|
|
available as an environment variable. Do:
|
|
|
|
|
|
~~~~ {.bash}
|
|
|
echo $SNP
|
|
|
echo $SNP
|
|
|
/home/formacion/COMUNES/IAMZ/soft/snpEff/snpEff.jar
|
|
|
~~~~
|
|
|
|
... | ... | @@ -414,6 +427,39 @@ Bonus question: Compare the results by selecting the 'MODERATE' and |
|
|
"HIGH" variants, and compare that with the 'deleterious' variants as
|
|
|
annotated (through SIFT) in VEP. You can use, e.g. vcf-compare.
|
|
|
|
|
|
#### If time permits: a human example
|
|
|
|
|
|
Of all species, human so far has the best annotation information for
|
|
|
variation (or anything else for that matter). The reason, obviously, is
|
|
|
the clinical relevance. In the coming years, increased knowledge on
|
|
|
regulatory sequences, e.g. through the ENCODE project, will further
|
|
|
increase the knowledge on relevance of non-exonic variation. Again, the
|
|
|
human/clinical genomics community will lead the way here. The domestic
|
|
|
animal genomics community will follow and profit from those insights.
|
|
|
|
|
|
Currently, there are already important resources of information on
|
|
|
clinically relevant variations. You can find more information here:
|
|
|
|
|
|
` `[`http://www.ncbi.nlm.nih.gov/variation/docs/human_variation_vcf/`](http://www.ncbi.nlm.nih.gov/variation/docs/human_variation_vcf/)
|
|
|
|
|
|
Among the variation sets available is the 'common&clinical' - a set of
|
|
|
variants that has clinical relevance, and a relatively high occurrence
|
|
|
in human populations (i.e. not the de novo mutations often found to be
|
|
|
the reason for genetic disorders).
|
|
|
|
|
|
We can annotate these SNPs using snpEff:
|
|
|
|
|
|
~~~~ {.bash}
|
|
|
java -Xmx4G -jar $snpEff -dataDir $SESSIONDIR/snpEff/data -v GRCh38.78 $SESSIONDIR/common_and_clinical.vcf >common_clinical_snpEff.vcf
|
|
|
~~~~
|
|
|
|
|
|
Copy the HTML report to your local computer and study it. A few things
|
|
|
will be apparent that are different to the cattle example. You will see
|
|
|
consequences that you have not seen in cattle, notably
|
|
|
"TF\_binding\_site\_variant". Another is that, because this is a
|
|
|
non-random sample of variants, its distribution is non-random as well.
|
|
|
Exonic variants, most notably, are highly overrepresented.
|
|
|
|
|
|
Inferring function
|
|
|
------------------
|
|
|
|
... | ... | @@ -437,30 +483,32 @@ Protein sequence (Fasta): |
|
|
|
|
|
and variation, with coordinates pertaining the above protein sequences,
|
|
|
the second column the reference amino acid, and the third/last column
|
|
|
the alternative amino acid (alternative allele): `
|
|
|
235,G,E
|
|
|
5,V,A
|
|
|
22,C,Y
|
|
|
34,T,I
|
|
|
51,D,N
|
|
|
53,K,N
|
|
|
59,S,Y
|
|
|
61,C,R
|
|
|
64,S,L
|
|
|
67,H,P
|
|
|
68,I,T
|
|
|
75,A,V
|
|
|
108,T,K
|
|
|
115,A,T
|
|
|
124,E,D
|
|
|
130,F,Y
|
|
|
133,L,P
|
|
|
142,P,A
|
|
|
155,F,I
|
|
|
158,E,G
|
|
|
186,I,V
|
|
|
21,G,D
|
|
|
` Output looks like this: `
|
|
|
the alternative amino acid (alternative allele):
|
|
|
|
|
|
` 235,G,E`\
|
|
|
` 5,V,A`\
|
|
|
` 22,C,Y`\
|
|
|
` 34,T,I`\
|
|
|
` 51,D,N`\
|
|
|
` 53,K,N`\
|
|
|
` 59,S,Y`\
|
|
|
` 61,C,R`\
|
|
|
` 64,S,L`\
|
|
|
` 67,H,P`\
|
|
|
` 68,I,T`\
|
|
|
` 75,A,V`\
|
|
|
` 108,T,K`\
|
|
|
` 115,A,T`\
|
|
|
` 124,E,D`\
|
|
|
` 130,F,Y`\
|
|
|
` 133,L,P`\
|
|
|
` 142,P,A `\
|
|
|
` 155,F,I`\
|
|
|
` 158,E,G`\
|
|
|
` 186,I,V `\
|
|
|
` 21,G,D`
|
|
|
|
|
|
Output looks like this: `
|
|
|
## PROVEAN scores ##
|
|
|
# VARIATION SCORE
|
|
|
235,G,E 0.076
|
... | ... | @@ -618,7 +666,9 @@ in.coord |
|
|
` ENSSSCG00000001099ENSSSCT00000001195 453 R H`\
|
|
|
` ENSSSCG00000001099ENSSSCT00000001195 431 K T`
|
|
|
|
|
|
` run_pph.pl -s in.fa in.coord`
|
|
|
~~~~ {.bash}
|
|
|
run_pph.pl -s in.fa in.coord
|
|
|
~~~~
|
|
|
|
|
|
What comes out:
|
|
|
|
... | ... | |