<p>Again, study and interpret your outcome by looking inside the file 'BT18.vep.vcf.gz'. Specifically, note WHERE the annotation ended up.</p>
<p>Again, study and interpret your outcome by looking inside the file 'BT18.vep.vcf.gz'. Specifically, note WHERE the annotation ended up. For this you have to understand the VCF format in detail. If not done already, further familiarise yourself with the VCF format.</p>
<p>The description states that the 8th column is the "INFO" column. The VEP annotation has been added with the "CSQ" field tag.</p>
<pre><code>18 53205918 . G A 0.329662 . AB=0;ABP=0;AC=0;AF=0.125;AN=16;AO=2;CIGAR=1X;DP=70;DPB=70;DPRA=1.03279;EPP=7.35324;EPPR=3.13803;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=60;NS=8;NUMALT=1;ODDS=2.54006;PAIRED=1;PAIREDR=0.985294;PAO=0;PQA=0;PQR=0;PRO=0;QA=70;QR=2421;RO=68;RPP=3.0103;RPPR=11.1853;RUN=1;SAF=1;SAP=3.0103;SAR=1;SRF=49;SRP=31.7504;SRR=19;TYPE=snp;technology.ILLUMINA=1;CSQ=A|ENSBTAG00000018834|ENSBTAT00000025071|Transcript|missense_variant|2132|1984|662|E/K|Gag/Aag|||1|YES|deleterious(0.02) GT:DP:RO:QR:AO:QA:GL 0/0:7:7:263:0:0:0,-1.92588,-10 0/0:7:7:238:0:0:0,-1.92376,-10 0/0:9:7:266:2:70:-3.90732,0,-10 0/0:10:10:362:0:0:0,-2.73805,-10 0/0:10:10:345:0:0:0,-2.73789,-10 0/0:8:8:287:0:0:0,-2.19662,-10 0/0:5:5:178:0:0:0,-1.38409,-10 0/0:14:14:482:0:0:0,-3.82032,-10</code></pre>
<p>The two VEP runs have generated reports in HTML format. Transfer them to your local computer and study them.</p>
<h3id="a-further-look-at-the-vep-output">A further look at the VEP output</h3>
<p>From studying the HTML reports it should be clear that the information you can find there is necessarily limited and general. It is therefore important to acquire skills to generate further analyses based on the annotation that meet your specific research needs. In this section we will delve a little bit deeper in the structure of the annotations and how to extract information from it.</p>
<p>Although the VCF now looks even more complex than before, there is a good reason to add the VEP annotation data the way it was done. The 8th column of the VCF provides a lot of information, each field separated by a ';'. In addition, each field contains a field name and a value, delimited by a '='. The VEP annotation has its own field name, or tag: 'CSQ'. Within this field, the values are again structured, but this time delimited by a '|'. What we in fact have now is a nested data structure:</p>
<p>Although the VCF now looks even more complex than before, there is a good reason to add the VEP annotation data the way it was done. The 8th column of the VCF (the "INFO" column) provides a lot of information, each field separated by a ';'. In addition, each field contains a field name and a value, delimited by a '='. The VEP annotation has its own field name, or tag: 'CSQ'. Within this field, the values are again structured, but this time delimited by a '|'. What we in fact have now is a nested data structure:</p>
<p>When dealing with deleterious alleles, we might be interested to learn how frequent that allele is in the population we are studying. Our expectation is that genuinly deleterious alleles should be relatively rare in a population. If not they might hint at either a sign of inbreeding, or, perhaps, something that might be deleterious in the wild but advantageous in the context of domestication/breeding. The VCF file holds the (reference) allele frequency information. The 8th column of the VCF file in fact is a bunch of fields concatenated with the ';' delimitor. One of the fields has the tag 'AF=', where AF stands for (Alternative) Allele Frequency. We can expand the number of tab-delimited fields easily by replacing ';' for '\t', and then look for the 11th column. We need to remove the 'AF=' tag though, but if we do that we can further select for instance for allele frequencies between 0 and 15%.</p>
<p>The next shell-oneliner will give you the variants with rare (<15%) deleterious alleles and count them (by omitting the last part you can retrieve the relevant variations themselves).</p>
<p>Analysing the annotations a bit further, you could discover that there are quite a few variants that have high deleterious allele frequencies. Not only do some putative deleterious alleles occur at high frequency, but there is also a number of genes that has quite a few deleterious alleles annotated in the current population sample. Say you would like to make a table that contains all genes that have more than one putative deleterious variant. We would need to isolate both the AF field and the CSQ field, and from the latter isolate the 'gene' field. While this is possible with a fairly simple shell scripting hack, it does require an awful lot of counting to make sure you end up with the right fields. Ideally you would like to have more control by putting the relevant data in a complex data structure. Languages such as Perl and particularly Python allow far more explicit syntax to achieve that, so we will swap out some complex string of shell-commands by a single Python script. That script does only one thing: takes lines from the VCF file, put all fields in a complex data structure (as seen above) and, in a structured way, prints a selection of fields.</p>
<p>The Python script will take VCF input from a stream:</p>
<p>Another popular variant annotation tool is 'snpEff'. Among the advantages over VEP is that it is applicable to a wider range of species. Furthermore, it is blazingly fast. For chromosome 18, we will also use this tool:</p>
<p>snpEff is Java-based. The path to the snpEff.jar Java archive is available as an environment variable. Do:</p>
<p>and variation, with coordinates pertaining the above protein sequences, the second column the reference amino acid, and the third/last column the alternative amino acid (alternative allele): <code>
235,G,E
5,V,A
22,C,Y
34,T,I
51,D,N
53,K,N
59,S,Y
61,C,R
64,S,L
67,H,P
68,I,T
75,A,V
108,T,K
115,A,T
124,E,D
130,F,Y
133,L,P
142,P,A
155,F,I
158,E,G
186,I,V
21,G,D
</code> Output looks like this: <code>
<p>and variation, with coordinates pertaining the above protein sequences, the second column the reference amino acid, and the third/last column the alternative amino acid (alternative allele):</p>
<p><code> ENSSSCG00000001099ENSSSCT00000001195 368 S A</code><br/><code> ENSSSCG00000001099ENSSSCT00000001195 453 R H</code><br/><code> ENSSSCG00000001099ENSSSCT00000001195 431 K T</code></p>
<p><code> /lustre/nobackup/WUR/ABGC/shared/public_data_store/polyphen/polyphen-2.2.2/bin/run_pph.pl -s test1033.fa test1033.coord </code><br/><code> #o_acc o_pos o_aa1 o_aa2 rsid acc pos aa1 aa2 nt1 nt2 prediction based_on effect site region PHAT dScore Score1 Score2 MSAv Nobs Nstruct Nfilt PDB_id PDB_pos PDB_ch ident lengthNormASA SecStr MapReg dVol dProp B-fact H-bonds AveNHet MinDHet AveNInt MinDInt AveNSit MinDSit Transv CodPos CpG MinDJxn PfamHit IdPmax IdPSNP IdQmin</code><br/><code> ENSSSCG00000001099ENSSSCT00000001195 368 S A ENSSSCG00000001099ENSSSCT00000001195 368 S A benign alignment +0.721 -1.180 -1.901 2 68 12.350 12.350 42.63</code><br/><code> ENSSSCG00000001099ENSSSCT00000001195 453 R H ENSSSCG00000001099ENSSSCT00000001195 453 R H benign alignment +0.351 -2.130 -2.481 2 67 33.194 33.194 93.76</code><br/><code> ENSSSCG00000001099ENSSSCT00000001195 431 K T ENSSSCG00000001099ENSSSCT00000001195 431 K T possibly damaging alignment +1.705 -1.678 -3.383 2 68 2.382 68.80</code></p>
<p>The fasta and coordinate files can be made similar to what you've done for Provean. Polyphen is much more human-oriented than Provean (which is more species agnostic) and has many more features, not all of which work very well for non-human species. Since Polyphen won't run properly in your present infrastructure we will not practice it actively.</p>
Again, study and interpret your outcome by looking inside the file 'BT18.vep.vcf.gz'. Specifically, note WHERE the annotation ended up.
For this you have to understand the VCF format in detail. If not done already, further familiarise yourself with the VCF format.
http://www.1000genomes.org/wiki/analysis/vcf4.0
The description states that the 8th column is the "INFO" column. The VEP annotation has been added with the "CSQ" field tag.
<pre style="white-space: pre-wrap;
white-space: -moz-pre-wrap;
white-space: -pre-wrap;
...
...
@@ -169,7 +174,7 @@ The two VEP runs have generated reports in HTML format. Transfer them to your lo
From studying the HTML reports it should be clear that the information you can find there is necessarily limited and general. It is therefore important to acquire skills to generate further analyses based on the annotation that meet your specific research needs. In this section we will delve a little bit deeper in the structure of the annotations and how to extract information from it.
Although the VCF now looks even more complex than before, there is a good reason to add the VEP annotation data the way it was done. The 8th column of the VCF provides a lot of information, each field separated by a ';'. In addition, each field contains a field name and a value, delimited by a '='. The VEP annotation has its own field name, or tag: 'CSQ'. Within this field, the values are again structured, but this time delimited by a '|'. What we in fact have now is a nested data structure:
Although the VCF now looks even more complex than before, there is a good reason to add the VEP annotation data the way it was done. The 8th column of the VCF (the "INFO" column) provides a lot of information, each field separated by a ';'. In addition, each field contains a field name and a value, delimited by a '='. The VEP annotation has its own field name, or tag: 'CSQ'. Within this field, the values are again structured, but this time delimited by a '|'. What we in fact have now is a nested data structure:
@@ -188,8 +193,8 @@ When dealing with deleterious alleles, we might be interested to learn how frequ
The next shell-oneliner will give you the variants with rare (<15%) deleterious alleles and count them (by omitting the last part you can retrieve the relevant variations themselves).
Analysing the annotations a bit further, you could discover that there are quite a few variants that have high deleterious allele frequencies. Not only do some putative deleterious alleles occur at high frequency, but there is also a number of genes that has quite a few deleterious alleles annotated in the current population sample. Say you would like to make a table that contains all genes that have more than one putative deleterious variant. We would need to isolate both the AF field and the CSQ field, and from the latter isolate the 'gene' field. While this is possible with a fairly simple shell scripting hack, it does require an awful lot of counting to make sure you end up with the right fields. Ideally you would like to have more control by putting the relevant data in a complex data structure. Languages such as Perl and particularly Python allow far more explicit syntax to achieve that, so we will swap out some complex string of shell-commands by a single Python script. That script does only one thing: takes lines from the VCF file, put all fields in a complex data structure (as seen above) and, in a structured way, prints a selection of fields.
...
...
@@ -239,7 +244,7 @@ Another popular variant annotation tool is 'snpEff'. Among the advantages over V
snpEff is Java-based. The path to the snpEff.jar Java archive is available as an environment variable. Do:
and variation, with coordinates pertaining the above protein sequences, the second column the reference amino acid, and the third/last column the alternative amino acid (alternative allele):
<code>
235,G,E
5,V,A
22,C,Y
...
...
@@ -315,7 +319,7 @@ and variation, with coordinates pertaining the above protein sequences, the seco