tutorial

df161079 · Megens, Hendrik-Jan · 9359c283 · df161079 · df161079 · df161079
Commit df161079 authored 10 years ago by Megens, Hendrik-Jan
--- a/README.html
+++ b/README.html
@@ -91,12 +91,14 @@ code > span.er { color: #ff0000; font-weight: bold; }
  <span class="kw">--fork</span> 4 --canonical --sift b --offline --force_overwrite --vcf 
 <span class="kw">bgzip</span> BT18.vep.vcf
 <span class="kw">tabix</span> -p vcf BT18.vep.vcf.gz</code></pre>
-<p>Again, study and interpret your outcome by looking inside the file 'BT18.vep.vcf.gz'. Specifically, note WHERE the annotation ended up.</p>
+<p>Again, study and interpret your outcome by looking inside the file 'BT18.vep.vcf.gz'. Specifically, note WHERE the annotation ended up. For this you have to understand the VCF format in detail. If not done already, further familiarise yourself with the VCF format.</p>
+<p><code> </code><a href="http://www.1000genomes.org/wiki/analysis/vcf4.0"><code>http://www.1000genomes.org/wiki/analysis/vcf4.0</code></a></p>
+<p>The description states that the 8th column is the &quot;INFO&quot; column. The VEP annotation has been added with the &quot;CSQ&quot; field tag.</p>
 <pre><code>18  53205918    .   G   A   0.329662    . AB=0;ABP=0;AC=0;AF=0.125;AN=16;AO=2;CIGAR=1X;DP=70;DPB=70;DPRA=1.03279;EPP=7.35324;EPPR=3.13803;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=60;NS=8;NUMALT=1;ODDS=2.54006;PAIRED=1;PAIREDR=0.985294;PAO=0;PQA=0;PQR=0;PRO=0;QA=70;QR=2421;RO=68;RPP=3.0103;RPPR=11.1853;RUN=1;SAF=1;SAP=3.0103;SAR=1;SRF=49;SRP=31.7504;SRR=19;TYPE=snp;technology.ILLUMINA=1;CSQ=A|ENSBTAG00000018834|ENSBTAT00000025071|Transcript|missense_variant|2132|1984|662|E/K|Gag/Aag|||1|YES|deleterious(0.02)   GT:DP:RO:QR:AO:QA:GL    0/0:7:7:263:0:0:0,-1.92588,-10  0/0:7:7:238:0:0:0,-1.92376,-10 0/0:9:7:266:2:70:-3.90732,0,-10  0/0:10:10:362:0:0:0,-2.73805,-10    0/0:10:10:345:0:0:0,-2.73789,-10    0/0:8:8:287:0:0:0,-2.19662,-10  0/0:5:5:178:0:0:0,-1.38409,-10 0/0:14:14:482:0:0:0,-3.82032,-10</code></pre>
 <p>The two VEP runs have generated reports in HTML format. Transfer them to your local computer and study them.</p>
 <h3 id="a-further-look-at-the-vep-output">A further look at the VEP output</h3>
 <p>From studying the HTML reports it should be clear that the information you can find there is necessarily limited and general. It is therefore important to acquire skills to generate further analyses based on the annotation that meet your specific research needs. In this section we will delve a little bit deeper in the structure of the annotations and how to extract information from it.</p>
-<p>Although the VCF now looks even more complex than before, there is a good reason to add the VEP annotation data the way it was done. The 8th column of the VCF provides a lot of information, each field separated by a ';'. In addition, each field contains a field name and a value, delimited by a '='. The VEP annotation has its own field name, or tag: 'CSQ'. Within this field, the values are again structured, but this time delimited by a '|'. What we in fact have now is a nested data structure:</p>
+<p>Although the VCF now looks even more complex than before, there is a good reason to add the VEP annotation data the way it was done. The 8th column of the VCF (the &quot;INFO&quot; column) provides a lot of information, each field separated by a ';'. In addition, each field contains a field name and a value, delimited by a '='. The VEP annotation has its own field name, or tag: 'CSQ'. Within this field, the values are again structured, but this time delimited by a '|'. What we in fact have now is a nested data structure:</p>
 <pre class="sourceCode python"><code class="sourceCode python">  [<span class="st">&#39;18&#39;</span>, <span class="st">&#39;53205918&#39;</span>, <span class="st">&#39;.&#39;</span>, <span class="st">&#39;G&#39;</span>, <span class="st">&#39;A&#39;</span>, <span class="st">&#39;0.329662&#39;</span>, <span class="st">&#39;.&#39;</span>, {<span class="st">&#39;PAIREDR&#39;</span>: <span class="st">&#39;0.985294&#39;</span>, <span class="st">&#39;DPB&#39;</span>: <span class="st">&#39;70&#39;</span>, <span class="st">&#39;MQMR&#39;</span>: <span class="st">&#39;60&#39;</span>, <span class="st">&#39;PQA&#39;</span>: <span class="st">&#39;0&#39;</span>, <span class="st">&#39;PAO&#39;</span>: <span class="st">&#39;0&#39;</span>, 
  <span class="co">&#39;SAR&#39;</span>: <span class="st">&#39;1&#39;</span>, <span class="st">&#39;AB&#39;</span>: <span class="st">&#39;0&#39;</span>, <span class="st">&#39;GTI&#39;</span>: <span class="st">&#39;0&#39;</span>, <span class="st">&#39;NUMALT&#39;</span>: <span class="st">&#39;1&#39;</span>, <span class="st">&#39;RO&#39;</span>: <span class="st">&#39;68&#39;</span>, <span class="st">&#39;DP&#39;</span>: <span class="st">&#39;70&#39;</span>, <span class="st">&#39;DPRA&#39;</span>: <span class="st">&#39;1.03279&#39;</span>, 
  <span class="co">&#39;CSQ&#39;</span>: [<span class="st">&#39;A&#39;</span>, <span class="st">&#39;ENSBTAG00000018834&#39;</span>, <span class="st">&#39;ENSBTAT00000025071&#39;</span>, <span class="st">&#39;Transcript&#39;</span>, <span class="st">&#39;missense_variant&#39;</span>, 
@@ -110,8 +112,8 @@ code > span.er { color: #ff0000; font-weight: bold; }
  <span class="co">&#39;0/0:5:5:178:0:0:0,-1.38409,-10&#39;</span>, <span class="st">&#39;0/0:14:14:482:0:0:0,-3.82032,-10&#39;</span>]</code></pre>
 <p>When dealing with deleterious alleles, we might be interested to learn how frequent that allele is in the population we are studying. Our expectation is that genuinly deleterious alleles should be relatively rare in a population. If not they might hint at either a sign of inbreeding, or, perhaps, something that might be deleterious in the wild but advantageous in the context of domestication/breeding. The VCF file holds the (reference) allele frequency information. The 8th column of the VCF file in fact is a bunch of fields concatenated with the ';' delimitor. One of the fields has the tag 'AF=', where AF stands for (Alternative) Allele Frequency. We can expand the number of tab-delimited fields easily by replacing ';' for '\t', and then look for the 11th column. We need to remove the 'AF=' tag though, but if we do that we can further select for instance for allele frequencies between 0 and 15%.</p>
 <p>The next shell-oneliner will give you the variants with rare (&lt;15%) deleterious alleles and count them (by omitting the last part you can retrieve the relevant variations themselves).</p>
-<pre class="sourceCode bash"><code class="sourceCode bash">  <span class="kw">gunzip</span> -c BT18.vep.vcf.gz <span class="kw">|</span> <span class="kw">grep</span> deleterious <span class="kw">|</span> <span class="kw">sed</span> <span class="st">&#39;s/;/\t/g&#39;</span> <span class="kw">|</span> <span class="kw">awk</span> <span class="st">&#39;{gsub(&quot;AF=&quot;,&quot;&quot;,$11); print}&#39;</span> <span class="kw">|</span> <span class="kw">\</span>
-     <span class="kw">awk</span> <span class="st">&#39;$11&lt;0.15&amp;&amp;$11&gt;0&#39;</span> <span class="kw">|</span> <span class="kw">wc</span> -l</code></pre>
+<pre class="sourceCode bash"><code class="sourceCode bash"><span class="kw">gunzip</span> -c BT18.vep.vcf.gz <span class="kw">|</span> <span class="kw">grep</span> deleterious <span class="kw">|</span> <span class="kw">sed</span> <span class="st">&#39;s/;/\t/g&#39;</span> <span class="kw">|</span> <span class="kw">awk</span> <span class="st">&#39;{gsub(&quot;AF=&quot;,&quot;&quot;,$11); print}&#39;</span> <span class="kw">|</span> <span class="kw">\</span>
+  <span class="kw">awk</span> <span class="st">&#39;$11&lt;0.15&amp;&amp;$11&gt;0&#39;</span> <span class="kw">|</span> <span class="kw">wc</span> -l</code></pre>
 <p>Analysing the annotations a bit further, you could discover that there are quite a few variants that have high deleterious allele frequencies. Not only do some putative deleterious alleles occur at high frequency, but there is also a number of genes that has quite a few deleterious alleles annotated in the current population sample. Say you would like to make a table that contains all genes that have more than one putative deleterious variant. We would need to isolate both the AF field and the CSQ field, and from the latter isolate the 'gene' field. While this is possible with a fairly simple shell scripting hack, it does require an awful lot of counting to make sure you end up with the right fields. Ideally you would like to have more control by putting the relevant data in a complex data structure. Languages such as Perl and particularly Python allow far more explicit syntax to achieve that, so we will swap out some complex string of shell-commands by a single Python script. That script does only one thing: takes lines from the VCF file, put all fields in a complex data structure (as seen above) and, in a structured way, prints a selection of fields.</p>
 <p>The Python script will take VCF input from a stream:</p>
 <pre class="sourceCode bash"><code class="sourceCode bash"><span class="kw">gunzip</span> -c BT18.vep.vcf.gz <span class="kw">|</span> <span class="kw">python3</span> <span class="ot">$SESSIONDIR</span>/print_fields_from_vcf.py -m vep <span class="kw">|</span> <span class="kw">grep</span> deleterious <span class="kw">|</span> <span class="kw">more</span> </code></pre>
@@ -136,7 +138,7 @@ code > span.er { color: #ff0000; font-weight: bold; }
 <h3 id="snpeff">snpEff</h3>
 <p>Another popular variant annotation tool is 'snpEff'. Among the advantages over VEP is that it is applicable to a wider range of species. Furthermore, it is blazingly fast. For chromosome 18, we will also use this tool:</p>
 <p>snpEff is Java-based. The path to the snpEff.jar Java archive is available as an environment variable. Do:</p>
-<pre class="sourceCode bash"><code class="sourceCode bash">  <span class="kw">echo</span> <span class="ot">$SNP</span>
+<pre class="sourceCode bash"><code class="sourceCode bash"><span class="kw">echo</span> <span class="ot">$SNP</span>
  <span class="kw">/home/formacion/COMUNES/IAMZ/soft/snpEff/snpEff.jar</span></code></pre>
 <p>The snpEff program can be used like this:</p>
 <pre class="sourceCode bash"><code class="sourceCode bash"><span class="kw">java</span> -Xmx4G -jar <span class="ot">$snpEff</span> -dataDir <span class="ot">$SESSIONDIR</span>/snpEff/data -v UMD3.1.78 BT18_rsnumbers.vcf.gz <span class="kw">&gt;</span>BT18_snpEff.vcf
@@ -163,30 +165,9 @@ Protein sequence (Fasta):
  PTHEVEVVVFPALGTSRPPSMPGPPTTLPATTWSFVSERETMANNLGKGPASQDPGQHPR
  SKHPSIRLLLLVFLEVPLFLGMLGAVLWVHRPLRSSESRSVAMDPVPGNTAPSAGWK
 </code></p>
-<p>and variation, with coordinates pertaining the above protein sequences, the second column the reference amino acid, and the third/last column the alternative amino acid (alternative allele): <code>
-  235,G,E
-  5,V,A
-  22,C,Y
-  34,T,I
-  51,D,N
-  53,K,N
-  59,S,Y
-  61,C,R
-  64,S,L
-  67,H,P
-  68,I,T
-  75,A,V
-  108,T,K
-  115,A,T
-  124,E,D
-  130,F,Y
-  133,L,P
-  142,P,A 
-  155,F,I
-  158,E,G
-  186,I,V 
-  21,G,D
-</code> Output looks like this: <code>
+<p>and variation, with coordinates pertaining the above protein sequences, the second column the reference amino acid, and the third/last column the alternative amino acid (alternative allele):</p>
+<p><code> 235,G,E</code><br /><code> 5,V,A</code><br /><code> 22,C,Y</code><br /><code> 34,T,I</code><br /><code> 51,D,N</code><br /><code> 53,K,N</code><br /><code> 59,S,Y</code><br /><code> 61,C,R</code><br /><code> 64,S,L</code><br /><code> 67,H,P</code><br /><code> 68,I,T</code><br /><code> 75,A,V</code><br /><code> 108,T,K</code><br /><code> 115,A,T</code><br /><code> 124,E,D</code><br /><code> 130,F,Y</code><br /><code> 133,L,P</code><br /><code> 142,P,A </code><br /><code> 155,F,I</code><br /><code> 158,E,G</code><br /><code> 186,I,V </code><br /><code> 21,G,D</code></p>
+<p>Output looks like this: <code>
  ## PROVEAN scores ##
  # VARIATION   SCORE
  235,G,E   0.076
@@ -257,7 +238,7 @@ Protein sequence (Fasta):
 <p><code> &gt;ENSSSCG00000001099ENSSSCT00000001195</code><br /><code> MSSIEQTTEILLCLSPAEAANLKEGINFVRNKSTGKDYILFKNKSRLKACKNMCKHQGGL</code><br /><code> FIKDIEDLNGRSVKCTKHNWKLDVSSMKYINPPGSFCQDELVVEKDEENGVLLLELNPPN</code><br /><code> PWDSEPRSPEDLAFGEVQITYLTHACMDLKLGDKRMVFDPWLIGPAFARGWWLLHEPPSD</code><br /><code> WLERLSRADLIYISHMHSDHLSYPTLKKLAERRPDVPIYVGNTERPVFWNLNQSGVQLTN</code><br /><code> INVVPFGIWQQVDKNLRFMILMDGVHPEMDTCIIVEYKGHKILNTVDCTRPNGGRLPMKV</code><br /><code> ALMMSDFAGGASGFPMTFSGGKFTEEWKAQFIKTERKKLLNYKARLVKDLQPRIYCPFAG</code><br /><code> YFVESHPSDKYIKETNIKNDPNELNNLIKKNSEVVTWTPRPGATLDLGRMLKDPTDSKGI</code><br /><code> VEPPEGTKIYKDSWDFGPYLNILNAAIGDEIFRHSSWIKEYFTWAGFKDYNLVVRMIETD</code><br /><code> EDFSPLPGGYDYLVDFLDLSFPKERPSREHPYEEIRSRVDVIRHVVKNGLLWDDLYIGFQ</code><br /><code> TRLQRDPDIYHHLFWNHFQIKLPLTPPDWKSFLMCSG </code></p>
 <p>in.coord</p>
 <p><code> ENSSSCG00000001099ENSSSCT00000001195 368        S A</code><br /><code> ENSSSCG00000001099ENSSSCT00000001195 453        R H</code><br /><code> ENSSSCG00000001099ENSSSCT00000001195 431        K T</code></p>
-<p><code> run_pph.pl -s in.fa in.coord</code></p>
+<pre class="sourceCode bash"><code class="sourceCode bash"><span class="kw">run_pph.pl</span> -s in.fa in.coord</code></pre>
 <p>What comes out:</p>
 <p><code> /lustre/nobackup/WUR/ABGC/shared/public_data_store/polyphen/polyphen-2.2.2/bin/run_pph.pl -s test1033.fa test1033.coord </code><br /><code> #o_acc                 o_pos  o_aa1   o_aa2   rsid        acc            pos  aa1 aa2 nt1 nt2         prediction              based_on        effect      site      region        PHAT    dScore  Score1  Score2  MSAv      Nobs   Nstruct     Nfilt  PDB_id  PDB_pos PDB_ch   ident  lengthNormASA   SecStr  MapReg    dVol   dProp  B-fact   H-bonds     AveNHet     MinDHet     AveNInt     MinDInt     AveNSit     MinDSit    Transv  CodPos  CpG  MinDJxn         PfamHit      IdPmax      IdPSNP      IdQmin</code><br /><code> ENSSSCG00000001099ENSSSCT00000001195     368      S       A               ENSSSCG00000001099ENSSSCT00000001195       368    S   A                     benign             alignment                                                    +0.721  -1.180  -1.901     2        68                                                                                                                                                                                                                                                    12.350      12.350       42.63</code><br /><code> ENSSSCG00000001099ENSSSCT00000001195     453      R       H               ENSSSCG00000001099ENSSSCT00000001195       453    R   H                     benign             alignment                                                    +0.351  -2.130  -2.481     2        67                                                                                                                                                                                                                                                    33.194      33.194       93.76</code><br /><code> ENSSSCG00000001099ENSSSCT00000001195     431      K       T               ENSSSCG00000001099ENSSSCT00000001195       431    K   T          possibly damaging             alignment                                                    +1.705  -1.678  -3.383     2        68                                                                                                                                                                                                                                                     2.382                   68.80</code></p>
 <p>The fasta and coordinate files can be made similar to what you've done for Provean. Polyphen is much more human-oriented than Provean (which is more species agnostic) and has many more features, not all of which work very well for non-human species. Since Polyphen won't run properly in your present infrastructure we will not practice it actively.</p>

--- a/README.md
+++ b/README.md
@@ -211,7 +211,14 @@ tabix -p vcf BT18.vep.vcf.gz
 ~~~~

 Again, study and interpret your outcome by looking inside the file
-'BT18.vep.vcf.gz'. Specifically, note WHERE the annotation ended up.
+'BT18.vep.vcf.gz'. Specifically, note WHERE the annotation ended up. For
+this you have to understand the VCF format in detail. If not done
+already, further familiarise yourself with the VCF format.
+
+` `[`http://www.1000genomes.org/wiki/analysis/vcf4.0`](http://www.1000genomes.org/wiki/analysis/vcf4.0)
+
+The description states that the 8th column is the "INFO" column. The VEP
+annotation has been added with the "CSQ" field tag.

    18  53205918    .   G   A   0.329662    . AB=0;ABP=0;AC=0;AF=0.125;AN=16;AO=2;CIGAR=1X;DP=70;DPB=70;DPRA=1.03279;EPP=7.35324;EPPR=3.13803;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=60;NS=8;NUMALT=1;ODDS=2.54006;PAIRED=1;PAIREDR=0.985294;PAO=0;PQA=0;PQR=0;PRO=0;QA=70;QR=2421;RO=68;RPP=3.0103;RPPR=11.1853;RUN=1;SAF=1;SAP=3.0103;SAR=1;SRF=49;SRP=31.7504;SRR=19;TYPE=snp;technology.ILLUMINA=1;CSQ=A|ENSBTAG00000018834|ENSBTAT00000025071|Transcript|missense_variant|2132|1984|662|E/K|Gag/Aag|||1|YES|deleterious(0.02)   GT:DP:RO:QR:AO:QA:GL    0/0:7:7:263:0:0:0,-1.92588,-10  0/0:7:7:238:0:0:0,-1.92376,-10 0/0:9:7:266:2:70:-3.90732,0,-10  0/0:10:10:362:0:0:0,-2.73805,-10    0/0:10:10:345:0:0:0,-2.73789,-10    0/0:8:8:287:0:0:0,-2.19662,-10  0/0:5:5:178:0:0:0,-1.38409,-10 0/0:14:14:482:0:0:0,-3.82032,-10

@@ -229,11 +236,12 @@ how to extract information from it.

 Although the VCF now looks even more complex than before, there is a
 good reason to add the VEP annotation data the way it was done. The 8th
-column of the VCF provides a lot of information, each field separated by
-a ';'. In addition, each field contains a field name and a value,
-delimited by a '='. The VEP annotation has its own field name, or tag:
-'CSQ'. Within this field, the values are again structured, but this time
-delimited by a '|'. What we in fact have now is a nested data structure:
+column of the VCF (the "INFO" column) provides a lot of information,
+each field separated by a ';'. In addition, each field contains a field
+name and a value, delimited by a '='. The VEP annotation has its own
+field name, or tag: 'CSQ'. Within this field, the values are again
+structured, but this time delimited by a '|'. What we in fact have now
+is a nested data structure:

 ~~~~ {.python}
  ['18', '53205918', '.', 'G', 'A', '0.329662', '.', {'PAIREDR': '0.985294', 'DPB': '70', 'MQMR': '60', 'PQA': '0', 'PAO': '0', 
@@ -269,8 +277,8 @@ deleterious alleles and count them (by omitting the last part you can
 retrieve the relevant variations themselves).

 ~~~~ {.bash}
-  gunzip -c BT18.vep.vcf.gz | grep deleterious | sed 's/;/\t/g' | awk '{gsub("AF=","",$11); print}' | \
-     awk '$11<0.15&&$11>0' | wc -l
+gunzip -c BT18.vep.vcf.gz | grep deleterious | sed 's/;/\t/g' | awk '{gsub("AF=","",$11); print}' | \
+  awk '$11<0.15&&$11>0' | wc -l
 ~~~~

 Analysing the annotations a bit further, you could discover that there
@@ -370,7 +378,7 @@ snpEff is Java-based. The path to the snpEff.jar Java archive is
 available as an environment variable. Do:

 ~~~~ {.bash}
-  echo $SNP
+echo $SNP
  /home/formacion/COMUNES/IAMZ/soft/snpEff/snpEff.jar
 ~~~~

@@ -437,30 +445,32 @@ Protein sequence (Fasta):

 and variation, with coordinates pertaining the above protein sequences,
 the second column the reference amino acid, and the third/last column
-the alternative amino acid (alternative allele): `
-  235,G,E
-  5,V,A
-  22,C,Y
-  34,T,I
-  51,D,N
-  53,K,N
-  59,S,Y
-  61,C,R
-  64,S,L
-  67,H,P
-  68,I,T
-  75,A,V
-  108,T,K
-  115,A,T
-  124,E,D
-  130,F,Y
-  133,L,P
-  142,P,A 
-  155,F,I
-  158,E,G
-  186,I,V 
-  21,G,D
-` Output looks like this: `
+the alternative amino acid (alternative allele):
+
+` 235,G,E`\
+` 5,V,A`\
+` 22,C,Y`\
+` 34,T,I`\
+` 51,D,N`\
+` 53,K,N`\
+` 59,S,Y`\
+` 61,C,R`\
+` 64,S,L`\
+` 67,H,P`\
+` 68,I,T`\
+` 75,A,V`\
+` 108,T,K`\
+` 115,A,T`\
+` 124,E,D`\
+` 130,F,Y`\
+` 133,L,P`\
+` 142,P,A `\
+` 155,F,I`\
+` 158,E,G`\
+` 186,I,V `\
+` 21,G,D`
+
+Output looks like this: `
  ## PROVEAN scores ##
  # VARIATION   SCORE
  235,G,E   0.076
@@ -618,7 +628,9 @@ in.coord
 ` ENSSSCG00000001099ENSSSCT00000001195 453        R H`\
 ` ENSSSCG00000001099ENSSSCT00000001195 431        K T`

-` run_pph.pl -s in.fa in.coord`
+~~~~ {.bash}
+run_pph.pl -s in.fa in.coord
+~~~~

 What comes out:


--- a/README.wiki
+++ b/README.wiki
@@ -155,6 +155,11 @@ tabix -p vcf BT18.vep.vcf.gz
 </source>

 Again, study and interpret your outcome by looking inside the file 'BT18.vep.vcf.gz'. Specifically, note WHERE the annotation ended up. 
+For this you have to understand the VCF format in detail. If not done already, further familiarise yourself with the VCF format.
+  http://www.1000genomes.org/wiki/analysis/vcf4.0
+
+The description states that the 8th column is the "INFO" column. The VEP annotation has been added with the "CSQ" field tag.
+
 <pre style="white-space: pre-wrap; 
 white-space: -moz-pre-wrap; 
 white-space: -pre-wrap; 
@@ -169,7 +174,7 @@ The two VEP runs have generated reports in HTML format. Transfer them to your lo

 From studying the HTML reports it should be clear that the information you can find there is necessarily limited and general. It is therefore important to acquire skills to generate further analyses based on the annotation that meet your specific research needs. In this section we will delve a little bit deeper in the structure of the annotations and how to extract information from it.

-Although the VCF now looks even more complex than before, there is a good reason to add the VEP annotation data the way it was done. The 8th column of the VCF provides a lot of information, each field separated by a ';'. In addition, each field contains a field name and a value, delimited by a '='. The VEP annotation has its own field name, or tag: 'CSQ'. Within this field, the values are again structured, but this time delimited by a '|'. What we in fact have now is a nested data structure: 
+Although the VCF now looks even more complex than before, there is a good reason to add the VEP annotation data the way it was done. The 8th column of the VCF (the "INFO" column) provides a lot of information, each field separated by a ';'. In addition, each field contains a field name and a value, delimited by a '='. The VEP annotation has its own field name, or tag: 'CSQ'. Within this field, the values are again structured, but this time delimited by a '|'. What we in fact have now is a nested data structure: 
 <source lang='python'>
  ['18', '53205918', '.', 'G', 'A', '0.329662', '.', {'PAIREDR': '0.985294', 'DPB': '70', 'MQMR': '60', 'PQA': '0', 'PAO': '0', 
  'SAR': '1', 'AB': '0', 'GTI': '0', 'NUMALT': '1', 'RO': '68', 'DP': '70', 'DPRA': '1.03279', 
@@ -188,8 +193,8 @@ When dealing with deleterious alleles, we might be interested to learn how frequ

 The next shell-oneliner will give you the variants with rare (<15%) deleterious alleles and count them (by omitting the last part you can retrieve the relevant variations themselves).  
 <source lang='bash'>
-  gunzip -c BT18.vep.vcf.gz | grep deleterious | sed 's/;/\t/g' | awk '{gsub("AF=","",$11); print}' | \
-     awk '$11<0.15&&$11>0' | wc -l
+gunzip -c BT18.vep.vcf.gz | grep deleterious | sed 's/;/\t/g' | awk '{gsub("AF=","",$11); print}' | \
+  awk '$11<0.15&&$11>0' | wc -l
 </source>

 Analysing the annotations a bit further, you could discover that there are quite a few variants that have high deleterious allele frequencies. Not only do some putative deleterious alleles occur at high frequency, but there is also a number of genes that has quite a few deleterious alleles annotated in the current population sample. Say you would like to make a table that contains all genes that have more than one putative deleterious variant. We would need to isolate both the AF field and the CSQ field, and from the latter isolate the 'gene' field. While this is possible with a fairly simple shell scripting hack, it does require an awful lot of counting to make sure you end up with the right fields. Ideally you would like to have more control by putting the relevant data in a complex data structure. Languages such as Perl and particularly Python allow far more explicit syntax to achieve that, so we will swap out some complex string of shell-commands by a single Python script. That script does only one thing: takes lines from the VCF file, put all fields in a complex data structure (as seen above) and, in a structured way, prints a selection of fields. 
@@ -239,7 +244,7 @@ Another popular variant annotation tool is 'snpEff'. Among the advantages over V
 snpEff is Java-based. The path to the snpEff.jar Java archive is available as an environment variable. Do:

 <source lang='bash'>
-  echo $SNP
+echo $SNP
  /home/formacion/COMUNES/IAMZ/soft/snpEff/snpEff.jar
 </source>

@@ -292,7 +297,6 @@ Protein sequence (Fasta):
 </code>

 and variation, with coordinates pertaining the above protein sequences, the second column the reference amino acid, and the third/last column the alternative amino acid (alternative allele):
-<code>
  235,G,E
  5,V,A
  22,C,Y
@@ -315,7 +319,7 @@ and variation, with coordinates pertaining the above protein sequences, the seco
  158,E,G
  186,I,V 
  21,G,D
-</code>
+
 Output looks like this:
 <code>
  ## PROVEAN scores ##
@@ -434,7 +438,9 @@ in.coord
  ENSSSCG00000001099ENSSSCT00000001195 453        R H
  ENSSSCG00000001099ENSSSCT00000001195 431        K T

-  run_pph.pl -s in.fa in.coord
+<source lang='bash'>
+run_pph.pl -s in.fa in.coord
+</source>

 What comes out:
  /lustre/nobackup/WUR/ABGC/shared/public_data_store/polyphen/polyphen-2.2.2/bin/run_pph.pl -s test1033.fa test1033.coord