Changes

Megens, Hendrik-Jan · 13ca5cf6
--- a/home.markdown
+++ b/home.markdown
@@ -211,13 +211,25 @@ tabix -p vcf BT18.vep.vcf.gz
 ~~~~
 Again, study and interpret your outcome by looking inside the file
-'BT18.vep.vcf.gz'. Specifically, note WHERE the annotation ended up.
+'BT18.vep.vcf.gz'. Specifically, note WHERE the annotation ended up. For
+this you have to understand the VCF format in detail. If not done
+already, further familiarise yourself with the VCF format.
+` `[`http://www.1000genomes.org/wiki/analysis/vcf4.0`](http://www.1000genomes.org/wiki/analysis/vcf4.0)
+The description states that the 8th column is the "INFO" column. The VEP
+annotation has been added with the "CSQ" field tag.
    18  53205918    .   G   A   0.329662    . AB=0;ABP=0;AC=0;AF=0.125;AN=16;AO=2;CIGAR=1X;DP=70;DPB=70;DPRA=1.03279;EPP=7.35324;EPPR=3.13803;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=60;NS=8;NUMALT=1;ODDS=2.54006;PAIRED=1;PAIREDR=0.985294;PAO=0;PQA=0;PQR=0;PRO=0;QA=70;QR=2421;RO=68;RPP=3.0103;RPPR=11.1853;RUN=1;SAF=1;SAP=3.0103;SAR=1;SRF=49;SRP=31.7504;SRR=19;TYPE=snp;technology.ILLUMINA=1;CSQ=A|ENSBTAG00000018834|ENSBTAT00000025071|Transcript|missense_variant|2132|1984|662|E/K|Gag/Aag|||1|YES|deleterious(0.02)   GT:DP:RO:QR:AO:QA:GL    0/0:7:7:263:0:0:0,-1.92588,-10  0/0:7:7:238:0:0:0,-1.92376,-10 0/0:9:7:266:2:70:-3.90732,0,-10  0/0:10:10:362:0:0:0,-2.73805,-10    0/0:10:10:345:0:0:0,-2.73789,-10    0/0:8:8:287:0:0:0,-2.19662,-10  0/0:5:5:178:0:0:0,-1.38409,-10 0/0:14:14:482:0:0:0,-3.82032,-10
 The two VEP runs have generated reports in HTML format. Transfer them to
 your local computer and study them.
+You can compare one or a few examples manually by checking them at the
+Ensembl VEP website:
+` `[`http://www.ensembl.org/Bos_taurus/Tools/VEP`](http://www.ensembl.org/Bos_taurus/Tools/VEP)
 ### A further look at the VEP output
 From studying the HTML reports it should be clear that the information
@@ -229,11 +241,12 @@ how to extract information from it.
 Although the VCF now looks even more complex than before, there is a
 good reason to add the VEP annotation data the way it was done. The 8th
-column of the VCF provides a lot of information, each field separated by
+column of the VCF (the "INFO" column) provides a lot of information,
-a ';'. In addition, each field contains a field name and a value,
+each field separated by a ';'. In addition, each field contains a field
-delimited by a '='. The VEP annotation has its own field name, or tag:
+name and a value, delimited by a '='. The VEP annotation has its own
-'CSQ'. Within this field, the values are again structured, but this time
+field name, or tag: 'CSQ'. Within this field, the values are again
-delimited by a '|'. What we in fact have now is a nested data structure:
+structured, but this time delimited by a '|'. What we in fact have now
+is a nested data structure:
 ~~~~ {.python}
  ['18', '53205918', '.', 'G', 'A', '0.329662', '.', {'PAIREDR': '0.985294', 'DPB': '70', 'MQMR': '60', 'PQA': '0', 'PAO': '0', 
@@ -269,8 +282,8 @@ deleterious alleles and count them (by omitting the last part you can
 retrieve the relevant variations themselves).
 ~~~~ {.bash}
-  gunzip -c BT18.vep.vcf.gz | grep deleterious | sed 's/;/\t/g' | awk '{gsub("AF=","",$11); print}' | \
+gunzip -c BT18.vep.vcf.gz | grep deleterious | sed 's/;/\t/g' | awk '{gsub("AF=","",$11); print}' | \
-     awk '$11<0.15&&$11>0' | wc -l
+  awk '$11<0.15&&$11>0' | wc -l
 ~~~~
 Analysing the annotations a bit further, you could discover that there
@@ -370,7 +383,7 @@ snpEff is Java-based. The path to the snpEff.jar Java archive is
 available as an environment variable. Do:
 ~~~~ {.bash}
-  echo $SNP
+echo $SNP
  /home/formacion/COMUNES/IAMZ/soft/snpEff/snpEff.jar
 ~~~~
@@ -414,6 +427,39 @@ Bonus question: Compare the results by selecting the 'MODERATE' and
 "HIGH" variants, and compare that with the 'deleterious' variants as
 annotated (through SIFT) in VEP. You can use, e.g. vcf-compare.
+#### If time permits: a human example
+Of all species, human so far has the best annotation information for
+variation (or anything else for that matter). The reason, obviously, is
+the clinical relevance. In the coming years, increased knowledge on
+regulatory sequences, e.g. through the ENCODE project, will further
+increase the knowledge on relevance of non-exonic variation. Again, the
+human/clinical genomics community will lead the way here. The domestic
+animal genomics community will follow and profit from those insights.
+Currently, there are already important resources of information on
+clinically relevant variations. You can find more information here:
+` `[`http://www.ncbi.nlm.nih.gov/variation/docs/human_variation_vcf/`](http://www.ncbi.nlm.nih.gov/variation/docs/human_variation_vcf/)
+Among the variation sets available is the 'common&clinical' - a set of
+variants that has clinical relevance, and a relatively high occurrence
+in human populations (i.e. not the de novo mutations often found to be
+the reason for genetic disorders).
+We can annotate these SNPs using snpEff:
+~~~~ {.bash}
+java -Xmx4G -jar $snpEff -dataDir $SESSIONDIR/snpEff/data -v GRCh38.78 $SESSIONDIR/common_and_clinical.vcf >common_clinical_snpEff.vcf
+~~~~
+Copy the HTML report to your local computer and study it. A few things
+will be apparent that are different to the cattle example. You will see
+consequences that you have not seen in cattle, notably
+"TF\_binding\_site\_variant". Another is that, because this is a
+non-random sample of variants, its distribution is non-random as well.
+Exonic variants, most notably, are highly overrepresented.
 Inferring function
 ------------------
@@ -437,30 +483,32 @@ Protein sequence (Fasta):
 and variation, with coordinates pertaining the above protein sequences,
 the second column the reference amino acid, and the third/last column
-the alternative amino acid (alternative allele): `
+the alternative amino acid (alternative allele):
-  235,G,E
-  5,V,A
+` 235,G,E`\
-  22,C,Y
+` 5,V,A`\
-  34,T,I
+` 22,C,Y`\
-  51,D,N
+` 34,T,I`\
-  53,K,N
+` 51,D,N`\
-  59,S,Y
+` 53,K,N`\
-  61,C,R
+` 59,S,Y`\
-  64,S,L
+` 61,C,R`\
-  67,H,P
+` 64,S,L`\
-  68,I,T
+` 67,H,P`\
-  75,A,V
+` 68,I,T`\
-  108,T,K
+` 75,A,V`\
-  115,A,T
+` 108,T,K`\
-  124,E,D
+` 115,A,T`\
-  130,F,Y
+` 124,E,D`\
-  133,L,P
+` 130,F,Y`\
-  142,P,A 
+` 133,L,P`\
-  155,F,I
+` 142,P,A `\
-  158,E,G
+` 155,F,I`\
-  186,I,V 
+` 158,E,G`\
-  21,G,D
+` 186,I,V `\
-` Output looks like this: `
+` 21,G,D`
+Output looks like this: `
  ## PROVEAN scores ##
  # VARIATION   SCORE
  235,G,E   0.076
@@ -618,7 +666,9 @@ in.coord
 ` ENSSSCG00000001099ENSSSCT00000001195 453        R H`\
 ` ENSSSCG00000001099ENSSSCT00000001195 431        K T`
-` run_pph.pl -s in.fa in.coord`
+~~~~ {.bash}
+run_pph.pl -s in.fa in.coord
+~~~~
 What comes out: