Megens, Hendrik-Jan · f4b8c823
--- a/home.markdown
+++ b/home.markdown
@@ -4,8 +4,6 @@ prediction sessions
 Variant annotation
 ------------------

-=
-
 ### Slicing and dicing of VCF files

 VCF files can get very big. Being able to manipulate them and extract
@@ -16,8 +14,10 @@ and Population practical sessions.
 The first task involves selecting all called variants of chromosome 18,
 bgzip-ping them on-the-fly, and then indexing that file with tabix:

-` tabix -h allbt.vcf.gz 18 | bgzip >BT18.vcf.gz`\
-` tabix -p vcf BT18.vcf.gz`
+~~~~ {.bash}
+  tabix -h allbt.vcf.gz 18 | bgzip >BT18.vcf.gz
+  tabix -p vcf BT18.vcf.gz
+~~~~

 ### Annotating VCF with rs-numbers

@@ -29,11 +29,15 @@ as databased in dbSNP can be found in this file:

 Have a look at what is inside this file:

-` gunzip -c BT18.vcf.gz | grep -v '^##' | more`
+~~~~ {.bash}
+  gunzip -c BT18.vcf.gz | grep -v '^##' | more
+~~~~

 or

-` tabix BT18.vcf.gz 18:1-10000 | more`
+~~~~ {.bash}
+  tabix BT18.vcf.gz 18:1-10000 | more
+~~~~

 You can use vcftools to do many things related to VCF files. This
 includes annotating the VCF based on other interval-file formatted
@@ -41,13 +45,17 @@ input, including other VCF files. For instance, you can annotate an
 interval of the VCF on the fly, by selecting that interval using tabix,
 and then piping it to vcf-annotate, like so:

-` tabix -h BT18.vcf.gz 18:100000-101000 | vcf-annotate -a BT_incl_cons.18.vcf.gz -c CHROM,FROM,ID`
+~~~~ {.bash}
+  tabix -h BT18.vcf.gz 18:100000-101000 | vcf-annotate -a BT_incl_cons.18.vcf.gz -c CHROM,FROM,ID
+~~~~

 Now make a new VCF file for all the variants called for chromosome 18
 using this tool:

-` tabix -h BT18.vcf.gz 18 | vcf-annotate -a BT_incl_cons.18.vcf.gz -c CHROM,FROM,ID | bgzip >BT18_rsnumbers.vcf.gz`\
-` tabix -p vcf BT18_rsnumbers.vcf.gz`
+~~~~ {.bash}
+  tabix -h BT18.vcf.gz 18 | vcf-annotate -a BT_incl_cons.18.vcf.gz -c CHROM,FROM,ID | bgzip >BT18_rsnumbers.vcf.gz
+  tabix -p vcf BT18_rsnumbers.vcf.gz
+~~~~

 ### Extracting variants using BEDtools

@@ -66,7 +74,9 @@ build can be found at Ensembl and downloaded using the command line
 (NOTE: the compute nodes you are working on do not have internet
 connectivity. This line of code is just for completeness):

-` wget `[`ftp://ftp.ensembl.org/pub/release-78/gtf/bos_taurus/Bos_taurus.UMD3.1.78.gtf.gz`](ftp://ftp.ensembl.org/pub/release-78/gtf/bos_taurus/Bos_taurus.UMD3.1.78.gtf.gz)
+~~~~ {.bash}
+  wget ftp://ftp.ensembl.org/pub/release-78/gtf/bos_taurus/Bos_taurus.UMD3.1.78.gtf.gz
+~~~~

 The file is already downloaded and can be found here:

@@ -74,7 +84,9 @@ The file is already downloaded and can be found here:

 Have a quick look at what is in this file:

-gunzip -c /path/to/Bos\_taurus.UMD3.1.78.gtf.gz | more
+~~~~ {.bash}
+gunzip -c /path/to/Bos_taurus.UMD3.1.78.gtf.gz | more
+~~~~

 Familiarise yourself with the GTF format if you have not already. For
 more information:
@@ -83,12 +95,16 @@ more information:

 You can select the appropriate region from the GTF file:

-` gunzip -c Bos_taurus.UMD3.1.78.gtf.gz | awk '$3=="gene"' | awk '$1==18&&$4>1000000&&$4<2000000' >mygenes.gtf`
+~~~~ {.bash}
+  gunzip -c Bos_taurus.UMD3.1.78.gtf.gz | awk '$3=="gene"' | awk '$1==18&&$4>1000000&&$4<2000000' >mygenes.gtf
+~~~~

 Subsequently you can intersect the VCF with the GTF file that contains
 just the genes between coordinates 1 and 2 million:

-` bedtools intersect -a BT18_rsnumbers.vcf.gz -b mygenes.gtf >intersection_genes.vcf`
+~~~~ {.bash}
+  bedtools intersect -a BT18_rsnumbers.vcf.gz -b mygenes.gtf >intersection_genes.vcf
+~~~~

 Have a quick look at the output, and count the number of variants in
 this VCF (e.g. by using 'wc -l').
@@ -105,29 +121,114 @@ achieve using Bedtools and a GTF. Among them are Variant Effect
 Predictor (VEP), snpEff, and Annovar. For the current practical we will
 use former two.

-VEP uses a cache directory that has the annotation information. On the
-infrastructure you are working on that directory can be found here:
-
-` /home/formacion/COMUNES/IAMZ/data/CIHEAM/ReferenceGenome/VEP/`
+VEP can be found here:
+
+~~~~ {.bash}
+perl /home/formacion/COMUNES/IAMZ/soft/ensembl-tools-release-78/scripts/variant_effect_predictor/variant_effect_predictor.pl
+~~~~
+
+If you invoke it without parameters you will see the following:
+
+` Usage:`\
+` perl variant_effect_predictor.pl [--cache|--offline|--database] [arguments]`\
+` `\
+` Basic options`\
+` =============`\
+` `\
+` --help                 Display this message and quit`\
+` `\
+` -i | --input_file      Input file`\
+` -o | --output_file     Output file`\
+` --force_overwrite      Force overwriting of output file`\
+` --species [species]    Species to use [default: "human"]`\
+`                        `\
+` --everything           Shortcut switch to turn on commonly used options. See web`\
+`                        documentation for details [default: off]                       `\
+` --fork [num_forks]     Use forking to improve script runtime`\
+` `\
+` For full option documentation see:`\
+` `[`http://www.ensembl.org/info/docs/tools/vep/script/vep_options.html`](http://www.ensembl.org/info/docs/tools/vep/script/vep_options.html)\
+` `
+
+For the current session we will use the following options:
+
+` --species bos_taurus`\
+` -o BT18.vep   # outputfile`\
+` --fork 4    # use 4 threads`\
+` --canonical`\
+` --sift b`\
+` --coding_only`\
+` --no_intergenic`\
+` --offline   # using a cache directory`\
+` --dir /home/formacion/COMUNES/IAMZ/data/CIHEAM/ReferenceGenome/VEP/ # where the cache dir lives`\
+` --force_overwrite`

 You can create a file that will provide annotation information for each
 SNP:

-` gunzip -c BT18_rsnumbers.vcf.gz | perl /home/formacion/COMUNES/IAMZ/soft/ensembl-tools-release-78/scripts/variant_effect_predictor/variant_effect_predictor.pl --dir /home/formacion/COMUNES/IAMZ/data/CIHEAM/ReferenceGenome/VEP/ --species bos_taurus -o BT18.vep --fork 4 --canonical --sift b --coding_only --no_intergenic --offline --force_overwrite`
+~~~~ {.bash}
+  gunzip -c BT18_rsnumbers.vcf.gz | \
+    perl /home/formacion/COMUNES/IAMZ/soft/ensembl-tools-release-78/scripts/variant_effect_predictor/variant_effect_predictor.pl \
+    --dir /home/formacion/COMUNES/IAMZ/data/CIHEAM/ReferenceGenome/VEP/ --species bos_taurus -o BT18.vep \
+    --fork 4 --canonical --sift b --coding_only --no_intergenic --offline --force_overwrite
+~~~~

 Have a look inside the output file and study its content. Why are there
 no SNPs in this list that are intergenic?

-For some purposes, it might be more convenient to have the annotations
-directly in the VCF. You could for instance do the following:
-
-` gunzip -c BT18_rsnumbers.vcf.gz | perl /home/formacion/COMUNES/IAMZ/soft/ensembl-tools-release-78/scripts/variant_effect_predictor/variant_effect_predictor.pl --dir /home/formacion/COMUNES/IAMZ/data/CIHEAM/ReferenceGenome/VEP/ --species bos_taurus -o BT18.vep.vcf --fork 4 --canonical --sift b --coding_only --offline --force_overwrite --vcf `
-` bgzip BT18.vep.vcf`
-` tabix -p vcf BT18.vep.vcf.gz`
+The VEP output you generated before has many advantages, one of which is
+that it is nicely tabular and easy to parse. In fact a demonstration of
+that will be covered in the 'Provean' practical. For some purposes,
+however, it might be more convenient to have the annotations directly in
+the VCF. Furthermore, you may want annotations for each and every SNP,
+even if e.g. they are intergenic. You could for instance do the
+following, omitting the '--no\_intergenic and --coding\_only flags, and
+in stead adding the --vcf flag:
+
+~~~~ {.bash}
+  gunzip -c BT18_rsnumbers.vcf.gz | \ 
+    perl /home/formacion/COMUNES/IAMZ/soft/ensembl-tools-release-78/scripts/variant_effect_predictor/variant_effect_predictor.pl \
+    --dir /home/formacion/COMUNES/IAMZ/data/CIHEAM/ReferenceGenome/VEP/ --species bos_taurus -o BT18.vep.vcf \
+    --fork 4 --canonical --sift b --offline --force_overwrite --vcf 
+  bgzip BT18.vep.vcf
+  tabix -p vcf BT18.vep.vcf.gz
+~~~~

 Again, study and interpret your outcome by looking inside the file.
 Specifically, note WHERE the annotation ended up.

+` 18    53205918    .   G   A   0.329662    .   AB=0;ABP=0;AC=0;AF=0.125;AN=16;AO=2;CIGAR=1X;DP=70;DPB`\
+` =70;DPRA=1.03279;EPP=7.35324;EPPR=3.13803;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=60;NS=8;NUMALT=1;`\
+` ODDS=2.54006;PAIRED=1;PAIREDR=0.985294;PAO=0;PQA=0;PQR=0;PRO=0;QA=70;QR=2421;RO=68;RPP=3.0103;`\
+` RPPR=11.1853;RUN=1;SAF=1;SAP=3.0103;SAR=1;SRF=49;SRP=31.7504;SRR=19;TYPE=snp;technology.ILLUMINA=1;`\
+` CSQ=A|ENSBTAG00000018834|ENSBTAT00000025071|Transcript|missense_variant|2132|1984|662|E/K|Gag/Aag|||1|`\
+` YES|deleterious(0.02) GT:DP:RO:QR:AO:QA:GL    0/0:7:7:263:0:0:0,-1.92588,-10  0/0:7:7:238:0:0:0,-1.92376,-10`\
+` 0/0:9:7:266:2:70:-3.90732,0,-10   0/0:10:10:362:0:0:0,-2.73805,-10    0/0:10:10:345:0:0:0,-2.73789,-10    0/0:8:8:287:0:0:0,-2.19662,-10  0/0:5:5:178:0:0:0,-1.38409,-10  0/0:14:14:482:0:0:0,-3.82032,-10`
+
+### A further look at the VEP output
+
+Although the VCF now looks even more complex than before, there is a
+good reason to add the VEP annotation data the way it was done. The 8th
+column of the VCF provides a lot of information, each field separated by
+a ';'. In addition, each field contains a field name and a value,
+delimited by a '='. The VEP annotation has its own field name, or tag:
+'CSQ'. Within this field, the values are again structured, but this time
+delimited by a '|'. What we in fact have now is a nested data structure:
+
+~~~~ {.python}
+  ['18', '53205918', '.', 'G', 'A', '0.329662', '.', {'PAIREDR': '0.985294', 'DPB': '70', 'MQMR': '60', 'PQA': '0', 'PAO': '0', 
+  'SAR': '1', 'AB': '0', 'GTI': '0', 'NUMALT': '1', 'RO': '68', 'DP': '70', 'DPRA': '1.03279', 
+  'CSQ': ['A', 'ENSBTAG00000018834', 'ENSBTAT00000025071', 'Transcript', 'missense_variant', 
+  '2132', '1984', '662', 'E/K', 'Gag/Aag', '', '', '1', 'YES', ['deleterious', '0.02']], 'EPP': '7.35324', 
+  'SRP': '31.7504', 'QR': '2421', 'SAF': '1', 'ODDS': '2.54006', 'SRR': '19', 'SRF': '49', 
+  'technology.ILLUMINA': '1', 'QA': '70', 'EPPR': '3.13803', 'SAP': '3.0103', 'RPPR': '11.1853', 'PQR': '0',
+ 'NS': '8', 'CIGAR': '1X', 'TYPE': 'snp', 'AF': '0.125', 'ABP': '0', 'LEN': '1', 'AC': '0', 'MEANALT': '1',
+ 'PAIRED': '1', 'AO': '2', 'AN': '16', 'MQM': '60', 'RUN': '1', 'PRO': '0', 'RPP': '3.0103'}, 'GT:DP:RO:QR:AO:QA:GL', 
+  '0/0:7:7:263:0:0:0,-1.92588,-10', '0/0:7:7:238:0:0:0,-1.92376,-10', '0/0:9:7:266:2:70:-3.90732,0,-10', 
+  '0/0:10:10:362:0:0:0,-2.73805,-10', '0/0:10:10:345:0:0:0,-2.73789,-10', '0/0:8:8:287:0:0:0,-2.19662,-10',
+  '0/0:5:5:178:0:0:0,-1.38409,-10', '0/0:14:14:482:0:0:0,-3.82032,-10']
+~~~~
+
 When dealing with deleterious alleles, we might be interested to learn
 how frequent that allele is in the population we are studying. Our
 expectation is that genuinly deleterious alleles should be relatively
@@ -137,16 +238,67 @@ but advantageous in the context of domestication/breeding. The VCF file
 holds the (reference) allele frequency information. The 8th column of
 the VCF file in fact is a bunch of fields concatenated with the ';'
 delimitor. One of the fields has the tag 'AF=', where AF stands for
-Allele Frequency. We can expand the number of space-delimited fields
-easily and then look for the 11th column. We need to remove the 'AF='
-tag though, but if we do that we can further select for instance for
-allele frequencies between 0 and 15%.
-
-` gunzip -c BT18.vep.vcf.gz | grep deleterious | sed 's/;/\t/g' | awk '{gsub("AF=","",$11); print}' | awk '$11<0.15&&$11>0' | wc -l`
-
-Analysing the annotations a bit further, you could discover that some
-
-` gunzip -c BT18.vep.vcf.gz | grep deleterious | sed 's/;/ /g' | awk '{gsub("AF=","",$11); print}' | awk '$11<1&&$11>0' | sed 's/ /#_#/g' | sed 's/|/\t/g' | cut -f2 | sort | uniq -c | sed 's/^ \+//' | awk '$1>1'  >genes_with_more_than1.txt  `
+(Alternative) Allele Frequency. We can expand the number of
+tab-delimited fields easily by replacing ';' for '\\t', and then look
+for the 11th column. We need to remove the 'AF=' tag though, but if we
+do that we can further select for instance for allele frequencies
+between 0 and 15%.
+
+~~~~ {.bash}
+  gunzip -c BT18.vep.vcf.gz | grep deleterious | sed 's/;/\t/g' | awk '{gsub("AF=","",$11); print}' | awk '$11<0.15&&$11>0' | wc -l
+~~~~
+
+Analysing the annotations a bit further, you could discover that there
+are quite a few variants that have high deleterious allele frequencies.
+Not only do some putative deleterious alleles occur at high frequency,
+but there is also a number of genes that has quite a few deleterious
+alleles annotated in the current population sample. Say you would like
+to make a table that contains all genes that have more than one putative
+deleterious variant. We would need to isolate both the AF field and the
+CSQ field, and from the latter isolate the 'gene' field. While this is
+possible with a fairly simple shell scripting hack, it does require an
+awful lot of counting to make sure you end up with the right fields.
+Ideally you would like to have more control by putting the relevant data
+in a complex data structure. Languages such as Perl and particularly
+Python allow far more explicit syntax to achieve that, so we will swap
+out some complex string of shell-commands by a single Python script.
+That script does only one thing: takes lines from the VCF file, put all
+fields in a complex data structure (as seen above) and, in a structured
+way, prints a selection of fields.
+
+What this script will VCF input from a stream:
+
+~~~~ {.bash}
+  gunzip -c LeonEnv/BT18.vep.all.vcf.gz | python3 print_fields_from_vcf.py -m vep 
+~~~~
+
+Notice that the script itself does in fact take an option: 'vep'. This
+is done because it can also work for the snpEff annotation, which,
+highly inconveniently, is slightly different.
+
+and for each line provide output similar to this:
+
+` # chrom coord AF  Gene  consequence  sift_prediction  sift_score`\
+` 18    53205918    0.125   ENSBTAG00000018834  missense_variant    deleterious 0.02`
+
+Obviously, the functionality of the Python script is limited. However,
+it would be quite easy to provide additional statistics and output
+options based on the current script, because it makes the data structure
+so explicit. This means that the Python script provides you with a means
+for expansion of functionality, while that will be far more tricky using
+shell commands solely.
+
+~~~~ {.bash}
+  #alternative way, shell hack
+  gunzip -c BT18.vep.vcf.gz | grep deleterious | sed 's/;/ /g' | awk '{gsub("AF=","",$11); print}' | awk '$11<1&&$11>0' | sed 's/ /#_#/g' | sed 's/|/\t/g' | cut -f2 | sort | uniq -c | sed 's/^ \+//' | awk '$1>1'  >genes_with_more_than1.txt  
+~~~~
+
+Now we make a file that has the gene names, and counts, for all genes
+that have more than one deleterious allele:
+
+~~~~ {.bash}
+  gunzip -c LeonEnv/BT18.vep.all.vcf.gz | python3 print_fields_from_vcf.py -m vep | grep deleterious | awk '$3>0&&$3<1' | cut -f4 | sort | uniq -c | sed 's/^ \+//' | awk '$1>1'  >genes_with_more_than1.txt
+~~~~

 Take two or three genes that have the highest number of deleterious
 variants. Manually look them up at the Ensembl website (ensembl.org).
@@ -164,7 +316,9 @@ a low frequency would certainly be better candidates than genes that
 have multiple deleterious variants, as the above example indicates. To
 filter you can do the following:

-` gunzip -c BT18.vep.vcf.gz | grep deleterious | sed 's/;/ /g' | awk '{gsub("AF=","",$11); print}' | awk '$11<0.15&&$11>0' | sed 's/ /#_#/g' | sed 's/|/\t/g' | cut -f2 | sort | uniq -c | sed 's/^ \+//' | awk '$1==1' | awk '{print $2}' >genes_with1.txt`
+~~~~ {.bash}
+  gunzip -c LeonEnv/BT18.vep.all.vcf.gz | python3 print_fields_from_vcf.py -m vep | grep deleterious | awk '$3>0&&$3<1' | cut -f4 | sort | uniq -c | sed 's/^ \+//' | awk '$1==1' | awk '{print $2}'  >genes_with1.txt
+~~~~

 How many genes are there that meet this condition?

@@ -175,7 +329,26 @@ advantages over VEP is that it is applicable to a wider range of
 species. Furthermore, it is blazingly fast. For chromosome 18, we will
 also use this tool:

-` java -Xmx4G -jar $snpEff -dataDir ~/snpEff/data -v UMD3.1.78 BT18_rsnumbers.vcf.gz >testresult_snpEff_Bt.txt`
+snpEff is Java-based. The path to the snpEff.jar Java archive is
+available as an environment variable. Do:
+
+~~~~ {.bash}
+  echo $SNP
+  /home/formacion/COMUNES/IAMZ/soft/snpEff/snpEff.jar
+~~~~
+
+The snpEff program can be used like this:
+
+~~~~ {.bash}
+  java -Xmx4G -jar $snpEff -dataDir ~/snpEff/data -v UMD3.1.78 BT18_rsnumbers.vcf.gz >testresult_snpEff_Bt.txt
+~~~~
+
+The options used here are:
+
+` -dataDir ~/snpEff/data   # annotation data goes here`\
+` -v UMD3.1.78   # Genome / version`\
+` BT18_rsnumbers.vcf.gz   # requires an input file, can be gzipped`\
+` >testresult_snpEff_Bt.txt   # results go to STDOUT and are captured in a file by '>'`

 Study the outcome. Try to filter out 'intergenic' SNPs and other things
 that are perhaps a bit less relevant (use your skills in 'grep-ping',
@@ -186,21 +359,41 @@ You will notice that the variants that have an effect are labelled
 'MODERATE' or 'HIGH'. Try to filter out the ones that are 'HIGH'. How
 many are there?

-Compare the results by selecting the 'MODERATE' and "HIGH" variants, and
-compare that with the 'deleterious' variants as annotated (through SIFT)
-in VEP. You can use, e.g. vcf-compare.
+You can now easily make tables for number of different consequences as
+annotated by both these annotation tools:
+
+~~~~ {.bash}
+  cat LeonEnv/testresult_snpEff_Bt.txt  | python3 cut_line.py -m snpeff | cut -f6 | sort | uniq -c
+
+  gunzip -c LeonEnv/BT18.vep.all.vcf.gz  | python3 cut_line.py -m vep | cut -f5 | sort | uniq -c
+~~~~
+
+Bonus question: Compare the results by selecting the 'MODERATE' and
+"HIGH" variants, and compare that with the 'deleterious' variants as
+annotated (through SIFT) in VEP. You can use, e.g. vcf-compare.

 Inferring function
 ------------------

 ### Provean

+Provean uses an approach where a protein sequence is compared (Psiblast)
+to all known protein sequences (so called 'non-redundant database').
+Subsequent clustering defines a group of 'similar', and implicitly
+homologous sequences. An example of in- and output of a pig sequence:
+
+What goes in: Protein sequence (Fasta):
+
 ` >ENSSSCP00000018263 pep:novel chromosome:Sscrofa10.2:12:6621092:6624938:1 gene:ENSSSCG00000017236 transcript:ENSSSCT00000018765 gene_biotype:protein_coding t transcript_biotype:protein_coding`\
 ` MTPRVGAVWLPSALLLLRVPGCLSLSGPPTAMGTKGGSLSVQCRYEEEYIDDKKYWDKSP`\
 ` CFLSWKHIVETTESAREVRRGRVSIRDDPANLTFTVTLERLTEEDAGTYCCGITAQFSVD`\
 ` PTHEVEVVVFPALGTSRPPSMPGPPTTLPATTWSFVSERETMANNLGKGPASQDPGQHPR`\
 ` SKHPSIRLLLLVFLEVPLFLGMLGAVLWVHRPLRSSESRSVAMDPVPGNTAPSAGWK`

+and variation, with coordinates pertaining the above protein sequences,
+the second column the reference amino acid, and the third/last column
+the alternative amino acid (alternative allele):
+
 ` 235,G,E`\
 ` 5,V,A`\
 ` 22,C,Y`\
@@ -224,47 +417,114 @@ Inferring function
 ` 186,I,V `\
 ` 21,G,D`

-` ## PROVEAN scores ##`
-` # VARIATION   SCORE`
-` 235,G,E   0.076`
-` 5,V,A 0.287`
-` 22,C,Y    -8.028`
-` 34,T,I    -1.932`
-` 51,D,N    -1.613`
-` 53,K,N    1.565`
-` 59,S,Y    -2.140`
-` 61,C,R    -5.826`
-` 64,S,L    -0.437`
-` 67,H,P    -0.511`
-` 68,I,T    -2.664`
-` 75,A,V    -1.061`
-` 108,T,K   -4.051`
-` 115,A,T   1.557`
-` 124,E,D   -1.587`
-` 130,F,Y   -0.983`
-` 133,L,P   3.203`
-` 142,P,A   -2.058`
-` 155,F,I   0.502`
-` 158,E,G   -0.077`
-` 186,I,V   0.220`
+Output looks like this:
+
+` ## PROVEAN scores ##`\
+` # VARIATION   SCORE`\
+` 235,G,E   0.076`\
+` 5,V,A 0.287`\
+` 22,C,Y    -8.028`\
+` 34,T,I    -1.932`\
+` 51,D,N    -1.613`\
+` 53,K,N    1.565`\
+` 59,S,Y    -2.140`\
+` 61,C,R    -5.826`\
+` 64,S,L    -0.437`\
+` 67,H,P    -0.511`\
+` 68,I,T    -2.664`\
+` 75,A,V    -1.061`\
+` 108,T,K   -4.051`\
+` 115,A,T   1.557`\
+` 124,E,D   -1.587`\
+` 130,F,Y   -0.983`\
+` 133,L,P   3.203`\
+` 142,P,A   -2.058`\
+` 155,F,I   0.502`\
+` 158,E,G   -0.077`\
+` 186,I,V   0.220`\
 ` 21,G,D    -6.014`

-At the end of the VEP exercise we created a filtered list of genes that
-have only a single deleterious variant annotated. Among these genes was
-BLOC1S3 gene, ENSBTAG00000007070.
+We will do something similar, although with a somewhat less complex
+example, for the cow data. At the end of the VEP exercise we created a
+filtered list of genes that have only a single deleterious variant
+annotated. Among these genes was BLOC1S3 gene, ENSBTAG00000007070.

 ` `[`http://www.ensembl.org/Bos_taurus/Gene/Summary?db=core;g=ENSBTAG00000007070;r=18:53231599-53232201;t=ENSBTAT00000009293`](http://www.ensembl.org/Bos_taurus/Gene/Summary?db=core;g=ENSBTAG00000007070;r=18:53231599-53232201;t=ENSBTAT00000009293)

-` GENE=ENSBTAG00000007070`\
-``  PROT=`cat ~/Bos_taurus.UMD3.1.pep.all.fa | grep $GENE | awk '{print $1}' | sed 's/>//'` ``\
-` cat test.vep | grep $GENE | grep missense | awk '{print $10","$11}' | sed 's/\//,/' >$PROT.var`
-
-` provean.sh -q $PROT.fa -v $PROT.var --save_supporting_set $PROT.sss >$PROT.result.txt 2>$PROT.error;`
-
-` faOneRecord ~/Bos_taurus.UMD3.1.pep.all.fa $PROT >$PROT.fa`
-
-` cat ENSBTAG00000007070.fa >>ENSBTAG00000007070.sss.fasta`\
-` mafft --auto ENSBTAG00000007070.sss.fasta >ENSBTAG00000007070.sss.fasta.out`
+What we need for the Provean analysis is 1) the protein sequence of the
+gene (or rather, the transcript), and 2) a file with the variation in
+the same format as listed above. Obviously, if we want to do this for
+many genes, we will not parse the relevant information by hand. We will
+make a little pipeline for that. A multifasta file that contains all
+protein sequences for cow can be found at Ensembl. It is located here on
+the machine you are working on:
+
+~~~~ {.bash}
+  ## define the Env.Var. that holds the gene id:
+  GENE=ENSBTAG00000007070
+  
+  ## search the protein file for the gene, and parse out the corresponding protein name
+  ## Note: ideally you do this by transcript, as a gene may be represented by multiple transcripts/protein sequences
+  PROT=`cat ~/Bos_taurus.UMD3.1.pep.all.fa | grep $GENE | awk '{print $1}' | sed 's/>//'`
+  
+  ## use the first VEP annotation you did for this. Easiest to work with because completely tabular.
+  ## Note: in this case there is only a single missense variant in the gene. This won't work for all genes.
+  ## Note: the 'sed' statement replaces the '/', as found in the VEP annotation, with a ','.
+  cat test.vep | grep $GENE | grep missense | awk '{print $10","$11}' | sed 's/\//,/' >$PROT.var
+  
+  ## retrieve the protein sequence from the multifasta protein sequence file.
+  ## Note: the 'faOneRecord program is part of the so called 'Blat' suite, of which Blat is the best known
+  faOneRecord ~/Bos_taurus.UMD3.1.pep.all.fa $PROT >$PROT.fa
+~~~~
+
+This should result in two files that are produced:
+
+ENSBTAP00000009293.fa
+
+` >ENSBTAP00000009293 pep:known chromosome:UMD3.1:18:53231599:53232201:1 gene:ENSBTAG00000007070 transcript:ENSBTAT00000009293 gene_biotype:protein_coding transcript_biotype:protein_coding`\
+` MESQSRRRRPLRRPETLVQGEAAESDSDLSASSSEEEELYLGPSGPTRGRPTGLRVAGEA`\
+` AETDSDPEPEPKAAPRDLPPLVVQRETAGEAWAEEEAPAPAPARSLLQLRLAESQARLDH`\
+` DVAAAVSGVYRRAGRDVAALAGRLAAAQAAGLAAAHSVRLARGDLCALAERLDIVASCRL`\
+` LPDIRGVPGTEPEQDPGPRA`
+
+ENSBTAP00000009293.var
+
+` 26,D,N`
+
+Finally, the actual Provean analysis is done with this command:
+
+~~~~ {.bash}
+  provean.sh -q $PROT.fa -v $PROT.var --save_supporting_set $PROT.sss >$PROT.result.txt 2>$PROT.error;
+~~~~
+
+Study the result. What is the verdict on the alternative allele?
+(D--\>N)
+
+Provean does many things 'under the hood'. When you follow the process
+(e.g. using the 'top' command in a second terminal screen), you will
+notice that the majority of time is spent on Psiblast, and that it takes
+quite a bit of the total memory of the node you are working on
+(typically \>10GB of ram). The similar sequences that Provean fished out
+of the nr-database and then clustered are written to a 'supporting set'
+(because you asked Provean to save the supporting set by using the
+--save\_supporting flag). Briefly investigate what is in the supporting
+set (extension .sss.fasta).
+
+The supporting sequences are not aligned. Chances are that among the
+supporting sequences there is a cow sequence. Bit cumbersome to figure
+out which is which (although doable using blastdb tools - we won't go
+into that here). Easiest is to add the protein sequence you have used
+for the Provean analysis and add it to the supporting set fasta file
+(using the '\>\>', note that you need TWO of these to add to a file,
+otherwise the file will get overwritten if you use only one!).
+
+~~~~ {.bash}
+  ## add protein sequence to supporting set fasta:
+  cat $PROT.fa >>$PROT.sss.fasta
+  
+  ## align all protein sequences using the multiple sequence aligner mafft:
+  mafft --auto $PROT.sss.fasta >$PROT.sss.fasta.out
+~~~~

 Transfer the alignment to your local environment and view (seaview?).
 What do you observe for the 26th amino-acid of the cow protein sequence?
@@ -279,15 +539,12 @@ cattle population?

 ### Polyphen

-` /lustre/nobackup/WUR/ABGC/shared/public_data_store/polyphen/polyphen-2.2.2/bin/run_pph.pl -s test1033.fa test1033.coord `\
-` #o_acc                 o_pos  o_aa1   o_aa2   rsid        acc            pos  aa1 aa2 nt1 nt2         prediction              based_on        effect      site      region        PHAT    dScore  Score1  Score2  MSAv      Nobs   Nstruct     Nfilt  PDB_id  PDB_pos PDB_ch   ident  lengthNormASA   SecStr  MapReg    dVol   dProp  B-fact   H-bonds     AveNHet     MinDHet     AveNInt     MinDInt     AveNSit     MinDSit    Transv  CodPos  CpG  MinDJxn         PfamHit      IdPmax      IdPSNP      IdQmin`\
-` ENSSSCG00000001099ENSSSCT00000001195     368      S       A               ENSSSCG00000001099ENSSSCT00000001195       368    S   A                     benign             alignment                                                    +0.721  -1.180  -1.901     2        68                                                                                                                                                                                                                                                    12.350      12.350       42.63`\
-` ENSSSCG00000001099ENSSSCT00000001195     453      R       H               ENSSSCG00000001099ENSSSCT00000001195       453    R   H                     benign             alignment                                                    +0.351  -2.130  -2.481     2        67                                                                                                                                                                                                                                                    33.194      33.194       93.76`\
-` ENSSSCG00000001099ENSSSCT00000001195     431      K       T               ENSSSCG00000001099ENSSSCT00000001195       431    K   T          possibly damaging             alignment                                                    +1.705  -1.678  -3.383     2        68                                                                                                                                                                                                                                                     2.382                   68.80`
+The Polyphen analysis is quite similar in what you need compared to
+Provean: a fasta file containing the protein sequence, and a file that
+lists coordinates and variants:

-What goes in:
+What goes in: in.fa

-` test1033.fa`\
 ` >ENSSSCG00000001099ENSSSCT00000001195`\
 ` MSSIEQTTEILLCLSPAEAANLKEGINFVRNKSTGKDYILFKNKSRLKACKNMCKHQGGL`\
 ` FIKDIEDLNGRSVKCTKHNWKLDVSSMKYINPPGSFCQDELVVEKDEENGVLLLELNPPN`\
@@ -300,7 +557,24 @@ What goes in:
 ` EDFSPLPGGYDYLVDFLDLSFPKERPSREHPYEEIRSRVDVIRHVVKNGLLWDDLYIGFQ`\
 ` TRLQRDPDIYHHLFWNHFQIKLPLTPPDWKSFLMCSG `

-` test1033.coord`\
+in.coord
+
 ` ENSSSCG00000001099ENSSSCT00000001195 368        S A`\
 ` ENSSSCG00000001099ENSSSCT00000001195 453        R H`\
-` ENSSSCG00000001099ENSSSCT00000001195 431        K T`
\ No newline at end of file
+` ENSSSCG00000001099ENSSSCT00000001195 431        K T`
+
+` run_pph.pl -s in.fa in.coord`
+
+What comes out:
+
+` /lustre/nobackup/WUR/ABGC/shared/public_data_store/polyphen/polyphen-2.2.2/bin/run_pph.pl -s test1033.fa test1033.coord `\
+` #o_acc                 o_pos  o_aa1   o_aa2   rsid        acc            pos  aa1 aa2 nt1 nt2         prediction              based_on        effect      site      region        PHAT    dScore  Score1  Score2  MSAv      Nobs   Nstruct     Nfilt  PDB_id  PDB_pos PDB_ch   ident  lengthNormASA   SecStr  MapReg    dVol   dProp  B-fact   H-bonds     AveNHet     MinDHet     AveNInt     MinDInt     AveNSit     MinDSit    Transv  CodPos  CpG  MinDJxn         PfamHit      IdPmax      IdPSNP      IdQmin`\
+` ENSSSCG00000001099ENSSSCT00000001195     368      S       A               ENSSSCG00000001099ENSSSCT00000001195       368    S   A                     benign             alignment                                                    +0.721  -1.180  -1.901     2        68                                                                                                                                                                                                                                                    12.350      12.350       42.63`\
+` ENSSSCG00000001099ENSSSCT00000001195     453      R       H               ENSSSCG00000001099ENSSSCT00000001195       453    R   H                     benign             alignment                                                    +0.351  -2.130  -2.481     2        67                                                                                                                                                                                                                                                    33.194      33.194       93.76`\
+` ENSSSCG00000001099ENSSSCT00000001195     431      K       T               ENSSSCG00000001099ENSSSCT00000001195       431    K   T          possibly damaging             alignment                                                    +1.705  -1.678  -3.383     2        68                                                                                                                                                                                                                                                     2.382                   68.80`
+
+The fasta and coordinate files can be made similar to what you've done
+for Provean. Polyphen is much more human-oriented than Provean (which is
+more species agnostic) and has many more features, not all of which work
+very well for non-human species. Since Polyphen won't run properly in
+your present infrastructure we will not practice it actively.
\ No newline at end of file