Changes

Megens, Hendrik-Jan · 0d680f25
--- a/home.markdown
+++ b/home.markdown
@@ -4,29 +4,34 @@ prediction sessions
 Variant annotation
 ------------------

+make a new folder to work in, and go there:
+
+~~~~ {.bash}
+mkdir annotation_session
+cd annotation_session
+~~~~
+
+Create an Env.Var. to the communal session directory that holds some of
+the files you need:
+
+~~~~ {.bash}
+SESSIONDIR=/home/formacion/COMUNES/IAMZ/data/CIHEAM/sessions/variant_annotation
+~~~~
+
 ### Slicing and dicing of VCF files

 VCF files can get very big. Being able to manipulate them and extract
 relevant information is important. In practice, flat text files such as
 VCF are currently the basis for further analysis - see the Imputation
-and Population practical sessions.
-
-The first task involves selecting all called variants of chromosome 18,
-bgzip-ping them on-the-fly, and then indexing that file with tabix:
+and Population practical sessions. The first task involves selecting all
+called variants of chromosome 18, bgzip-ping them on-the-fly, and then
+indexing that file with tabix:

 ~~~~ {.bash}
-  tabix -h allbt.vcf.gz 18 | bgzip >BT18.vcf.gz
+  tabix -h /home/formacion/COMUNES/IAMZ/data/CIHEAM/MULTISAMPLE_VCF/allbt.vcf.gz 18 | bgzip >BT18.vcf.gz
  tabix -p vcf BT18.vcf.gz
 ~~~~

-### Annotating VCF with rs-numbers
-
-The VCF file you are working does not yet have the dbSNP identifiers
-(rs-numbers) annotated. The rs-numbers of all currently known variants
-as databased in dbSNP can be found in this file:
-
-` /path/to/ BT18.vcf.gz`
-
 Have a look at what is inside this file:

 ~~~~ {.bash}
@@ -39,6 +44,14 @@ or
  tabix BT18.vcf.gz 18:1-10000 | more
 ~~~~

+### Annotating VCF with rs-numbers
+
+The VCF file you are working does not yet have the dbSNP identifiers
+(rs-numbers) annotated. The rs-numbers of all currently known variants
+as databased in dbSNP can be found in this file:
+
+` /home/formacion/COMUNES/IAMZ/data/CIHEAM/sessions/variant_annotation/BT_incl_cons.18.vcf.gz`\
+
 You can use vcftools to do many things related to VCF files. This
 includes annotating the VCF based on other interval-file formatted
 input, including other VCF files. For instance, you can annotate an
@@ -46,14 +59,15 @@ interval of the VCF on the fly, by selecting that interval using tabix,
 and then piping it to vcf-annotate, like so:

 ~~~~ {.bash}
-  tabix -h BT18.vcf.gz 18:100000-101000 | vcf-annotate -a BT_incl_cons.18.vcf.gz -c CHROM,FROM,ID
+## will take an interval of 1000bp and annotate the rs numbers; piped into 'tail' means last 10 lines are shown
+tabix -h BT18.vcf.gz 18:100000-101000 | vcf-annotate -a $SESSIONDIR/BT_incl_cons.18.vcf.gz -c CHROM,FROM,ID | tail 
 ~~~~

 Now make a new VCF file for all the variants called for chromosome 18
 using this tool:

 ~~~~ {.bash}
-  tabix -h BT18.vcf.gz 18 | vcf-annotate -a BT_incl_cons.18.vcf.gz -c CHROM,FROM,ID | bgzip >BT18_rsnumbers.vcf.gz
+  tabix -h BT18.vcf.gz 18 | vcf-annotate -a $SESSIONDIR/BT_incl_cons.18.vcf.gz -c CHROM,FROM,ID | bgzip >BT18_rsnumbers.vcf.gz
  tabix -p vcf BT18_rsnumbers.vcf.gz
 ~~~~

@@ -80,23 +94,23 @@ connectivity. This line of code is just for completeness):

 The file is already downloaded and can be found here:

-` /path/to/Bos_taurus.UMD3.1.78.gtf.gz`
+`  /home/formacion/COMUNES/IAMZ/data/CIHEAM/sessions/variant_annotation/Bos_taurus.UMD3.1.78.gtf.gz`

 Have a quick look at what is in this file:

 ~~~~ {.bash}
-gunzip -c /path/to/Bos_taurus.UMD3.1.78.gtf.gz | more
+gunzip -c $SESSIONDIR/Bos_taurus.UMD3.1.78.gtf.gz | more
 ~~~~

 Familiarise yourself with the GTF format if you have not already. For
 more information:

-` `[`http://www.ensembl.org/info/website/upload/gff.html`](http://www.ensembl.org/info/website/upload/gff.html)
+` `[`http://www.ensembl.org/info/website/upload/gff.html`](http://www.ensembl.org/info/website/upload/gff.html)

 You can select the appropriate region from the GTF file:

 ~~~~ {.bash}
-  gunzip -c Bos_taurus.UMD3.1.78.gtf.gz | awk '$3=="gene"' | awk '$1==18&&$4>1000000&&$4<2000000' >mygenes.gtf
+  gunzip -c $SESSIONDIR/Bos_taurus.UMD3.1.78.gtf.gz | awk '$3=="gene"' | awk '$1==18&&$4>1000000&&$4<2000000' >mygenes.gtf
 ~~~~

 Subsequently you can intersect the VCF with the GTF file that contains
@@ -129,48 +143,48 @@ perl /home/formacion/COMUNES/IAMZ/soft/ensembl-tools-release-78/scripts/variant_

 If you invoke it without parameters you will see the following:

-` Usage:`\
-` perl variant_effect_predictor.pl [--cache|--offline|--database] [arguments]`\
-` `\
-` Basic options`\
-` =============`\
-` `\
-` --help                 Display this message and quit`\
-` `\
-` -i | --input_file      Input file`\
-` -o | --output_file     Output file`\
-` --force_overwrite      Force overwriting of output file`\
-` --species [species]    Species to use [default: "human"]`\
-`                        `\
-` --everything           Shortcut switch to turn on commonly used options. See web`\
-`                        documentation for details [default: off]                       `\
-` --fork [num_forks]     Use forking to improve script runtime`\
-` `\
-` For full option documentation see:`\
-` `[`http://www.ensembl.org/info/docs/tools/vep/script/vep_options.html`](http://www.ensembl.org/info/docs/tools/vep/script/vep_options.html)\
-` `
+` Usage:`\
+` perl variant_effect_predictor.pl [--cache|--offline|--database] [arguments]`\
+` `\
+` Basic options`\
+` =============`\
+` `\
+` --help                 Display this message and quit`\
+` `\
+` -i | --input_file      Input file`\
+` -o | --output_file     Output file`\
+` --force_overwrite      Force overwriting of output file`\
+` --species [species]    Species to use [default: "human"]`\
+`                        `\
+` --everything           Shortcut switch to turn on commonly used options. See web`\
+`                        documentation for details [default: off]                       `\
+` --fork [num_forks]     Use forking to improve script runtime`\
+` `\
+` For full option documentation see:`\
+` `[`http://www.ensembl.org/info/docs/tools/vep/script/vep_options.html`](http://www.ensembl.org/info/docs/tools/vep/script/vep_options.html)\
+` `

 For the current session we will use the following options:

-` --species bos_taurus`\
-` -o BT18.vep   # outputfile`\
-` --fork 4    # use 4 threads`\
-` --canonical`\
-` --sift b`\
-` --coding_only`\
-` --no_intergenic`\
-` --offline   # using a cache directory`\
-` --dir /home/formacion/COMUNES/IAMZ/data/CIHEAM/ReferenceGenome/VEP/ # where the cache dir lives`\
-` --force_overwrite`
+` --species bos_taurus`\
+` -o BT18.vep   # outputfile`\
+` --fork 4    # use 4 threads`\
+` --canonical`\
+` --sift b`\
+` --coding_only`\
+` --no_intergenic`\
+` --offline   # using a cache directory`\
+` --dir /home/formacion/COMUNES/IAMZ/data/CIHEAM/ReferenceGenome/VEP/ # where the cache dir lives`\
+` --force_overwrite`

 You can create a file that will provide annotation information for each
 SNP:

 ~~~~ {.bash}
-  gunzip -c BT18_rsnumbers.vcf.gz | \
-    perl /home/formacion/COMUNES/IAMZ/soft/ensembl-tools-release-78/scripts/variant_effect_predictor/variant_effect_predictor.pl \
-    --dir /home/formacion/COMUNES/IAMZ/data/CIHEAM/ReferenceGenome/VEP/ --species bos_taurus -o BT18.vep \
-    --fork 4 --canonical --sift b --coding_only --no_intergenic --offline --force_overwrite
+gunzip -c BT18_rsnumbers.vcf.gz | \
+  perl /home/formacion/COMUNES/IAMZ/soft/ensembl-tools-release-78/scripts/variant_effect_predictor/variant_effect_predictor.pl \
+  --dir /home/formacion/COMUNES/IAMZ/data/CIHEAM/ReferenceGenome/VEP/ --species bos_taurus -o BT18.vep \
+  --fork 4 --canonical --sift b --coding_only --no_intergenic --offline --force_overwrite
 ~~~~

 Have a look inside the output file and study its content. Why are there
@@ -186,24 +200,24 @@ following, omitting the '--no\_intergenic and --coding\_only flags, and
 in stead adding the --vcf flag:

 ~~~~ {.bash}
-  gunzip -c BT18_rsnumbers.vcf.gz | \ 
-    perl /home/formacion/COMUNES/IAMZ/soft/ensembl-tools-release-78/scripts/variant_effect_predictor/variant_effect_predictor.pl \
-    --dir /home/formacion/COMUNES/IAMZ/data/CIHEAM/ReferenceGenome/VEP/ --species bos_taurus -o BT18.vep.vcf \
-    --fork 4 --canonical --sift b --offline --force_overwrite --vcf 
-  bgzip BT18.vep.vcf
-  tabix -p vcf BT18.vep.vcf.gz
+gunzip -c BT18_rsnumbers.vcf.gz | \
+  perl /home/formacion/COMUNES/IAMZ/soft/ensembl-tools-release-78/scripts/variant_effect_predictor/variant_effect_predictor.pl \
+  --dir /home/formacion/COMUNES/IAMZ/data/CIHEAM/ReferenceGenome/VEP/ --species bos_taurus -o BT18.vep.vcf \
+  --fork 4 --canonical --sift b --offline --force_overwrite --vcf 
+bgzip BT18.vep.vcf
+tabix -p vcf BT18.vep.vcf.gz
 ~~~~

 Again, study and interpret your outcome by looking inside the file.
 Specifically, note WHERE the annotation ended up.

-` 18    53205918    .   G   A   0.329662    .   AB=0;ABP=0;AC=0;AF=0.125;AN=16;AO=2;CIGAR=1X;DP=70;DPB`\
-` =70;DPRA=1.03279;EPP=7.35324;EPPR=3.13803;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=60;NS=8;NUMALT=1;`\
-` ODDS=2.54006;PAIRED=1;PAIREDR=0.985294;PAO=0;PQA=0;PQR=0;PRO=0;QA=70;QR=2421;RO=68;RPP=3.0103;`\
-` RPPR=11.1853;RUN=1;SAF=1;SAP=3.0103;SAR=1;SRF=49;SRP=31.7504;SRR=19;TYPE=snp;technology.ILLUMINA=1;`\
-` CSQ=A|ENSBTAG00000018834|ENSBTAT00000025071|Transcript|missense_variant|2132|1984|662|E/K|Gag/Aag|||1|`\
-` YES|deleterious(0.02) GT:DP:RO:QR:AO:QA:GL    0/0:7:7:263:0:0:0,-1.92588,-10  0/0:7:7:238:0:0:0,-1.92376,-10`\
-` 0/0:9:7:266:2:70:-3.90732,0,-10   0/0:10:10:362:0:0:0,-2.73805,-10    0/0:10:10:345:0:0:0,-2.73789,-10    0/0:8:8:287:0:0:0,-2.19662,-10  0/0:5:5:178:0:0:0,-1.38409,-10  0/0:14:14:482:0:0:0,-3.82032,-10`
+` 18    53205918    .   G   A   0.329662    .   AB=0;ABP=0;AC=0;AF=0.125;AN=16;AO=2;CIGAR=1X;DP=70;DPB`\
+` =70;DPRA=1.03279;EPP=7.35324;EPPR=3.13803;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=60;NS=8;NUMALT=1;`\
+` ODDS=2.54006;PAIRED=1;PAIREDR=0.985294;PAO=0;PQA=0;PQR=0;PRO=0;QA=70;QR=2421;RO=68;RPP=3.0103;`\
+` RPPR=11.1853;RUN=1;SAF=1;SAP=3.0103;SAR=1;SRF=49;SRP=31.7504;SRR=19;TYPE=snp;technology.ILLUMINA=1;`\
+` CSQ=A|ENSBTAG00000018834|ENSBTAT00000025071|Transcript|missense_variant|2132|1984|662|E/K|Gag/Aag|||1|`\
+` YES|deleterious(0.02) GT:DP:RO:QR:AO:QA:GL    0/0:7:7:263:0:0:0,-1.92588,-10  0/0:7:7:238:0:0:0,-1.92376,-10`\
+` 0/0:9:7:266:2:70:-3.90732,0,-10   0/0:10:10:362:0:0:0,-2.73805,-10    0/0:10:10:345:0:0:0,-2.73789,-10    0/0:8:8:287:0:0:0,-2.19662,-10  0/0:5:5:178:0:0:0,-1.38409,-10  0/0:14:14:482:0:0:0,-3.82032,-10`

 ### A further look at the VEP output

@@ -269,7 +283,7 @@ way, prints a selection of fields.
 What this script will VCF input from a stream:

 ~~~~ {.bash}
-  gunzip -c LeonEnv/BT18.vep.all.vcf.gz | python3 print_fields_from_vcf.py -m vep 
+  gunzip -c BT18.vep.vcf.gz | python3 $SESSIONDIR/print_fields_from_vcf.py -m vep | grep deleterious | more 
 ~~~~

 Notice that the script itself does in fact take an option: 'vep'. This
@@ -278,8 +292,8 @@ highly inconveniently, is slightly different.

 and for each line provide output similar to this:

-` # chrom coord AF  Gene  consequence  sift_prediction  sift_score`\
-` 18    53205918    0.125   ENSBTAG00000018834  missense_variant    deleterious 0.02`
+` # chrom coord AF  Gene  consequence  sift_prediction  sift_score`\
+` 18    53205918    0.125   ENSBTAG00000018834  missense_variant    deleterious 0.02`

 Obviously, the functionality of the Python script is limited. However,
 it would be quite easy to provide additional statistics and output
@@ -297,7 +311,8 @@ Now we make a file that has the gene names, and counts, for all genes
 that have more than one deleterious allele:

 ~~~~ {.bash}
-  gunzip -c LeonEnv/BT18.vep.all.vcf.gz | python3 print_fields_from_vcf.py -m vep | grep deleterious | awk '$3>0&&$3<1' | cut -f4 | sort | uniq -c | sed 's/^ \+//' | awk '$1>1'  >genes_with_more_than1.txt
+gunzip -c BT18.vep.vcf.gz | python3 $SESSIONDIR/print_fields_from_vcf.py -m vep | grep deleterious | awk '$3>0&&$3<1' | \
+  cut -f4 | sort | uniq -c | sed 's/^ \+//' | awk '$1>1'  >genes_with_more_than1.txt
 ~~~~

 Take two or three genes that have the highest number of deleterious
@@ -317,7 +332,8 @@ have multiple deleterious variants, as the above example indicates. To
 filter you can do the following:

 ~~~~ {.bash}
-  gunzip -c LeonEnv/BT18.vep.all.vcf.gz | python3 print_fields_from_vcf.py -m vep | grep deleterious | awk '$3>0&&$3<1' | cut -f4 | sort | uniq -c | sed 's/^ \+//' | awk '$1==1' | awk '{print $2}'  >genes_with1.txt
+gunzip -c BT18.vep.vcf.gz | python3 $SESSIONDIR/print_fields_from_vcf.py -m vep | grep deleterious | awk '$3>0&&$3<1' | \
+  cut -f4 | sort | uniq -c | sed 's/^ \+//' | awk '$1==1' | awk '{print $2}'  >genes_with1.txt
 ~~~~

 How many genes are there that meet this condition?
@@ -340,15 +356,17 @@ available as an environment variable. Do:
 The snpEff program can be used like this:

 ~~~~ {.bash}
-  java -Xmx4G -jar $snpEff -dataDir ~/snpEff/data -v UMD3.1.78 BT18_rsnumbers.vcf.gz >testresult_snpEff_Bt.txt
+java -Xmx4G -jar $snpEff -dataDir $SESSIONDIR/snpEff/data -v UMD3.1.78 BT18_rsnumbers.vcf.gz >BT18_snpEff.vcf
+bgzip BT18_snpEff.vcf
+tabix -p vcf BT18_snpEff.vcf.gz
 ~~~~

 The options used here are:

-` -dataDir ~/snpEff/data   # annotation data goes here`\
-` -v UMD3.1.78   # Genome / version`\
-` BT18_rsnumbers.vcf.gz   # requires an input file, can be gzipped`\
-` >testresult_snpEff_Bt.txt   # results go to STDOUT and are captured in a file by '>'`
+` -dataDir ~/snpEff/data   # annotation data goes here`\
+` -v UMD3.1.78   # Genome / version`\
+` BT18_rsnumbers.vcf.gz   # requires an input file, can be gzipped`\
+` >testresult_snpEff_Bt.txt   # results go to STDOUT and are captured in a file by '>'`

 Study the outcome. Try to filter out 'intergenic' SNPs and other things
 that are perhaps a bit less relevant (use your skills in 'grep-ping',
@@ -363,9 +381,11 @@ You can now easily make tables for number of different consequences as
 annotated by both these annotation tools:

 ~~~~ {.bash}
-  cat LeonEnv/testresult_snpEff_Bt.txt  | python3 cut_line.py -m snpeff | cut -f6 | sort | uniq -c
+## for snpEff results:
+gunzip -c BT18_snpEff.vcf.gz  | python3 $SESSIONDIR/print_fields_from_vcf.py -m snpeff | cut -f5 | sort | uniq -c

-  gunzip -c LeonEnv/BT18.vep.all.vcf.gz  | python3 cut_line.py -m vep | cut -f5 | sort | uniq -c
+## for VEP results:
+gunzip -c BT18.vep.vcf.gz  | python3 $SESSIONDIR/print_fields_from_vcf.py  -m vep | cut -f5 | sort | uniq -c
 ~~~~

 Bonus question: Compare the results by selecting the 'MODERATE' and
@@ -384,72 +404,72 @@ homologous sequences. An example of in- and output of a pig sequence:

 What goes in: Protein sequence (Fasta):

-` >ENSSSCP00000018263 pep:novel chromosome:Sscrofa10.2:12:6621092:6624938:1 gene:ENSSSCG00000017236 transcript:ENSSSCT00000018765 gene_biotype:protein_coding t transcript_biotype:protein_coding`\
-` MTPRVGAVWLPSALLLLRVPGCLSLSGPPTAMGTKGGSLSVQCRYEEEYIDDKKYWDKSP`\
-` CFLSWKHIVETTESAREVRRGRVSIRDDPANLTFTVTLERLTEEDAGTYCCGITAQFSVD`\
-` PTHEVEVVVFPALGTSRPPSMPGPPTTLPATTWSFVSERETMANNLGKGPASQDPGQHPR`\
-` SKHPSIRLLLLVFLEVPLFLGMLGAVLWVHRPLRSSESRSVAMDPVPGNTAPSAGWK`
+` >ENSSSCP00000018263 pep:novel chromosome:Sscrofa10.2:12:6621092:6624938:1 gene:ENSSSCG00000017236 transcript:ENSSSCT00000018765 gene_biotype:protein_coding t transcript_biotype:protein_coding`\
+` MTPRVGAVWLPSALLLLRVPGCLSLSGPPTAMGTKGGSLSVQCRYEEEYIDDKKYWDKSP`\
+` CFLSWKHIVETTESAREVRRGRVSIRDDPANLTFTVTLERLTEEDAGTYCCGITAQFSVD`\
+` PTHEVEVVVFPALGTSRPPSMPGPPTTLPATTWSFVSERETMANNLGKGPASQDPGQHPR`\
+` SKHPSIRLLLLVFLEVPLFLGMLGAVLWVHRPLRSSESRSVAMDPVPGNTAPSAGWK`

 and variation, with coordinates pertaining the above protein sequences,
 the second column the reference amino acid, and the third/last column
 the alternative amino acid (alternative allele):

-` 235,G,E`\
-` 5,V,A`\
-` 22,C,Y`\
-` 34,T,I`\
-` 51,D,N`\
-` 53,K,N`\
-` 59,S,Y`\
-` 61,C,R`\
-` 64,S,L`\
-` 67,H,P`\
-` 68,I,T`\
-` 75,A,V`\
-` 108,T,K`\
-` 115,A,T`\
-` 124,E,D`\
-` 130,F,Y`\
-` 133,L,P`\
-` 142,P,A `\
-` 155,F,I`\
-` 158,E,G`\
-` 186,I,V `\
-` 21,G,D`
+` 235,G,E`\
+` 5,V,A`\
+` 22,C,Y`\
+` 34,T,I`\
+` 51,D,N`\
+` 53,K,N`\
+` 59,S,Y`\
+` 61,C,R`\
+` 64,S,L`\
+` 67,H,P`\
+` 68,I,T`\
+` 75,A,V`\
+` 108,T,K`\
+` 115,A,T`\
+` 124,E,D`\
+` 130,F,Y`\
+` 133,L,P`\
+` 142,P,A `\
+` 155,F,I`\
+` 158,E,G`\
+` 186,I,V `\
+` 21,G,D`

 Output looks like this:

-` ## PROVEAN scores ##`\
-` # VARIATION   SCORE`\
-` 235,G,E   0.076`\
-` 5,V,A 0.287`\
-` 22,C,Y    -8.028`\
-` 34,T,I    -1.932`\
-` 51,D,N    -1.613`\
-` 53,K,N    1.565`\
-` 59,S,Y    -2.140`\
-` 61,C,R    -5.826`\
-` 64,S,L    -0.437`\
-` 67,H,P    -0.511`\
-` 68,I,T    -2.664`\
-` 75,A,V    -1.061`\
-` 108,T,K   -4.051`\
-` 115,A,T   1.557`\
-` 124,E,D   -1.587`\
-` 130,F,Y   -0.983`\
-` 133,L,P   3.203`\
-` 142,P,A   -2.058`\
-` 155,F,I   0.502`\
-` 158,E,G   -0.077`\
-` 186,I,V   0.220`\
-` 21,G,D    -6.014`
+` ## PROVEAN scores ##`\
+` # VARIATION   SCORE`\
+` 235,G,E   0.076`\
+` 5,V,A 0.287`\
+` 22,C,Y    -8.028`\
+` 34,T,I    -1.932`\
+` 51,D,N    -1.613`\
+` 53,K,N    1.565`\
+` 59,S,Y    -2.140`\
+` 61,C,R    -5.826`\
+` 64,S,L    -0.437`\
+` 67,H,P    -0.511`\
+` 68,I,T    -2.664`\
+` 75,A,V    -1.061`\
+` 108,T,K   -4.051`\
+` 115,A,T   1.557`\
+` 124,E,D   -1.587`\
+` 130,F,Y   -0.983`\
+` 133,L,P   3.203`\
+` 142,P,A   -2.058`\
+` 155,F,I   0.502`\
+` 158,E,G   -0.077`\
+` 186,I,V   0.220`\
+` 21,G,D    -6.014`

 We will do something similar, although with a somewhat less complex
 example, for the cow data. At the end of the VEP exercise we created a
 filtered list of genes that have only a single deleterious variant
 annotated. Among these genes was BLOC1S3 gene, ENSBTAG00000007070.

-` `[`http://www.ensembl.org/Bos_taurus/Gene/Summary?db=core;g=ENSBTAG00000007070;r=18:53231599-53232201;t=ENSBTAT00000009293`](http://www.ensembl.org/Bos_taurus/Gene/Summary?db=core;g=ENSBTAG00000007070;r=18:53231599-53232201;t=ENSBTAT00000009293)
+` `[`http://www.ensembl.org/Bos_taurus/Gene/Summary?db=core;g=ENSBTAG00000007070;r=18:53231599-53232201;t=ENSBTAT00000009293`](http://www.ensembl.org/Bos_taurus/Gene/Summary?db=core;g=ENSBTAG00000007070;r=18:53231599-53232201;t=ENSBTAT00000009293)

 What we need for the Provean analysis is 1) the protein sequence of the
 gene (or rather, the transcript), and 2) a file with the variation in
@@ -460,41 +480,42 @@ protein sequences for cow can be found at Ensembl. It is located here on
 the machine you are working on:

 ~~~~ {.bash}
-  ## define the Env.Var. that holds the gene id:
-  GENE=ENSBTAG00000007070
+## define the Env.Var. that holds the gene id:
+GENE=ENSBTAG00000007070
  
-  ## search the protein file for the gene, and parse out the corresponding protein name
-  ## Note: ideally you do this by transcript, as a gene may be represented by multiple transcripts/protein sequences
-  PROT=`cat ~/Bos_taurus.UMD3.1.pep.all.fa | grep $GENE | awk '{print $1}' | sed 's/>//'`
+## search the protein file for the gene, and parse out the corresponding protein name
+## Note: ideally you do this by transcript, as a gene may be represented by multiple transcripts/protein sequences
+PROT=`cat $SESSIONDIR/Bos_taurus.UMD3.1.pep.all.fa | grep $GENE | awk '{print $1}' | sed 's/>//'`
  
-  ## use the first VEP annotation you did for this. Easiest to work with because completely tabular.
-  ## Note: in this case there is only a single missense variant in the gene. This won't work for all genes.
-  ## Note: the 'sed' statement replaces the '/', as found in the VEP annotation, with a ','.
-  cat test.vep | grep $GENE | grep missense | awk '{print $10","$11}' | sed 's/\//,/' >$PROT.var
+## use the first VEP annotation you did for this. Easiest to work with because completely tabular.
+## Note: in this case there is only a single missense variant in the gene. This won't work for all genes.
+## Note: the 'sed' statement replaces the '/', as found in the VEP annotation, with a ','.
+cat BT18.vep | grep $GENE | grep missense | awk '{print $10","$11}' | sed 's/\//,/' >$PROT.var
  
-  ## retrieve the protein sequence from the multifasta protein sequence file.
-  ## Note: the 'faOneRecord program is part of the so called 'Blat' suite, of which Blat is the best known
-  faOneRecord ~/Bos_taurus.UMD3.1.pep.all.fa $PROT >$PROT.fa
+## retrieve the protein sequence from the multifasta protein sequence file.
+## Note: the 'faOneRecord program is part of the so called 'Blat' suite, of which Blat is the best known
+faOneRecord $SESSIONDIR/Bos_taurus.UMD3.1.pep.all.fa $PROT >$PROT.fa
 ~~~~

 This should result in two files that are produced:

 ENSBTAP00000009293.fa

-` >ENSBTAP00000009293 pep:known chromosome:UMD3.1:18:53231599:53232201:1 gene:ENSBTAG00000007070 transcript:ENSBTAT00000009293 gene_biotype:protein_coding transcript_biotype:protein_coding`\
-` MESQSRRRRPLRRPETLVQGEAAESDSDLSASSSEEEELYLGPSGPTRGRPTGLRVAGEA`\
-` AETDSDPEPEPKAAPRDLPPLVVQRETAGEAWAEEEAPAPAPARSLLQLRLAESQARLDH`\
-` DVAAAVSGVYRRAGRDVAALAGRLAAAQAAGLAAAHSVRLARGDLCALAERLDIVASCRL`\
-` LPDIRGVPGTEPEQDPGPRA`
+` >ENSBTAP00000009293 pep:known chromosome:UMD3.1:18:53231599:53232201:1 gene:ENSBTAG00000007070 transcript:ENSBTAT00000009293 gene_biotype:protein_coding transcript_biotype:protein_coding`\
+` MESQSRRRRPLRRPETLVQGEAAESDSDLSASSSEEEELYLGPSGPTRGRPTGLRVAGEA`\
+` AETDSDPEPEPKAAPRDLPPLVVQRETAGEAWAEEEAPAPAPARSLLQLRLAESQARLDH`\
+` DVAAAVSGVYRRAGRDVAALAGRLAAAQAAGLAAAHSVRLARGDLCALAERLDIVASCRL`\
+` LPDIRGVPGTEPEQDPGPRA`

 ENSBTAP00000009293.var

-` 26,D,N`
+` 26,D,N`

-Finally, the actual Provean analysis is done with this command:
+Finally, the actual Provean analysis is done with this command (this may
+take a few minutes to run):

 ~~~~ {.bash}
-  provean.sh -q $PROT.fa -v $PROT.var --save_supporting_set $PROT.sss >$PROT.result.txt 2>$PROT.error;
+provean.sh --num_threads 8 -q $PROT.fa -v $PROT.var --save_supporting_set $PROT.sss >$PROT.result.txt 2>$PROT.error
 ~~~~

 Study the result. What is the verdict on the alternative allele?
@@ -519,18 +540,17 @@ for the Provean analysis and add it to the supporting set fasta file
 otherwise the file will get overwritten if you use only one!).

 ~~~~ {.bash}
-  ## add protein sequence to supporting set fasta:
-  cat $PROT.fa >>$PROT.sss.fasta
+## add protein sequence to supporting set fasta:
+cat $PROT.fa >>$PROT.sss.fasta
  
-  ## align all protein sequences using the multiple sequence aligner mafft:
-  mafft --auto $PROT.sss.fasta >$PROT.sss.fasta.out
+## align all protein sequences using the multiple sequence aligner mafft:
+mafft --auto $PROT.sss.fasta >$PROT.sss.fasta.out
 ~~~~

 Transfer the alignment to your local environment and view (seaview?).
 What do you observe for the 26th amino-acid of the cow protein sequence?
-In this 'implicit evolutionary context' there is one alternative amino
-acid that occurs at this site. What do the D amino acid and this
-evolutionary viable alternative share in terms of properties?
+In this 'implicit evolutionary context' is there another alternative
+amino acid residue possible?

 Through the orthology with human, try to find some possible phenotypic
 consequences of the alternative allele (e.g. through OMIM). Could there
@@ -545,33 +565,33 @@ lists coordinates and variants:

 What goes in: in.fa

-` >ENSSSCG00000001099ENSSSCT00000001195`\
-` MSSIEQTTEILLCLSPAEAANLKEGINFVRNKSTGKDYILFKNKSRLKACKNMCKHQGGL`\
-` FIKDIEDLNGRSVKCTKHNWKLDVSSMKYINPPGSFCQDELVVEKDEENGVLLLELNPPN`\
-` PWDSEPRSPEDLAFGEVQITYLTHACMDLKLGDKRMVFDPWLIGPAFARGWWLLHEPPSD`\
-` WLERLSRADLIYISHMHSDHLSYPTLKKLAERRPDVPIYVGNTERPVFWNLNQSGVQLTN`\
-` INVVPFGIWQQVDKNLRFMILMDGVHPEMDTCIIVEYKGHKILNTVDCTRPNGGRLPMKV`\
-` ALMMSDFAGGASGFPMTFSGGKFTEEWKAQFIKTERKKLLNYKARLVKDLQPRIYCPFAG`\
-` YFVESHPSDKYIKETNIKNDPNELNNLIKKNSEVVTWTPRPGATLDLGRMLKDPTDSKGI`\
-` VEPPEGTKIYKDSWDFGPYLNILNAAIGDEIFRHSSWIKEYFTWAGFKDYNLVVRMIETD`\
-` EDFSPLPGGYDYLVDFLDLSFPKERPSREHPYEEIRSRVDVIRHVVKNGLLWDDLYIGFQ`\
-` TRLQRDPDIYHHLFWNHFQIKLPLTPPDWKSFLMCSG `
+` >ENSSSCG00000001099ENSSSCT00000001195`\
+` MSSIEQTTEILLCLSPAEAANLKEGINFVRNKSTGKDYILFKNKSRLKACKNMCKHQGGL`\
+` FIKDIEDLNGRSVKCTKHNWKLDVSSMKYINPPGSFCQDELVVEKDEENGVLLLELNPPN`\
+` PWDSEPRSPEDLAFGEVQITYLTHACMDLKLGDKRMVFDPWLIGPAFARGWWLLHEPPSD`\
+` WLERLSRADLIYISHMHSDHLSYPTLKKLAERRPDVPIYVGNTERPVFWNLNQSGVQLTN`\
+` INVVPFGIWQQVDKNLRFMILMDGVHPEMDTCIIVEYKGHKILNTVDCTRPNGGRLPMKV`\
+` ALMMSDFAGGASGFPMTFSGGKFTEEWKAQFIKTERKKLLNYKARLVKDLQPRIYCPFAG`\
+` YFVESHPSDKYIKETNIKNDPNELNNLIKKNSEVVTWTPRPGATLDLGRMLKDPTDSKGI`\
+` VEPPEGTKIYKDSWDFGPYLNILNAAIGDEIFRHSSWIKEYFTWAGFKDYNLVVRMIETD`\
+` EDFSPLPGGYDYLVDFLDLSFPKERPSREHPYEEIRSRVDVIRHVVKNGLLWDDLYIGFQ`\
+` TRLQRDPDIYHHLFWNHFQIKLPLTPPDWKSFLMCSG `

 in.coord

-` ENSSSCG00000001099ENSSSCT00000001195 368        S A`\
-` ENSSSCG00000001099ENSSSCT00000001195 453        R H`\
-` ENSSSCG00000001099ENSSSCT00000001195 431        K T`
+` ENSSSCG00000001099ENSSSCT00000001195 368        S A`\
+` ENSSSCG00000001099ENSSSCT00000001195 453        R H`\
+` ENSSSCG00000001099ENSSSCT00000001195 431        K T`

-` run_pph.pl -s in.fa in.coord`
+` run_pph.pl -s in.fa in.coord`

 What comes out:

-` /lustre/nobackup/WUR/ABGC/shared/public_data_store/polyphen/polyphen-2.2.2/bin/run_pph.pl -s test1033.fa test1033.coord `\
-` #o_acc                 o_pos  o_aa1   o_aa2   rsid        acc            pos  aa1 aa2 nt1 nt2         prediction              based_on        effect      site      region        PHAT    dScore  Score1  Score2  MSAv      Nobs   Nstruct     Nfilt  PDB_id  PDB_pos PDB_ch   ident  lengthNormASA   SecStr  MapReg    dVol   dProp  B-fact   H-bonds     AveNHet     MinDHet     AveNInt     MinDInt     AveNSit     MinDSit    Transv  CodPos  CpG  MinDJxn         PfamHit      IdPmax      IdPSNP      IdQmin`\
-` ENSSSCG00000001099ENSSSCT00000001195     368      S       A               ENSSSCG00000001099ENSSSCT00000001195       368    S   A                     benign             alignment                                                    +0.721  -1.180  -1.901     2        68                                                                                                                                                                                                                                                    12.350      12.350       42.63`\
-` ENSSSCG00000001099ENSSSCT00000001195     453      R       H               ENSSSCG00000001099ENSSSCT00000001195       453    R   H                     benign             alignment                                                    +0.351  -2.130  -2.481     2        67                                                                                                                                                                                                                                                    33.194      33.194       93.76`\
-` ENSSSCG00000001099ENSSSCT00000001195     431      K       T               ENSSSCG00000001099ENSSSCT00000001195       431    K   T          possibly damaging             alignment                                                    +1.705  -1.678  -3.383     2        68                                                                                                                                                                                                                                                     2.382                   68.80`
+` /lustre/nobackup/WUR/ABGC/shared/public_data_store/polyphen/polyphen-2.2.2/bin/run_pph.pl -s test1033.fa test1033.coord `\
+` #o_acc                 o_pos  o_aa1   o_aa2   rsid        acc            pos  aa1 aa2 nt1 nt2         prediction              based_on        effect      site      region        PHAT    dScore  Score1  Score2  MSAv      Nobs   Nstruct     Nfilt  PDB_id  PDB_pos PDB_ch   ident  lengthNormASA   SecStr  MapReg    dVol   dProp  B-fact   H-bonds     AveNHet     MinDHet     AveNInt     MinDInt     AveNSit     MinDSit    Transv  CodPos  CpG  MinDJxn         PfamHit      IdPmax      IdPSNP      IdQmin`\
+` ENSSSCG00000001099ENSSSCT00000001195     368      S       A               ENSSSCG00000001099ENSSSCT00000001195       368    S   A                     benign             alignment                                                    +0.721  -1.180  -1.901     2        68                                                                                                                                                                                                                                                    12.350      12.350       42.63`\
+` ENSSSCG00000001099ENSSSCT00000001195     453      R       H               ENSSSCG00000001099ENSSSCT00000001195       453    R   H                     benign             alignment                                                    +0.351  -2.130  -2.481     2        67                                                                                                                                                                                                                                                    33.194      33.194       93.76`\
+` ENSSSCG00000001099ENSSSCT00000001195     431      K       T               ENSSSCG00000001099ENSSSCT00000001195       431    K   T          possibly damaging             alignment                                                    +1.705  -1.678  -3.383     2        68                                                                                                                                                                                                                                                     2.382                   68.80`

 The fasta and coordinate files can be made similar to what you've done
 for Provean. Polyphen is much more human-oriented than Provean (which is