Changes

Megens, Hendrik-Jan · be9baf9e
--- a/home.markdown
+++ b/home.markdown
@@ -28,20 +28,20 @@ called variants of chromosome 18, bgzip-ping them on-the-fly, and then
 indexing that file with tabix:

 ~~~~ {.bash}
-  tabix -h /home/formacion/COMUNES/IAMZ/data/CIHEAM/MULTISAMPLE_VCF/allbt.vcf.gz 18 | bgzip >BT18.vcf.gz
-  tabix -p vcf BT18.vcf.gz
+tabix -h /home/formacion/COMUNES/IAMZ/data/CIHEAM/MULTISAMPLE_VCF/allbt.vcf.gz 18 | bgzip >BT18.vcf.gz
+tabix -p vcf BT18.vcf.gz
 ~~~~

 Have a look at what is inside this file:

 ~~~~ {.bash}
-  gunzip -c BT18.vcf.gz | grep -v '^##' | more
+gunzip -c BT18.vcf.gz | grep -v '^##' | more
 ~~~~

 or

 ~~~~ {.bash}
-  tabix BT18.vcf.gz 18:1-10000 | more
+tabix BT18.vcf.gz 18:1-10000 | more
 ~~~~

 ### Annotating VCF with rs-numbers
@@ -50,7 +50,9 @@ The VCF file you are working does not yet have the dbSNP identifiers
 (rs-numbers) annotated. The rs-numbers of all currently known variants
 as databased in dbSNP can be found in this file:

-` /home/formacion/COMUNES/IAMZ/data/CIHEAM/sessions/variant_annotation/BT_incl_cons.18.vcf.gz`\
+`
+/home/formacion/COMUNES/IAMZ/data/CIHEAM/sessions/variant_annotation/BT_incl_cons.18.vcf.gz
+`

 You can use vcftools to do many things related to VCF files. This
 includes annotating the VCF based on other interval-file formatted
@@ -67,8 +69,8 @@ Now make a new VCF file for all the variants called for chromosome 18
 using this tool:

 ~~~~ {.bash}
-  tabix -h BT18.vcf.gz 18 | vcf-annotate -a $SESSIONDIR/BT_incl_cons.18.vcf.gz -c CHROM,FROM,ID | bgzip >BT18_rsnumbers.vcf.gz
-  tabix -p vcf BT18_rsnumbers.vcf.gz
+tabix -h BT18.vcf.gz 18 | vcf-annotate -a $SESSIONDIR/BT_incl_cons.18.vcf.gz -c CHROM,FROM,ID | bgzip >BT18_rsnumbers.vcf.gz
+tabix -p vcf BT18_rsnumbers.vcf.gz
 ~~~~

 ### Extracting variants using BEDtools
@@ -89,12 +91,12 @@ build can be found at Ensembl and downloaded using the command line
 connectivity. This line of code is just for completeness):

 ~~~~ {.bash}
-  wget ftp://ftp.ensembl.org/pub/release-78/gtf/bos_taurus/Bos_taurus.UMD3.1.78.gtf.gz
+wget ftp://ftp.ensembl.org/pub/release-78/gtf/bos_taurus/Bos_taurus.UMD3.1.78.gtf.gz
 ~~~~

 The file is already downloaded and can be found here:

-`  /home/formacion/COMUNES/IAMZ/data/CIHEAM/sessions/variant_annotation/Bos_taurus.UMD3.1.78.gtf.gz`
+`  /home/formacion/COMUNES/IAMZ/data/CIHEAM/sessions/variant_annotation/Bos_taurus.UMD3.1.78.gtf.gz`

 Have a quick look at what is in this file:

@@ -105,19 +107,19 @@ gunzip -c $SESSIONDIR/Bos_taurus.UMD3.1.78.gtf.gz | more
 Familiarise yourself with the GTF format if you have not already. For
 more information:

-` `[`http://www.ensembl.org/info/website/upload/gff.html`](http://www.ensembl.org/info/website/upload/gff.html)
+` `[`http://www.ensembl.org/info/website/upload/gff.html`](http://www.ensembl.org/info/website/upload/gff.html)

 You can select the appropriate region from the GTF file:

 ~~~~ {.bash}
-  gunzip -c $SESSIONDIR/Bos_taurus.UMD3.1.78.gtf.gz | awk '$3=="gene"' | awk '$1==18&&$4>1000000&&$4<2000000' >mygenes.gtf
+gunzip -c $SESSIONDIR/Bos_taurus.UMD3.1.78.gtf.gz | awk '$3=="gene"' | awk '$1==18&&$4>1000000&&$4<2000000' >mygenes.gtf
 ~~~~

 Subsequently you can intersect the VCF with the GTF file that contains
 just the genes between coordinates 1 and 2 million:

 ~~~~ {.bash}
-  bedtools intersect -a BT18_rsnumbers.vcf.gz -b mygenes.gtf >intersection_genes.vcf
+bedtools intersect -a BT18_rsnumbers.vcf.gz -b mygenes.gtf >intersection_genes.vcf
 ~~~~

 Have a quick look at the output, and count the number of variants in
@@ -143,39 +145,39 @@ perl /home/formacion/COMUNES/IAMZ/soft/ensembl-tools-release-78/scripts/variant_

 If you invoke it without parameters you will see the following:

-` Usage:`\
-` perl variant_effect_predictor.pl [--cache|--offline|--database] [arguments]`\
-` `\
-` Basic options`\
-` =============`\
-` `\
-` --help                 Display this message and quit`\
-` `\
-` -i | --input_file      Input file`\
-` -o | --output_file     Output file`\
-` --force_overwrite      Force overwriting of output file`\
-` --species [species]    Species to use [default: "human"]`\
-`                        `\
-` --everything           Shortcut switch to turn on commonly used options. See web`\
-`                        documentation for details [default: off]                       `\
-` --fork [num_forks]     Use forking to improve script runtime`\
-` `\
-` For full option documentation see:`\
-` `[`http://www.ensembl.org/info/docs/tools/vep/script/vep_options.html`](http://www.ensembl.org/info/docs/tools/vep/script/vep_options.html)\
-` `
+` Usage:`\
+` perl variant_effect_predictor.pl [--cache|--offline|--database] [arguments]`\
+` `\
+` Basic options`\
+` =============`\
+` `\
+` --help                 Display this message and quit`\
+` `\
+` -i | --input_file      Input file`\
+` -o | --output_file     Output file`\
+` --force_overwrite      Force overwriting of output file`\
+` --species [species]    Species to use [default: "human"]`\
+`                        `\
+` --everything           Shortcut switch to turn on commonly used options. See web`\
+`                        documentation for details [default: off]                       `\
+` --fork [num_forks]     Use forking to improve script runtime`\
+` `\
+` For full option documentation see:`\
+` `[`http://www.ensembl.org/info/docs/tools/vep/script/vep_options.html`](http://www.ensembl.org/info/docs/tools/vep/script/vep_options.html)\
+` `

 For the current session we will use the following options:

-` --species bos_taurus`\
-` -o BT18.vep   # outputfile`\
-` --fork 4    # use 4 threads`\
-` --canonical`\
-` --sift b`\
-` --coding_only`\
-` --no_intergenic`\
-` --offline   # using a cache directory`\
-` --dir /home/formacion/COMUNES/IAMZ/data/CIHEAM/ReferenceGenome/VEP/ # where the cache dir lives`\
-` --force_overwrite`
+` --species bos_taurus`\
+` -o BT18.vep   # outputfile`\
+` --fork 4    # use 4 threads`\
+` --canonical`\
+` --sift b`\
+` --coding_only`\
+` --no_intergenic`\
+` --offline   # using a cache directory`\
+` --dir /home/formacion/COMUNES/IAMZ/data/CIHEAM/ReferenceGenome/VEP/ # where the cache dir lives`\
+` --force_overwrite`

 You can create a file that will provide annotation information for each
 SNP:
@@ -208,19 +210,23 @@ bgzip BT18.vep.vcf
 tabix -p vcf BT18.vep.vcf.gz
 ~~~~

-Again, study and interpret your outcome by looking inside the file.
-Specifically, note WHERE the annotation ended up.
+Again, study and interpret your outcome by looking inside the file
+'BT18.vep.vcf.gz'. Specifically, note WHERE the annotation ended up.

-` 18    53205918    .   G   A   0.329662    .   AB=0;ABP=0;AC=0;AF=0.125;AN=16;AO=2;CIGAR=1X;DP=70;DPB`\
-` =70;DPRA=1.03279;EPP=7.35324;EPPR=3.13803;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=60;NS=8;NUMALT=1;`\
-` ODDS=2.54006;PAIRED=1;PAIREDR=0.985294;PAO=0;PQA=0;PQR=0;PRO=0;QA=70;QR=2421;RO=68;RPP=3.0103;`\
-` RPPR=11.1853;RUN=1;SAF=1;SAP=3.0103;SAR=1;SRF=49;SRP=31.7504;SRR=19;TYPE=snp;technology.ILLUMINA=1;`\
-` CSQ=A|ENSBTAG00000018834|ENSBTAT00000025071|Transcript|missense_variant|2132|1984|662|E/K|Gag/Aag|||1|`\
-` YES|deleterious(0.02) GT:DP:RO:QR:AO:QA:GL    0/0:7:7:263:0:0:0,-1.92588,-10  0/0:7:7:238:0:0:0,-1.92376,-10`\
-` 0/0:9:7:266:2:70:-3.90732,0,-10   0/0:10:10:362:0:0:0,-2.73805,-10    0/0:10:10:345:0:0:0,-2.73789,-10    0/0:8:8:287:0:0:0,-2.19662,-10  0/0:5:5:178:0:0:0,-1.38409,-10  0/0:14:14:482:0:0:0,-3.82032,-10`
+    18  53205918    .   G   A   0.329662    . AB=0;ABP=0;AC=0;AF=0.125;AN=16;AO=2;CIGAR=1X;DP=70;DPB=70;DPRA=1.03279;EPP=7.35324;EPPR=3.13803;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=60;NS=8;NUMALT=1;ODDS=2.54006;PAIRED=1;PAIREDR=0.985294;PAO=0;PQA=0;PQR=0;PRO=0;QA=70;QR=2421;RO=68;RPP=3.0103;RPPR=11.1853;RUN=1;SAF=1;SAP=3.0103;SAR=1;SRF=49;SRP=31.7504;SRR=19;TYPE=snp;technology.ILLUMINA=1;CSQ=A|ENSBTAG00000018834|ENSBTAT00000025071|Transcript|missense_variant|2132|1984|662|E/K|Gag/Aag|||1|YES|deleterious(0.02)   GT:DP:RO:QR:AO:QA:GL    0/0:7:7:263:0:0:0,-1.92588,-10  0/0:7:7:238:0:0:0,-1.92376,-10 0/0:9:7:266:2:70:-3.90732,0,-10  0/0:10:10:362:0:0:0,-2.73805,-10    0/0:10:10:345:0:0:0,-2.73789,-10    0/0:8:8:287:0:0:0,-2.19662,-10  0/0:5:5:178:0:0:0,-1.38409,-10 0/0:14:14:482:0:0:0,-3.82032,-10
+
+The two VEP runs have generated reports in HTML format. Transfer them to
+your local computer and study them.

 ### A further look at the VEP output

+From studying the HTML reports it should be clear that the information
+you can find there is necessarily limited and general. It is therefore
+important to acquire skills to generate further analyses based on the
+annotation that meet your specific research needs. In this section we
+will delve a little bit deeper in the structure of the annotations and
+how to extract information from it.
+
 Although the VCF now looks even more complex than before, there is a
 good reason to add the VEP annotation data the way it was done. The 8th
 column of the VCF provides a lot of information, each field separated by
@@ -258,8 +264,13 @@ for the 11th column. We need to remove the 'AF=' tag though, but if we
 do that we can further select for instance for allele frequencies
 between 0 and 15%.

+The next shell-oneliner will give you the variants with rare (\<15%)
+deleterious alleles and count them (by omitting the last part you can
+retrieve the relevant variations themselves).
+
 ~~~~ {.bash}
-  gunzip -c BT18.vep.vcf.gz | grep deleterious | sed 's/;/\t/g' | awk '{gsub("AF=","",$11); print}' | awk '$11<0.15&&$11>0' | wc -l
+  gunzip -c BT18.vep.vcf.gz | grep deleterious | sed 's/;/\t/g' | awk '{gsub("AF=","",$11); print}' | \
+     awk '$11<0.15&&$11>0' | wc -l
 ~~~~

 Analysing the annotations a bit further, you could discover that there
@@ -280,10 +291,10 @@ That script does only one thing: takes lines from the VCF file, put all
 fields in a complex data structure (as seen above) and, in a structured
 way, prints a selection of fields.

-What this script will VCF input from a stream:
+The Python script will take VCF input from a stream:

 ~~~~ {.bash}
-  gunzip -c BT18.vep.vcf.gz | python3 $SESSIONDIR/print_fields_from_vcf.py -m vep | grep deleterious | more 
+gunzip -c BT18.vep.vcf.gz | python3 $SESSIONDIR/print_fields_from_vcf.py -m vep | grep deleterious | more 
 ~~~~

 Notice that the script itself does in fact take an option: 'vep'. This
@@ -292,8 +303,8 @@ highly inconveniently, is slightly different.

 and for each line provide output similar to this:

-` # chrom coord AF  Gene  consequence  sift_prediction  sift_score`\
-` 18    53205918    0.125   ENSBTAG00000018834  missense_variant    deleterious 0.02`
+` #chrom  coord   AF      Gene                                consequence           sift_prediction  sift_score`\
+` 18    53205918    0.125   ENSBTAG00000018834  missense_variant    deleterious 0.02`

 Obviously, the functionality of the Python script is limited. However,
 it would be quite easy to provide additional statistics and output
@@ -302,11 +313,6 @@ so explicit. This means that the Python script provides you with a means
 for expansion of functionality, while that will be far more tricky using
 shell commands solely.

-~~~~ {.bash}
-  #alternative way, shell hack
-  gunzip -c BT18.vep.vcf.gz | grep deleterious | sed 's/;/ /g' | awk '{gsub("AF=","",$11); print}' | awk '$11<1&&$11>0' | sed 's/ /#_#/g' | sed 's/|/\t/g' | cut -f2 | sort | uniq -c | sed 's/^ \+//' | awk '$1>1'  >genes_with_more_than1.txt  
-~~~~
-
 Now we make a file that has the gene names, and counts, for all genes
 that have more than one deleterious allele:

@@ -316,7 +322,10 @@ gunzip -c BT18.vep.vcf.gz | python3 $SESSIONDIR/print_fields_from_vcf.py -m vep
 ~~~~

 Take two or three genes that have the highest number of deleterious
-variants. Manually look them up at the Ensembl website (ensembl.org).
+variants. Manually look them up at the Ensembl website.
+
+` `[`http://www.ensembl.org/Bos_taurus/Info/Index`](http://www.ensembl.org/Bos_taurus/Info/Index)
+
 Check the 'orthologous' status. What do you observe? How would you
 interpret the validity of the gene annotation and/or the likelihood of
 off-site mapping of reads?
@@ -338,6 +347,18 @@ gunzip -c BT18.vep.vcf.gz | python3 $SESSIONDIR/print_fields_from_vcf.py -m vep

 How many genes are there that meet this condition?

+You can now easily make tables for number of different consequences as
+annotated by both these annotation tools:
+
+~~~~ {.bash}
+## for VEP results:
+gunzip -c BT18.vep.vcf.gz  | python3 $SESSIONDIR/print_fields_from_vcf.py  -m vep | cut -f5 | sort | uniq -c
+~~~~
+
+VEP also made a report in HTML format. Transfer it to your local
+computer and view it, if you haven't done already. How do your 'hacky'
+results compare to the report VEP made?
+
 ### snpEff

 Another popular variant annotation tool is 'snpEff'. Among the
@@ -363,10 +384,10 @@ tabix -p vcf BT18_snpEff.vcf.gz

 The options used here are:

-` -dataDir ~/snpEff/data   # annotation data goes here`\
-` -v UMD3.1.78   # Genome / version`\
-` BT18_rsnumbers.vcf.gz   # requires an input file, can be gzipped`\
-` >testresult_snpEff_Bt.txt   # results go to STDOUT and are captured in a file by '>'`
+` -dataDir ~/snpEff/data   # annotation data goes here`\
+` -v UMD3.1.78   # Genome / version`\
+` BT18_rsnumbers.vcf.gz   # requires an input file, can be gzipped`\
+` >testresult_snpEff_Bt.txt   # results go to STDOUT and are captured in a file by '>'`

 Study the outcome. Try to filter out 'intergenic' SNPs and other things
 that are perhaps a bit less relevant (use your skills in 'grep-ping',
@@ -377,17 +398,18 @@ You will notice that the variants that have an effect are labelled
 'MODERATE' or 'HIGH'. Try to filter out the ones that are 'HIGH'. How
 many are there?

-You can now easily make tables for number of different consequences as
-annotated by both these annotation tools:
+You can now again easily make tables for number of different
+consequences as annotated by both these annotation tools:

 ~~~~ {.bash}
 ## for snpEff results:
 gunzip -c BT18_snpEff.vcf.gz  | python3 $SESSIONDIR/print_fields_from_vcf.py -m snpeff | cut -f5 | sort | uniq -c
-
-## for VEP results:
-gunzip -c BT18.vep.vcf.gz  | python3 $SESSIONDIR/print_fields_from_vcf.py  -m vep | cut -f5 | sort | uniq -c
 ~~~~

+Like VEP, snpEff makes a report in HTML format. Transfer it to your
+local computer and study it. How do your 'hacky' results compare to the
+report snpEff made?
+
 Bonus question: Compare the results by selecting the 'MODERATE' and
 "HIGH" variants, and compare that with the 'deleterious' variants as
 annotated (through SIFT) in VEP. You can use, e.g. vcf-compare.
@@ -402,74 +424,75 @@ to all known protein sequences (so called 'non-redundant database').
 Subsequent clustering defines a group of 'similar', and implicitly
 homologous sequences. An example of in- and output of a pig sequence:

-What goes in: Protein sequence (Fasta):
+What goes in:

-` >ENSSSCP00000018263 pep:novel chromosome:Sscrofa10.2:12:6621092:6624938:1 gene:ENSSSCG00000017236 transcript:ENSSSCT00000018765 gene_biotype:protein_coding t transcript_biotype:protein_coding`\
-` MTPRVGAVWLPSALLLLRVPGCLSLSGPPTAMGTKGGSLSVQCRYEEEYIDDKKYWDKSP`\
-` CFLSWKHIVETTESAREVRRGRVSIRDDPANLTFTVTLERLTEEDAGTYCCGITAQFSVD`\
-` PTHEVEVVVFPALGTSRPPSMPGPPTTLPATTWSFVSERETMANNLGKGPASQDPGQHPR`\
-` SKHPSIRLLLLVFLEVPLFLGMLGAVLWVHRPLRSSESRSVAMDPVPGNTAPSAGWK`
+`
+Protein sequence (Fasta):
+  >ENSSSCP00000018263 pep:novel chromosome:Sscrofa10.2:12:6621092:6624938:1 gene:ENSSSCG00000017236 transcript:ENSSSCT00000018765 gene_biotype:protein_coding t transcript_biotype:protein_coding
+  MTPRVGAVWLPSALLLLRVPGCLSLSGPPTAMGTKGGSLSVQCRYEEEYIDDKKYWDKSP
+  CFLSWKHIVETTESAREVRRGRVSIRDDPANLTFTVTLERLTEEDAGTYCCGITAQFSVD
+  PTHEVEVVVFPALGTSRPPSMPGPPTTLPATTWSFVSERETMANNLGKGPASQDPGQHPR
+  SKHPSIRLLLLVFLEVPLFLGMLGAVLWVHRPLRSSESRSVAMDPVPGNTAPSAGWK
+`

 and variation, with coordinates pertaining the above protein sequences,
 the second column the reference amino acid, and the third/last column
-the alternative amino acid (alternative allele):
-
-` 235,G,E`\
-` 5,V,A`\
-` 22,C,Y`\
-` 34,T,I`\
-` 51,D,N`\
-` 53,K,N`\
-` 59,S,Y`\
-` 61,C,R`\
-` 64,S,L`\
-` 67,H,P`\
-` 68,I,T`\
-` 75,A,V`\
-` 108,T,K`\
-` 115,A,T`\
-` 124,E,D`\
-` 130,F,Y`\
-` 133,L,P`\
-` 142,P,A `\
-` 155,F,I`\
-` 158,E,G`\
-` 186,I,V `\
-` 21,G,D`
-
-Output looks like this:
-
-` ## PROVEAN scores ##`\
-` # VARIATION   SCORE`\
-` 235,G,E   0.076`\
-` 5,V,A 0.287`\
-` 22,C,Y    -8.028`\
-` 34,T,I    -1.932`\
-` 51,D,N    -1.613`\
-` 53,K,N    1.565`\
-` 59,S,Y    -2.140`\
-` 61,C,R    -5.826`\
-` 64,S,L    -0.437`\
-` 67,H,P    -0.511`\
-` 68,I,T    -2.664`\
-` 75,A,V    -1.061`\
-` 108,T,K   -4.051`\
-` 115,A,T   1.557`\
-` 124,E,D   -1.587`\
-` 130,F,Y   -0.983`\
-` 133,L,P   3.203`\
-` 142,P,A   -2.058`\
-` 155,F,I   0.502`\
-` 158,E,G   -0.077`\
-` 186,I,V   0.220`\
-` 21,G,D    -6.014`
+the alternative amino acid (alternative allele): `
+  235,G,E
+  5,V,A
+  22,C,Y
+  34,T,I
+  51,D,N
+  53,K,N
+  59,S,Y
+  61,C,R
+  64,S,L
+  67,H,P
+  68,I,T
+  75,A,V
+  108,T,K
+  115,A,T
+  124,E,D
+  130,F,Y
+  133,L,P
+  142,P,A 
+  155,F,I
+  158,E,G
+  186,I,V 
+  21,G,D
+` Output looks like this: `
+  ## PROVEAN scores ##
+  # VARIATION   SCORE
+  235,G,E   0.076
+  5,V,A 0.287
+  22,C,Y    -8.028
+  34,T,I    -1.932
+  51,D,N    -1.613
+  53,K,N    1.565
+  59,S,Y    -2.140
+  61,C,R    -5.826
+  64,S,L    -0.437
+  67,H,P    -0.511
+  68,I,T    -2.664
+  75,A,V    -1.061
+  108,T,K   -4.051
+  115,A,T   1.557
+  124,E,D   -1.587
+  130,F,Y   -0.983
+  133,L,P   3.203
+  142,P,A   -2.058
+  155,F,I   0.502
+  158,E,G   -0.077
+  186,I,V   0.220
+  21,G,D    -6.014
+`

 We will do something similar, although with a somewhat less complex
 example, for the cow data. At the end of the VEP exercise we created a
 filtered list of genes that have only a single deleterious variant
 annotated. Among these genes was BLOC1S3 gene, ENSBTAG00000007070.

-` `[`http://www.ensembl.org/Bos_taurus/Gene/Summary?db=core;g=ENSBTAG00000007070;r=18:53231599-53232201;t=ENSBTAT00000009293`](http://www.ensembl.org/Bos_taurus/Gene/Summary?db=core;g=ENSBTAG00000007070;r=18:53231599-53232201;t=ENSBTAT00000009293)
+` `[`http://www.ensembl.org/Bos_taurus/Gene/Summary?db=core;g=ENSBTAG00000007070;r=18:53231599-53232201;t=ENSBTAT00000009293`](http://www.ensembl.org/Bos_taurus/Gene/Summary?db=core;g=ENSBTAG00000007070;r=18:53231599-53232201;t=ENSBTAT00000009293)

 What we need for the Provean analysis is 1) the protein sequence of the
 gene (or rather, the transcript), and 2) a file with the variation in
@@ -501,15 +524,15 @@ This should result in two files that are produced:

 ENSBTAP00000009293.fa

-` >ENSBTAP00000009293 pep:known chromosome:UMD3.1:18:53231599:53232201:1 gene:ENSBTAG00000007070 transcript:ENSBTAT00000009293 gene_biotype:protein_coding transcript_biotype:protein_coding`\
-` MESQSRRRRPLRRPETLVQGEAAESDSDLSASSSEEEELYLGPSGPTRGRPTGLRVAGEA`\
-` AETDSDPEPEPKAAPRDLPPLVVQRETAGEAWAEEEAPAPAPARSLLQLRLAESQARLDH`\
-` DVAAAVSGVYRRAGRDVAALAGRLAAAQAAGLAAAHSVRLARGDLCALAERLDIVASCRL`\
-` LPDIRGVPGTEPEQDPGPRA`
+` >ENSBTAP00000009293 pep:known chromosome:UMD3.1:18:53231599:53232201:1 gene:ENSBTAG00000007070 transcript:ENSBTAT00000009293 gene_biotype:protein_coding transcript_biotype:protein_coding`\
+` MESQSRRRRPLRRPETLVQGEAAESDSDLSASSSEEEELYLGPSGPTRGRPTGLRVAGEA`\
+` AETDSDPEPEPKAAPRDLPPLVVQRETAGEAWAEEEAPAPAPARSLLQLRLAESQARLDH`\
+` DVAAAVSGVYRRAGRDVAALAGRLAAAQAAGLAAAHSVRLARGDLCALAERLDIVASCRL`\
+` LPDIRGVPGTEPEQDPGPRA`

 ENSBTAP00000009293.var

-` 26,D,N`
+` 26,D,N`

 Finally, the actual Provean analysis is done with this command (this may
 take a few minutes to run):
@@ -547,7 +570,19 @@ cat $PROT.fa >>$PROT.sss.fasta
 mafft --auto $PROT.sss.fasta >$PROT.sss.fasta.out
 ~~~~

-Transfer the alignment to your local environment and view (seaview?).
+Transfer the alignment to your local environment and view. If installed
+on your local computer you could use e.g. Seaview. If you are on a
+Debian-based system (e.g. Ubuntu) and you have administrative rights,
+you can simply install it like this:
+
+~~~~ {.bash}
+sudo apt-get install seaview
+~~~~
+
+Alternatively, you can view the alignment through this online tool:
+
+` `[`http://toolkit.tuebingen.mpg.de/alnviz`](http://toolkit.tuebingen.mpg.de/alnviz)
+
 What do you observe for the 26th amino-acid of the cow protein sequence?
 In this 'implicit evolutionary context' is there another alternative
 amino acid residue possible?
@@ -565,33 +600,33 @@ lists coordinates and variants:

 What goes in: in.fa

-` >ENSSSCG00000001099ENSSSCT00000001195`\
-` MSSIEQTTEILLCLSPAEAANLKEGINFVRNKSTGKDYILFKNKSRLKACKNMCKHQGGL`\
-` FIKDIEDLNGRSVKCTKHNWKLDVSSMKYINPPGSFCQDELVVEKDEENGVLLLELNPPN`\
-` PWDSEPRSPEDLAFGEVQITYLTHACMDLKLGDKRMVFDPWLIGPAFARGWWLLHEPPSD`\
-` WLERLSRADLIYISHMHSDHLSYPTLKKLAERRPDVPIYVGNTERPVFWNLNQSGVQLTN`\
-` INVVPFGIWQQVDKNLRFMILMDGVHPEMDTCIIVEYKGHKILNTVDCTRPNGGRLPMKV`\
-` ALMMSDFAGGASGFPMTFSGGKFTEEWKAQFIKTERKKLLNYKARLVKDLQPRIYCPFAG`\
-` YFVESHPSDKYIKETNIKNDPNELNNLIKKNSEVVTWTPRPGATLDLGRMLKDPTDSKGI`\
-` VEPPEGTKIYKDSWDFGPYLNILNAAIGDEIFRHSSWIKEYFTWAGFKDYNLVVRMIETD`\
-` EDFSPLPGGYDYLVDFLDLSFPKERPSREHPYEEIRSRVDVIRHVVKNGLLWDDLYIGFQ`\
-` TRLQRDPDIYHHLFWNHFQIKLPLTPPDWKSFLMCSG `
+` >ENSSSCG00000001099ENSSSCT00000001195`\
+` MSSIEQTTEILLCLSPAEAANLKEGINFVRNKSTGKDYILFKNKSRLKACKNMCKHQGGL`\
+` FIKDIEDLNGRSVKCTKHNWKLDVSSMKYINPPGSFCQDELVVEKDEENGVLLLELNPPN`\
+` PWDSEPRSPEDLAFGEVQITYLTHACMDLKLGDKRMVFDPWLIGPAFARGWWLLHEPPSD`\
+` WLERLSRADLIYISHMHSDHLSYPTLKKLAERRPDVPIYVGNTERPVFWNLNQSGVQLTN`\
+` INVVPFGIWQQVDKNLRFMILMDGVHPEMDTCIIVEYKGHKILNTVDCTRPNGGRLPMKV`\
+` ALMMSDFAGGASGFPMTFSGGKFTEEWKAQFIKTERKKLLNYKARLVKDLQPRIYCPFAG`\
+` YFVESHPSDKYIKETNIKNDPNELNNLIKKNSEVVTWTPRPGATLDLGRMLKDPTDSKGI`\
+` VEPPEGTKIYKDSWDFGPYLNILNAAIGDEIFRHSSWIKEYFTWAGFKDYNLVVRMIETD`\
+` EDFSPLPGGYDYLVDFLDLSFPKERPSREHPYEEIRSRVDVIRHVVKNGLLWDDLYIGFQ`\
+` TRLQRDPDIYHHLFWNHFQIKLPLTPPDWKSFLMCSG `

 in.coord

-` ENSSSCG00000001099ENSSSCT00000001195 368        S A`\
-` ENSSSCG00000001099ENSSSCT00000001195 453        R H`\
-` ENSSSCG00000001099ENSSSCT00000001195 431        K T`
+` ENSSSCG00000001099ENSSSCT00000001195 368        S A`\
+` ENSSSCG00000001099ENSSSCT00000001195 453        R H`\
+` ENSSSCG00000001099ENSSSCT00000001195 431        K T`

-` run_pph.pl -s in.fa in.coord`
+` run_pph.pl -s in.fa in.coord`

 What comes out:

-` /lustre/nobackup/WUR/ABGC/shared/public_data_store/polyphen/polyphen-2.2.2/bin/run_pph.pl -s test1033.fa test1033.coord `\
-` #o_acc                 o_pos  o_aa1   o_aa2   rsid        acc            pos  aa1 aa2 nt1 nt2         prediction              based_on        effect      site      region        PHAT    dScore  Score1  Score2  MSAv      Nobs   Nstruct     Nfilt  PDB_id  PDB_pos PDB_ch   ident  lengthNormASA   SecStr  MapReg    dVol   dProp  B-fact   H-bonds     AveNHet     MinDHet     AveNInt     MinDInt     AveNSit     MinDSit    Transv  CodPos  CpG  MinDJxn         PfamHit      IdPmax      IdPSNP      IdQmin`\
-` ENSSSCG00000001099ENSSSCT00000001195     368      S       A               ENSSSCG00000001099ENSSSCT00000001195       368    S   A                     benign             alignment                                                    +0.721  -1.180  -1.901     2        68                                                                                                                                                                                                                                                    12.350      12.350       42.63`\
-` ENSSSCG00000001099ENSSSCT00000001195     453      R       H               ENSSSCG00000001099ENSSSCT00000001195       453    R   H                     benign             alignment                                                    +0.351  -2.130  -2.481     2        67                                                                                                                                                                                                                                                    33.194      33.194       93.76`\
-` ENSSSCG00000001099ENSSSCT00000001195     431      K       T               ENSSSCG00000001099ENSSSCT00000001195       431    K   T          possibly damaging             alignment                                                    +1.705  -1.678  -3.383     2        68                                                                                                                                                                                                                                                     2.382                   68.80`
+` /lustre/nobackup/WUR/ABGC/shared/public_data_store/polyphen/polyphen-2.2.2/bin/run_pph.pl -s test1033.fa test1033.coord `\
+` #o_acc                 o_pos  o_aa1   o_aa2   rsid        acc            pos  aa1 aa2 nt1 nt2         prediction              based_on        effect      site      region        PHAT    dScore  Score1  Score2  MSAv      Nobs   Nstruct     Nfilt  PDB_id  PDB_pos PDB_ch   ident  lengthNormASA   SecStr  MapReg    dVol   dProp  B-fact   H-bonds     AveNHet     MinDHet     AveNInt     MinDInt     AveNSit     MinDSit    Transv  CodPos  CpG  MinDJxn         PfamHit      IdPmax      IdPSNP      IdQmin`\
+` ENSSSCG00000001099ENSSSCT00000001195     368      S       A               ENSSSCG00000001099ENSSSCT00000001195       368    S   A                     benign             alignment                                                    +0.721  -1.180  -1.901     2        68                                                                                                                                                                                                                                                    12.350      12.350       42.63`\
+` ENSSSCG00000001099ENSSSCT00000001195     453      R       H               ENSSSCG00000001099ENSSSCT00000001195       453    R   H                     benign             alignment                                                    +0.351  -2.130  -2.481     2        67                                                                                                                                                                                                                                                    33.194      33.194       93.76`\
+` ENSSSCG00000001099ENSSSCT00000001195     431      K       T               ENSSSCG00000001099ENSSSCT00000001195       431    K   T          possibly damaging             alignment                                                    +1.705  -1.678  -3.383     2        68                                                                                                                                                                                                                                                     2.382                   68.80`

 The fasta and coordinate files can be made similar to what you've done
 for Provean. Polyphen is much more human-oriented than Provean (which is