Changes

Megens, Hendrik-Jan · 655cf3da
--- a/home.markdown
+++ b/home.markdown
@@ -28,6 +28,7 @@ as databased in dbSNP can be found in this file:
 ` /path/to/ BT18.vcf.gz`

 Have a look at what is inside this file:
+
 ` gunzip -c BT18.vcf.gz | grep -v '^##' | more`

 or
@@ -53,7 +54,8 @@ using this tool:
 You will very often want to extract variants based on interval files,
 e.g. a certain region on a chromosome. This can be easily achieve using
 the tabix command described above. However, sometimes you may want to
-extract a substantial number of intervals. You could use tabix in a loop, either as a shell-script or in a scripting language like Perl or
+extract a substantial number of intervals. You could use tabix in a
+loop, either as a shell-script or in a scripting language like Perl or
 Python. However, BedTools provides many utilities for manipulating
 interval data, such as BED, VCF, or GTF/GFF formats.

@@ -95,6 +97,7 @@ Now do the same, but select only the exonic variants in the same
 interval, from 1 to 2 million.

 How many variants do you have now?
+
 ### Variant Effect Predictor

 Various tools exist to make more precise annotations than what you can
@@ -141,3 +144,163 @@ allele frequencies between 0 and 15%.

 ` gunzip -c BT18.vep.vcf.gz | grep deleterious | sed 's/;/\t/g' | awk '{gsub("AF=","",$11); print}' | awk '$11<0.15&&$11>0' | wc -l`

+Analysing the annotations a bit further, you could discover that some
+
+` gunzip -c BT18.vep.vcf.gz | grep deleterious | sed 's/;/ /g' | awk '{gsub("AF=","",$11); print}' | awk '$11<1&&$11>0' | sed 's/ /#_#/g' | sed 's/|/\t/g' | cut -f2 | sort | uniq -c | sed 's/^ \+//' | awk '$1>1'  >genes_with_more_than1.txt  `
+
+Take two or three genes that have the highest number of deleterious
+variants. Manually look them up at the Ensembl website (ensembl.org).
+Check the 'orthologous' status. What do you observe? How would you
+interpret the validity of the gene annotation and/or the likelihood of
+off-site mapping of reads?
+
+This provides us with a strategy to filter a bit more rigorously for
+genes that may actually contain genuine deleterious alleles. Ideally we
+would a priory filter for genes that have proper one-to-one orthologous
+relations with other mammalian species. Time and lack of direct internet
+connection makes that unfeasible for this practical. However, genes that
+have only one deleterious variant segregating in the study population at
+a low frequency would certainly be better candidates than genes that
+have multiple deleterious variants, as the above example indicates. To
+filter you can do the following:
+
+` gunzip -c BT18.vep.vcf.gz | grep deleterious | sed 's/;/ /g' | awk '{gsub("AF=","",$11); print}' | awk '$11<0.15&&$11>0' | sed 's/ /#_#/g' | sed 's/|/\t/g' | cut -f2 | sort | uniq -c | sed 's/^ \+//' | awk '$1==1' | awk '{print $2}' >genes_with1.txt`
+
+How many genes are there that meet this condition?
+
+### snpEff
+
+Another popular variant annotation tool is 'snpEff'. Among the
+advantages over VEP is that it is applicable to a wider range of
+species. Furthermore, it is blazingly fast. For chromosome 18, we will
+also use this tool:
+
+` java -Xmx4G -jar $snpEff -dataDir ~/snpEff/data -v UMD3.1.78 BT18_rsnumbers.vcf.gz >testresult_snpEff_Bt.txt`
+
+Study the outcome. Try to filter out 'intergenic' SNPs and other things
+that are perhaps a bit less relevant (use your skills in 'grep-ping',
+etc.). What are some of the differences in how the variants are
+annotated, compared to VEP?
+
+You will notice that the variants that have an effect are labelled
+'MODERATE' or 'HIGH'. Try to filter out the ones that are 'HIGH'. How
+many are there?
+
+Compare the results by selecting the 'MODERATE' and "HIGH" variants, and
+compare that with the 'deleterious' variants as annotated (through SIFT)
+in VEP. You can use, e.g. vcf-compare.
+
+Inferring function
+------------------
+
+### Provean
+
+` >ENSSSCP00000018263 pep:novel chromosome:Sscrofa10.2:12:6621092:6624938:1 gene:ENSSSCG00000017236 transcript:ENSSSCT00000018765 gene_biotype:protein_coding t transcript_biotype:protein_coding`\
+` MTPRVGAVWLPSALLLLRVPGCLSLSGPPTAMGTKGGSLSVQCRYEEEYIDDKKYWDKSP`\
+` CFLSWKHIVETTESAREVRRGRVSIRDDPANLTFTVTLERLTEEDAGTYCCGITAQFSVD`\
+` PTHEVEVVVFPALGTSRPPSMPGPPTTLPATTWSFVSERETMANNLGKGPASQDPGQHPR`\
+` SKHPSIRLLLLVFLEVPLFLGMLGAVLWVHRPLRSSESRSVAMDPVPGNTAPSAGWK`
+
+` 235,G,E`\
+` 5,V,A`\
+` 22,C,Y`\
+` 34,T,I`\
+` 51,D,N`\
+` 53,K,N`\
+` 59,S,Y`\
+` 61,C,R`\
+` 64,S,L`\
+` 67,H,P`\
+` 68,I,T`\
+` 75,A,V`\
+` 108,T,K`\
+` 115,A,T`\
+` 124,E,D`\
+` 130,F,Y`\
+` 133,L,P`\
+` 142,P,A `\
+` 155,F,I`\
+` 158,E,G`\
+` 186,I,V `\
+` 21,G,D`
+
+` ## PROVEAN scores ##`\
+` # VARIATION   SCORE`\
+` 235,G,E   0.076`\
+` 5,V,A 0.287`\
+` 22,C,Y    -8.028`\
+` 34,T,I    -1.932`\
+` 51,D,N    -1.613`\
+` 53,K,N    1.565`\
+` 59,S,Y    -2.140`\
+` 61,C,R    -5.826`\
+` 64,S,L    -0.437`\
+` 67,H,P    -0.511`\
+` 68,I,T    -2.664`\
+` 75,A,V    -1.061`\
+` 108,T,K   -4.051`\
+` 115,A,T   1.557`\
+` 124,E,D   -1.587`\
+` 130,F,Y   -0.983`\
+` 133,L,P   3.203`\
+` 142,P,A   -2.058`\
+` 155,F,I   0.502`\
+` 158,E,G   -0.077`\
+` 186,I,V   0.220`\
+` 21,G,D    -6.014`
+
+At the end of the VEP exercise we created a filtered list of genes that
+have only a single deleterious variant annotated. Among these genes was
+BLOC1S3 gene, ENSBTAG00000007070.
+
+` `[`http://www.ensembl.org/Bos_taurus/Gene/Summary?db=core;g=ENSBTAG00000007070;r=18:53231599-53232201;t=ENSBTAT00000009293`](http://www.ensembl.org/Bos_taurus/Gene/Summary?db=core;g=ENSBTAG00000007070;r=18:53231599-53232201;t=ENSBTAT00000009293)
+
+` GENE=ENSBTAG00000007070`\
+``  PROT=`cat ~/Bos_taurus.UMD3.1.pep.all.fa | grep $GENE | awk '{print $1}' | sed 's/>//'` ``\
+` cat test.vep | grep $GENE | grep missense | awk '{print $10","$11}' | sed 's/\//,/' >$PROT.var`
+
+` provean.sh -q $PROT.fa -v $PROT.var --save_supporting_set $PROT.sss >$PROT.result.txt 2>$PROT.error;`
+
+` faOneRecord ~/Bos_taurus.UMD3.1.pep.all.fa $PROT >$PROT.fa`
+
+` cat ENSBTAG00000007070.fa >>ENSBTAG00000007070.sss.fasta`\
+` mafft --auto ENSBTAG00000007070.sss.fasta >ENSBTAG00000007070.sss.fasta.out`
+
+Transfer the alignment to your local environment and view (seaview?).
+What do you observe for the 26th amino-acid of the cow protein sequence?
+In this 'implicit evolutionary context' there is one alternative amino
+acid that occurs at this site. What do the D amino acid and this
+evolutionary viable alternative share in terms of properties?
+
+Through the orthology with human, try to find some possible phenotypic
+consequences of the alternative allele (e.g. through OMIM). Could there
+be a reason that the variant is at a relatively high frequency in this
+cattle population?
+
+### Polyphen
+
+` /lustre/nobackup/WUR/ABGC/shared/public_data_store/polyphen/polyphen-2.2.2/bin/run_pph.pl -s test1033.fa test1033.coord `\
+` #o_acc                 o_pos  o_aa1   o_aa2   rsid        acc            pos  aa1 aa2 nt1 nt2         prediction              based_on        effect      site      region        PHAT    dScore  Score1  Score2  MSAv      Nobs   Nstruct     Nfilt  PDB_id  PDB_pos PDB_ch   ident  lengthNormASA   SecStr  MapReg    dVol   dProp  B-fact   H-bonds     AveNHet     MinDHet     AveNInt     MinDInt     AveNSit     MinDSit    Transv  CodPos  CpG  MinDJxn         PfamHit      IdPmax      IdPSNP      IdQmin`\
+` ENSSSCG00000001099ENSSSCT00000001195     368      S       A               ENSSSCG00000001099ENSSSCT00000001195       368    S   A                     benign             alignment                                                    +0.721  -1.180  -1.901     2        68                                                                                                                                                                                                                                                    12.350      12.350       42.63`\
+` ENSSSCG00000001099ENSSSCT00000001195     453      R       H               ENSSSCG00000001099ENSSSCT00000001195       453    R   H                     benign             alignment                                                    +0.351  -2.130  -2.481     2        67                                                                                                                                                                                                                                                    33.194      33.194       93.76`\
+` ENSSSCG00000001099ENSSSCT00000001195     431      K       T               ENSSSCG00000001099ENSSSCT00000001195       431    K   T          possibly damaging             alignment                                                    +1.705  -1.678  -3.383     2        68                                                                                                                                                                                                                                                     2.382                   68.80`
+
+What goes in:
+
+` test1033.fa`\
+` >ENSSSCG00000001099ENSSSCT00000001195`\
+` MSSIEQTTEILLCLSPAEAANLKEGINFVRNKSTGKDYILFKNKSRLKACKNMCKHQGGL`\
+` FIKDIEDLNGRSVKCTKHNWKLDVSSMKYINPPGSFCQDELVVEKDEENGVLLLELNPPN`\
+` PWDSEPRSPEDLAFGEVQITYLTHACMDLKLGDKRMVFDPWLIGPAFARGWWLLHEPPSD`\
+` WLERLSRADLIYISHMHSDHLSYPTLKKLAERRPDVPIYVGNTERPVFWNLNQSGVQLTN`\
+` INVVPFGIWQQVDKNLRFMILMDGVHPEMDTCIIVEYKGHKILNTVDCTRPNGGRLPMKV`\
+` ALMMSDFAGGASGFPMTFSGGKFTEEWKAQFIKTERKKLLNYKARLVKDLQPRIYCPFAG`\
+` YFVESHPSDKYIKETNIKNDPNELNNLIKKNSEVVTWTPRPGATLDLGRMLKDPTDSKGI`\
+` VEPPEGTKIYKDSWDFGPYLNILNAAIGDEIFRHSSWIKEYFTWAGFKDYNLVVRMIETD`\
+` EDFSPLPGGYDYLVDFLDLSFPKERPSREHPYEEIRSRVDVIRHVVKNGLLWDDLYIGFQ`\
+` TRLQRDPDIYHHLFWNHFQIKLPLTPPDWKSFLMCSG `
+
+` test1033.coord`\
+` ENSSSCG00000001099ENSSSCT00000001195 368        S A`\
+` ENSSSCG00000001099ENSSSCT00000001195 453        R H`\
+` ENSSSCG00000001099ENSSSCT00000001195 431        K T`
\ No newline at end of file