... | ... | @@ -28,6 +28,7 @@ as databased in dbSNP can be found in this file: |
|
|
` /path/to/ BT18.vcf.gz`
|
|
|
|
|
|
Have a look at what is inside this file:
|
|
|
|
|
|
` gunzip -c BT18.vcf.gz | grep -v '^##' | more`
|
|
|
|
|
|
or
|
... | ... | @@ -53,7 +54,8 @@ using this tool: |
|
|
You will very often want to extract variants based on interval files,
|
|
|
e.g. a certain region on a chromosome. This can be easily achieve using
|
|
|
the tabix command described above. However, sometimes you may want to
|
|
|
extract a substantial number of intervals. You could use tabix in a loop, either as a shell-script or in a scripting language like Perl or
|
|
|
extract a substantial number of intervals. You could use tabix in a
|
|
|
loop, either as a shell-script or in a scripting language like Perl or
|
|
|
Python. However, BedTools provides many utilities for manipulating
|
|
|
interval data, such as BED, VCF, or GTF/GFF formats.
|
|
|
|
... | ... | @@ -95,6 +97,7 @@ Now do the same, but select only the exonic variants in the same |
|
|
interval, from 1 to 2 million.
|
|
|
|
|
|
How many variants do you have now?
|
|
|
|
|
|
### Variant Effect Predictor
|
|
|
|
|
|
Various tools exist to make more precise annotations than what you can
|
... | ... | @@ -141,3 +144,163 @@ allele frequencies between 0 and 15%. |
|
|
|
|
|
` gunzip -c BT18.vep.vcf.gz | grep deleterious | sed 's/;/\t/g' | awk '{gsub("AF=","",$11); print}' | awk '$11<0.15&&$11>0' | wc -l`
|
|
|
|
|
|
Analysing the annotations a bit further, you could discover that some
|
|
|
|
|
|
` gunzip -c BT18.vep.vcf.gz | grep deleterious | sed 's/;/ /g' | awk '{gsub("AF=","",$11); print}' | awk '$11<1&&$11>0' | sed 's/ /#_#/g' | sed 's/|/\t/g' | cut -f2 | sort | uniq -c | sed 's/^ \+//' | awk '$1>1' >genes_with_more_than1.txt `
|
|
|
|
|
|
Take two or three genes that have the highest number of deleterious
|
|
|
variants. Manually look them up at the Ensembl website (ensembl.org).
|
|
|
Check the 'orthologous' status. What do you observe? How would you
|
|
|
interpret the validity of the gene annotation and/or the likelihood of
|
|
|
off-site mapping of reads?
|
|
|
|
|
|
This provides us with a strategy to filter a bit more rigorously for
|
|
|
genes that may actually contain genuine deleterious alleles. Ideally we
|
|
|
would a priory filter for genes that have proper one-to-one orthologous
|
|
|
relations with other mammalian species. Time and lack of direct internet
|
|
|
connection makes that unfeasible for this practical. However, genes that
|
|
|
have only one deleterious variant segregating in the study population at
|
|
|
a low frequency would certainly be better candidates than genes that
|
|
|
have multiple deleterious variants, as the above example indicates. To
|
|
|
filter you can do the following:
|
|
|
|
|
|
` gunzip -c BT18.vep.vcf.gz | grep deleterious | sed 's/;/ /g' | awk '{gsub("AF=","",$11); print}' | awk '$11<0.15&&$11>0' | sed 's/ /#_#/g' | sed 's/|/\t/g' | cut -f2 | sort | uniq -c | sed 's/^ \+//' | awk '$1==1' | awk '{print $2}' >genes_with1.txt`
|
|
|
|
|
|
How many genes are there that meet this condition?
|
|
|
|
|
|
### snpEff
|
|
|
|
|
|
Another popular variant annotation tool is 'snpEff'. Among the
|
|
|
advantages over VEP is that it is applicable to a wider range of
|
|
|
species. Furthermore, it is blazingly fast. For chromosome 18, we will
|
|
|
also use this tool:
|
|
|
|
|
|
` java -Xmx4G -jar $snpEff -dataDir ~/snpEff/data -v UMD3.1.78 BT18_rsnumbers.vcf.gz >testresult_snpEff_Bt.txt`
|
|
|
|
|
|
Study the outcome. Try to filter out 'intergenic' SNPs and other things
|
|
|
that are perhaps a bit less relevant (use your skills in 'grep-ping',
|
|
|
etc.). What are some of the differences in how the variants are
|
|
|
annotated, compared to VEP?
|
|
|
|
|
|
You will notice that the variants that have an effect are labelled
|
|
|
'MODERATE' or 'HIGH'. Try to filter out the ones that are 'HIGH'. How
|
|
|
many are there?
|
|
|
|
|
|
Compare the results by selecting the 'MODERATE' and "HIGH" variants, and
|
|
|
compare that with the 'deleterious' variants as annotated (through SIFT)
|
|
|
in VEP. You can use, e.g. vcf-compare.
|
|
|
|
|
|
Inferring function
|
|
|
------------------
|
|
|
|
|
|
### Provean
|
|
|
|
|
|
` >ENSSSCP00000018263 pep:novel chromosome:Sscrofa10.2:12:6621092:6624938:1 gene:ENSSSCG00000017236 transcript:ENSSSCT00000018765 gene_biotype:protein_coding t transcript_biotype:protein_coding`\
|
|
|
` MTPRVGAVWLPSALLLLRVPGCLSLSGPPTAMGTKGGSLSVQCRYEEEYIDDKKYWDKSP`\
|
|
|
` CFLSWKHIVETTESAREVRRGRVSIRDDPANLTFTVTLERLTEEDAGTYCCGITAQFSVD`\
|
|
|
` PTHEVEVVVFPALGTSRPPSMPGPPTTLPATTWSFVSERETMANNLGKGPASQDPGQHPR`\
|
|
|
` SKHPSIRLLLLVFLEVPLFLGMLGAVLWVHRPLRSSESRSVAMDPVPGNTAPSAGWK`
|
|
|
|
|
|
` 235,G,E`\
|
|
|
` 5,V,A`\
|
|
|
` 22,C,Y`\
|
|
|
` 34,T,I`\
|
|
|
` 51,D,N`\
|
|
|
` 53,K,N`\
|
|
|
` 59,S,Y`\
|
|
|
` 61,C,R`\
|
|
|
` 64,S,L`\
|
|
|
` 67,H,P`\
|
|
|
` 68,I,T`\
|
|
|
` 75,A,V`\
|
|
|
` 108,T,K`\
|
|
|
` 115,A,T`\
|
|
|
` 124,E,D`\
|
|
|
` 130,F,Y`\
|
|
|
` 133,L,P`\
|
|
|
` 142,P,A `\
|
|
|
` 155,F,I`\
|
|
|
` 158,E,G`\
|
|
|
` 186,I,V `\
|
|
|
` 21,G,D`
|
|
|
|
|
|
` ## PROVEAN scores ##`\
|
|
|
` # VARIATION SCORE`\
|
|
|
` 235,G,E 0.076`\
|
|
|
` 5,V,A 0.287`\
|
|
|
` 22,C,Y -8.028`\
|
|
|
` 34,T,I -1.932`\
|
|
|
` 51,D,N -1.613`\
|
|
|
` 53,K,N 1.565`\
|
|
|
` 59,S,Y -2.140`\
|
|
|
` 61,C,R -5.826`\
|
|
|
` 64,S,L -0.437`\
|
|
|
` 67,H,P -0.511`\
|
|
|
` 68,I,T -2.664`\
|
|
|
` 75,A,V -1.061`\
|
|
|
` 108,T,K -4.051`\
|
|
|
` 115,A,T 1.557`\
|
|
|
` 124,E,D -1.587`\
|
|
|
` 130,F,Y -0.983`\
|
|
|
` 133,L,P 3.203`\
|
|
|
` 142,P,A -2.058`\
|
|
|
` 155,F,I 0.502`\
|
|
|
` 158,E,G -0.077`\
|
|
|
` 186,I,V 0.220`\
|
|
|
` 21,G,D -6.014`
|
|
|
|
|
|
At the end of the VEP exercise we created a filtered list of genes that
|
|
|
have only a single deleterious variant annotated. Among these genes was
|
|
|
BLOC1S3 gene, ENSBTAG00000007070.
|
|
|
|
|
|
` `[`http://www.ensembl.org/Bos_taurus/Gene/Summary?db=core;g=ENSBTAG00000007070;r=18:53231599-53232201;t=ENSBTAT00000009293`](http://www.ensembl.org/Bos_taurus/Gene/Summary?db=core;g=ENSBTAG00000007070;r=18:53231599-53232201;t=ENSBTAT00000009293)
|
|
|
|
|
|
` GENE=ENSBTAG00000007070`\
|
|
|
`` PROT=`cat ~/Bos_taurus.UMD3.1.pep.all.fa | grep $GENE | awk '{print $1}' | sed 's/>//'` ``\
|
|
|
` cat test.vep | grep $GENE | grep missense | awk '{print $10","$11}' | sed 's/\//,/' >$PROT.var`
|
|
|
|
|
|
` provean.sh -q $PROT.fa -v $PROT.var --save_supporting_set $PROT.sss >$PROT.result.txt 2>$PROT.error;`
|
|
|
|
|
|
` faOneRecord ~/Bos_taurus.UMD3.1.pep.all.fa $PROT >$PROT.fa`
|
|
|
|
|
|
` cat ENSBTAG00000007070.fa >>ENSBTAG00000007070.sss.fasta`\
|
|
|
` mafft --auto ENSBTAG00000007070.sss.fasta >ENSBTAG00000007070.sss.fasta.out`
|
|
|
|
|
|
Transfer the alignment to your local environment and view (seaview?).
|
|
|
What do you observe for the 26th amino-acid of the cow protein sequence?
|
|
|
In this 'implicit evolutionary context' there is one alternative amino
|
|
|
acid that occurs at this site. What do the D amino acid and this
|
|
|
evolutionary viable alternative share in terms of properties?
|
|
|
|
|
|
Through the orthology with human, try to find some possible phenotypic
|
|
|
consequences of the alternative allele (e.g. through OMIM). Could there
|
|
|
be a reason that the variant is at a relatively high frequency in this
|
|
|
cattle population?
|
|
|
|
|
|
### Polyphen
|
|
|
|
|
|
` /lustre/nobackup/WUR/ABGC/shared/public_data_store/polyphen/polyphen-2.2.2/bin/run_pph.pl -s test1033.fa test1033.coord `\
|
|
|
` #o_acc o_pos o_aa1 o_aa2 rsid acc pos aa1 aa2 nt1 nt2 prediction based_on effect site region PHAT dScore Score1 Score2 MSAv Nobs Nstruct Nfilt PDB_id PDB_pos PDB_ch ident lengthNormASA SecStr MapReg dVol dProp B-fact H-bonds AveNHet MinDHet AveNInt MinDInt AveNSit MinDSit Transv CodPos CpG MinDJxn PfamHit IdPmax IdPSNP IdQmin`\
|
|
|
` ENSSSCG00000001099ENSSSCT00000001195 368 S A ENSSSCG00000001099ENSSSCT00000001195 368 S A benign alignment +0.721 -1.180 -1.901 2 68 12.350 12.350 42.63`\
|
|
|
` ENSSSCG00000001099ENSSSCT00000001195 453 R H ENSSSCG00000001099ENSSSCT00000001195 453 R H benign alignment +0.351 -2.130 -2.481 2 67 33.194 33.194 93.76`\
|
|
|
` ENSSSCG00000001099ENSSSCT00000001195 431 K T ENSSSCG00000001099ENSSSCT00000001195 431 K T possibly damaging alignment +1.705 -1.678 -3.383 2 68 2.382 68.80`
|
|
|
|
|
|
What goes in:
|
|
|
|
|
|
` test1033.fa`\
|
|
|
` >ENSSSCG00000001099ENSSSCT00000001195`\
|
|
|
` MSSIEQTTEILLCLSPAEAANLKEGINFVRNKSTGKDYILFKNKSRLKACKNMCKHQGGL`\
|
|
|
` FIKDIEDLNGRSVKCTKHNWKLDVSSMKYINPPGSFCQDELVVEKDEENGVLLLELNPPN`\
|
|
|
` PWDSEPRSPEDLAFGEVQITYLTHACMDLKLGDKRMVFDPWLIGPAFARGWWLLHEPPSD`\
|
|
|
` WLERLSRADLIYISHMHSDHLSYPTLKKLAERRPDVPIYVGNTERPVFWNLNQSGVQLTN`\
|
|
|
` INVVPFGIWQQVDKNLRFMILMDGVHPEMDTCIIVEYKGHKILNTVDCTRPNGGRLPMKV`\
|
|
|
` ALMMSDFAGGASGFPMTFSGGKFTEEWKAQFIKTERKKLLNYKARLVKDLQPRIYCPFAG`\
|
|
|
` YFVESHPSDKYIKETNIKNDPNELNNLIKKNSEVVTWTPRPGATLDLGRMLKDPTDSKGI`\
|
|
|
` VEPPEGTKIYKDSWDFGPYLNILNAAIGDEIFRHSSWIKEYFTWAGFKDYNLVVRMIETD`\
|
|
|
` EDFSPLPGGYDYLVDFLDLSFPKERPSREHPYEEIRSRVDVIRHVVKNGLLWDDLYIGFQ`\
|
|
|
` TRLQRDPDIYHHLFWNHFQIKLPLTPPDWKSFLMCSG `
|
|
|
|
|
|
` test1033.coord`\
|
|
|
` ENSSSCG00000001099ENSSSCT00000001195 368 S A`\
|
|
|
` ENSSSCG00000001099ENSSSCT00000001195 453 R H`\
|
|
|
` ENSSSCG00000001099ENSSSCT00000001195 431 K T` |
|
|
\ No newline at end of file |