... | ... | @@ -28,20 +28,20 @@ called variants of chromosome 18, bgzip-ping them on-the-fly, and then |
|
|
indexing that file with tabix:
|
|
|
|
|
|
~~~~ {.bash}
|
|
|
tabix -h /home/formacion/COMUNES/IAMZ/data/CIHEAM/MULTISAMPLE_VCF/allbt.vcf.gz 18 | bgzip >BT18.vcf.gz
|
|
|
tabix -p vcf BT18.vcf.gz
|
|
|
tabix -h /home/formacion/COMUNES/IAMZ/data/CIHEAM/MULTISAMPLE_VCF/allbt.vcf.gz 18 | bgzip >BT18.vcf.gz
|
|
|
tabix -p vcf BT18.vcf.gz
|
|
|
~~~~
|
|
|
|
|
|
Have a look at what is inside this file:
|
|
|
|
|
|
~~~~ {.bash}
|
|
|
gunzip -c BT18.vcf.gz | grep -v '^##' | more
|
|
|
gunzip -c BT18.vcf.gz | grep -v '^##' | more
|
|
|
~~~~
|
|
|
|
|
|
or
|
|
|
|
|
|
~~~~ {.bash}
|
|
|
tabix BT18.vcf.gz 18:1-10000 | more
|
|
|
tabix BT18.vcf.gz 18:1-10000 | more
|
|
|
~~~~
|
|
|
|
|
|
### Annotating VCF with rs-numbers
|
... | ... | @@ -50,7 +50,9 @@ The VCF file you are working does not yet have the dbSNP identifiers |
|
|
(rs-numbers) annotated. The rs-numbers of all currently known variants
|
|
|
as databased in dbSNP can be found in this file:
|
|
|
|
|
|
` /home/formacion/COMUNES/IAMZ/data/CIHEAM/sessions/variant_annotation/BT_incl_cons.18.vcf.gz`\
|
|
|
`
|
|
|
/home/formacion/COMUNES/IAMZ/data/CIHEAM/sessions/variant_annotation/BT_incl_cons.18.vcf.gz
|
|
|
`
|
|
|
|
|
|
You can use vcftools to do many things related to VCF files. This
|
|
|
includes annotating the VCF based on other interval-file formatted
|
... | ... | @@ -67,8 +69,8 @@ Now make a new VCF file for all the variants called for chromosome 18 |
|
|
using this tool:
|
|
|
|
|
|
~~~~ {.bash}
|
|
|
tabix -h BT18.vcf.gz 18 | vcf-annotate -a $SESSIONDIR/BT_incl_cons.18.vcf.gz -c CHROM,FROM,ID | bgzip >BT18_rsnumbers.vcf.gz
|
|
|
tabix -p vcf BT18_rsnumbers.vcf.gz
|
|
|
tabix -h BT18.vcf.gz 18 | vcf-annotate -a $SESSIONDIR/BT_incl_cons.18.vcf.gz -c CHROM,FROM,ID | bgzip >BT18_rsnumbers.vcf.gz
|
|
|
tabix -p vcf BT18_rsnumbers.vcf.gz
|
|
|
~~~~
|
|
|
|
|
|
### Extracting variants using BEDtools
|
... | ... | @@ -89,12 +91,12 @@ build can be found at Ensembl and downloaded using the command line |
|
|
connectivity. This line of code is just for completeness):
|
|
|
|
|
|
~~~~ {.bash}
|
|
|
wget ftp://ftp.ensembl.org/pub/release-78/gtf/bos_taurus/Bos_taurus.UMD3.1.78.gtf.gz
|
|
|
wget ftp://ftp.ensembl.org/pub/release-78/gtf/bos_taurus/Bos_taurus.UMD3.1.78.gtf.gz
|
|
|
~~~~
|
|
|
|
|
|
The file is already downloaded and can be found here:
|
|
|
|
|
|
` /home/formacion/COMUNES/IAMZ/data/CIHEAM/sessions/variant_annotation/Bos_taurus.UMD3.1.78.gtf.gz`
|
|
|
` /home/formacion/COMUNES/IAMZ/data/CIHEAM/sessions/variant_annotation/Bos_taurus.UMD3.1.78.gtf.gz`
|
|
|
|
|
|
Have a quick look at what is in this file:
|
|
|
|
... | ... | @@ -105,19 +107,19 @@ gunzip -c $SESSIONDIR/Bos_taurus.UMD3.1.78.gtf.gz | more |
|
|
Familiarise yourself with the GTF format if you have not already. For
|
|
|
more information:
|
|
|
|
|
|
` `[`http://www.ensembl.org/info/website/upload/gff.html`](http://www.ensembl.org/info/website/upload/gff.html)
|
|
|
` `[`http://www.ensembl.org/info/website/upload/gff.html`](http://www.ensembl.org/info/website/upload/gff.html)
|
|
|
|
|
|
You can select the appropriate region from the GTF file:
|
|
|
|
|
|
~~~~ {.bash}
|
|
|
gunzip -c $SESSIONDIR/Bos_taurus.UMD3.1.78.gtf.gz | awk '$3=="gene"' | awk '$1==18&&$4>1000000&&$4<2000000' >mygenes.gtf
|
|
|
gunzip -c $SESSIONDIR/Bos_taurus.UMD3.1.78.gtf.gz | awk '$3=="gene"' | awk '$1==18&&$4>1000000&&$4<2000000' >mygenes.gtf
|
|
|
~~~~
|
|
|
|
|
|
Subsequently you can intersect the VCF with the GTF file that contains
|
|
|
just the genes between coordinates 1 and 2 million:
|
|
|
|
|
|
~~~~ {.bash}
|
|
|
bedtools intersect -a BT18_rsnumbers.vcf.gz -b mygenes.gtf >intersection_genes.vcf
|
|
|
bedtools intersect -a BT18_rsnumbers.vcf.gz -b mygenes.gtf >intersection_genes.vcf
|
|
|
~~~~
|
|
|
|
|
|
Have a quick look at the output, and count the number of variants in
|
... | ... | @@ -143,39 +145,39 @@ perl /home/formacion/COMUNES/IAMZ/soft/ensembl-tools-release-78/scripts/variant_ |
|
|
|
|
|
If you invoke it without parameters you will see the following:
|
|
|
|
|
|
` Usage:`\
|
|
|
` perl variant_effect_predictor.pl [--cache|--offline|--database] [arguments]`\
|
|
|
` `\
|
|
|
` Basic options`\
|
|
|
` =============`\
|
|
|
` `\
|
|
|
` --help Display this message and quit`\
|
|
|
` `\
|
|
|
` -i | --input_file Input file`\
|
|
|
` -o | --output_file Output file`\
|
|
|
` --force_overwrite Force overwriting of output file`\
|
|
|
` --species [species] Species to use [default: "human"]`\
|
|
|
` `\
|
|
|
` --everything Shortcut switch to turn on commonly used options. See web`\
|
|
|
` documentation for details [default: off] `\
|
|
|
` --fork [num_forks] Use forking to improve script runtime`\
|
|
|
` `\
|
|
|
` For full option documentation see:`\
|
|
|
` `[`http://www.ensembl.org/info/docs/tools/vep/script/vep_options.html`](http://www.ensembl.org/info/docs/tools/vep/script/vep_options.html)\
|
|
|
` `
|
|
|
` Usage:`\
|
|
|
` perl variant_effect_predictor.pl [--cache|--offline|--database] [arguments]`\
|
|
|
` `\
|
|
|
` Basic options`\
|
|
|
` =============`\
|
|
|
` `\
|
|
|
` --help Display this message and quit`\
|
|
|
` `\
|
|
|
` -i | --input_file Input file`\
|
|
|
` -o | --output_file Output file`\
|
|
|
` --force_overwrite Force overwriting of output file`\
|
|
|
` --species [species] Species to use [default: "human"]`\
|
|
|
` `\
|
|
|
` --everything Shortcut switch to turn on commonly used options. See web`\
|
|
|
` documentation for details [default: off] `\
|
|
|
` --fork [num_forks] Use forking to improve script runtime`\
|
|
|
` `\
|
|
|
` For full option documentation see:`\
|
|
|
` `[`http://www.ensembl.org/info/docs/tools/vep/script/vep_options.html`](http://www.ensembl.org/info/docs/tools/vep/script/vep_options.html)\
|
|
|
` `
|
|
|
|
|
|
For the current session we will use the following options:
|
|
|
|
|
|
` --species bos_taurus`\
|
|
|
` -o BT18.vep # outputfile`\
|
|
|
` --fork 4 # use 4 threads`\
|
|
|
` --canonical`\
|
|
|
` --sift b`\
|
|
|
` --coding_only`\
|
|
|
` --no_intergenic`\
|
|
|
` --offline # using a cache directory`\
|
|
|
` --dir /home/formacion/COMUNES/IAMZ/data/CIHEAM/ReferenceGenome/VEP/ # where the cache dir lives`\
|
|
|
` --force_overwrite`
|
|
|
` --species bos_taurus`\
|
|
|
` -o BT18.vep # outputfile`\
|
|
|
` --fork 4 # use 4 threads`\
|
|
|
` --canonical`\
|
|
|
` --sift b`\
|
|
|
` --coding_only`\
|
|
|
` --no_intergenic`\
|
|
|
` --offline # using a cache directory`\
|
|
|
` --dir /home/formacion/COMUNES/IAMZ/data/CIHEAM/ReferenceGenome/VEP/ # where the cache dir lives`\
|
|
|
` --force_overwrite`
|
|
|
|
|
|
You can create a file that will provide annotation information for each
|
|
|
SNP:
|
... | ... | @@ -208,19 +210,23 @@ bgzip BT18.vep.vcf |
|
|
tabix -p vcf BT18.vep.vcf.gz
|
|
|
~~~~
|
|
|
|
|
|
Again, study and interpret your outcome by looking inside the file.
|
|
|
Specifically, note WHERE the annotation ended up.
|
|
|
Again, study and interpret your outcome by looking inside the file
|
|
|
'BT18.vep.vcf.gz'. Specifically, note WHERE the annotation ended up.
|
|
|
|
|
|
` 18 53205918 . G A 0.329662 . AB=0;ABP=0;AC=0;AF=0.125;AN=16;AO=2;CIGAR=1X;DP=70;DPB`\
|
|
|
` =70;DPRA=1.03279;EPP=7.35324;EPPR=3.13803;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=60;NS=8;NUMALT=1;`\
|
|
|
` ODDS=2.54006;PAIRED=1;PAIREDR=0.985294;PAO=0;PQA=0;PQR=0;PRO=0;QA=70;QR=2421;RO=68;RPP=3.0103;`\
|
|
|
` RPPR=11.1853;RUN=1;SAF=1;SAP=3.0103;SAR=1;SRF=49;SRP=31.7504;SRR=19;TYPE=snp;technology.ILLUMINA=1;`\
|
|
|
` CSQ=A|ENSBTAG00000018834|ENSBTAT00000025071|Transcript|missense_variant|2132|1984|662|E/K|Gag/Aag|||1|`\
|
|
|
` YES|deleterious(0.02) GT:DP:RO:QR:AO:QA:GL 0/0:7:7:263:0:0:0,-1.92588,-10 0/0:7:7:238:0:0:0,-1.92376,-10`\
|
|
|
` 0/0:9:7:266:2:70:-3.90732,0,-10 0/0:10:10:362:0:0:0,-2.73805,-10 0/0:10:10:345:0:0:0,-2.73789,-10 0/0:8:8:287:0:0:0,-2.19662,-10 0/0:5:5:178:0:0:0,-1.38409,-10 0/0:14:14:482:0:0:0,-3.82032,-10`
|
|
|
18 53205918 . G A 0.329662 . AB=0;ABP=0;AC=0;AF=0.125;AN=16;AO=2;CIGAR=1X;DP=70;DPB=70;DPRA=1.03279;EPP=7.35324;EPPR=3.13803;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=60;NS=8;NUMALT=1;ODDS=2.54006;PAIRED=1;PAIREDR=0.985294;PAO=0;PQA=0;PQR=0;PRO=0;QA=70;QR=2421;RO=68;RPP=3.0103;RPPR=11.1853;RUN=1;SAF=1;SAP=3.0103;SAR=1;SRF=49;SRP=31.7504;SRR=19;TYPE=snp;technology.ILLUMINA=1;CSQ=A|ENSBTAG00000018834|ENSBTAT00000025071|Transcript|missense_variant|2132|1984|662|E/K|Gag/Aag|||1|YES|deleterious(0.02) GT:DP:RO:QR:AO:QA:GL 0/0:7:7:263:0:0:0,-1.92588,-10 0/0:7:7:238:0:0:0,-1.92376,-10 0/0:9:7:266:2:70:-3.90732,0,-10 0/0:10:10:362:0:0:0,-2.73805,-10 0/0:10:10:345:0:0:0,-2.73789,-10 0/0:8:8:287:0:0:0,-2.19662,-10 0/0:5:5:178:0:0:0,-1.38409,-10 0/0:14:14:482:0:0:0,-3.82032,-10
|
|
|
|
|
|
The two VEP runs have generated reports in HTML format. Transfer them to
|
|
|
your local computer and study them.
|
|
|
|
|
|
### A further look at the VEP output
|
|
|
|
|
|
From studying the HTML reports it should be clear that the information
|
|
|
you can find there is necessarily limited and general. It is therefore
|
|
|
important to acquire skills to generate further analyses based on the
|
|
|
annotation that meet your specific research needs. In this section we
|
|
|
will delve a little bit deeper in the structure of the annotations and
|
|
|
how to extract information from it.
|
|
|
|
|
|
Although the VCF now looks even more complex than before, there is a
|
|
|
good reason to add the VEP annotation data the way it was done. The 8th
|
|
|
column of the VCF provides a lot of information, each field separated by
|
... | ... | @@ -258,8 +264,13 @@ for the 11th column. We need to remove the 'AF=' tag though, but if we |
|
|
do that we can further select for instance for allele frequencies
|
|
|
between 0 and 15%.
|
|
|
|
|
|
The next shell-oneliner will give you the variants with rare (\<15%)
|
|
|
deleterious alleles and count them (by omitting the last part you can
|
|
|
retrieve the relevant variations themselves).
|
|
|
|
|
|
~~~~ {.bash}
|
|
|
gunzip -c BT18.vep.vcf.gz | grep deleterious | sed 's/;/\t/g' | awk '{gsub("AF=","",$11); print}' | awk '$11<0.15&&$11>0' | wc -l
|
|
|
gunzip -c BT18.vep.vcf.gz | grep deleterious | sed 's/;/\t/g' | awk '{gsub("AF=","",$11); print}' | \
|
|
|
awk '$11<0.15&&$11>0' | wc -l
|
|
|
~~~~
|
|
|
|
|
|
Analysing the annotations a bit further, you could discover that there
|
... | ... | @@ -280,10 +291,10 @@ That script does only one thing: takes lines from the VCF file, put all |
|
|
fields in a complex data structure (as seen above) and, in a structured
|
|
|
way, prints a selection of fields.
|
|
|
|
|
|
What this script will VCF input from a stream:
|
|
|
The Python script will take VCF input from a stream:
|
|
|
|
|
|
~~~~ {.bash}
|
|
|
gunzip -c BT18.vep.vcf.gz | python3 $SESSIONDIR/print_fields_from_vcf.py -m vep | grep deleterious | more
|
|
|
gunzip -c BT18.vep.vcf.gz | python3 $SESSIONDIR/print_fields_from_vcf.py -m vep | grep deleterious | more
|
|
|
~~~~
|
|
|
|
|
|
Notice that the script itself does in fact take an option: 'vep'. This
|
... | ... | @@ -292,8 +303,8 @@ highly inconveniently, is slightly different. |
|
|
|
|
|
and for each line provide output similar to this:
|
|
|
|
|
|
` # chrom coord AF Gene consequence sift_prediction sift_score`\
|
|
|
` 18 53205918 0.125 ENSBTAG00000018834 missense_variant deleterious 0.02`
|
|
|
` #chrom coord AF Gene consequence sift_prediction sift_score`\
|
|
|
` 18 53205918 0.125 ENSBTAG00000018834 missense_variant deleterious 0.02`
|
|
|
|
|
|
Obviously, the functionality of the Python script is limited. However,
|
|
|
it would be quite easy to provide additional statistics and output
|
... | ... | @@ -302,11 +313,6 @@ so explicit. This means that the Python script provides you with a means |
|
|
for expansion of functionality, while that will be far more tricky using
|
|
|
shell commands solely.
|
|
|
|
|
|
~~~~ {.bash}
|
|
|
#alternative way, shell hack
|
|
|
gunzip -c BT18.vep.vcf.gz | grep deleterious | sed 's/;/ /g' | awk '{gsub("AF=","",$11); print}' | awk '$11<1&&$11>0' | sed 's/ /#_#/g' | sed 's/|/\t/g' | cut -f2 | sort | uniq -c | sed 's/^ \+//' | awk '$1>1' >genes_with_more_than1.txt
|
|
|
~~~~
|
|
|
|
|
|
Now we make a file that has the gene names, and counts, for all genes
|
|
|
that have more than one deleterious allele:
|
|
|
|
... | ... | @@ -316,7 +322,10 @@ gunzip -c BT18.vep.vcf.gz | python3 $SESSIONDIR/print_fields_from_vcf.py -m vep |
|
|
~~~~
|
|
|
|
|
|
Take two or three genes that have the highest number of deleterious
|
|
|
variants. Manually look them up at the Ensembl website (ensembl.org).
|
|
|
variants. Manually look them up at the Ensembl website.
|
|
|
|
|
|
` `[`http://www.ensembl.org/Bos_taurus/Info/Index`](http://www.ensembl.org/Bos_taurus/Info/Index)
|
|
|
|
|
|
Check the 'orthologous' status. What do you observe? How would you
|
|
|
interpret the validity of the gene annotation and/or the likelihood of
|
|
|
off-site mapping of reads?
|
... | ... | @@ -338,6 +347,18 @@ gunzip -c BT18.vep.vcf.gz | python3 $SESSIONDIR/print_fields_from_vcf.py -m vep |
|
|
|
|
|
How many genes are there that meet this condition?
|
|
|
|
|
|
You can now easily make tables for number of different consequences as
|
|
|
annotated by both these annotation tools:
|
|
|
|
|
|
~~~~ {.bash}
|
|
|
## for VEP results:
|
|
|
gunzip -c BT18.vep.vcf.gz | python3 $SESSIONDIR/print_fields_from_vcf.py -m vep | cut -f5 | sort | uniq -c
|
|
|
~~~~
|
|
|
|
|
|
VEP also made a report in HTML format. Transfer it to your local
|
|
|
computer and view it, if you haven't done already. How do your 'hacky'
|
|
|
results compare to the report VEP made?
|
|
|
|
|
|
### snpEff
|
|
|
|
|
|
Another popular variant annotation tool is 'snpEff'. Among the
|
... | ... | @@ -363,10 +384,10 @@ tabix -p vcf BT18_snpEff.vcf.gz |
|
|
|
|
|
The options used here are:
|
|
|
|
|
|
` -dataDir ~/snpEff/data # annotation data goes here`\
|
|
|
` -v UMD3.1.78 # Genome / version`\
|
|
|
` BT18_rsnumbers.vcf.gz # requires an input file, can be gzipped`\
|
|
|
` >testresult_snpEff_Bt.txt # results go to STDOUT and are captured in a file by '>'`
|
|
|
` -dataDir ~/snpEff/data # annotation data goes here`\
|
|
|
` -v UMD3.1.78 # Genome / version`\
|
|
|
` BT18_rsnumbers.vcf.gz # requires an input file, can be gzipped`\
|
|
|
` >testresult_snpEff_Bt.txt # results go to STDOUT and are captured in a file by '>'`
|
|
|
|
|
|
Study the outcome. Try to filter out 'intergenic' SNPs and other things
|
|
|
that are perhaps a bit less relevant (use your skills in 'grep-ping',
|
... | ... | @@ -377,17 +398,18 @@ You will notice that the variants that have an effect are labelled |
|
|
'MODERATE' or 'HIGH'. Try to filter out the ones that are 'HIGH'. How
|
|
|
many are there?
|
|
|
|
|
|
You can now easily make tables for number of different consequences as
|
|
|
annotated by both these annotation tools:
|
|
|
You can now again easily make tables for number of different
|
|
|
consequences as annotated by both these annotation tools:
|
|
|
|
|
|
~~~~ {.bash}
|
|
|
## for snpEff results:
|
|
|
gunzip -c BT18_snpEff.vcf.gz | python3 $SESSIONDIR/print_fields_from_vcf.py -m snpeff | cut -f5 | sort | uniq -c
|
|
|
|
|
|
## for VEP results:
|
|
|
gunzip -c BT18.vep.vcf.gz | python3 $SESSIONDIR/print_fields_from_vcf.py -m vep | cut -f5 | sort | uniq -c
|
|
|
~~~~
|
|
|
|
|
|
Like VEP, snpEff makes a report in HTML format. Transfer it to your
|
|
|
local computer and study it. How do your 'hacky' results compare to the
|
|
|
report snpEff made?
|
|
|
|
|
|
Bonus question: Compare the results by selecting the 'MODERATE' and
|
|
|
"HIGH" variants, and compare that with the 'deleterious' variants as
|
|
|
annotated (through SIFT) in VEP. You can use, e.g. vcf-compare.
|
... | ... | @@ -402,74 +424,75 @@ to all known protein sequences (so called 'non-redundant database'). |
|
|
Subsequent clustering defines a group of 'similar', and implicitly
|
|
|
homologous sequences. An example of in- and output of a pig sequence:
|
|
|
|
|
|
What goes in: Protein sequence (Fasta):
|
|
|
What goes in:
|
|
|
|
|
|
` >ENSSSCP00000018263 pep:novel chromosome:Sscrofa10.2:12:6621092:6624938:1 gene:ENSSSCG00000017236 transcript:ENSSSCT00000018765 gene_biotype:protein_coding t transcript_biotype:protein_coding`\
|
|
|
` MTPRVGAVWLPSALLLLRVPGCLSLSGPPTAMGTKGGSLSVQCRYEEEYIDDKKYWDKSP`\
|
|
|
` CFLSWKHIVETTESAREVRRGRVSIRDDPANLTFTVTLERLTEEDAGTYCCGITAQFSVD`\
|
|
|
` PTHEVEVVVFPALGTSRPPSMPGPPTTLPATTWSFVSERETMANNLGKGPASQDPGQHPR`\
|
|
|
` SKHPSIRLLLLVFLEVPLFLGMLGAVLWVHRPLRSSESRSVAMDPVPGNTAPSAGWK`
|
|
|
`
|
|
|
Protein sequence (Fasta):
|
|
|
>ENSSSCP00000018263 pep:novel chromosome:Sscrofa10.2:12:6621092:6624938:1 gene:ENSSSCG00000017236 transcript:ENSSSCT00000018765 gene_biotype:protein_coding t transcript_biotype:protein_coding
|
|
|
MTPRVGAVWLPSALLLLRVPGCLSLSGPPTAMGTKGGSLSVQCRYEEEYIDDKKYWDKSP
|
|
|
CFLSWKHIVETTESAREVRRGRVSIRDDPANLTFTVTLERLTEEDAGTYCCGITAQFSVD
|
|
|
PTHEVEVVVFPALGTSRPPSMPGPPTTLPATTWSFVSERETMANNLGKGPASQDPGQHPR
|
|
|
SKHPSIRLLLLVFLEVPLFLGMLGAVLWVHRPLRSSESRSVAMDPVPGNTAPSAGWK
|
|
|
`
|
|
|
|
|
|
and variation, with coordinates pertaining the above protein sequences,
|
|
|
the second column the reference amino acid, and the third/last column
|
|
|
the alternative amino acid (alternative allele):
|
|
|
|
|
|
` 235,G,E`\
|
|
|
` 5,V,A`\
|
|
|
` 22,C,Y`\
|
|
|
` 34,T,I`\
|
|
|
` 51,D,N`\
|
|
|
` 53,K,N`\
|
|
|
` 59,S,Y`\
|
|
|
` 61,C,R`\
|
|
|
` 64,S,L`\
|
|
|
` 67,H,P`\
|
|
|
` 68,I,T`\
|
|
|
` 75,A,V`\
|
|
|
` 108,T,K`\
|
|
|
` 115,A,T`\
|
|
|
` 124,E,D`\
|
|
|
` 130,F,Y`\
|
|
|
` 133,L,P`\
|
|
|
` 142,P,A `\
|
|
|
` 155,F,I`\
|
|
|
` 158,E,G`\
|
|
|
` 186,I,V `\
|
|
|
` 21,G,D`
|
|
|
|
|
|
Output looks like this:
|
|
|
|
|
|
` ## PROVEAN scores ##`\
|
|
|
` # VARIATION SCORE`\
|
|
|
` 235,G,E 0.076`\
|
|
|
` 5,V,A 0.287`\
|
|
|
` 22,C,Y -8.028`\
|
|
|
` 34,T,I -1.932`\
|
|
|
` 51,D,N -1.613`\
|
|
|
` 53,K,N 1.565`\
|
|
|
` 59,S,Y -2.140`\
|
|
|
` 61,C,R -5.826`\
|
|
|
` 64,S,L -0.437`\
|
|
|
` 67,H,P -0.511`\
|
|
|
` 68,I,T -2.664`\
|
|
|
` 75,A,V -1.061`\
|
|
|
` 108,T,K -4.051`\
|
|
|
` 115,A,T 1.557`\
|
|
|
` 124,E,D -1.587`\
|
|
|
` 130,F,Y -0.983`\
|
|
|
` 133,L,P 3.203`\
|
|
|
` 142,P,A -2.058`\
|
|
|
` 155,F,I 0.502`\
|
|
|
` 158,E,G -0.077`\
|
|
|
` 186,I,V 0.220`\
|
|
|
` 21,G,D -6.014`
|
|
|
the alternative amino acid (alternative allele): `
|
|
|
235,G,E
|
|
|
5,V,A
|
|
|
22,C,Y
|
|
|
34,T,I
|
|
|
51,D,N
|
|
|
53,K,N
|
|
|
59,S,Y
|
|
|
61,C,R
|
|
|
64,S,L
|
|
|
67,H,P
|
|
|
68,I,T
|
|
|
75,A,V
|
|
|
108,T,K
|
|
|
115,A,T
|
|
|
124,E,D
|
|
|
130,F,Y
|
|
|
133,L,P
|
|
|
142,P,A
|
|
|
155,F,I
|
|
|
158,E,G
|
|
|
186,I,V
|
|
|
21,G,D
|
|
|
` Output looks like this: `
|
|
|
## PROVEAN scores ##
|
|
|
# VARIATION SCORE
|
|
|
235,G,E 0.076
|
|
|
5,V,A 0.287
|
|
|
22,C,Y -8.028
|
|
|
34,T,I -1.932
|
|
|
51,D,N -1.613
|
|
|
53,K,N 1.565
|
|
|
59,S,Y -2.140
|
|
|
61,C,R -5.826
|
|
|
64,S,L -0.437
|
|
|
67,H,P -0.511
|
|
|
68,I,T -2.664
|
|
|
75,A,V -1.061
|
|
|
108,T,K -4.051
|
|
|
115,A,T 1.557
|
|
|
124,E,D -1.587
|
|
|
130,F,Y -0.983
|
|
|
133,L,P 3.203
|
|
|
142,P,A -2.058
|
|
|
155,F,I 0.502
|
|
|
158,E,G -0.077
|
|
|
186,I,V 0.220
|
|
|
21,G,D -6.014
|
|
|
`
|
|
|
|
|
|
We will do something similar, although with a somewhat less complex
|
|
|
example, for the cow data. At the end of the VEP exercise we created a
|
|
|
filtered list of genes that have only a single deleterious variant
|
|
|
annotated. Among these genes was BLOC1S3 gene, ENSBTAG00000007070.
|
|
|
|
|
|
` `[`http://www.ensembl.org/Bos_taurus/Gene/Summary?db=core;g=ENSBTAG00000007070;r=18:53231599-53232201;t=ENSBTAT00000009293`](http://www.ensembl.org/Bos_taurus/Gene/Summary?db=core;g=ENSBTAG00000007070;r=18:53231599-53232201;t=ENSBTAT00000009293)
|
|
|
` `[`http://www.ensembl.org/Bos_taurus/Gene/Summary?db=core;g=ENSBTAG00000007070;r=18:53231599-53232201;t=ENSBTAT00000009293`](http://www.ensembl.org/Bos_taurus/Gene/Summary?db=core;g=ENSBTAG00000007070;r=18:53231599-53232201;t=ENSBTAT00000009293)
|
|
|
|
|
|
What we need for the Provean analysis is 1) the protein sequence of the
|
|
|
gene (or rather, the transcript), and 2) a file with the variation in
|
... | ... | @@ -501,15 +524,15 @@ This should result in two files that are produced: |
|
|
|
|
|
ENSBTAP00000009293.fa
|
|
|
|
|
|
` >ENSBTAP00000009293 pep:known chromosome:UMD3.1:18:53231599:53232201:1 gene:ENSBTAG00000007070 transcript:ENSBTAT00000009293 gene_biotype:protein_coding transcript_biotype:protein_coding`\
|
|
|
` MESQSRRRRPLRRPETLVQGEAAESDSDLSASSSEEEELYLGPSGPTRGRPTGLRVAGEA`\
|
|
|
` AETDSDPEPEPKAAPRDLPPLVVQRETAGEAWAEEEAPAPAPARSLLQLRLAESQARLDH`\
|
|
|
` DVAAAVSGVYRRAGRDVAALAGRLAAAQAAGLAAAHSVRLARGDLCALAERLDIVASCRL`\
|
|
|
` LPDIRGVPGTEPEQDPGPRA`
|
|
|
` >ENSBTAP00000009293 pep:known chromosome:UMD3.1:18:53231599:53232201:1 gene:ENSBTAG00000007070 transcript:ENSBTAT00000009293 gene_biotype:protein_coding transcript_biotype:protein_coding`\
|
|
|
` MESQSRRRRPLRRPETLVQGEAAESDSDLSASSSEEEELYLGPSGPTRGRPTGLRVAGEA`\
|
|
|
` AETDSDPEPEPKAAPRDLPPLVVQRETAGEAWAEEEAPAPAPARSLLQLRLAESQARLDH`\
|
|
|
` DVAAAVSGVYRRAGRDVAALAGRLAAAQAAGLAAAHSVRLARGDLCALAERLDIVASCRL`\
|
|
|
` LPDIRGVPGTEPEQDPGPRA`
|
|
|
|
|
|
ENSBTAP00000009293.var
|
|
|
|
|
|
` 26,D,N`
|
|
|
` 26,D,N`
|
|
|
|
|
|
Finally, the actual Provean analysis is done with this command (this may
|
|
|
take a few minutes to run):
|
... | ... | @@ -547,7 +570,19 @@ cat $PROT.fa >>$PROT.sss.fasta |
|
|
mafft --auto $PROT.sss.fasta >$PROT.sss.fasta.out
|
|
|
~~~~
|
|
|
|
|
|
Transfer the alignment to your local environment and view (seaview?).
|
|
|
Transfer the alignment to your local environment and view. If installed
|
|
|
on your local computer you could use e.g. Seaview. If you are on a
|
|
|
Debian-based system (e.g. Ubuntu) and you have administrative rights,
|
|
|
you can simply install it like this:
|
|
|
|
|
|
~~~~ {.bash}
|
|
|
sudo apt-get install seaview
|
|
|
~~~~
|
|
|
|
|
|
Alternatively, you can view the alignment through this online tool:
|
|
|
|
|
|
` `[`http://toolkit.tuebingen.mpg.de/alnviz`](http://toolkit.tuebingen.mpg.de/alnviz)
|
|
|
|
|
|
What do you observe for the 26th amino-acid of the cow protein sequence?
|
|
|
In this 'implicit evolutionary context' is there another alternative
|
|
|
amino acid residue possible?
|
... | ... | @@ -565,33 +600,33 @@ lists coordinates and variants: |
|
|
|
|
|
What goes in: in.fa
|
|
|
|
|
|
` >ENSSSCG00000001099ENSSSCT00000001195`\
|
|
|
` MSSIEQTTEILLCLSPAEAANLKEGINFVRNKSTGKDYILFKNKSRLKACKNMCKHQGGL`\
|
|
|
` FIKDIEDLNGRSVKCTKHNWKLDVSSMKYINPPGSFCQDELVVEKDEENGVLLLELNPPN`\
|
|
|
` PWDSEPRSPEDLAFGEVQITYLTHACMDLKLGDKRMVFDPWLIGPAFARGWWLLHEPPSD`\
|
|
|
` WLERLSRADLIYISHMHSDHLSYPTLKKLAERRPDVPIYVGNTERPVFWNLNQSGVQLTN`\
|
|
|
` INVVPFGIWQQVDKNLRFMILMDGVHPEMDTCIIVEYKGHKILNTVDCTRPNGGRLPMKV`\
|
|
|
` ALMMSDFAGGASGFPMTFSGGKFTEEWKAQFIKTERKKLLNYKARLVKDLQPRIYCPFAG`\
|
|
|
` YFVESHPSDKYIKETNIKNDPNELNNLIKKNSEVVTWTPRPGATLDLGRMLKDPTDSKGI`\
|
|
|
` VEPPEGTKIYKDSWDFGPYLNILNAAIGDEIFRHSSWIKEYFTWAGFKDYNLVVRMIETD`\
|
|
|
` EDFSPLPGGYDYLVDFLDLSFPKERPSREHPYEEIRSRVDVIRHVVKNGLLWDDLYIGFQ`\
|
|
|
` TRLQRDPDIYHHLFWNHFQIKLPLTPPDWKSFLMCSG `
|
|
|
` >ENSSSCG00000001099ENSSSCT00000001195`\
|
|
|
` MSSIEQTTEILLCLSPAEAANLKEGINFVRNKSTGKDYILFKNKSRLKACKNMCKHQGGL`\
|
|
|
` FIKDIEDLNGRSVKCTKHNWKLDVSSMKYINPPGSFCQDELVVEKDEENGVLLLELNPPN`\
|
|
|
` PWDSEPRSPEDLAFGEVQITYLTHACMDLKLGDKRMVFDPWLIGPAFARGWWLLHEPPSD`\
|
|
|
` WLERLSRADLIYISHMHSDHLSYPTLKKLAERRPDVPIYVGNTERPVFWNLNQSGVQLTN`\
|
|
|
` INVVPFGIWQQVDKNLRFMILMDGVHPEMDTCIIVEYKGHKILNTVDCTRPNGGRLPMKV`\
|
|
|
` ALMMSDFAGGASGFPMTFSGGKFTEEWKAQFIKTERKKLLNYKARLVKDLQPRIYCPFAG`\
|
|
|
` YFVESHPSDKYIKETNIKNDPNELNNLIKKNSEVVTWTPRPGATLDLGRMLKDPTDSKGI`\
|
|
|
` VEPPEGTKIYKDSWDFGPYLNILNAAIGDEIFRHSSWIKEYFTWAGFKDYNLVVRMIETD`\
|
|
|
` EDFSPLPGGYDYLVDFLDLSFPKERPSREHPYEEIRSRVDVIRHVVKNGLLWDDLYIGFQ`\
|
|
|
` TRLQRDPDIYHHLFWNHFQIKLPLTPPDWKSFLMCSG `
|
|
|
|
|
|
in.coord
|
|
|
|
|
|
` ENSSSCG00000001099ENSSSCT00000001195 368 S A`\
|
|
|
` ENSSSCG00000001099ENSSSCT00000001195 453 R H`\
|
|
|
` ENSSSCG00000001099ENSSSCT00000001195 431 K T`
|
|
|
` ENSSSCG00000001099ENSSSCT00000001195 368 S A`\
|
|
|
` ENSSSCG00000001099ENSSSCT00000001195 453 R H`\
|
|
|
` ENSSSCG00000001099ENSSSCT00000001195 431 K T`
|
|
|
|
|
|
` run_pph.pl -s in.fa in.coord`
|
|
|
` run_pph.pl -s in.fa in.coord`
|
|
|
|
|
|
What comes out:
|
|
|
|
|
|
` /lustre/nobackup/WUR/ABGC/shared/public_data_store/polyphen/polyphen-2.2.2/bin/run_pph.pl -s test1033.fa test1033.coord `\
|
|
|
` #o_acc o_pos o_aa1 o_aa2 rsid acc pos aa1 aa2 nt1 nt2 prediction based_on effect site region PHAT dScore Score1 Score2 MSAv Nobs Nstruct Nfilt PDB_id PDB_pos PDB_ch ident lengthNormASA SecStr MapReg dVol dProp B-fact H-bonds AveNHet MinDHet AveNInt MinDInt AveNSit MinDSit Transv CodPos CpG MinDJxn PfamHit IdPmax IdPSNP IdQmin`\
|
|
|
` ENSSSCG00000001099ENSSSCT00000001195 368 S A ENSSSCG00000001099ENSSSCT00000001195 368 S A benign alignment +0.721 -1.180 -1.901 2 68 12.350 12.350 42.63`\
|
|
|
` ENSSSCG00000001099ENSSSCT00000001195 453 R H ENSSSCG00000001099ENSSSCT00000001195 453 R H benign alignment +0.351 -2.130 -2.481 2 67 33.194 33.194 93.76`\
|
|
|
` ENSSSCG00000001099ENSSSCT00000001195 431 K T ENSSSCG00000001099ENSSSCT00000001195 431 K T possibly damaging alignment +1.705 -1.678 -3.383 2 68 2.382 68.80`
|
|
|
` /lustre/nobackup/WUR/ABGC/shared/public_data_store/polyphen/polyphen-2.2.2/bin/run_pph.pl -s test1033.fa test1033.coord `\
|
|
|
` #o_acc o_pos o_aa1 o_aa2 rsid acc pos aa1 aa2 nt1 nt2 prediction based_on effect site region PHAT dScore Score1 Score2 MSAv Nobs Nstruct Nfilt PDB_id PDB_pos PDB_ch ident lengthNormASA SecStr MapReg dVol dProp B-fact H-bonds AveNHet MinDHet AveNInt MinDInt AveNSit MinDSit Transv CodPos CpG MinDJxn PfamHit IdPmax IdPSNP IdQmin`\
|
|
|
` ENSSSCG00000001099ENSSSCT00000001195 368 S A ENSSSCG00000001099ENSSSCT00000001195 368 S A benign alignment +0.721 -1.180 -1.901 2 68 12.350 12.350 42.63`\
|
|
|
` ENSSSCG00000001099ENSSSCT00000001195 453 R H ENSSSCG00000001099ENSSSCT00000001195 453 R H benign alignment +0.351 -2.130 -2.481 2 67 33.194 33.194 93.76`\
|
|
|
` ENSSSCG00000001099ENSSSCT00000001195 431 K T ENSSSCG00000001099ENSSSCT00000001195 431 K T possibly damaging alignment +1.705 -1.678 -3.383 2 68 2.382 68.80`
|
|
|
|
|
|
The fasta and coordinate files can be made similar to what you've done
|
|
|
for Provean. Polyphen is much more human-oriented than Provean (which is
|
... | ... | |