... | ... | @@ -4,29 +4,34 @@ prediction sessions |
|
|
Variant annotation
|
|
|
------------------
|
|
|
|
|
|
make a new folder to work in, and go there:
|
|
|
|
|
|
~~~~ {.bash}
|
|
|
mkdir annotation_session
|
|
|
cd annotation_session
|
|
|
~~~~
|
|
|
|
|
|
Create an Env.Var. to the communal session directory that holds some of
|
|
|
the files you need:
|
|
|
|
|
|
~~~~ {.bash}
|
|
|
SESSIONDIR=/home/formacion/COMUNES/IAMZ/data/CIHEAM/sessions/variant_annotation
|
|
|
~~~~
|
|
|
|
|
|
### Slicing and dicing of VCF files
|
|
|
|
|
|
VCF files can get very big. Being able to manipulate them and extract
|
|
|
relevant information is important. In practice, flat text files such as
|
|
|
VCF are currently the basis for further analysis - see the Imputation
|
|
|
and Population practical sessions.
|
|
|
|
|
|
The first task involves selecting all called variants of chromosome 18,
|
|
|
bgzip-ping them on-the-fly, and then indexing that file with tabix:
|
|
|
and Population practical sessions. The first task involves selecting all
|
|
|
called variants of chromosome 18, bgzip-ping them on-the-fly, and then
|
|
|
indexing that file with tabix:
|
|
|
|
|
|
~~~~ {.bash}
|
|
|
tabix -h allbt.vcf.gz 18 | bgzip >BT18.vcf.gz
|
|
|
tabix -h /home/formacion/COMUNES/IAMZ/data/CIHEAM/MULTISAMPLE_VCF/allbt.vcf.gz 18 | bgzip >BT18.vcf.gz
|
|
|
tabix -p vcf BT18.vcf.gz
|
|
|
~~~~
|
|
|
|
|
|
### Annotating VCF with rs-numbers
|
|
|
|
|
|
The VCF file you are working does not yet have the dbSNP identifiers
|
|
|
(rs-numbers) annotated. The rs-numbers of all currently known variants
|
|
|
as databased in dbSNP can be found in this file:
|
|
|
|
|
|
` /path/to/ BT18.vcf.gz`
|
|
|
|
|
|
Have a look at what is inside this file:
|
|
|
|
|
|
~~~~ {.bash}
|
... | ... | @@ -39,6 +44,14 @@ or |
|
|
tabix BT18.vcf.gz 18:1-10000 | more
|
|
|
~~~~
|
|
|
|
|
|
### Annotating VCF with rs-numbers
|
|
|
|
|
|
The VCF file you are working does not yet have the dbSNP identifiers
|
|
|
(rs-numbers) annotated. The rs-numbers of all currently known variants
|
|
|
as databased in dbSNP can be found in this file:
|
|
|
|
|
|
` /home/formacion/COMUNES/IAMZ/data/CIHEAM/sessions/variant_annotation/BT_incl_cons.18.vcf.gz`\
|
|
|
|
|
|
You can use vcftools to do many things related to VCF files. This
|
|
|
includes annotating the VCF based on other interval-file formatted
|
|
|
input, including other VCF files. For instance, you can annotate an
|
... | ... | @@ -46,14 +59,15 @@ interval of the VCF on the fly, by selecting that interval using tabix, |
|
|
and then piping it to vcf-annotate, like so:
|
|
|
|
|
|
~~~~ {.bash}
|
|
|
tabix -h BT18.vcf.gz 18:100000-101000 | vcf-annotate -a BT_incl_cons.18.vcf.gz -c CHROM,FROM,ID
|
|
|
## will take an interval of 1000bp and annotate the rs numbers; piped into 'tail' means last 10 lines are shown
|
|
|
tabix -h BT18.vcf.gz 18:100000-101000 | vcf-annotate -a $SESSIONDIR/BT_incl_cons.18.vcf.gz -c CHROM,FROM,ID | tail
|
|
|
~~~~
|
|
|
|
|
|
Now make a new VCF file for all the variants called for chromosome 18
|
|
|
using this tool:
|
|
|
|
|
|
~~~~ {.bash}
|
|
|
tabix -h BT18.vcf.gz 18 | vcf-annotate -a BT_incl_cons.18.vcf.gz -c CHROM,FROM,ID | bgzip >BT18_rsnumbers.vcf.gz
|
|
|
tabix -h BT18.vcf.gz 18 | vcf-annotate -a $SESSIONDIR/BT_incl_cons.18.vcf.gz -c CHROM,FROM,ID | bgzip >BT18_rsnumbers.vcf.gz
|
|
|
tabix -p vcf BT18_rsnumbers.vcf.gz
|
|
|
~~~~
|
|
|
|
... | ... | @@ -80,23 +94,23 @@ connectivity. This line of code is just for completeness): |
|
|
|
|
|
The file is already downloaded and can be found here:
|
|
|
|
|
|
` /path/to/Bos_taurus.UMD3.1.78.gtf.gz`
|
|
|
` /home/formacion/COMUNES/IAMZ/data/CIHEAM/sessions/variant_annotation/Bos_taurus.UMD3.1.78.gtf.gz`
|
|
|
|
|
|
Have a quick look at what is in this file:
|
|
|
|
|
|
~~~~ {.bash}
|
|
|
gunzip -c /path/to/Bos_taurus.UMD3.1.78.gtf.gz | more
|
|
|
gunzip -c $SESSIONDIR/Bos_taurus.UMD3.1.78.gtf.gz | more
|
|
|
~~~~
|
|
|
|
|
|
Familiarise yourself with the GTF format if you have not already. For
|
|
|
more information:
|
|
|
|
|
|
` `[`http://www.ensembl.org/info/website/upload/gff.html`](http://www.ensembl.org/info/website/upload/gff.html)
|
|
|
` `[`http://www.ensembl.org/info/website/upload/gff.html`](http://www.ensembl.org/info/website/upload/gff.html)
|
|
|
|
|
|
You can select the appropriate region from the GTF file:
|
|
|
|
|
|
~~~~ {.bash}
|
|
|
gunzip -c Bos_taurus.UMD3.1.78.gtf.gz | awk '$3=="gene"' | awk '$1==18&&$4>1000000&&$4<2000000' >mygenes.gtf
|
|
|
gunzip -c $SESSIONDIR/Bos_taurus.UMD3.1.78.gtf.gz | awk '$3=="gene"' | awk '$1==18&&$4>1000000&&$4<2000000' >mygenes.gtf
|
|
|
~~~~
|
|
|
|
|
|
Subsequently you can intersect the VCF with the GTF file that contains
|
... | ... | @@ -129,48 +143,48 @@ perl /home/formacion/COMUNES/IAMZ/soft/ensembl-tools-release-78/scripts/variant_ |
|
|
|
|
|
If you invoke it without parameters you will see the following:
|
|
|
|
|
|
` Usage:`\
|
|
|
` perl variant_effect_predictor.pl [--cache|--offline|--database] [arguments]`\
|
|
|
` `\
|
|
|
` Basic options`\
|
|
|
` =============`\
|
|
|
` `\
|
|
|
` --help Display this message and quit`\
|
|
|
` `\
|
|
|
` -i | --input_file Input file`\
|
|
|
` -o | --output_file Output file`\
|
|
|
` --force_overwrite Force overwriting of output file`\
|
|
|
` --species [species] Species to use [default: "human"]`\
|
|
|
` `\
|
|
|
` --everything Shortcut switch to turn on commonly used options. See web`\
|
|
|
` documentation for details [default: off] `\
|
|
|
` --fork [num_forks] Use forking to improve script runtime`\
|
|
|
` `\
|
|
|
` For full option documentation see:`\
|
|
|
` `[`http://www.ensembl.org/info/docs/tools/vep/script/vep_options.html`](http://www.ensembl.org/info/docs/tools/vep/script/vep_options.html)\
|
|
|
` `
|
|
|
` Usage:`\
|
|
|
` perl variant_effect_predictor.pl [--cache|--offline|--database] [arguments]`\
|
|
|
` `\
|
|
|
` Basic options`\
|
|
|
` =============`\
|
|
|
` `\
|
|
|
` --help Display this message and quit`\
|
|
|
` `\
|
|
|
` -i | --input_file Input file`\
|
|
|
` -o | --output_file Output file`\
|
|
|
` --force_overwrite Force overwriting of output file`\
|
|
|
` --species [species] Species to use [default: "human"]`\
|
|
|
` `\
|
|
|
` --everything Shortcut switch to turn on commonly used options. See web`\
|
|
|
` documentation for details [default: off] `\
|
|
|
` --fork [num_forks] Use forking to improve script runtime`\
|
|
|
` `\
|
|
|
` For full option documentation see:`\
|
|
|
` `[`http://www.ensembl.org/info/docs/tools/vep/script/vep_options.html`](http://www.ensembl.org/info/docs/tools/vep/script/vep_options.html)\
|
|
|
` `
|
|
|
|
|
|
For the current session we will use the following options:
|
|
|
|
|
|
` --species bos_taurus`\
|
|
|
` -o BT18.vep # outputfile`\
|
|
|
` --fork 4 # use 4 threads`\
|
|
|
` --canonical`\
|
|
|
` --sift b`\
|
|
|
` --coding_only`\
|
|
|
` --no_intergenic`\
|
|
|
` --offline # using a cache directory`\
|
|
|
` --dir /home/formacion/COMUNES/IAMZ/data/CIHEAM/ReferenceGenome/VEP/ # where the cache dir lives`\
|
|
|
` --force_overwrite`
|
|
|
` --species bos_taurus`\
|
|
|
` -o BT18.vep # outputfile`\
|
|
|
` --fork 4 # use 4 threads`\
|
|
|
` --canonical`\
|
|
|
` --sift b`\
|
|
|
` --coding_only`\
|
|
|
` --no_intergenic`\
|
|
|
` --offline # using a cache directory`\
|
|
|
` --dir /home/formacion/COMUNES/IAMZ/data/CIHEAM/ReferenceGenome/VEP/ # where the cache dir lives`\
|
|
|
` --force_overwrite`
|
|
|
|
|
|
You can create a file that will provide annotation information for each
|
|
|
SNP:
|
|
|
|
|
|
~~~~ {.bash}
|
|
|
gunzip -c BT18_rsnumbers.vcf.gz | \
|
|
|
perl /home/formacion/COMUNES/IAMZ/soft/ensembl-tools-release-78/scripts/variant_effect_predictor/variant_effect_predictor.pl \
|
|
|
--dir /home/formacion/COMUNES/IAMZ/data/CIHEAM/ReferenceGenome/VEP/ --species bos_taurus -o BT18.vep \
|
|
|
--fork 4 --canonical --sift b --coding_only --no_intergenic --offline --force_overwrite
|
|
|
gunzip -c BT18_rsnumbers.vcf.gz | \
|
|
|
perl /home/formacion/COMUNES/IAMZ/soft/ensembl-tools-release-78/scripts/variant_effect_predictor/variant_effect_predictor.pl \
|
|
|
--dir /home/formacion/COMUNES/IAMZ/data/CIHEAM/ReferenceGenome/VEP/ --species bos_taurus -o BT18.vep \
|
|
|
--fork 4 --canonical --sift b --coding_only --no_intergenic --offline --force_overwrite
|
|
|
~~~~
|
|
|
|
|
|
Have a look inside the output file and study its content. Why are there
|
... | ... | @@ -186,24 +200,24 @@ following, omitting the '--no\_intergenic and --coding\_only flags, and |
|
|
in stead adding the --vcf flag:
|
|
|
|
|
|
~~~~ {.bash}
|
|
|
gunzip -c BT18_rsnumbers.vcf.gz | \
|
|
|
perl /home/formacion/COMUNES/IAMZ/soft/ensembl-tools-release-78/scripts/variant_effect_predictor/variant_effect_predictor.pl \
|
|
|
--dir /home/formacion/COMUNES/IAMZ/data/CIHEAM/ReferenceGenome/VEP/ --species bos_taurus -o BT18.vep.vcf \
|
|
|
--fork 4 --canonical --sift b --offline --force_overwrite --vcf
|
|
|
bgzip BT18.vep.vcf
|
|
|
tabix -p vcf BT18.vep.vcf.gz
|
|
|
gunzip -c BT18_rsnumbers.vcf.gz | \
|
|
|
perl /home/formacion/COMUNES/IAMZ/soft/ensembl-tools-release-78/scripts/variant_effect_predictor/variant_effect_predictor.pl \
|
|
|
--dir /home/formacion/COMUNES/IAMZ/data/CIHEAM/ReferenceGenome/VEP/ --species bos_taurus -o BT18.vep.vcf \
|
|
|
--fork 4 --canonical --sift b --offline --force_overwrite --vcf
|
|
|
bgzip BT18.vep.vcf
|
|
|
tabix -p vcf BT18.vep.vcf.gz
|
|
|
~~~~
|
|
|
|
|
|
Again, study and interpret your outcome by looking inside the file.
|
|
|
Specifically, note WHERE the annotation ended up.
|
|
|
|
|
|
` 18 53205918 . G A 0.329662 . AB=0;ABP=0;AC=0;AF=0.125;AN=16;AO=2;CIGAR=1X;DP=70;DPB`\
|
|
|
` =70;DPRA=1.03279;EPP=7.35324;EPPR=3.13803;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=60;NS=8;NUMALT=1;`\
|
|
|
` ODDS=2.54006;PAIRED=1;PAIREDR=0.985294;PAO=0;PQA=0;PQR=0;PRO=0;QA=70;QR=2421;RO=68;RPP=3.0103;`\
|
|
|
` RPPR=11.1853;RUN=1;SAF=1;SAP=3.0103;SAR=1;SRF=49;SRP=31.7504;SRR=19;TYPE=snp;technology.ILLUMINA=1;`\
|
|
|
` CSQ=A|ENSBTAG00000018834|ENSBTAT00000025071|Transcript|missense_variant|2132|1984|662|E/K|Gag/Aag|||1|`\
|
|
|
` YES|deleterious(0.02) GT:DP:RO:QR:AO:QA:GL 0/0:7:7:263:0:0:0,-1.92588,-10 0/0:7:7:238:0:0:0,-1.92376,-10`\
|
|
|
` 0/0:9:7:266:2:70:-3.90732,0,-10 0/0:10:10:362:0:0:0,-2.73805,-10 0/0:10:10:345:0:0:0,-2.73789,-10 0/0:8:8:287:0:0:0,-2.19662,-10 0/0:5:5:178:0:0:0,-1.38409,-10 0/0:14:14:482:0:0:0,-3.82032,-10`
|
|
|
` 18 53205918 . G A 0.329662 . AB=0;ABP=0;AC=0;AF=0.125;AN=16;AO=2;CIGAR=1X;DP=70;DPB`\
|
|
|
` =70;DPRA=1.03279;EPP=7.35324;EPPR=3.13803;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=60;NS=8;NUMALT=1;`\
|
|
|
` ODDS=2.54006;PAIRED=1;PAIREDR=0.985294;PAO=0;PQA=0;PQR=0;PRO=0;QA=70;QR=2421;RO=68;RPP=3.0103;`\
|
|
|
` RPPR=11.1853;RUN=1;SAF=1;SAP=3.0103;SAR=1;SRF=49;SRP=31.7504;SRR=19;TYPE=snp;technology.ILLUMINA=1;`\
|
|
|
` CSQ=A|ENSBTAG00000018834|ENSBTAT00000025071|Transcript|missense_variant|2132|1984|662|E/K|Gag/Aag|||1|`\
|
|
|
` YES|deleterious(0.02) GT:DP:RO:QR:AO:QA:GL 0/0:7:7:263:0:0:0,-1.92588,-10 0/0:7:7:238:0:0:0,-1.92376,-10`\
|
|
|
` 0/0:9:7:266:2:70:-3.90732,0,-10 0/0:10:10:362:0:0:0,-2.73805,-10 0/0:10:10:345:0:0:0,-2.73789,-10 0/0:8:8:287:0:0:0,-2.19662,-10 0/0:5:5:178:0:0:0,-1.38409,-10 0/0:14:14:482:0:0:0,-3.82032,-10`
|
|
|
|
|
|
### A further look at the VEP output
|
|
|
|
... | ... | @@ -269,7 +283,7 @@ way, prints a selection of fields. |
|
|
What this script will VCF input from a stream:
|
|
|
|
|
|
~~~~ {.bash}
|
|
|
gunzip -c LeonEnv/BT18.vep.all.vcf.gz | python3 print_fields_from_vcf.py -m vep
|
|
|
gunzip -c BT18.vep.vcf.gz | python3 $SESSIONDIR/print_fields_from_vcf.py -m vep | grep deleterious | more
|
|
|
~~~~
|
|
|
|
|
|
Notice that the script itself does in fact take an option: 'vep'. This
|
... | ... | @@ -278,8 +292,8 @@ highly inconveniently, is slightly different. |
|
|
|
|
|
and for each line provide output similar to this:
|
|
|
|
|
|
` # chrom coord AF Gene consequence sift_prediction sift_score`\
|
|
|
` 18 53205918 0.125 ENSBTAG00000018834 missense_variant deleterious 0.02`
|
|
|
` # chrom coord AF Gene consequence sift_prediction sift_score`\
|
|
|
` 18 53205918 0.125 ENSBTAG00000018834 missense_variant deleterious 0.02`
|
|
|
|
|
|
Obviously, the functionality of the Python script is limited. However,
|
|
|
it would be quite easy to provide additional statistics and output
|
... | ... | @@ -297,7 +311,8 @@ Now we make a file that has the gene names, and counts, for all genes |
|
|
that have more than one deleterious allele:
|
|
|
|
|
|
~~~~ {.bash}
|
|
|
gunzip -c LeonEnv/BT18.vep.all.vcf.gz | python3 print_fields_from_vcf.py -m vep | grep deleterious | awk '$3>0&&$3<1' | cut -f4 | sort | uniq -c | sed 's/^ \+//' | awk '$1>1' >genes_with_more_than1.txt
|
|
|
gunzip -c BT18.vep.vcf.gz | python3 $SESSIONDIR/print_fields_from_vcf.py -m vep | grep deleterious | awk '$3>0&&$3<1' | \
|
|
|
cut -f4 | sort | uniq -c | sed 's/^ \+//' | awk '$1>1' >genes_with_more_than1.txt
|
|
|
~~~~
|
|
|
|
|
|
Take two or three genes that have the highest number of deleterious
|
... | ... | @@ -317,7 +332,8 @@ have multiple deleterious variants, as the above example indicates. To |
|
|
filter you can do the following:
|
|
|
|
|
|
~~~~ {.bash}
|
|
|
gunzip -c LeonEnv/BT18.vep.all.vcf.gz | python3 print_fields_from_vcf.py -m vep | grep deleterious | awk '$3>0&&$3<1' | cut -f4 | sort | uniq -c | sed 's/^ \+//' | awk '$1==1' | awk '{print $2}' >genes_with1.txt
|
|
|
gunzip -c BT18.vep.vcf.gz | python3 $SESSIONDIR/print_fields_from_vcf.py -m vep | grep deleterious | awk '$3>0&&$3<1' | \
|
|
|
cut -f4 | sort | uniq -c | sed 's/^ \+//' | awk '$1==1' | awk '{print $2}' >genes_with1.txt
|
|
|
~~~~
|
|
|
|
|
|
How many genes are there that meet this condition?
|
... | ... | @@ -340,15 +356,17 @@ available as an environment variable. Do: |
|
|
The snpEff program can be used like this:
|
|
|
|
|
|
~~~~ {.bash}
|
|
|
java -Xmx4G -jar $snpEff -dataDir ~/snpEff/data -v UMD3.1.78 BT18_rsnumbers.vcf.gz >testresult_snpEff_Bt.txt
|
|
|
java -Xmx4G -jar $snpEff -dataDir $SESSIONDIR/snpEff/data -v UMD3.1.78 BT18_rsnumbers.vcf.gz >BT18_snpEff.vcf
|
|
|
bgzip BT18_snpEff.vcf
|
|
|
tabix -p vcf BT18_snpEff.vcf.gz
|
|
|
~~~~
|
|
|
|
|
|
The options used here are:
|
|
|
|
|
|
` -dataDir ~/snpEff/data # annotation data goes here`\
|
|
|
` -v UMD3.1.78 # Genome / version`\
|
|
|
` BT18_rsnumbers.vcf.gz # requires an input file, can be gzipped`\
|
|
|
` >testresult_snpEff_Bt.txt # results go to STDOUT and are captured in a file by '>'`
|
|
|
` -dataDir ~/snpEff/data # annotation data goes here`\
|
|
|
` -v UMD3.1.78 # Genome / version`\
|
|
|
` BT18_rsnumbers.vcf.gz # requires an input file, can be gzipped`\
|
|
|
` >testresult_snpEff_Bt.txt # results go to STDOUT and are captured in a file by '>'`
|
|
|
|
|
|
Study the outcome. Try to filter out 'intergenic' SNPs and other things
|
|
|
that are perhaps a bit less relevant (use your skills in 'grep-ping',
|
... | ... | @@ -363,9 +381,11 @@ You can now easily make tables for number of different consequences as |
|
|
annotated by both these annotation tools:
|
|
|
|
|
|
~~~~ {.bash}
|
|
|
cat LeonEnv/testresult_snpEff_Bt.txt | python3 cut_line.py -m snpeff | cut -f6 | sort | uniq -c
|
|
|
## for snpEff results:
|
|
|
gunzip -c BT18_snpEff.vcf.gz | python3 $SESSIONDIR/print_fields_from_vcf.py -m snpeff | cut -f5 | sort | uniq -c
|
|
|
|
|
|
gunzip -c LeonEnv/BT18.vep.all.vcf.gz | python3 cut_line.py -m vep | cut -f5 | sort | uniq -c
|
|
|
## for VEP results:
|
|
|
gunzip -c BT18.vep.vcf.gz | python3 $SESSIONDIR/print_fields_from_vcf.py -m vep | cut -f5 | sort | uniq -c
|
|
|
~~~~
|
|
|
|
|
|
Bonus question: Compare the results by selecting the 'MODERATE' and
|
... | ... | @@ -384,72 +404,72 @@ homologous sequences. An example of in- and output of a pig sequence: |
|
|
|
|
|
What goes in: Protein sequence (Fasta):
|
|
|
|
|
|
` >ENSSSCP00000018263 pep:novel chromosome:Sscrofa10.2:12:6621092:6624938:1 gene:ENSSSCG00000017236 transcript:ENSSSCT00000018765 gene_biotype:protein_coding t transcript_biotype:protein_coding`\
|
|
|
` MTPRVGAVWLPSALLLLRVPGCLSLSGPPTAMGTKGGSLSVQCRYEEEYIDDKKYWDKSP`\
|
|
|
` CFLSWKHIVETTESAREVRRGRVSIRDDPANLTFTVTLERLTEEDAGTYCCGITAQFSVD`\
|
|
|
` PTHEVEVVVFPALGTSRPPSMPGPPTTLPATTWSFVSERETMANNLGKGPASQDPGQHPR`\
|
|
|
` SKHPSIRLLLLVFLEVPLFLGMLGAVLWVHRPLRSSESRSVAMDPVPGNTAPSAGWK`
|
|
|
` >ENSSSCP00000018263 pep:novel chromosome:Sscrofa10.2:12:6621092:6624938:1 gene:ENSSSCG00000017236 transcript:ENSSSCT00000018765 gene_biotype:protein_coding t transcript_biotype:protein_coding`\
|
|
|
` MTPRVGAVWLPSALLLLRVPGCLSLSGPPTAMGTKGGSLSVQCRYEEEYIDDKKYWDKSP`\
|
|
|
` CFLSWKHIVETTESAREVRRGRVSIRDDPANLTFTVTLERLTEEDAGTYCCGITAQFSVD`\
|
|
|
` PTHEVEVVVFPALGTSRPPSMPGPPTTLPATTWSFVSERETMANNLGKGPASQDPGQHPR`\
|
|
|
` SKHPSIRLLLLVFLEVPLFLGMLGAVLWVHRPLRSSESRSVAMDPVPGNTAPSAGWK`
|
|
|
|
|
|
and variation, with coordinates pertaining the above protein sequences,
|
|
|
the second column the reference amino acid, and the third/last column
|
|
|
the alternative amino acid (alternative allele):
|
|
|
|
|
|
` 235,G,E`\
|
|
|
` 5,V,A`\
|
|
|
` 22,C,Y`\
|
|
|
` 34,T,I`\
|
|
|
` 51,D,N`\
|
|
|
` 53,K,N`\
|
|
|
` 59,S,Y`\
|
|
|
` 61,C,R`\
|
|
|
` 64,S,L`\
|
|
|
` 67,H,P`\
|
|
|
` 68,I,T`\
|
|
|
` 75,A,V`\
|
|
|
` 108,T,K`\
|
|
|
` 115,A,T`\
|
|
|
` 124,E,D`\
|
|
|
` 130,F,Y`\
|
|
|
` 133,L,P`\
|
|
|
` 142,P,A `\
|
|
|
` 155,F,I`\
|
|
|
` 158,E,G`\
|
|
|
` 186,I,V `\
|
|
|
` 21,G,D`
|
|
|
` 235,G,E`\
|
|
|
` 5,V,A`\
|
|
|
` 22,C,Y`\
|
|
|
` 34,T,I`\
|
|
|
` 51,D,N`\
|
|
|
` 53,K,N`\
|
|
|
` 59,S,Y`\
|
|
|
` 61,C,R`\
|
|
|
` 64,S,L`\
|
|
|
` 67,H,P`\
|
|
|
` 68,I,T`\
|
|
|
` 75,A,V`\
|
|
|
` 108,T,K`\
|
|
|
` 115,A,T`\
|
|
|
` 124,E,D`\
|
|
|
` 130,F,Y`\
|
|
|
` 133,L,P`\
|
|
|
` 142,P,A `\
|
|
|
` 155,F,I`\
|
|
|
` 158,E,G`\
|
|
|
` 186,I,V `\
|
|
|
` 21,G,D`
|
|
|
|
|
|
Output looks like this:
|
|
|
|
|
|
` ## PROVEAN scores ##`\
|
|
|
` # VARIATION SCORE`\
|
|
|
` 235,G,E 0.076`\
|
|
|
` 5,V,A 0.287`\
|
|
|
` 22,C,Y -8.028`\
|
|
|
` 34,T,I -1.932`\
|
|
|
` 51,D,N -1.613`\
|
|
|
` 53,K,N 1.565`\
|
|
|
` 59,S,Y -2.140`\
|
|
|
` 61,C,R -5.826`\
|
|
|
` 64,S,L -0.437`\
|
|
|
` 67,H,P -0.511`\
|
|
|
` 68,I,T -2.664`\
|
|
|
` 75,A,V -1.061`\
|
|
|
` 108,T,K -4.051`\
|
|
|
` 115,A,T 1.557`\
|
|
|
` 124,E,D -1.587`\
|
|
|
` 130,F,Y -0.983`\
|
|
|
` 133,L,P 3.203`\
|
|
|
` 142,P,A -2.058`\
|
|
|
` 155,F,I 0.502`\
|
|
|
` 158,E,G -0.077`\
|
|
|
` 186,I,V 0.220`\
|
|
|
` 21,G,D -6.014`
|
|
|
` ## PROVEAN scores ##`\
|
|
|
` # VARIATION SCORE`\
|
|
|
` 235,G,E 0.076`\
|
|
|
` 5,V,A 0.287`\
|
|
|
` 22,C,Y -8.028`\
|
|
|
` 34,T,I -1.932`\
|
|
|
` 51,D,N -1.613`\
|
|
|
` 53,K,N 1.565`\
|
|
|
` 59,S,Y -2.140`\
|
|
|
` 61,C,R -5.826`\
|
|
|
` 64,S,L -0.437`\
|
|
|
` 67,H,P -0.511`\
|
|
|
` 68,I,T -2.664`\
|
|
|
` 75,A,V -1.061`\
|
|
|
` 108,T,K -4.051`\
|
|
|
` 115,A,T 1.557`\
|
|
|
` 124,E,D -1.587`\
|
|
|
` 130,F,Y -0.983`\
|
|
|
` 133,L,P 3.203`\
|
|
|
` 142,P,A -2.058`\
|
|
|
` 155,F,I 0.502`\
|
|
|
` 158,E,G -0.077`\
|
|
|
` 186,I,V 0.220`\
|
|
|
` 21,G,D -6.014`
|
|
|
|
|
|
We will do something similar, although with a somewhat less complex
|
|
|
example, for the cow data. At the end of the VEP exercise we created a
|
|
|
filtered list of genes that have only a single deleterious variant
|
|
|
annotated. Among these genes was BLOC1S3 gene, ENSBTAG00000007070.
|
|
|
|
|
|
` `[`http://www.ensembl.org/Bos_taurus/Gene/Summary?db=core;g=ENSBTAG00000007070;r=18:53231599-53232201;t=ENSBTAT00000009293`](http://www.ensembl.org/Bos_taurus/Gene/Summary?db=core;g=ENSBTAG00000007070;r=18:53231599-53232201;t=ENSBTAT00000009293)
|
|
|
` `[`http://www.ensembl.org/Bos_taurus/Gene/Summary?db=core;g=ENSBTAG00000007070;r=18:53231599-53232201;t=ENSBTAT00000009293`](http://www.ensembl.org/Bos_taurus/Gene/Summary?db=core;g=ENSBTAG00000007070;r=18:53231599-53232201;t=ENSBTAT00000009293)
|
|
|
|
|
|
What we need for the Provean analysis is 1) the protein sequence of the
|
|
|
gene (or rather, the transcript), and 2) a file with the variation in
|
... | ... | @@ -460,41 +480,42 @@ protein sequences for cow can be found at Ensembl. It is located here on |
|
|
the machine you are working on:
|
|
|
|
|
|
~~~~ {.bash}
|
|
|
## define the Env.Var. that holds the gene id:
|
|
|
GENE=ENSBTAG00000007070
|
|
|
## define the Env.Var. that holds the gene id:
|
|
|
GENE=ENSBTAG00000007070
|
|
|
|
|
|
## search the protein file for the gene, and parse out the corresponding protein name
|
|
|
## Note: ideally you do this by transcript, as a gene may be represented by multiple transcripts/protein sequences
|
|
|
PROT=`cat ~/Bos_taurus.UMD3.1.pep.all.fa | grep $GENE | awk '{print $1}' | sed 's/>//'`
|
|
|
## search the protein file for the gene, and parse out the corresponding protein name
|
|
|
## Note: ideally you do this by transcript, as a gene may be represented by multiple transcripts/protein sequences
|
|
|
PROT=`cat $SESSIONDIR/Bos_taurus.UMD3.1.pep.all.fa | grep $GENE | awk '{print $1}' | sed 's/>//'`
|
|
|
|
|
|
## use the first VEP annotation you did for this. Easiest to work with because completely tabular.
|
|
|
## Note: in this case there is only a single missense variant in the gene. This won't work for all genes.
|
|
|
## Note: the 'sed' statement replaces the '/', as found in the VEP annotation, with a ','.
|
|
|
cat test.vep | grep $GENE | grep missense | awk '{print $10","$11}' | sed 's/\//,/' >$PROT.var
|
|
|
## use the first VEP annotation you did for this. Easiest to work with because completely tabular.
|
|
|
## Note: in this case there is only a single missense variant in the gene. This won't work for all genes.
|
|
|
## Note: the 'sed' statement replaces the '/', as found in the VEP annotation, with a ','.
|
|
|
cat BT18.vep | grep $GENE | grep missense | awk '{print $10","$11}' | sed 's/\//,/' >$PROT.var
|
|
|
|
|
|
## retrieve the protein sequence from the multifasta protein sequence file.
|
|
|
## Note: the 'faOneRecord program is part of the so called 'Blat' suite, of which Blat is the best known
|
|
|
faOneRecord ~/Bos_taurus.UMD3.1.pep.all.fa $PROT >$PROT.fa
|
|
|
## retrieve the protein sequence from the multifasta protein sequence file.
|
|
|
## Note: the 'faOneRecord program is part of the so called 'Blat' suite, of which Blat is the best known
|
|
|
faOneRecord $SESSIONDIR/Bos_taurus.UMD3.1.pep.all.fa $PROT >$PROT.fa
|
|
|
~~~~
|
|
|
|
|
|
This should result in two files that are produced:
|
|
|
|
|
|
ENSBTAP00000009293.fa
|
|
|
|
|
|
` >ENSBTAP00000009293 pep:known chromosome:UMD3.1:18:53231599:53232201:1 gene:ENSBTAG00000007070 transcript:ENSBTAT00000009293 gene_biotype:protein_coding transcript_biotype:protein_coding`\
|
|
|
` MESQSRRRRPLRRPETLVQGEAAESDSDLSASSSEEEELYLGPSGPTRGRPTGLRVAGEA`\
|
|
|
` AETDSDPEPEPKAAPRDLPPLVVQRETAGEAWAEEEAPAPAPARSLLQLRLAESQARLDH`\
|
|
|
` DVAAAVSGVYRRAGRDVAALAGRLAAAQAAGLAAAHSVRLARGDLCALAERLDIVASCRL`\
|
|
|
` LPDIRGVPGTEPEQDPGPRA`
|
|
|
` >ENSBTAP00000009293 pep:known chromosome:UMD3.1:18:53231599:53232201:1 gene:ENSBTAG00000007070 transcript:ENSBTAT00000009293 gene_biotype:protein_coding transcript_biotype:protein_coding`\
|
|
|
` MESQSRRRRPLRRPETLVQGEAAESDSDLSASSSEEEELYLGPSGPTRGRPTGLRVAGEA`\
|
|
|
` AETDSDPEPEPKAAPRDLPPLVVQRETAGEAWAEEEAPAPAPARSLLQLRLAESQARLDH`\
|
|
|
` DVAAAVSGVYRRAGRDVAALAGRLAAAQAAGLAAAHSVRLARGDLCALAERLDIVASCRL`\
|
|
|
` LPDIRGVPGTEPEQDPGPRA`
|
|
|
|
|
|
ENSBTAP00000009293.var
|
|
|
|
|
|
` 26,D,N`
|
|
|
` 26,D,N`
|
|
|
|
|
|
Finally, the actual Provean analysis is done with this command:
|
|
|
Finally, the actual Provean analysis is done with this command (this may
|
|
|
take a few minutes to run):
|
|
|
|
|
|
~~~~ {.bash}
|
|
|
provean.sh -q $PROT.fa -v $PROT.var --save_supporting_set $PROT.sss >$PROT.result.txt 2>$PROT.error;
|
|
|
provean.sh --num_threads 8 -q $PROT.fa -v $PROT.var --save_supporting_set $PROT.sss >$PROT.result.txt 2>$PROT.error
|
|
|
~~~~
|
|
|
|
|
|
Study the result. What is the verdict on the alternative allele?
|
... | ... | @@ -519,18 +540,17 @@ for the Provean analysis and add it to the supporting set fasta file |
|
|
otherwise the file will get overwritten if you use only one!).
|
|
|
|
|
|
~~~~ {.bash}
|
|
|
## add protein sequence to supporting set fasta:
|
|
|
cat $PROT.fa >>$PROT.sss.fasta
|
|
|
## add protein sequence to supporting set fasta:
|
|
|
cat $PROT.fa >>$PROT.sss.fasta
|
|
|
|
|
|
## align all protein sequences using the multiple sequence aligner mafft:
|
|
|
mafft --auto $PROT.sss.fasta >$PROT.sss.fasta.out
|
|
|
## align all protein sequences using the multiple sequence aligner mafft:
|
|
|
mafft --auto $PROT.sss.fasta >$PROT.sss.fasta.out
|
|
|
~~~~
|
|
|
|
|
|
Transfer the alignment to your local environment and view (seaview?).
|
|
|
What do you observe for the 26th amino-acid of the cow protein sequence?
|
|
|
In this 'implicit evolutionary context' there is one alternative amino
|
|
|
acid that occurs at this site. What do the D amino acid and this
|
|
|
evolutionary viable alternative share in terms of properties?
|
|
|
In this 'implicit evolutionary context' is there another alternative
|
|
|
amino acid residue possible?
|
|
|
|
|
|
Through the orthology with human, try to find some possible phenotypic
|
|
|
consequences of the alternative allele (e.g. through OMIM). Could there
|
... | ... | @@ -545,33 +565,33 @@ lists coordinates and variants: |
|
|
|
|
|
What goes in: in.fa
|
|
|
|
|
|
` >ENSSSCG00000001099ENSSSCT00000001195`\
|
|
|
` MSSIEQTTEILLCLSPAEAANLKEGINFVRNKSTGKDYILFKNKSRLKACKNMCKHQGGL`\
|
|
|
` FIKDIEDLNGRSVKCTKHNWKLDVSSMKYINPPGSFCQDELVVEKDEENGVLLLELNPPN`\
|
|
|
` PWDSEPRSPEDLAFGEVQITYLTHACMDLKLGDKRMVFDPWLIGPAFARGWWLLHEPPSD`\
|
|
|
` WLERLSRADLIYISHMHSDHLSYPTLKKLAERRPDVPIYVGNTERPVFWNLNQSGVQLTN`\
|
|
|
` INVVPFGIWQQVDKNLRFMILMDGVHPEMDTCIIVEYKGHKILNTVDCTRPNGGRLPMKV`\
|
|
|
` ALMMSDFAGGASGFPMTFSGGKFTEEWKAQFIKTERKKLLNYKARLVKDLQPRIYCPFAG`\
|
|
|
` YFVESHPSDKYIKETNIKNDPNELNNLIKKNSEVVTWTPRPGATLDLGRMLKDPTDSKGI`\
|
|
|
` VEPPEGTKIYKDSWDFGPYLNILNAAIGDEIFRHSSWIKEYFTWAGFKDYNLVVRMIETD`\
|
|
|
` EDFSPLPGGYDYLVDFLDLSFPKERPSREHPYEEIRSRVDVIRHVVKNGLLWDDLYIGFQ`\
|
|
|
` TRLQRDPDIYHHLFWNHFQIKLPLTPPDWKSFLMCSG `
|
|
|
` >ENSSSCG00000001099ENSSSCT00000001195`\
|
|
|
` MSSIEQTTEILLCLSPAEAANLKEGINFVRNKSTGKDYILFKNKSRLKACKNMCKHQGGL`\
|
|
|
` FIKDIEDLNGRSVKCTKHNWKLDVSSMKYINPPGSFCQDELVVEKDEENGVLLLELNPPN`\
|
|
|
` PWDSEPRSPEDLAFGEVQITYLTHACMDLKLGDKRMVFDPWLIGPAFARGWWLLHEPPSD`\
|
|
|
` WLERLSRADLIYISHMHSDHLSYPTLKKLAERRPDVPIYVGNTERPVFWNLNQSGVQLTN`\
|
|
|
` INVVPFGIWQQVDKNLRFMILMDGVHPEMDTCIIVEYKGHKILNTVDCTRPNGGRLPMKV`\
|
|
|
` ALMMSDFAGGASGFPMTFSGGKFTEEWKAQFIKTERKKLLNYKARLVKDLQPRIYCPFAG`\
|
|
|
` YFVESHPSDKYIKETNIKNDPNELNNLIKKNSEVVTWTPRPGATLDLGRMLKDPTDSKGI`\
|
|
|
` VEPPEGTKIYKDSWDFGPYLNILNAAIGDEIFRHSSWIKEYFTWAGFKDYNLVVRMIETD`\
|
|
|
` EDFSPLPGGYDYLVDFLDLSFPKERPSREHPYEEIRSRVDVIRHVVKNGLLWDDLYIGFQ`\
|
|
|
` TRLQRDPDIYHHLFWNHFQIKLPLTPPDWKSFLMCSG `
|
|
|
|
|
|
in.coord
|
|
|
|
|
|
` ENSSSCG00000001099ENSSSCT00000001195 368 S A`\
|
|
|
` ENSSSCG00000001099ENSSSCT00000001195 453 R H`\
|
|
|
` ENSSSCG00000001099ENSSSCT00000001195 431 K T`
|
|
|
` ENSSSCG00000001099ENSSSCT00000001195 368 S A`\
|
|
|
` ENSSSCG00000001099ENSSSCT00000001195 453 R H`\
|
|
|
` ENSSSCG00000001099ENSSSCT00000001195 431 K T`
|
|
|
|
|
|
` run_pph.pl -s in.fa in.coord`
|
|
|
` run_pph.pl -s in.fa in.coord`
|
|
|
|
|
|
What comes out:
|
|
|
|
|
|
` /lustre/nobackup/WUR/ABGC/shared/public_data_store/polyphen/polyphen-2.2.2/bin/run_pph.pl -s test1033.fa test1033.coord `\
|
|
|
` #o_acc o_pos o_aa1 o_aa2 rsid acc pos aa1 aa2 nt1 nt2 prediction based_on effect site region PHAT dScore Score1 Score2 MSAv Nobs Nstruct Nfilt PDB_id PDB_pos PDB_ch ident lengthNormASA SecStr MapReg dVol dProp B-fact H-bonds AveNHet MinDHet AveNInt MinDInt AveNSit MinDSit Transv CodPos CpG MinDJxn PfamHit IdPmax IdPSNP IdQmin`\
|
|
|
` ENSSSCG00000001099ENSSSCT00000001195 368 S A ENSSSCG00000001099ENSSSCT00000001195 368 S A benign alignment +0.721 -1.180 -1.901 2 68 12.350 12.350 42.63`\
|
|
|
` ENSSSCG00000001099ENSSSCT00000001195 453 R H ENSSSCG00000001099ENSSSCT00000001195 453 R H benign alignment +0.351 -2.130 -2.481 2 67 33.194 33.194 93.76`\
|
|
|
` ENSSSCG00000001099ENSSSCT00000001195 431 K T ENSSSCG00000001099ENSSSCT00000001195 431 K T possibly damaging alignment +1.705 -1.678 -3.383 2 68 2.382 68.80`
|
|
|
` /lustre/nobackup/WUR/ABGC/shared/public_data_store/polyphen/polyphen-2.2.2/bin/run_pph.pl -s test1033.fa test1033.coord `\
|
|
|
` #o_acc o_pos o_aa1 o_aa2 rsid acc pos aa1 aa2 nt1 nt2 prediction based_on effect site region PHAT dScore Score1 Score2 MSAv Nobs Nstruct Nfilt PDB_id PDB_pos PDB_ch ident lengthNormASA SecStr MapReg dVol dProp B-fact H-bonds AveNHet MinDHet AveNInt MinDInt AveNSit MinDSit Transv CodPos CpG MinDJxn PfamHit IdPmax IdPSNP IdQmin`\
|
|
|
` ENSSSCG00000001099ENSSSCT00000001195 368 S A ENSSSCG00000001099ENSSSCT00000001195 368 S A benign alignment +0.721 -1.180 -1.901 2 68 12.350 12.350 42.63`\
|
|
|
` ENSSSCG00000001099ENSSSCT00000001195 453 R H ENSSSCG00000001099ENSSSCT00000001195 453 R H benign alignment +0.351 -2.130 -2.481 2 67 33.194 33.194 93.76`\
|
|
|
` ENSSSCG00000001099ENSSSCT00000001195 431 K T ENSSSCG00000001099ENSSSCT00000001195 431 K T possibly damaging alignment +1.705 -1.678 -3.383 2 68 2.382 68.80`
|
|
|
|
|
|
The fasta and coordinate files can be made similar to what you've done
|
|
|
for Provean. Polyphen is much more human-oriented than Provean (which is
|
... | ... | |