... | ... | @@ -4,8 +4,6 @@ prediction sessions |
|
|
Variant annotation
|
|
|
------------------
|
|
|
|
|
|
=
|
|
|
|
|
|
### Slicing and dicing of VCF files
|
|
|
|
|
|
VCF files can get very big. Being able to manipulate them and extract
|
... | ... | @@ -16,8 +14,10 @@ and Population practical sessions. |
|
|
The first task involves selecting all called variants of chromosome 18,
|
|
|
bgzip-ping them on-the-fly, and then indexing that file with tabix:
|
|
|
|
|
|
` tabix -h allbt.vcf.gz 18 | bgzip >BT18.vcf.gz`\
|
|
|
` tabix -p vcf BT18.vcf.gz`
|
|
|
~~~~ {.bash}
|
|
|
tabix -h allbt.vcf.gz 18 | bgzip >BT18.vcf.gz
|
|
|
tabix -p vcf BT18.vcf.gz
|
|
|
~~~~
|
|
|
|
|
|
### Annotating VCF with rs-numbers
|
|
|
|
... | ... | @@ -29,11 +29,15 @@ as databased in dbSNP can be found in this file: |
|
|
|
|
|
Have a look at what is inside this file:
|
|
|
|
|
|
` gunzip -c BT18.vcf.gz | grep -v '^##' | more`
|
|
|
~~~~ {.bash}
|
|
|
gunzip -c BT18.vcf.gz | grep -v '^##' | more
|
|
|
~~~~
|
|
|
|
|
|
or
|
|
|
|
|
|
` tabix BT18.vcf.gz 18:1-10000 | more`
|
|
|
~~~~ {.bash}
|
|
|
tabix BT18.vcf.gz 18:1-10000 | more
|
|
|
~~~~
|
|
|
|
|
|
You can use vcftools to do many things related to VCF files. This
|
|
|
includes annotating the VCF based on other interval-file formatted
|
... | ... | @@ -41,13 +45,17 @@ input, including other VCF files. For instance, you can annotate an |
|
|
interval of the VCF on the fly, by selecting that interval using tabix,
|
|
|
and then piping it to vcf-annotate, like so:
|
|
|
|
|
|
` tabix -h BT18.vcf.gz 18:100000-101000 | vcf-annotate -a BT_incl_cons.18.vcf.gz -c CHROM,FROM,ID`
|
|
|
~~~~ {.bash}
|
|
|
tabix -h BT18.vcf.gz 18:100000-101000 | vcf-annotate -a BT_incl_cons.18.vcf.gz -c CHROM,FROM,ID
|
|
|
~~~~
|
|
|
|
|
|
Now make a new VCF file for all the variants called for chromosome 18
|
|
|
using this tool:
|
|
|
|
|
|
` tabix -h BT18.vcf.gz 18 | vcf-annotate -a BT_incl_cons.18.vcf.gz -c CHROM,FROM,ID | bgzip >BT18_rsnumbers.vcf.gz`\
|
|
|
` tabix -p vcf BT18_rsnumbers.vcf.gz`
|
|
|
~~~~ {.bash}
|
|
|
tabix -h BT18.vcf.gz 18 | vcf-annotate -a BT_incl_cons.18.vcf.gz -c CHROM,FROM,ID | bgzip >BT18_rsnumbers.vcf.gz
|
|
|
tabix -p vcf BT18_rsnumbers.vcf.gz
|
|
|
~~~~
|
|
|
|
|
|
### Extracting variants using BEDtools
|
|
|
|
... | ... | @@ -66,7 +74,9 @@ build can be found at Ensembl and downloaded using the command line |
|
|
(NOTE: the compute nodes you are working on do not have internet
|
|
|
connectivity. This line of code is just for completeness):
|
|
|
|
|
|
` wget `[`ftp://ftp.ensembl.org/pub/release-78/gtf/bos_taurus/Bos_taurus.UMD3.1.78.gtf.gz`](ftp://ftp.ensembl.org/pub/release-78/gtf/bos_taurus/Bos_taurus.UMD3.1.78.gtf.gz)
|
|
|
~~~~ {.bash}
|
|
|
wget ftp://ftp.ensembl.org/pub/release-78/gtf/bos_taurus/Bos_taurus.UMD3.1.78.gtf.gz
|
|
|
~~~~
|
|
|
|
|
|
The file is already downloaded and can be found here:
|
|
|
|
... | ... | @@ -74,7 +84,9 @@ The file is already downloaded and can be found here: |
|
|
|
|
|
Have a quick look at what is in this file:
|
|
|
|
|
|
gunzip -c /path/to/Bos\_taurus.UMD3.1.78.gtf.gz | more
|
|
|
~~~~ {.bash}
|
|
|
gunzip -c /path/to/Bos_taurus.UMD3.1.78.gtf.gz | more
|
|
|
~~~~
|
|
|
|
|
|
Familiarise yourself with the GTF format if you have not already. For
|
|
|
more information:
|
... | ... | @@ -83,12 +95,16 @@ more information: |
|
|
|
|
|
You can select the appropriate region from the GTF file:
|
|
|
|
|
|
` gunzip -c Bos_taurus.UMD3.1.78.gtf.gz | awk '$3=="gene"' | awk '$1==18&&$4>1000000&&$4<2000000' >mygenes.gtf`
|
|
|
~~~~ {.bash}
|
|
|
gunzip -c Bos_taurus.UMD3.1.78.gtf.gz | awk '$3=="gene"' | awk '$1==18&&$4>1000000&&$4<2000000' >mygenes.gtf
|
|
|
~~~~
|
|
|
|
|
|
Subsequently you can intersect the VCF with the GTF file that contains
|
|
|
just the genes between coordinates 1 and 2 million:
|
|
|
|
|
|
` bedtools intersect -a BT18_rsnumbers.vcf.gz -b mygenes.gtf >intersection_genes.vcf`
|
|
|
~~~~ {.bash}
|
|
|
bedtools intersect -a BT18_rsnumbers.vcf.gz -b mygenes.gtf >intersection_genes.vcf
|
|
|
~~~~
|
|
|
|
|
|
Have a quick look at the output, and count the number of variants in
|
|
|
this VCF (e.g. by using 'wc -l').
|
... | ... | @@ -105,29 +121,114 @@ achieve using Bedtools and a GTF. Among them are Variant Effect |
|
|
Predictor (VEP), snpEff, and Annovar. For the current practical we will
|
|
|
use former two.
|
|
|
|
|
|
VEP uses a cache directory that has the annotation information. On the
|
|
|
infrastructure you are working on that directory can be found here:
|
|
|
|
|
|
` /home/formacion/COMUNES/IAMZ/data/CIHEAM/ReferenceGenome/VEP/`
|
|
|
VEP can be found here:
|
|
|
|
|
|
~~~~ {.bash}
|
|
|
perl /home/formacion/COMUNES/IAMZ/soft/ensembl-tools-release-78/scripts/variant_effect_predictor/variant_effect_predictor.pl
|
|
|
~~~~
|
|
|
|
|
|
If you invoke it without parameters you will see the following:
|
|
|
|
|
|
` Usage:`\
|
|
|
` perl variant_effect_predictor.pl [--cache|--offline|--database] [arguments]`\
|
|
|
` `\
|
|
|
` Basic options`\
|
|
|
` =============`\
|
|
|
` `\
|
|
|
` --help Display this message and quit`\
|
|
|
` `\
|
|
|
` -i | --input_file Input file`\
|
|
|
` -o | --output_file Output file`\
|
|
|
` --force_overwrite Force overwriting of output file`\
|
|
|
` --species [species] Species to use [default: "human"]`\
|
|
|
` `\
|
|
|
` --everything Shortcut switch to turn on commonly used options. See web`\
|
|
|
` documentation for details [default: off] `\
|
|
|
` --fork [num_forks] Use forking to improve script runtime`\
|
|
|
` `\
|
|
|
` For full option documentation see:`\
|
|
|
` `[`http://www.ensembl.org/info/docs/tools/vep/script/vep_options.html`](http://www.ensembl.org/info/docs/tools/vep/script/vep_options.html)\
|
|
|
` `
|
|
|
|
|
|
For the current session we will use the following options:
|
|
|
|
|
|
` --species bos_taurus`\
|
|
|
` -o BT18.vep # outputfile`\
|
|
|
` --fork 4 # use 4 threads`\
|
|
|
` --canonical`\
|
|
|
` --sift b`\
|
|
|
` --coding_only`\
|
|
|
` --no_intergenic`\
|
|
|
` --offline # using a cache directory`\
|
|
|
` --dir /home/formacion/COMUNES/IAMZ/data/CIHEAM/ReferenceGenome/VEP/ # where the cache dir lives`\
|
|
|
` --force_overwrite`
|
|
|
|
|
|
You can create a file that will provide annotation information for each
|
|
|
SNP:
|
|
|
|
|
|
` gunzip -c BT18_rsnumbers.vcf.gz | perl /home/formacion/COMUNES/IAMZ/soft/ensembl-tools-release-78/scripts/variant_effect_predictor/variant_effect_predictor.pl --dir /home/formacion/COMUNES/IAMZ/data/CIHEAM/ReferenceGenome/VEP/ --species bos_taurus -o BT18.vep --fork 4 --canonical --sift b --coding_only --no_intergenic --offline --force_overwrite`
|
|
|
~~~~ {.bash}
|
|
|
gunzip -c BT18_rsnumbers.vcf.gz | \
|
|
|
perl /home/formacion/COMUNES/IAMZ/soft/ensembl-tools-release-78/scripts/variant_effect_predictor/variant_effect_predictor.pl \
|
|
|
--dir /home/formacion/COMUNES/IAMZ/data/CIHEAM/ReferenceGenome/VEP/ --species bos_taurus -o BT18.vep \
|
|
|
--fork 4 --canonical --sift b --coding_only --no_intergenic --offline --force_overwrite
|
|
|
~~~~
|
|
|
|
|
|
Have a look inside the output file and study its content. Why are there
|
|
|
no SNPs in this list that are intergenic?
|
|
|
|
|
|
For some purposes, it might be more convenient to have the annotations
|
|
|
directly in the VCF. You could for instance do the following:
|
|
|
|
|
|
` gunzip -c BT18_rsnumbers.vcf.gz | perl /home/formacion/COMUNES/IAMZ/soft/ensembl-tools-release-78/scripts/variant_effect_predictor/variant_effect_predictor.pl --dir /home/formacion/COMUNES/IAMZ/data/CIHEAM/ReferenceGenome/VEP/ --species bos_taurus -o BT18.vep.vcf --fork 4 --canonical --sift b --coding_only --offline --force_overwrite --vcf `
|
|
|
` bgzip BT18.vep.vcf`
|
|
|
` tabix -p vcf BT18.vep.vcf.gz`
|
|
|
The VEP output you generated before has many advantages, one of which is
|
|
|
that it is nicely tabular and easy to parse. In fact a demonstration of
|
|
|
that will be covered in the 'Provean' practical. For some purposes,
|
|
|
however, it might be more convenient to have the annotations directly in
|
|
|
the VCF. Furthermore, you may want annotations for each and every SNP,
|
|
|
even if e.g. they are intergenic. You could for instance do the
|
|
|
following, omitting the '--no\_intergenic and --coding\_only flags, and
|
|
|
in stead adding the --vcf flag:
|
|
|
|
|
|
~~~~ {.bash}
|
|
|
gunzip -c BT18_rsnumbers.vcf.gz | \
|
|
|
perl /home/formacion/COMUNES/IAMZ/soft/ensembl-tools-release-78/scripts/variant_effect_predictor/variant_effect_predictor.pl \
|
|
|
--dir /home/formacion/COMUNES/IAMZ/data/CIHEAM/ReferenceGenome/VEP/ --species bos_taurus -o BT18.vep.vcf \
|
|
|
--fork 4 --canonical --sift b --offline --force_overwrite --vcf
|
|
|
bgzip BT18.vep.vcf
|
|
|
tabix -p vcf BT18.vep.vcf.gz
|
|
|
~~~~
|
|
|
|
|
|
Again, study and interpret your outcome by looking inside the file.
|
|
|
Specifically, note WHERE the annotation ended up.
|
|
|
|
|
|
` 18 53205918 . G A 0.329662 . AB=0;ABP=0;AC=0;AF=0.125;AN=16;AO=2;CIGAR=1X;DP=70;DPB`\
|
|
|
` =70;DPRA=1.03279;EPP=7.35324;EPPR=3.13803;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=60;NS=8;NUMALT=1;`\
|
|
|
` ODDS=2.54006;PAIRED=1;PAIREDR=0.985294;PAO=0;PQA=0;PQR=0;PRO=0;QA=70;QR=2421;RO=68;RPP=3.0103;`\
|
|
|
` RPPR=11.1853;RUN=1;SAF=1;SAP=3.0103;SAR=1;SRF=49;SRP=31.7504;SRR=19;TYPE=snp;technology.ILLUMINA=1;`\
|
|
|
` CSQ=A|ENSBTAG00000018834|ENSBTAT00000025071|Transcript|missense_variant|2132|1984|662|E/K|Gag/Aag|||1|`\
|
|
|
` YES|deleterious(0.02) GT:DP:RO:QR:AO:QA:GL 0/0:7:7:263:0:0:0,-1.92588,-10 0/0:7:7:238:0:0:0,-1.92376,-10`\
|
|
|
` 0/0:9:7:266:2:70:-3.90732,0,-10 0/0:10:10:362:0:0:0,-2.73805,-10 0/0:10:10:345:0:0:0,-2.73789,-10 0/0:8:8:287:0:0:0,-2.19662,-10 0/0:5:5:178:0:0:0,-1.38409,-10 0/0:14:14:482:0:0:0,-3.82032,-10`
|
|
|
|
|
|
### A further look at the VEP output
|
|
|
|
|
|
Although the VCF now looks even more complex than before, there is a
|
|
|
good reason to add the VEP annotation data the way it was done. The 8th
|
|
|
column of the VCF provides a lot of information, each field separated by
|
|
|
a ';'. In addition, each field contains a field name and a value,
|
|
|
delimited by a '='. The VEP annotation has its own field name, or tag:
|
|
|
'CSQ'. Within this field, the values are again structured, but this time
|
|
|
delimited by a '|'. What we in fact have now is a nested data structure:
|
|
|
|
|
|
~~~~ {.python}
|
|
|
['18', '53205918', '.', 'G', 'A', '0.329662', '.', {'PAIREDR': '0.985294', 'DPB': '70', 'MQMR': '60', 'PQA': '0', 'PAO': '0',
|
|
|
'SAR': '1', 'AB': '0', 'GTI': '0', 'NUMALT': '1', 'RO': '68', 'DP': '70', 'DPRA': '1.03279',
|
|
|
'CSQ': ['A', 'ENSBTAG00000018834', 'ENSBTAT00000025071', 'Transcript', 'missense_variant',
|
|
|
'2132', '1984', '662', 'E/K', 'Gag/Aag', '', '', '1', 'YES', ['deleterious', '0.02']], 'EPP': '7.35324',
|
|
|
'SRP': '31.7504', 'QR': '2421', 'SAF': '1', 'ODDS': '2.54006', 'SRR': '19', 'SRF': '49',
|
|
|
'technology.ILLUMINA': '1', 'QA': '70', 'EPPR': '3.13803', 'SAP': '3.0103', 'RPPR': '11.1853', 'PQR': '0',
|
|
|
'NS': '8', 'CIGAR': '1X', 'TYPE': 'snp', 'AF': '0.125', 'ABP': '0', 'LEN': '1', 'AC': '0', 'MEANALT': '1',
|
|
|
'PAIRED': '1', 'AO': '2', 'AN': '16', 'MQM': '60', 'RUN': '1', 'PRO': '0', 'RPP': '3.0103'}, 'GT:DP:RO:QR:AO:QA:GL',
|
|
|
'0/0:7:7:263:0:0:0,-1.92588,-10', '0/0:7:7:238:0:0:0,-1.92376,-10', '0/0:9:7:266:2:70:-3.90732,0,-10',
|
|
|
'0/0:10:10:362:0:0:0,-2.73805,-10', '0/0:10:10:345:0:0:0,-2.73789,-10', '0/0:8:8:287:0:0:0,-2.19662,-10',
|
|
|
'0/0:5:5:178:0:0:0,-1.38409,-10', '0/0:14:14:482:0:0:0,-3.82032,-10']
|
|
|
~~~~
|
|
|
|
|
|
When dealing with deleterious alleles, we might be interested to learn
|
|
|
how frequent that allele is in the population we are studying. Our
|
|
|
expectation is that genuinly deleterious alleles should be relatively
|
... | ... | @@ -137,16 +238,67 @@ but advantageous in the context of domestication/breeding. The VCF file |
|
|
holds the (reference) allele frequency information. The 8th column of
|
|
|
the VCF file in fact is a bunch of fields concatenated with the ';'
|
|
|
delimitor. One of the fields has the tag 'AF=', where AF stands for
|
|
|
Allele Frequency. We can expand the number of space-delimited fields
|
|
|
easily and then look for the 11th column. We need to remove the 'AF='
|
|
|
tag though, but if we do that we can further select for instance for
|
|
|
allele frequencies between 0 and 15%.
|
|
|
|
|
|
` gunzip -c BT18.vep.vcf.gz | grep deleterious | sed 's/;/\t/g' | awk '{gsub("AF=","",$11); print}' | awk '$11<0.15&&$11>0' | wc -l`
|
|
|
|
|
|
Analysing the annotations a bit further, you could discover that some
|
|
|
|
|
|
` gunzip -c BT18.vep.vcf.gz | grep deleterious | sed 's/;/ /g' | awk '{gsub("AF=","",$11); print}' | awk '$11<1&&$11>0' | sed 's/ /#_#/g' | sed 's/|/\t/g' | cut -f2 | sort | uniq -c | sed 's/^ \+//' | awk '$1>1' >genes_with_more_than1.txt `
|
|
|
(Alternative) Allele Frequency. We can expand the number of
|
|
|
tab-delimited fields easily by replacing ';' for '\\t', and then look
|
|
|
for the 11th column. We need to remove the 'AF=' tag though, but if we
|
|
|
do that we can further select for instance for allele frequencies
|
|
|
between 0 and 15%.
|
|
|
|
|
|
~~~~ {.bash}
|
|
|
gunzip -c BT18.vep.vcf.gz | grep deleterious | sed 's/;/\t/g' | awk '{gsub("AF=","",$11); print}' | awk '$11<0.15&&$11>0' | wc -l
|
|
|
~~~~
|
|
|
|
|
|
Analysing the annotations a bit further, you could discover that there
|
|
|
are quite a few variants that have high deleterious allele frequencies.
|
|
|
Not only do some putative deleterious alleles occur at high frequency,
|
|
|
but there is also a number of genes that has quite a few deleterious
|
|
|
alleles annotated in the current population sample. Say you would like
|
|
|
to make a table that contains all genes that have more than one putative
|
|
|
deleterious variant. We would need to isolate both the AF field and the
|
|
|
CSQ field, and from the latter isolate the 'gene' field. While this is
|
|
|
possible with a fairly simple shell scripting hack, it does require an
|
|
|
awful lot of counting to make sure you end up with the right fields.
|
|
|
Ideally you would like to have more control by putting the relevant data
|
|
|
in a complex data structure. Languages such as Perl and particularly
|
|
|
Python allow far more explicit syntax to achieve that, so we will swap
|
|
|
out some complex string of shell-commands by a single Python script.
|
|
|
That script does only one thing: takes lines from the VCF file, put all
|
|
|
fields in a complex data structure (as seen above) and, in a structured
|
|
|
way, prints a selection of fields.
|
|
|
|
|
|
What this script will VCF input from a stream:
|
|
|
|
|
|
~~~~ {.bash}
|
|
|
gunzip -c LeonEnv/BT18.vep.all.vcf.gz | python3 print_fields_from_vcf.py -m vep
|
|
|
~~~~
|
|
|
|
|
|
Notice that the script itself does in fact take an option: 'vep'. This
|
|
|
is done because it can also work for the snpEff annotation, which,
|
|
|
highly inconveniently, is slightly different.
|
|
|
|
|
|
and for each line provide output similar to this:
|
|
|
|
|
|
` # chrom coord AF Gene consequence sift_prediction sift_score`\
|
|
|
` 18 53205918 0.125 ENSBTAG00000018834 missense_variant deleterious 0.02`
|
|
|
|
|
|
Obviously, the functionality of the Python script is limited. However,
|
|
|
it would be quite easy to provide additional statistics and output
|
|
|
options based on the current script, because it makes the data structure
|
|
|
so explicit. This means that the Python script provides you with a means
|
|
|
for expansion of functionality, while that will be far more tricky using
|
|
|
shell commands solely.
|
|
|
|
|
|
~~~~ {.bash}
|
|
|
#alternative way, shell hack
|
|
|
gunzip -c BT18.vep.vcf.gz | grep deleterious | sed 's/;/ /g' | awk '{gsub("AF=","",$11); print}' | awk '$11<1&&$11>0' | sed 's/ /#_#/g' | sed 's/|/\t/g' | cut -f2 | sort | uniq -c | sed 's/^ \+//' | awk '$1>1' >genes_with_more_than1.txt
|
|
|
~~~~
|
|
|
|
|
|
Now we make a file that has the gene names, and counts, for all genes
|
|
|
that have more than one deleterious allele:
|
|
|
|
|
|
~~~~ {.bash}
|
|
|
gunzip -c LeonEnv/BT18.vep.all.vcf.gz | python3 print_fields_from_vcf.py -m vep | grep deleterious | awk '$3>0&&$3<1' | cut -f4 | sort | uniq -c | sed 's/^ \+//' | awk '$1>1' >genes_with_more_than1.txt
|
|
|
~~~~
|
|
|
|
|
|
Take two or three genes that have the highest number of deleterious
|
|
|
variants. Manually look them up at the Ensembl website (ensembl.org).
|
... | ... | @@ -164,7 +316,9 @@ a low frequency would certainly be better candidates than genes that |
|
|
have multiple deleterious variants, as the above example indicates. To
|
|
|
filter you can do the following:
|
|
|
|
|
|
` gunzip -c BT18.vep.vcf.gz | grep deleterious | sed 's/;/ /g' | awk '{gsub("AF=","",$11); print}' | awk '$11<0.15&&$11>0' | sed 's/ /#_#/g' | sed 's/|/\t/g' | cut -f2 | sort | uniq -c | sed 's/^ \+//' | awk '$1==1' | awk '{print $2}' >genes_with1.txt`
|
|
|
~~~~ {.bash}
|
|
|
gunzip -c LeonEnv/BT18.vep.all.vcf.gz | python3 print_fields_from_vcf.py -m vep | grep deleterious | awk '$3>0&&$3<1' | cut -f4 | sort | uniq -c | sed 's/^ \+//' | awk '$1==1' | awk '{print $2}' >genes_with1.txt
|
|
|
~~~~
|
|
|
|
|
|
How many genes are there that meet this condition?
|
|
|
|
... | ... | @@ -175,7 +329,26 @@ advantages over VEP is that it is applicable to a wider range of |
|
|
species. Furthermore, it is blazingly fast. For chromosome 18, we will
|
|
|
also use this tool:
|
|
|
|
|
|
` java -Xmx4G -jar $snpEff -dataDir ~/snpEff/data -v UMD3.1.78 BT18_rsnumbers.vcf.gz >testresult_snpEff_Bt.txt`
|
|
|
snpEff is Java-based. The path to the snpEff.jar Java archive is
|
|
|
available as an environment variable. Do:
|
|
|
|
|
|
~~~~ {.bash}
|
|
|
echo $SNP
|
|
|
/home/formacion/COMUNES/IAMZ/soft/snpEff/snpEff.jar
|
|
|
~~~~
|
|
|
|
|
|
The snpEff program can be used like this:
|
|
|
|
|
|
~~~~ {.bash}
|
|
|
java -Xmx4G -jar $snpEff -dataDir ~/snpEff/data -v UMD3.1.78 BT18_rsnumbers.vcf.gz >testresult_snpEff_Bt.txt
|
|
|
~~~~
|
|
|
|
|
|
The options used here are:
|
|
|
|
|
|
` -dataDir ~/snpEff/data # annotation data goes here`\
|
|
|
` -v UMD3.1.78 # Genome / version`\
|
|
|
` BT18_rsnumbers.vcf.gz # requires an input file, can be gzipped`\
|
|
|
` >testresult_snpEff_Bt.txt # results go to STDOUT and are captured in a file by '>'`
|
|
|
|
|
|
Study the outcome. Try to filter out 'intergenic' SNPs and other things
|
|
|
that are perhaps a bit less relevant (use your skills in 'grep-ping',
|
... | ... | @@ -186,21 +359,41 @@ You will notice that the variants that have an effect are labelled |
|
|
'MODERATE' or 'HIGH'. Try to filter out the ones that are 'HIGH'. How
|
|
|
many are there?
|
|
|
|
|
|
Compare the results by selecting the 'MODERATE' and "HIGH" variants, and
|
|
|
compare that with the 'deleterious' variants as annotated (through SIFT)
|
|
|
in VEP. You can use, e.g. vcf-compare.
|
|
|
You can now easily make tables for number of different consequences as
|
|
|
annotated by both these annotation tools:
|
|
|
|
|
|
~~~~ {.bash}
|
|
|
cat LeonEnv/testresult_snpEff_Bt.txt | python3 cut_line.py -m snpeff | cut -f6 | sort | uniq -c
|
|
|
|
|
|
gunzip -c LeonEnv/BT18.vep.all.vcf.gz | python3 cut_line.py -m vep | cut -f5 | sort | uniq -c
|
|
|
~~~~
|
|
|
|
|
|
Bonus question: Compare the results by selecting the 'MODERATE' and
|
|
|
"HIGH" variants, and compare that with the 'deleterious' variants as
|
|
|
annotated (through SIFT) in VEP. You can use, e.g. vcf-compare.
|
|
|
|
|
|
Inferring function
|
|
|
------------------
|
|
|
|
|
|
### Provean
|
|
|
|
|
|
Provean uses an approach where a protein sequence is compared (Psiblast)
|
|
|
to all known protein sequences (so called 'non-redundant database').
|
|
|
Subsequent clustering defines a group of 'similar', and implicitly
|
|
|
homologous sequences. An example of in- and output of a pig sequence:
|
|
|
|
|
|
What goes in: Protein sequence (Fasta):
|
|
|
|
|
|
` >ENSSSCP00000018263 pep:novel chromosome:Sscrofa10.2:12:6621092:6624938:1 gene:ENSSSCG00000017236 transcript:ENSSSCT00000018765 gene_biotype:protein_coding t transcript_biotype:protein_coding`\
|
|
|
` MTPRVGAVWLPSALLLLRVPGCLSLSGPPTAMGTKGGSLSVQCRYEEEYIDDKKYWDKSP`\
|
|
|
` CFLSWKHIVETTESAREVRRGRVSIRDDPANLTFTVTLERLTEEDAGTYCCGITAQFSVD`\
|
|
|
` PTHEVEVVVFPALGTSRPPSMPGPPTTLPATTWSFVSERETMANNLGKGPASQDPGQHPR`\
|
|
|
` SKHPSIRLLLLVFLEVPLFLGMLGAVLWVHRPLRSSESRSVAMDPVPGNTAPSAGWK`
|
|
|
|
|
|
and variation, with coordinates pertaining the above protein sequences,
|
|
|
the second column the reference amino acid, and the third/last column
|
|
|
the alternative amino acid (alternative allele):
|
|
|
|
|
|
` 235,G,E`\
|
|
|
` 5,V,A`\
|
|
|
` 22,C,Y`\
|
... | ... | @@ -224,47 +417,114 @@ Inferring function |
|
|
` 186,I,V `\
|
|
|
` 21,G,D`
|
|
|
|
|
|
` ## PROVEAN scores ##`
|
|
|
` # VARIATION SCORE`
|
|
|
` 235,G,E 0.076`
|
|
|
` 5,V,A 0.287`
|
|
|
` 22,C,Y -8.028`
|
|
|
` 34,T,I -1.932`
|
|
|
` 51,D,N -1.613`
|
|
|
` 53,K,N 1.565`
|
|
|
` 59,S,Y -2.140`
|
|
|
` 61,C,R -5.826`
|
|
|
` 64,S,L -0.437`
|
|
|
` 67,H,P -0.511`
|
|
|
` 68,I,T -2.664`
|
|
|
` 75,A,V -1.061`
|
|
|
` 108,T,K -4.051`
|
|
|
` 115,A,T 1.557`
|
|
|
` 124,E,D -1.587`
|
|
|
` 130,F,Y -0.983`
|
|
|
` 133,L,P 3.203`
|
|
|
` 142,P,A -2.058`
|
|
|
` 155,F,I 0.502`
|
|
|
` 158,E,G -0.077`
|
|
|
` 186,I,V 0.220`
|
|
|
Output looks like this:
|
|
|
|
|
|
` ## PROVEAN scores ##`\
|
|
|
` # VARIATION SCORE`\
|
|
|
` 235,G,E 0.076`\
|
|
|
` 5,V,A 0.287`\
|
|
|
` 22,C,Y -8.028`\
|
|
|
` 34,T,I -1.932`\
|
|
|
` 51,D,N -1.613`\
|
|
|
` 53,K,N 1.565`\
|
|
|
` 59,S,Y -2.140`\
|
|
|
` 61,C,R -5.826`\
|
|
|
` 64,S,L -0.437`\
|
|
|
` 67,H,P -0.511`\
|
|
|
` 68,I,T -2.664`\
|
|
|
` 75,A,V -1.061`\
|
|
|
` 108,T,K -4.051`\
|
|
|
` 115,A,T 1.557`\
|
|
|
` 124,E,D -1.587`\
|
|
|
` 130,F,Y -0.983`\
|
|
|
` 133,L,P 3.203`\
|
|
|
` 142,P,A -2.058`\
|
|
|
` 155,F,I 0.502`\
|
|
|
` 158,E,G -0.077`\
|
|
|
` 186,I,V 0.220`\
|
|
|
` 21,G,D -6.014`
|
|
|
|
|
|
At the end of the VEP exercise we created a filtered list of genes that
|
|
|
have only a single deleterious variant annotated. Among these genes was
|
|
|
BLOC1S3 gene, ENSBTAG00000007070.
|
|
|
We will do something similar, although with a somewhat less complex
|
|
|
example, for the cow data. At the end of the VEP exercise we created a
|
|
|
filtered list of genes that have only a single deleterious variant
|
|
|
annotated. Among these genes was BLOC1S3 gene, ENSBTAG00000007070.
|
|
|
|
|
|
` `[`http://www.ensembl.org/Bos_taurus/Gene/Summary?db=core;g=ENSBTAG00000007070;r=18:53231599-53232201;t=ENSBTAT00000009293`](http://www.ensembl.org/Bos_taurus/Gene/Summary?db=core;g=ENSBTAG00000007070;r=18:53231599-53232201;t=ENSBTAT00000009293)
|
|
|
|
|
|
` GENE=ENSBTAG00000007070`\
|
|
|
`` PROT=`cat ~/Bos_taurus.UMD3.1.pep.all.fa | grep $GENE | awk '{print $1}' | sed 's/>//'` ``\
|
|
|
` cat test.vep | grep $GENE | grep missense | awk '{print $10","$11}' | sed 's/\//,/' >$PROT.var`
|
|
|
|
|
|
` provean.sh -q $PROT.fa -v $PROT.var --save_supporting_set $PROT.sss >$PROT.result.txt 2>$PROT.error;`
|
|
|
|
|
|
` faOneRecord ~/Bos_taurus.UMD3.1.pep.all.fa $PROT >$PROT.fa`
|
|
|
|
|
|
` cat ENSBTAG00000007070.fa >>ENSBTAG00000007070.sss.fasta`\
|
|
|
` mafft --auto ENSBTAG00000007070.sss.fasta >ENSBTAG00000007070.sss.fasta.out`
|
|
|
What we need for the Provean analysis is 1) the protein sequence of the
|
|
|
gene (or rather, the transcript), and 2) a file with the variation in
|
|
|
the same format as listed above. Obviously, if we want to do this for
|
|
|
many genes, we will not parse the relevant information by hand. We will
|
|
|
make a little pipeline for that. A multifasta file that contains all
|
|
|
protein sequences for cow can be found at Ensembl. It is located here on
|
|
|
the machine you are working on:
|
|
|
|
|
|
~~~~ {.bash}
|
|
|
## define the Env.Var. that holds the gene id:
|
|
|
GENE=ENSBTAG00000007070
|
|
|
|
|
|
## search the protein file for the gene, and parse out the corresponding protein name
|
|
|
## Note: ideally you do this by transcript, as a gene may be represented by multiple transcripts/protein sequences
|
|
|
PROT=`cat ~/Bos_taurus.UMD3.1.pep.all.fa | grep $GENE | awk '{print $1}' | sed 's/>//'`
|
|
|
|
|
|
## use the first VEP annotation you did for this. Easiest to work with because completely tabular.
|
|
|
## Note: in this case there is only a single missense variant in the gene. This won't work for all genes.
|
|
|
## Note: the 'sed' statement replaces the '/', as found in the VEP annotation, with a ','.
|
|
|
cat test.vep | grep $GENE | grep missense | awk '{print $10","$11}' | sed 's/\//,/' >$PROT.var
|
|
|
|
|
|
## retrieve the protein sequence from the multifasta protein sequence file.
|
|
|
## Note: the 'faOneRecord program is part of the so called 'Blat' suite, of which Blat is the best known
|
|
|
faOneRecord ~/Bos_taurus.UMD3.1.pep.all.fa $PROT >$PROT.fa
|
|
|
~~~~
|
|
|
|
|
|
This should result in two files that are produced:
|
|
|
|
|
|
ENSBTAP00000009293.fa
|
|
|
|
|
|
` >ENSBTAP00000009293 pep:known chromosome:UMD3.1:18:53231599:53232201:1 gene:ENSBTAG00000007070 transcript:ENSBTAT00000009293 gene_biotype:protein_coding transcript_biotype:protein_coding`\
|
|
|
` MESQSRRRRPLRRPETLVQGEAAESDSDLSASSSEEEELYLGPSGPTRGRPTGLRVAGEA`\
|
|
|
` AETDSDPEPEPKAAPRDLPPLVVQRETAGEAWAEEEAPAPAPARSLLQLRLAESQARLDH`\
|
|
|
` DVAAAVSGVYRRAGRDVAALAGRLAAAQAAGLAAAHSVRLARGDLCALAERLDIVASCRL`\
|
|
|
` LPDIRGVPGTEPEQDPGPRA`
|
|
|
|
|
|
ENSBTAP00000009293.var
|
|
|
|
|
|
` 26,D,N`
|
|
|
|
|
|
Finally, the actual Provean analysis is done with this command:
|
|
|
|
|
|
~~~~ {.bash}
|
|
|
provean.sh -q $PROT.fa -v $PROT.var --save_supporting_set $PROT.sss >$PROT.result.txt 2>$PROT.error;
|
|
|
~~~~
|
|
|
|
|
|
Study the result. What is the verdict on the alternative allele?
|
|
|
(D--\>N)
|
|
|
|
|
|
Provean does many things 'under the hood'. When you follow the process
|
|
|
(e.g. using the 'top' command in a second terminal screen), you will
|
|
|
notice that the majority of time is spent on Psiblast, and that it takes
|
|
|
quite a bit of the total memory of the node you are working on
|
|
|
(typically \>10GB of ram). The similar sequences that Provean fished out
|
|
|
of the nr-database and then clustered are written to a 'supporting set'
|
|
|
(because you asked Provean to save the supporting set by using the
|
|
|
--save\_supporting flag). Briefly investigate what is in the supporting
|
|
|
set (extension .sss.fasta).
|
|
|
|
|
|
The supporting sequences are not aligned. Chances are that among the
|
|
|
supporting sequences there is a cow sequence. Bit cumbersome to figure
|
|
|
out which is which (although doable using blastdb tools - we won't go
|
|
|
into that here). Easiest is to add the protein sequence you have used
|
|
|
for the Provean analysis and add it to the supporting set fasta file
|
|
|
(using the '\>\>', note that you need TWO of these to add to a file,
|
|
|
otherwise the file will get overwritten if you use only one!).
|
|
|
|
|
|
~~~~ {.bash}
|
|
|
## add protein sequence to supporting set fasta:
|
|
|
cat $PROT.fa >>$PROT.sss.fasta
|
|
|
|
|
|
## align all protein sequences using the multiple sequence aligner mafft:
|
|
|
mafft --auto $PROT.sss.fasta >$PROT.sss.fasta.out
|
|
|
~~~~
|
|
|
|
|
|
Transfer the alignment to your local environment and view (seaview?).
|
|
|
What do you observe for the 26th amino-acid of the cow protein sequence?
|
... | ... | @@ -279,15 +539,12 @@ cattle population? |
|
|
|
|
|
### Polyphen
|
|
|
|
|
|
` /lustre/nobackup/WUR/ABGC/shared/public_data_store/polyphen/polyphen-2.2.2/bin/run_pph.pl -s test1033.fa test1033.coord `\
|
|
|
` #o_acc o_pos o_aa1 o_aa2 rsid acc pos aa1 aa2 nt1 nt2 prediction based_on effect site region PHAT dScore Score1 Score2 MSAv Nobs Nstruct Nfilt PDB_id PDB_pos PDB_ch ident lengthNormASA SecStr MapReg dVol dProp B-fact H-bonds AveNHet MinDHet AveNInt MinDInt AveNSit MinDSit Transv CodPos CpG MinDJxn PfamHit IdPmax IdPSNP IdQmin`\
|
|
|
` ENSSSCG00000001099ENSSSCT00000001195 368 S A ENSSSCG00000001099ENSSSCT00000001195 368 S A benign alignment +0.721 -1.180 -1.901 2 68 12.350 12.350 42.63`\
|
|
|
` ENSSSCG00000001099ENSSSCT00000001195 453 R H ENSSSCG00000001099ENSSSCT00000001195 453 R H benign alignment +0.351 -2.130 -2.481 2 67 33.194 33.194 93.76`\
|
|
|
` ENSSSCG00000001099ENSSSCT00000001195 431 K T ENSSSCG00000001099ENSSSCT00000001195 431 K T possibly damaging alignment +1.705 -1.678 -3.383 2 68 2.382 68.80`
|
|
|
The Polyphen analysis is quite similar in what you need compared to
|
|
|
Provean: a fasta file containing the protein sequence, and a file that
|
|
|
lists coordinates and variants:
|
|
|
|
|
|
What goes in:
|
|
|
What goes in: in.fa
|
|
|
|
|
|
` test1033.fa`\
|
|
|
` >ENSSSCG00000001099ENSSSCT00000001195`\
|
|
|
` MSSIEQTTEILLCLSPAEAANLKEGINFVRNKSTGKDYILFKNKSRLKACKNMCKHQGGL`\
|
|
|
` FIKDIEDLNGRSVKCTKHNWKLDVSSMKYINPPGSFCQDELVVEKDEENGVLLLELNPPN`\
|
... | ... | @@ -300,7 +557,24 @@ What goes in: |
|
|
` EDFSPLPGGYDYLVDFLDLSFPKERPSREHPYEEIRSRVDVIRHVVKNGLLWDDLYIGFQ`\
|
|
|
` TRLQRDPDIYHHLFWNHFQIKLPLTPPDWKSFLMCSG `
|
|
|
|
|
|
` test1033.coord`\
|
|
|
in.coord
|
|
|
|
|
|
` ENSSSCG00000001099ENSSSCT00000001195 368 S A`\
|
|
|
` ENSSSCG00000001099ENSSSCT00000001195 453 R H`\
|
|
|
` ENSSSCG00000001099ENSSSCT00000001195 431 K T` |
|
|
\ No newline at end of file |
|
|
` ENSSSCG00000001099ENSSSCT00000001195 431 K T`
|
|
|
|
|
|
` run_pph.pl -s in.fa in.coord`
|
|
|
|
|
|
What comes out:
|
|
|
|
|
|
` /lustre/nobackup/WUR/ABGC/shared/public_data_store/polyphen/polyphen-2.2.2/bin/run_pph.pl -s test1033.fa test1033.coord `\
|
|
|
` #o_acc o_pos o_aa1 o_aa2 rsid acc pos aa1 aa2 nt1 nt2 prediction based_on effect site region PHAT dScore Score1 Score2 MSAv Nobs Nstruct Nfilt PDB_id PDB_pos PDB_ch ident lengthNormASA SecStr MapReg dVol dProp B-fact H-bonds AveNHet MinDHet AveNInt MinDInt AveNSit MinDSit Transv CodPos CpG MinDJxn PfamHit IdPmax IdPSNP IdQmin`\
|
|
|
` ENSSSCG00000001099ENSSSCT00000001195 368 S A ENSSSCG00000001099ENSSSCT00000001195 368 S A benign alignment +0.721 -1.180 -1.901 2 68 12.350 12.350 42.63`\
|
|
|
` ENSSSCG00000001099ENSSSCT00000001195 453 R H ENSSSCG00000001099ENSSSCT00000001195 453 R H benign alignment +0.351 -2.130 -2.481 2 67 33.194 33.194 93.76`\
|
|
|
` ENSSSCG00000001099ENSSSCT00000001195 431 K T ENSSSCG00000001099ENSSSCT00000001195 431 K T possibly damaging alignment +1.705 -1.678 -3.383 2 68 2.382 68.80`
|
|
|
|
|
|
The fasta and coordinate files can be made similar to what you've done
|
|
|
for Provean. Polyphen is much more human-oriented than Provean (which is
|
|
|
more species agnostic) and has many more features, not all of which work
|
|
|
very well for non-human species. Since Polyphen won't run properly in
|
|
|
your present infrastructure we will not practice it actively. |
|
|
\ No newline at end of file |