Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found
Select Git revision
  • master
1 result

Target

Select target project
  • derks047/CIHEAM
  • ABGC_Genomics/CIHEAM
2 results
Select Git revision
  • master
1 result
Show changes
Commits on Source (18)
Code and tutorials for CIHEAM course. Variation annotation and function
Code and tutorials for GeneticDiv course. Variation annotation and function
prediction sessions
Variant annotation
......@@ -15,13 +15,14 @@ Create an Env.Var. to the communal session directory that holds some of
the files you need:
~~~~ {.bash}
SESSIONDIR=/home/formacion/COMUNES/IAMZ/data/CIHEAM/sessions/variant_annotation
SESSIONDIR=/lustre/nobackup/WUR/ABGC/shared/IMAGEcourse2018/sessions/variant_annotation
~~~~
And make sure python3 can be found:
And make sure python3 and bcftools can be found:
~~~~ {.bash}
alias python3='/home/formacion/COMUNES/IAMZ/soft/python-3.4.2/bin/python3.4'
module load python/3.4.2
module load bcftools/gcc/64/1.2
~~~~
### Slicing and dicing of VCF files
......@@ -62,7 +63,7 @@ The VCF file you are working does not yet have the dbSNP identifiers
as databased in dbSNP can be found in this file:
`
/home/formacion/COMUNES/IAMZ/data/CIHEAM/sessions/variant_annotation/BT_incl_cons.18.vcf.gz
/lustre/nobackup/WUR/ABGC/shared/IMAGEcourse2018/sessions/variant_annotation/BT_incl_cons.18.vcf.gz
`
You can use vcftools to do many things related to VCF files. This
......@@ -73,6 +74,7 @@ and then piping it to vcf-annotate, like so:
~~~~ {.bash}
## will take an interval of 1000bp and annotate the rs numbers; piped into 'tail' means last 10 lines are shown
module load vcftools/gcc/0.1.12b ## first load vcftools
tabix -h BT18.vcf.gz 18:100000-101000 | vcf-annotate -a $SESSIONDIR/BT_incl_cons.18.vcf.gz -c CHROM,FROM,ID | tail
~~~~
......@@ -110,7 +112,7 @@ wget ftp://ftp.ensembl.org/pub/release-78/gtf/bos_taurus/Bos_taurus.UMD3.1.78.gt
The file is already downloaded and can be found here:
` /home/formacion/COMUNES/IAMZ/data/CIHEAM/sessions/variant_annotation/Bos_taurus.UMD3.1.78.gtf.gz`
`/lustre/nobackup/WUR/ABGC/shared/IMAGEcourse2018/sessions/variant_annotation/Bos_taurus.UMD3.1.78.gtf.gz`
Have a quick look at what is in this file:
......@@ -133,6 +135,7 @@ Subsequently you can intersect the VCF with the GTF file that contains
just the genes between coordinates 1 and 2 million:
~~~~ {.bash}
module load bedtools/gcc/64/2.26.0 ## first load bedtools
bedtools intersect -a BT18_rsnumbers.vcf.gz -b mygenes.gtf >intersection_genes.vcf
~~~~
......@@ -154,7 +157,7 @@ use the former two.
VEP can be found here:
~~~~ {.bash}
perl /home/formacion/COMUNES/IAMZ/soft/ensembl-tools-release-78/scripts/variant_effect_predictor/variant_effect_predictor.pl
perl /cm/shared/apps/WUR/ABGC/variant_effect_predictor/VEP78/variant_effect_predictor.pl
~~~~
If you invoke it without parameters you will see the following:
......@@ -190,7 +193,7 @@ For the current session we will use the following options:
` --coding_only`\
` --no_intergenic`\
` --offline # using a cache directory`\
` --dir /home/formacion/COMUNES/IAMZ/data/CIHEAM/ReferenceGenome/VEP/ # where the cache dir lives`\
` --dir /cm/shared/apps/WUR/ABGC/variant_effect_predictor/VEP78/cache/ # where the cache dir lives`\
` --force_overwrite`
You can create a file that will provide annotation information for each
......@@ -198,8 +201,8 @@ SNP:
~~~~ {.bash}
gunzip -c BT18_rsnumbers.vcf.gz | \
perl /home/formacion/COMUNES/IAMZ/soft/ensembl-tools-release-78/scripts/variant_effect_predictor/variant_effect_predictor.pl \
--dir /home/formacion/COMUNES/IAMZ/data/CIHEAM/ReferenceGenome/VEP/ --species bos_taurus -o BT18.vep \
perl /cm/shared/apps/WUR/ABGC/variant_effect_predictor/VEP78/variant_effect_predictor.pl \
--dir /cm/shared/apps/WUR/ABGC/variant_effect_predictor/VEP78/cache/ --species bos_taurus -o BT18.vep \
--fork 4 --canonical --sift b --coding_only --no_intergenic --offline --force_overwrite
~~~~
......@@ -217,8 +220,8 @@ and in stead adding the '--vcf' flag:
~~~~ {.bash}
gunzip -c BT18_rsnumbers.vcf.gz | \
perl /home/formacion/COMUNES/IAMZ/soft/ensembl-tools-release-78/scripts/variant_effect_predictor/variant_effect_predictor.pl \
--dir /home/formacion/COMUNES/IAMZ/data/CIHEAM/ReferenceGenome/VEP/ --species bos_taurus -o BT18.vep.vcf \
perl /cm/shared/apps/WUR/ABGC/variant_effect_predictor/VEP78/variant_effect_predictor.pl \
--dir /cm/shared/apps/WUR/ABGC/variant_effect_predictor/VEP78/cache/ --species bos_taurus -o BT18.vep.vcf \
--fork 4 --canonical --sift b --offline --force_overwrite --vcf
bgzip BT18.vep.vcf
tabix -p vcf BT18.vep.vcf.gz
......@@ -397,14 +400,13 @@ snpEff is Java-based. The path to the snpEff.jar Java archive is
available as an environment variable. Do:
~~~~ {.bash}
echo $snpEff
/home/formacion/COMUNES/IAMZ/soft/snpEff/snpEff.jar
module load snpEff/4.2
~~~~
The snpEff program can be used like this:
~~~~ {.bash}
java -Xmx4G -jar $snpEff -dataDir $SESSIONDIR/snpEff/data -v UMD3.1.78 BT18_rsnumbers.vcf.gz >BT18_snpEff.vcf
snpEff -dataDir $SESSIONDIR/snpEff/data -v UMD3.1.81 BT18_rsnumbers.vcf.gz >BT18_snpEff.vcf
bgzip BT18_snpEff.vcf
tabix -p vcf BT18_snpEff.vcf.gz
~~~~
......@@ -441,40 +443,6 @@ Bonus question: Compare the results by selecting the 'MODERATE' and
"HIGH" variants, and compare that with the 'deleterious' variants as
annotated (through SIFT) in VEP. You can use, e.g. vcf-compare.
#### If time permits: a human example
Of all species, human so far has the best annotation information for
variation (or anything else for that matter). The reason, obviously, is
the clinical relevance. In the coming years, increased knowledge on
regulatory sequences, e.g. through the ENCODE project, will further
increase the knowledge on relevance of non-exonic variation. Again, the
human/clinical genomics community will lead the way here. The domestic
animal genomics community will follow and profit from those insights.
Currently, there are already important resources of information on
clinically relevant variations. You can find more information here:
` `[`http://www.ncbi.nlm.nih.gov/variation/docs/human_variation_vcf/`](http://www.ncbi.nlm.nih.gov/variation/docs/human_variation_vcf/)
Among the variation sets available is the 'common&clinical' - a set of
variants that has clinical relevance, and a relatively high occurrence
in human populations (i.e. not the de novo mutations often found to be
the reason for genetic disorders).
We can annotate these SNPs using snpEff:
~~~~ {.bash}
java -Xmx4G -jar $snpEff -dataDir $SESSIONDIR/snpEff/data -v GRCh38.78 \
$SESSIONDIR/common_and_clinical.vcf >common_clinical_snpEff.vcf
~~~~
Copy the HTML report to your local computer and study it. A few things
will be apparent that are different to the cattle example. You will see
consequences that you have not seen in cattle, notably
"TF\_binding\_site\_variant". Another is that, because this is a
non-random sample of variants, its distribution is non-random as well.
Exonic variants, most notably, are highly overrepresented.
Inferring function
------------------
......@@ -489,7 +457,7 @@ What goes in:
`
Protein sequence (Fasta):
>ENSSSCP00000018263 pep:novel chromosome:Sscrofa10.2:12:6621092:6624938:1 gene:ENSSSCG00000017236 transcript:ENSSSCT00000018765 gene_biotype:protein_coding t transcript_biotype:protein_coding
>ENSSSCP00000018263 pep:novel chromosome:Sscrofa10.2:12:6621092:6624938:1 gene:ENSSSCG00000017236 transcript:ENSSSCT00000018765 gene_biotype:protein_coding t transcript_biotype:protein_coding
MTPRVGAVWLPSALLLLRVPGCLSLSGPPTAMGTKGGSLSVQCRYEEEYIDDKKYWDKSP
CFLSWKHIVETTESAREVRRGRVSIRDDPANLTFTVTLERLTEEDAGTYCCGITAQFSVD
PTHEVEVVVFPALGTSRPPSMPGPPTTLPATTWSFVSERETMANNLGKGPASQDPGQHPR
......@@ -571,7 +539,7 @@ GENE=ENSBTAG00000007070
## search the protein file for the gene, and parse out the corresponding protein name
## Note: ideally you do this by transcript, as a gene may be represented by multiple transcripts/protein sequences
PROT=`cat $SESSIONDIR/Bos_taurus.UMD3.1.pep.all.fa | grep $GENE | awk '{print $1}' | sed 's/>//'`
PROT=`grep $GENE $SESSIONDIR/Bos_taurus.UMD3.1.pep.all.fa | awk '{print $1}' | sed 's/>//'`
## use the first VEP annotation you did for this. Easiest to work with because completely tabular.
## Note: in this case there is only a single missense variant in the gene. This won't work for all genes.
......@@ -580,17 +548,17 @@ cat BT18.vep | grep $GENE | grep missense | awk '{print $10","$11}' | sed 's/\//
## retrieve the protein sequence from the multifasta protein sequence file.
## Note: the 'faOneRecord program is part of the so called 'Blat' suite, of which Blat is the best known
faOneRecord $SESSIONDIR/Bos_taurus.UMD3.1.pep.all.fa $PROT >$PROT.fa
$SESSIONDIR/faOneRecord $SESSIONDIR/Bos_taurus.UMD3.1.pep.all.fa $PROT >$PROT.fa
~~~~
This should result in two files that are produced:
ENSBTAP00000009293.fa
` >ENSBTAP00000009293 pep:known chromosome:UMD3.1:18:53231599:53232201:1 gene:ENSBTAG00000007070 transcript:ENSBTAT00000009293 gene_biotype:protein_coding transcript_biotype:protein_coding`\
` MESQSRRRRPLRRPETLVQGEAAESDSDLSASSSEEEELYLGPSGPTRGRPTGLRVAGEA`\
` AETDSDPEPEPKAAPRDLPPLVVQRETAGEAWAEEEAPAPAPARSLLQLRLAESQARLDH`\
` DVAAAVSGVYRRAGRDVAALAGRLAAAQAAGLAAAHSVRLARGDLCALAERLDIVASCRL`\
` >ENSBTAP00000009293 pep:known chromosome:UMD3.1:18:53231599:53232201:1 gene:ENSBTAG00000007070 transcript:ENSBTAT00000009293 gene_biotype:protein_coding transcript_biotype:protein_coding`
` MESQSRRRRPLRRPETLVQGEAAESDSDLSASSSEEEELYLGPSGPTRGRPTGLRVAGEA`
` AETDSDPEPEPKAAPRDLPPLVVQRETAGEAWAEEEAPAPAPARSLLQLRLAESQARLDH`
` DVAAAVSGVYRRAGRDVAALAGRLAAAQAAGLAAAHSVRLARGDLCALAERLDIVASCRL`
` LPDIRGVPGTEPEQDPGPRA`
ENSBTAP00000009293.var
......@@ -601,7 +569,7 @@ Finally, the actual Provean analysis is done with this command (this may
take a few minutes to run):
~~~~ {.bash}
provean.sh --num_threads 8 -q $PROT.fa -v $PROT.var --save_supporting_set $PROT.sss >$PROT.result.txt 2>$PROT.error
$SESSIONDIR/provean.sh --num_threads 8 -q $PROT.fa -v $PROT.var --save_supporting_set $PROT.sss >$PROT.result.txt 2>$PROT.error
~~~~
Study the result. What is the verdict on the alternative allele?
......@@ -630,21 +598,12 @@ otherwise the file will get overwritten if you use only one!).
cat $PROT.fa >>$PROT.sss.fasta
## align all protein sequences using the multiple sequence aligner mafft:
mafft --auto $PROT.sss.fasta >$PROT.sss.fasta.out
~~~~
Transfer the alignment to your local environment and view. If installed
on your local computer you could use e.g. Seaview. If you are on a
Debian-based system (e.g. Ubuntu) and you have administrative rights,
you can simply install it like this:
~~~~ {.bash}
sudo apt-get install seaview
$SESSIONDIR/mafft --auto $PROT.sss.fasta >$PROT.sss.fasta.out
~~~~
Alternatively, you can view the alignment through this online tool:
You can view the alignment through this online tool:
` `[`http://toolkit.tuebingen.mpg.de/alnviz`](http://toolkit.tuebingen.mpg.de/alnviz)
` `[`https://toolkit.tuebingen.mpg.de/#/tools/alnviz`](https://toolkit.tuebingen.mpg.de/#/tools/alnviz)
What do you observe for the 26th amino-acid of the cow protein sequence?
In this 'implicit evolutionary context' is there another alternative
......@@ -663,26 +622,26 @@ lists coordinates and variants:
What goes in: in.fa
` >ENSSSCG00000001099ENSSSCT00000001195`\
` MSSIEQTTEILLCLSPAEAANLKEGINFVRNKSTGKDYILFKNKSRLKACKNMCKHQGGL`\
` FIKDIEDLNGRSVKCTKHNWKLDVSSMKYINPPGSFCQDELVVEKDEENGVLLLELNPPN`\
` PWDSEPRSPEDLAFGEVQITYLTHACMDLKLGDKRMVFDPWLIGPAFARGWWLLHEPPSD`\
` WLERLSRADLIYISHMHSDHLSYPTLKKLAERRPDVPIYVGNTERPVFWNLNQSGVQLTN`\
` INVVPFGIWQQVDKNLRFMILMDGVHPEMDTCIIVEYKGHKILNTVDCTRPNGGRLPMKV`\
` ALMMSDFAGGASGFPMTFSGGKFTEEWKAQFIKTERKKLLNYKARLVKDLQPRIYCPFAG`\
` YFVESHPSDKYIKETNIKNDPNELNNLIKKNSEVVTWTPRPGATLDLGRMLKDPTDSKGI`\
` VEPPEGTKIYKDSWDFGPYLNILNAAIGDEIFRHSSWIKEYFTWAGFKDYNLVVRMIETD`\
` EDFSPLPGGYDYLVDFLDLSFPKERPSREHPYEEIRSRVDVIRHVVKNGLLWDDLYIGFQ`\
` TRLQRDPDIYHHLFWNHFQIKLPLTPPDWKSFLMCSG `
` >ENSSSCG00000001099ENSSSCT00000001195`
` MSSIEQTTEILLCLSPAEAANLKEGINFVRNKSTGKDYILFKNKSRLKACKNMCKHQGGL`
` FIKDIEDLNGRSVKCTKHNWKLDVSSMKYINPPGSFCQDELVVEKDEENGVLLLELNPPN`
` PWDSEPRSPEDLAFGEVQITYLTHACMDLKLGDKRMVFDPWLIGPAFARGWWLLHEPPSD`
` WLERLSRADLIYISHMHSDHLSYPTLKKLAERRPDVPIYVGNTERPVFWNLNQSGVQLTN`
` INVVPFGIWQQVDKNLRFMILMDGVHPEMDTCIIVEYKGHKILNTVDCTRPNGGRLPMKV`
` ALMMSDFAGGASGFPMTFSGGKFTEEWKAQFIKTERKKLLNYKARLVKDLQPRIYCPFAG`
` YFVESHPSDKYIKETNIKNDPNELNNLIKKNSEVVTWTPRPGATLDLGRMLKDPTDSKGI`
` VEPPEGTKIYKDSWDFGPYLNILNAAIGDEIFRHSSWIKEYFTWAGFKDYNLVVRMIETD`
` EDFSPLPGGYDYLVDFLDLSFPKERPSREHPYEEIRSRVDVIRHVVKNGLLWDDLYIGFQ`
` TRLQRDPDIYHHLFWNHFQIKLPLTPPDWKSFLMCSG `
in.coord
` ENSSSCG00000001099ENSSSCT00000001195 368 S A`\
` ENSSSCG00000001099ENSSSCT00000001195 453 R H`\
` ENSSSCG00000001099ENSSSCT00000001195 431 K T`
` ENSSSCG00000001099ENSSSCT00000001195 368 S A`
` ENSSSCG00000001099ENSSSCT00000001195 453 R H`
` ENSSSCG00000001099ENSSSCT00000001195 431 K T`
~~~~ {.bash}
run_pph.pl -s in.fa in.coord
$SESSIONDIR/run_pph.pl -s in.fa in.coord
~~~~
What comes out:
......