Compare revisions

Derks, Martijn · Derks, Martijn · Derks, Martijn · Derks, Martijn · Derks, Martijn · Derks, Martijn
--- a/README.md
+++ b/README.md
-Code and tutorials for CIHEAM course. Variation annotation and function
+Code and tutorials for GeneticDiv course. Variation annotation and function
 prediction sessions

 Variant annotation
@@ -15,13 +15,14 @@ Create an Env.Var. to the communal session directory that holds some of
 the files you need:

 ~~~~ {.bash}
-SESSIONDIR=/home/formacion/COMUNES/IAMZ/data/CIHEAM/sessions/variant_annotation
+SESSIONDIR=/lustre/nobackup/WUR/ABGC/shared/IMAGEcourse2018/sessions/variant_annotation
 ~~~~

-And make sure python3 can be found:
+And make sure python3 and bcftools can be found:

 ~~~~ {.bash}
-alias python3='/home/formacion/COMUNES/IAMZ/soft/python-3.4.2/bin/python3.4'
+module load python/3.4.2
+module load bcftools/gcc/64/1.2
 ~~~~

 ### Slicing and dicing of VCF files
@@ -62,7 +63,7 @@ The VCF file you are working does not yet have the dbSNP identifiers
 as databased in dbSNP can be found in this file:

 `
-/home/formacion/COMUNES/IAMZ/data/CIHEAM/sessions/variant_annotation/BT_incl_cons.18.vcf.gz
+/lustre/nobackup/WUR/ABGC/shared/IMAGEcourse2018/sessions/variant_annotation/BT_incl_cons.18.vcf.gz
 `

 You can use vcftools to do many things related to VCF files. This
@@ -73,6 +74,7 @@ and then piping it to vcf-annotate, like so:

 ~~~~ {.bash}
 ## will take an interval of 1000bp and annotate the rs numbers; piped into 'tail' means last 10 lines are shown
+module load vcftools/gcc/0.1.12b ## first load vcftools
 tabix -h BT18.vcf.gz 18:100000-101000 | vcf-annotate -a $SESSIONDIR/BT_incl_cons.18.vcf.gz -c CHROM,FROM,ID | tail 
 ~~~~

@@ -110,7 +112,7 @@ wget ftp://ftp.ensembl.org/pub/release-78/gtf/bos_taurus/Bos_taurus.UMD3.1.78.gt

 The file is already downloaded and can be found here:

-`  /home/formacion/COMUNES/IAMZ/data/CIHEAM/sessions/variant_annotation/Bos_taurus.UMD3.1.78.gtf.gz`
+`/lustre/nobackup/WUR/ABGC/shared/IMAGEcourse2018/sessions/variant_annotation/Bos_taurus.UMD3.1.78.gtf.gz`

 Have a quick look at what is in this file:

@@ -133,6 +135,7 @@ Subsequently you can intersect the VCF with the GTF file that contains
 just the genes between coordinates 1 and 2 million:

 ~~~~ {.bash}
+module load bedtools/gcc/64/2.26.0 ## first load bedtools
 bedtools intersect -a BT18_rsnumbers.vcf.gz -b mygenes.gtf >intersection_genes.vcf
 ~~~~

@@ -154,7 +157,7 @@ use the former two.
 VEP can be found here:

 ~~~~ {.bash}
-perl /home/formacion/COMUNES/IAMZ/soft/ensembl-tools-release-78/scripts/variant_effect_predictor/variant_effect_predictor.pl
+perl /cm/shared/apps/WUR/ABGC/variant_effect_predictor/VEP78/variant_effect_predictor.pl
 ~~~~

 If you invoke it without parameters you will see the following:
@@ -190,7 +193,7 @@ For the current session we will use the following options:
 ` --coding_only`\
 ` --no_intergenic`\
 ` --offline   # using a cache directory`\
-` --dir /home/formacion/COMUNES/IAMZ/data/CIHEAM/ReferenceGenome/VEP/ # where the cache dir lives`\
+` --dir /cm/shared/apps/WUR/ABGC/variant_effect_predictor/VEP78/cache/ # where the cache dir lives`\
 ` --force_overwrite`

 You can create a file that will provide annotation information for each
@@ -198,8 +201,8 @@ SNP:

 ~~~~ {.bash}
 gunzip -c BT18_rsnumbers.vcf.gz | \
-  perl /home/formacion/COMUNES/IAMZ/soft/ensembl-tools-release-78/scripts/variant_effect_predictor/variant_effect_predictor.pl \
-  --dir /home/formacion/COMUNES/IAMZ/data/CIHEAM/ReferenceGenome/VEP/ --species bos_taurus -o BT18.vep \
+  perl /cm/shared/apps/WUR/ABGC/variant_effect_predictor/VEP78/variant_effect_predictor.pl \
+  --dir /cm/shared/apps/WUR/ABGC/variant_effect_predictor/VEP78/cache/ --species bos_taurus -o BT18.vep \
  --fork 4 --canonical --sift b --coding_only --no_intergenic --offline --force_overwrite
 ~~~~

@@ -217,8 +220,8 @@ and in stead adding the '--vcf' flag:

 ~~~~ {.bash}
 gunzip -c BT18_rsnumbers.vcf.gz | \
-  perl /home/formacion/COMUNES/IAMZ/soft/ensembl-tools-release-78/scripts/variant_effect_predictor/variant_effect_predictor.pl \
-  --dir /home/formacion/COMUNES/IAMZ/data/CIHEAM/ReferenceGenome/VEP/ --species bos_taurus -o BT18.vep.vcf \
+  perl /cm/shared/apps/WUR/ABGC/variant_effect_predictor/VEP78/variant_effect_predictor.pl \
+  --dir /cm/shared/apps/WUR/ABGC/variant_effect_predictor/VEP78/cache/ --species bos_taurus -o BT18.vep.vcf \
  --fork 4 --canonical --sift b --offline --force_overwrite --vcf 
 bgzip BT18.vep.vcf
 tabix -p vcf BT18.vep.vcf.gz
@@ -397,14 +400,13 @@ snpEff is Java-based. The path to the snpEff.jar Java archive is
 available as an environment variable. Do:

 ~~~~ {.bash}
-echo $snpEff
-  /home/formacion/COMUNES/IAMZ/soft/snpEff/snpEff.jar
+module load snpEff/4.2
 ~~~~

 The snpEff program can be used like this:

 ~~~~ {.bash}
-java -Xmx4G -jar $snpEff -dataDir $SESSIONDIR/snpEff/data -v UMD3.1.78 BT18_rsnumbers.vcf.gz >BT18_snpEff.vcf
+snpEff -dataDir $SESSIONDIR/snpEff/data -v UMD3.1.81 BT18_rsnumbers.vcf.gz >BT18_snpEff.vcf
 bgzip BT18_snpEff.vcf
 tabix -p vcf BT18_snpEff.vcf.gz
 ~~~~
@@ -441,40 +443,6 @@ Bonus question: Compare the results by selecting the 'MODERATE' and
 "HIGH" variants, and compare that with the 'deleterious' variants as
 annotated (through SIFT) in VEP. You can use, e.g. vcf-compare.

-#### If time permits: a human example
-
-Of all species, human so far has the best annotation information for
-variation (or anything else for that matter). The reason, obviously, is
-the clinical relevance. In the coming years, increased knowledge on
-regulatory sequences, e.g. through the ENCODE project, will further
-increase the knowledge on relevance of non-exonic variation. Again, the
-human/clinical genomics community will lead the way here. The domestic
-animal genomics community will follow and profit from those insights.
-
-Currently, there are already important resources of information on
-clinically relevant variations. You can find more information here:
-
-` `[`http://www.ncbi.nlm.nih.gov/variation/docs/human_variation_vcf/`](http://www.ncbi.nlm.nih.gov/variation/docs/human_variation_vcf/)
-
-Among the variation sets available is the 'common&clinical' - a set of
-variants that has clinical relevance, and a relatively high occurrence
-in human populations (i.e. not the de novo mutations often found to be
-the reason for genetic disorders).
-
-We can annotate these SNPs using snpEff:
-
-~~~~ {.bash}
-java -Xmx4G -jar $snpEff -dataDir $SESSIONDIR/snpEff/data -v GRCh38.78 \
-   $SESSIONDIR/common_and_clinical.vcf >common_clinical_snpEff.vcf
-~~~~
-
-Copy the HTML report to your local computer and study it. A few things
-will be apparent that are different to the cattle example. You will see
-consequences that you have not seen in cattle, notably
-"TF\_binding\_site\_variant". Another is that, because this is a
-non-random sample of variants, its distribution is non-random as well.
-Exonic variants, most notably, are highly overrepresented.
-
 Inferring function
 ------------------

@@ -489,7 +457,7 @@ What goes in:

 `
 Protein sequence (Fasta):
-  >ENSSSCP00000018263 pep:novel chromosome:Sscrofa10.2:12:6621092:6624938:1 gene:ENSSSCG00000017236 transcript:ENSSSCT00000018765 gene_biotype:protein_coding t transcript_biotype:protein_coding
+  >ENSSSCP00000018263 pep:novel chromosome:Sscrofa10.2:12:6621092:6624938:1 gene:ENSSSCG00000017236 transcript:ENSSSCT00000018765 gene_biotype:protein_coding t transcript_biotype:protein_coding  
  MTPRVGAVWLPSALLLLRVPGCLSLSGPPTAMGTKGGSLSVQCRYEEEYIDDKKYWDKSP
  CFLSWKHIVETTESAREVRRGRVSIRDDPANLTFTVTLERLTEEDAGTYCCGITAQFSVD
  PTHEVEVVVFPALGTSRPPSMPGPPTTLPATTWSFVSERETMANNLGKGPASQDPGQHPR
@@ -571,7 +539,7 @@ GENE=ENSBTAG00000007070
  
 ## search the protein file for the gene, and parse out the corresponding protein name
 ## Note: ideally you do this by transcript, as a gene may be represented by multiple transcripts/protein sequences
-PROT=`cat $SESSIONDIR/Bos_taurus.UMD3.1.pep.all.fa | grep $GENE | awk '{print $1}' | sed 's/>//'`
+PROT=`grep $GENE $SESSIONDIR/Bos_taurus.UMD3.1.pep.all.fa | awk '{print $1}' | sed 's/>//'`
  
 ## use the first VEP annotation you did for this. Easiest to work with because completely tabular.
 ## Note: in this case there is only a single missense variant in the gene. This won't work for all genes.
@@ -580,17 +548,17 @@ cat BT18.vep | grep $GENE | grep missense | awk '{print $10","$11}' | sed 's/\//
  
 ## retrieve the protein sequence from the multifasta protein sequence file.
 ## Note: the 'faOneRecord program is part of the so called 'Blat' suite, of which Blat is the best known
-faOneRecord $SESSIONDIR/Bos_taurus.UMD3.1.pep.all.fa $PROT >$PROT.fa
+$SESSIONDIR/faOneRecord $SESSIONDIR/Bos_taurus.UMD3.1.pep.all.fa $PROT >$PROT.fa
 ~~~~

 This should result in two files that are produced:

 ENSBTAP00000009293.fa

-` >ENSBTAP00000009293 pep:known chromosome:UMD3.1:18:53231599:53232201:1 gene:ENSBTAG00000007070 transcript:ENSBTAT00000009293 gene_biotype:protein_coding transcript_biotype:protein_coding`\
-` MESQSRRRRPLRRPETLVQGEAAESDSDLSASSSEEEELYLGPSGPTRGRPTGLRVAGEA`\
-` AETDSDPEPEPKAAPRDLPPLVVQRETAGEAWAEEEAPAPAPARSLLQLRLAESQARLDH`\
-` DVAAAVSGVYRRAGRDVAALAGRLAAAQAAGLAAAHSVRLARGDLCALAERLDIVASCRL`\
+` >ENSBTAP00000009293 pep:known chromosome:UMD3.1:18:53231599:53232201:1 gene:ENSBTAG00000007070 transcript:ENSBTAT00000009293 gene_biotype:protein_coding transcript_biotype:protein_coding`  
+` MESQSRRRRPLRRPETLVQGEAAESDSDLSASSSEEEELYLGPSGPTRGRPTGLRVAGEA`
+` AETDSDPEPEPKAAPRDLPPLVVQRETAGEAWAEEEAPAPAPARSLLQLRLAESQARLDH`
+` DVAAAVSGVYRRAGRDVAALAGRLAAAQAAGLAAAHSVRLARGDLCALAERLDIVASCRL`
 ` LPDIRGVPGTEPEQDPGPRA`

 ENSBTAP00000009293.var
@@ -601,7 +569,7 @@ Finally, the actual Provean analysis is done with this command (this may
 take a few minutes to run):

 ~~~~ {.bash}
-provean.sh --num_threads 8 -q $PROT.fa -v $PROT.var --save_supporting_set $PROT.sss >$PROT.result.txt 2>$PROT.error
+$SESSIONDIR/provean.sh --num_threads 8 -q $PROT.fa -v $PROT.var --save_supporting_set $PROT.sss >$PROT.result.txt 2>$PROT.error
 ~~~~

 Study the result. What is the verdict on the alternative allele?
@@ -630,21 +598,12 @@ otherwise the file will get overwritten if you use only one!).
 cat $PROT.fa >>$PROT.sss.fasta
  
 ## align all protein sequences using the multiple sequence aligner mafft:
-mafft --auto $PROT.sss.fasta >$PROT.sss.fasta.out
-~~~~
-
-Transfer the alignment to your local environment and view. If installed
-on your local computer you could use e.g. Seaview. If you are on a
-Debian-based system (e.g. Ubuntu) and you have administrative rights,
-you can simply install it like this:
-
-~~~~ {.bash}
-sudo apt-get install seaview
+$SESSIONDIR/mafft --auto $PROT.sss.fasta >$PROT.sss.fasta.out
 ~~~~

-Alternatively, you can view the alignment through this online tool:
+You can view the alignment through this online tool:

-` `[`http://toolkit.tuebingen.mpg.de/alnviz`](http://toolkit.tuebingen.mpg.de/alnviz)
+` `[`https://toolkit.tuebingen.mpg.de/#/tools/alnviz`](https://toolkit.tuebingen.mpg.de/#/tools/alnviz)

 What do you observe for the 26th amino-acid of the cow protein sequence?
 In this 'implicit evolutionary context' is there another alternative
@@ -663,26 +622,26 @@ lists coordinates and variants:

 What goes in: in.fa

-` >ENSSSCG00000001099ENSSSCT00000001195`\
-` MSSIEQTTEILLCLSPAEAANLKEGINFVRNKSTGKDYILFKNKSRLKACKNMCKHQGGL`\
-` FIKDIEDLNGRSVKCTKHNWKLDVSSMKYINPPGSFCQDELVVEKDEENGVLLLELNPPN`\
-` PWDSEPRSPEDLAFGEVQITYLTHACMDLKLGDKRMVFDPWLIGPAFARGWWLLHEPPSD`\
-` WLERLSRADLIYISHMHSDHLSYPTLKKLAERRPDVPIYVGNTERPVFWNLNQSGVQLTN`\
-` INVVPFGIWQQVDKNLRFMILMDGVHPEMDTCIIVEYKGHKILNTVDCTRPNGGRLPMKV`\
-` ALMMSDFAGGASGFPMTFSGGKFTEEWKAQFIKTERKKLLNYKARLVKDLQPRIYCPFAG`\
-` YFVESHPSDKYIKETNIKNDPNELNNLIKKNSEVVTWTPRPGATLDLGRMLKDPTDSKGI`\
-` VEPPEGTKIYKDSWDFGPYLNILNAAIGDEIFRHSSWIKEYFTWAGFKDYNLVVRMIETD`\
-` EDFSPLPGGYDYLVDFLDLSFPKERPSREHPYEEIRSRVDVIRHVVKNGLLWDDLYIGFQ`\
-` TRLQRDPDIYHHLFWNHFQIKLPLTPPDWKSFLMCSG `
+` >ENSSSCG00000001099ENSSSCT00000001195`  
+` MSSIEQTTEILLCLSPAEAANLKEGINFVRNKSTGKDYILFKNKSRLKACKNMCKHQGGL`  
+` FIKDIEDLNGRSVKCTKHNWKLDVSSMKYINPPGSFCQDELVVEKDEENGVLLLELNPPN`  
+` PWDSEPRSPEDLAFGEVQITYLTHACMDLKLGDKRMVFDPWLIGPAFARGWWLLHEPPSD`  
+` WLERLSRADLIYISHMHSDHLSYPTLKKLAERRPDVPIYVGNTERPVFWNLNQSGVQLTN`  
+` INVVPFGIWQQVDKNLRFMILMDGVHPEMDTCIIVEYKGHKILNTVDCTRPNGGRLPMKV`  
+` ALMMSDFAGGASGFPMTFSGGKFTEEWKAQFIKTERKKLLNYKARLVKDLQPRIYCPFAG`  
+` YFVESHPSDKYIKETNIKNDPNELNNLIKKNSEVVTWTPRPGATLDLGRMLKDPTDSKGI`  
+` VEPPEGTKIYKDSWDFGPYLNILNAAIGDEIFRHSSWIKEYFTWAGFKDYNLVVRMIETD`  
+` EDFSPLPGGYDYLVDFLDLSFPKERPSREHPYEEIRSRVDVIRHVVKNGLLWDDLYIGFQ`  
+` TRLQRDPDIYHHLFWNHFQIKLPLTPPDWKSFLMCSG `  

 in.coord

-` ENSSSCG00000001099ENSSSCT00000001195 368        S A`\
-` ENSSSCG00000001099ENSSSCT00000001195 453        R H`\
-` ENSSSCG00000001099ENSSSCT00000001195 431        K T`
+` ENSSSCG00000001099ENSSSCT00000001195 368        S A`  
+` ENSSSCG00000001099ENSSSCT00000001195 453        R H`  
+` ENSSSCG00000001099ENSSSCT00000001195 431        K T`  

 ~~~~ {.bash}
-run_pph.pl -s in.fa in.coord
+$SESSIONDIR/run_pph.pl -s in.fa in.coord
 ~~~~

 What comes out:
No results found