remove file

8e302298 · Megens, Hendrik-Jan · 0883fdf2 · 0883fdf2
Commit 8e302298 authored 10 years ago by Megens, Hendrik-Jan
--- a/tutorial.md
+++ b/tutorial.md
-Code and tutorials for CIHEAM course. Variation annotation and function
-prediction sessions
-
-Variant annotation
------------------
-
-=
-
-### Slicing and dicing of VCF files
-
-VCF files can get very big. Being able to manipulate them and extract
-relevant information is important. In practice, flat text files such as
-VCF are currently the basis for further analysis - see the Imputation
-and Population practical sessions.
-
-The first task involves selecting all called variants of chromosome 18,
-bgzip-ping them on-the-fly, and then indexing that file with tabix:
-
-` tabix -h allbt.vcf.gz 18 | bgzip >BT18.vcf.gz`\
-` tabix -p vcf BT18.vcf.gz`
-
-### Annotating VCF with rs-numbers
-
-The VCF file you are working does not yet have the dbSNP identifiers
-(rs-numbers) annotated. The rs-numbers of all currently known variants
-as databased in dbSNP can be found in this file:
-
-` /path/to/ BT18.vcf.gz`
-
-Have a look at what is inside this file:
-
-` gunzip -c BT18.vcf.gz | grep -v '^##' | more`
-
-or
-
-` tabix BT18.vcf.gz 18:1-10000 | more`
-
-You can use vcftools to do many things related to VCF files. This
-includes annotating the VCF based on other interval-file formatted
-input, including other VCF files. For instance, you can annotate an
-interval of the VCF on the fly, by selecting that interval using tabix,
-and then piping it to vcf-annotate, like so:
-
-` tabix -h BT18.vcf.gz 18:100000-101000 | vcf-annotate -a BT_incl_cons.18.vcf.gz -c CHROM,FROM,ID`
-
-Now make a new VCF file for all the variants called for chromosome 18
-using this tool:
-
-` tabix -h BT18.vcf.gz 18 | vcf-annotate -a BT_incl_cons.18.vcf.gz -c CHROM,FROM,ID | bgzip >BT18_rsnumbers.vcf.gz`\
-` tabix -p vcf BT18_rsnumbers.vcf.gz`
-
-### Extracting variants using BEDtools
-
-You will very often want to extract variants based on interval files,
-e.g. a certain region on a chromosome. This can be easily achieve using
-the tabix command described above. However, sometimes you may want to
-extract a substantial number of intervals. You could use tabix in a
-loop, either as a shell-script or in a scripting language like Perl or
-Python. However, BedTools provides many utilities for manipulating
-interval data, such as BED, VCF, or GTF/GFF formats.
-
-Say you are interested in the variants in all the genes lying between
-coordinates 1,000,000 and 2,000,000 on chromosome 18. The annotation of
-genes can be found in a GTF file. The GTF file for the UMD3.1 genome
-build can be found at Ensembl and downloaded using the command line
-(NOTE: the compute nodes you are working on do not have internet
-connectivity. This line of code is just for completeness):
-
-` wget `[`ftp://ftp.ensembl.org/pub/release-78/gtf/bos_taurus/Bos_taurus.UMD3.1.78.gtf.gz`](ftp://ftp.ensembl.org/pub/release-78/gtf/bos_taurus/Bos_taurus.UMD3.1.78.gtf.gz)
-
-The file is already downloaded and can be found here:
-
-` /path/to/Bos_taurus.UMD3.1.78.gtf.gz`
-
-Have a quick look at what is in this file:
-
-gunzip -c /path/to/Bos\_taurus.UMD3.1.78.gtf.gz | more
-
-Familiarise yourself with the GTF format if you have not already. For
-more information:
-
-` `[`http://www.ensembl.org/info/website/upload/gff.html`](http://www.ensembl.org/info/website/upload/gff.html)
-
-You can select the appropriate region from the GTF file:
-
-` gunzip -c Bos_taurus.UMD3.1.78.gtf.gz | awk '$3=="gene"' | awk '$1==18&&$4>1000000&&$4<2000000' >mygenes.gtf`
-
-Subsequently you can intersect the VCF with the GTF file that contains
-just the genes between coordinates 1 and 2 million:
-
-` bedtools intersect -a BT18_rsnumbers.vcf.gz -b mygenes.gtf >intersection_genes.vcf`
-
-Have a quick look at the output, and count the number of variants in
-this VCF (e.g. by using 'wc -l').
-
-Now do the same, but select only the exonic variants in the same
-interval, from 1 to 2 million.
-
-How many variants do you have now?
-
-### Variant Effect Predictor
-
-Various tools exist to make more precise annotations than what you can
-achieve using Bedtools and a GTF. Among them are Variant Effect
-Predictor (VEP), snpEff, and Annovar. For the current practical we will
-use former two.
-
-VEP uses a cache directory that has the annotation information. On the
-infrastructure you are working on that directory can be found here:
-
-` /home/formacion/COMUNES/IAMZ/data/CIHEAM/ReferenceGenome/VEP/`
-
-You can create a file that will provide annotation information for each
-SNP:
-
-` gunzip -c BT18_rsnumbers.vcf.gz | perl /home/formacion/COMUNES/IAMZ/soft/ensembl-tools-release-78/scripts/variant_effect_predictor/variant_effect_predictor.pl --dir /home/formacion/COMUNES/IAMZ/data/CIHEAM/ReferenceGenome/VEP/ --species bos_taurus -o BT18.vep --fork 4 --canonical --sift b --coding_only --no_intergenic --offline --force_overwrite`
-
-Have a look inside the output file and study its content. Why are there
-no SNPs in this list that are intergenic?
-
-For some purposes, it might be more convenient to have the annotations
-directly in the VCF. You could for instance do the following:
-
-` gunzip -c BT18_rsnumbers.vcf.gz | perl /home/formacion/COMUNES/IAMZ/soft/ensembl-tools-release-78/scripts/variant_effect_predictor/variant_effect_predictor.pl --dir /home/formacion/COMUNES/IAMZ/data/CIHEAM/ReferenceGenome/VEP/ --species bos_taurus -o BT18.vep.vcf --fork 4 --canonical --sift b --coding_only --offline --force_overwrite --vcf `\
-` bgzip BT18.vep.vcf`\
-` tabix -p vcf BT18.vep.vcf.gz`
-
-Again, study and interpret your outcome by looking inside the file.
-Specifically, note WHERE the annotation ended up.
-
-When dealing with deleterious alleles, we might be interested to learn
-how frequent that allele is in the population we are studying. Our
-expectation is that genuinly deleterious alleles should be relatively
-rare in a population. If not they might hint at either a sign of
-inbreeding, or, perhaps, something that might be deleterious in the wild
-but advantageous in the context of domestication/breeding. The VCF file
-holds the (reference) allele frequency information. The 8th column of
-the VCF file in fact is a bunch of fields concatenated with the ';'
-delimitor. One of the fields has the tag 'AF=', where AF stands for
-Allele Frequency. We can expand the number of space-delimited fields
-easily and then look for the 11th column. We need to remove the 'AF='
-tag though, but if we do that we can further select for instance for
-allele frequencies between 0 and 15%.
-
-` gunzip -c BT18.vep.vcf.gz | grep deleterious | sed 's/;/\t/g' | awk '{gsub("AF=","",$11); print}' | awk '$11<0.15&&$11>0' | wc -l`
-
-Analysing the annotations a bit further, you could discover that some
-
-` gunzip -c BT18.vep.vcf.gz | grep deleterious | sed 's/;/ /g' | awk '{gsub("AF=","",$11); print}' | awk '$11<1&&$11>0' | sed 's/ /#_#/g' | sed 's/|/\t/g' | cut -f2 | sort | uniq -c | sed 's/^ \+//' | awk '$1>1'  >genes_with_more_than1.txt  `
-
-Take two or three genes that have the highest number of deleterious
-variants. Manually look them up at the Ensembl website (ensembl.org).
-Check the 'orthologous' status. What do you observe? How would you
-interpret the validity of the gene annotation and/or the likelihood of
-off-site mapping of reads?
-
-This provides us with a strategy to filter a bit more rigorously for
-genes that may actually contain genuine deleterious alleles. Ideally we
-would a priory filter for genes that have proper one-to-one orthologous
-relations with other mammalian species. Time and lack of direct internet
-connection makes that unfeasible for this practical. However, genes that
-have only one deleterious variant segregating in the study population at
-a low frequency would certainly be better candidates than genes that
-have multiple deleterious variants, as the above example indicates. To
-filter you can do the following:
-
-` gunzip -c BT18.vep.vcf.gz | grep deleterious | sed 's/;/ /g' | awk '{gsub("AF=","",$11); print}' | awk '$11<0.15&&$11>0' | sed 's/ /#_#/g' | sed 's/|/\t/g' | cut -f2 | sort | uniq -c | sed 's/^ \+//' | awk '$1==1' | awk '{print $2}' >genes_with1.txt`
-
-How many genes are there that meet this condition?
-
-### snpEff
-
-Another popular variant annotation tool is 'snpEff'. Among the
-advantages over VEP is that it is applicable to a wider range of
-species. Furthermore, it is blazingly fast. For chromosome 18, we will
-also use this tool:
-
-` java -Xmx4G -jar $snpEff -dataDir ~/snpEff/data -v UMD3.1.78 BT18_rsnumbers.vcf.gz >testresult_snpEff_Bt.txt`
-
-Study the outcome. Try to filter out 'intergenic' SNPs and other things
-that are perhaps a bit less relevant (use your skills in 'grep-ping',
-etc.). What are some of the differences in how the variants are
-annotated, compared to VEP?
-
-You will notice that the variants that have an effect are labelled
-'MODERATE' or 'HIGH'. Try to filter out the ones that are 'HIGH'. How
-many are there?
-
-Compare the results by selecting the 'MODERATE' and "HIGH" variants, and
-compare that with the 'deleterious' variants as annotated (through SIFT)
-in VEP. You can use, e.g. vcf-compare.
-
-Inferring function
------------------
-
-### Provean
-
-` >ENSSSCP00000018263 pep:novel chromosome:Sscrofa10.2:12:6621092:6624938:1 gene:ENSSSCG00000017236 transcript:ENSSSCT00000018765 gene_biotype:protein_coding t transcript_biotype:protein_coding`\
-` MTPRVGAVWLPSALLLLRVPGCLSLSGPPTAMGTKGGSLSVQCRYEEEYIDDKKYWDKSP`\
-` CFLSWKHIVETTESAREVRRGRVSIRDDPANLTFTVTLERLTEEDAGTYCCGITAQFSVD`\
-` PTHEVEVVVFPALGTSRPPSMPGPPTTLPATTWSFVSERETMANNLGKGPASQDPGQHPR`\
-` SKHPSIRLLLLVFLEVPLFLGMLGAVLWVHRPLRSSESRSVAMDPVPGNTAPSAGWK`
-
-` 235,G,E`\
-` 5,V,A`\
-` 22,C,Y`\
-` 34,T,I`\
-` 51,D,N`\
-` 53,K,N`\
-` 59,S,Y`\
-` 61,C,R`\
-` 64,S,L`\
-` 67,H,P`\
-` 68,I,T`\
-` 75,A,V`\
-` 108,T,K`\
-` 115,A,T`\
-` 124,E,D`\
-` 130,F,Y`\
-` 133,L,P`\
-` 142,P,A `\
-` 155,F,I`\
-` 158,E,G`\
-` 186,I,V `\
-` 21,G,D`
-
-` ## PROVEAN scores ##`\
-` # VARIATION   SCORE`\
-` 235,G,E   0.076`\
-` 5,V,A 0.287`\
-` 22,C,Y    -8.028`\
-` 34,T,I    -1.932`\
-` 51,D,N    -1.613`\
-` 53,K,N    1.565`\
-` 59,S,Y    -2.140`\
-` 61,C,R    -5.826`\
-` 64,S,L    -0.437`\
-` 67,H,P    -0.511`\
-` 68,I,T    -2.664`\
-` 75,A,V    -1.061`\
-` 108,T,K   -4.051`\
-` 115,A,T   1.557`\
-` 124,E,D   -1.587`\
-` 130,F,Y   -0.983`\
-` 133,L,P   3.203`\
-` 142,P,A   -2.058`\
-` 155,F,I   0.502`\
-` 158,E,G   -0.077`\
-` 186,I,V   0.220`\
-` 21,G,D    -6.014`
-
-At the end of the VEP exercise we created a filtered list of genes that
-have only a single deleterious variant annotated. Among these genes was
-BLOC1S3 gene, ENSBTAG00000007070.
-
-` `[`http://www.ensembl.org/Bos_taurus/Gene/Summary?db=core;g=ENSBTAG00000007070;r=18:53231599-53232201;t=ENSBTAT00000009293`](http://www.ensembl.org/Bos_taurus/Gene/Summary?db=core;g=ENSBTAG00000007070;r=18:53231599-53232201;t=ENSBTAT00000009293)
-
-` GENE=ENSBTAG00000007070`\
-``  PROT=`cat ~/Bos_taurus.UMD3.1.pep.all.fa | grep $GENE | awk '{print $1}' | sed 's/>//'` ``\
-` cat test.vep | grep $GENE | grep missense | awk '{print $10","$11}' | sed 's/\//,/' >$PROT.var`
-
-` provean.sh -q $PROT.fa -v $PROT.var --save_supporting_set $PROT.sss >$PROT.result.txt 2>$PROT.error;`
-
-` faOneRecord ~/Bos_taurus.UMD3.1.pep.all.fa $PROT >$PROT.fa`
-
-` cat ENSBTAG00000007070.fa >>ENSBTAG00000007070.sss.fasta`\
-` mafft --auto ENSBTAG00000007070.sss.fasta >ENSBTAG00000007070.sss.fasta.out`
-
-Transfer the alignment to your local environment and view (seaview?).
-What do you observe for the 26th amino-acid of the cow protein sequence?
-In this 'implicit evolutionary context' there is one alternative amino
-acid that occurs at this site. What do the D amino acid and this
-evolutionary viable alternative share in terms of properties?
-
-Through the orthology with human, try to find some possible phenotypic
-consequences of the alternative allele (e.g. through OMIM). Could there
-be a reason that the variant is at a relatively high frequency in this
-cattle population?
-
-### Polyphen
-
-` /lustre/nobackup/WUR/ABGC/shared/public_data_store/polyphen/polyphen-2.2.2/bin/run_pph.pl -s test1033.fa test1033.coord `\
-` #o_acc                 o_pos  o_aa1   o_aa2   rsid        acc            pos  aa1 aa2 nt1 nt2         prediction              based_on        effect      site      region        PHAT    dScore  Score1  Score2  MSAv      Nobs   Nstruct     Nfilt  PDB_id  PDB_pos PDB_ch   ident  lengthNormASA   SecStr  MapReg    dVol   dProp  B-fact   H-bonds     AveNHet     MinDHet     AveNInt     MinDInt     AveNSit     MinDSit    Transv  CodPos  CpG  MinDJxn         PfamHit      IdPmax      IdPSNP      IdQmin`\
-` ENSSSCG00000001099ENSSSCT00000001195     368      S       A               ENSSSCG00000001099ENSSSCT00000001195       368    S   A                     benign             alignment                                                    +0.721  -1.180  -1.901     2        68                                                                                                                                                                                                                                                    12.350      12.350       42.63`\
-` ENSSSCG00000001099ENSSSCT00000001195     453      R       H               ENSSSCG00000001099ENSSSCT00000001195       453    R   H                     benign             alignment                                                    +0.351  -2.130  -2.481     2        67                                                                                                                                                                                                                                                    33.194      33.194       93.76`\
-` ENSSSCG00000001099ENSSSCT00000001195     431      K       T               ENSSSCG00000001099ENSSSCT00000001195       431    K   T          possibly damaging             alignment                                                    +1.705  -1.678  -3.383     2        68                                                                                                                                                                                                                                                     2.382                   68.80`
-
-What goes in:
-
-` test1033.fa`\
-` >ENSSSCG00000001099ENSSSCT00000001195`\
-` MSSIEQTTEILLCLSPAEAANLKEGINFVRNKSTGKDYILFKNKSRLKACKNMCKHQGGL`\
-` FIKDIEDLNGRSVKCTKHNWKLDVSSMKYINPPGSFCQDELVVEKDEENGVLLLELNPPN`\
-` PWDSEPRSPEDLAFGEVQITYLTHACMDLKLGDKRMVFDPWLIGPAFARGWWLLHEPPSD`\
-` WLERLSRADLIYISHMHSDHLSYPTLKKLAERRPDVPIYVGNTERPVFWNLNQSGVQLTN`\
-` INVVPFGIWQQVDKNLRFMILMDGVHPEMDTCIIVEYKGHKILNTVDCTRPNGGRLPMKV`\
-` ALMMSDFAGGASGFPMTFSGGKFTEEWKAQFIKTERKKLLNYKARLVKDLQPRIYCPFAG`\
-` YFVESHPSDKYIKETNIKNDPNELNNLIKKNSEVVTWTPRPGATLDLGRMLKDPTDSKGI`\
-` VEPPEGTKIYKDSWDFGPYLNILNAAIGDEIFRHSSWIKEYFTWAGFKDYNLVVRMIETD`\
-` EDFSPLPGGYDYLVDFLDLSFPKERPSREHPYEEIRSRVDVIRHVVKNGLLWDDLYIGFQ`\
-` TRLQRDPDIYHHLFWNHFQIKLPLTPPDWKSFLMCSG `
-
-` test1033.coord`\
-` ENSSSCG00000001099ENSSSCT00000001195 368        S A`\
-` ENSSSCG00000001099ENSSSCT00000001195 453        R H`\
-` ENSSSCG00000001099ENSSSCT00000001195 431        K T`