Variant annotation tutorial: Difference between revisions
No edit summary |
No edit summary |
||
Line 32: | Line 32: | ||
tabix -h BT18.vcf.gz 18 | vcf-annotate -a BT_incl_cons.18.vcf.gz -c CHROM,FROM,ID | bgzip >BT18_rsnumbers.vcf.gz | tabix -h BT18.vcf.gz 18 | vcf-annotate -a BT_incl_cons.18.vcf.gz -c CHROM,FROM,ID | bgzip >BT18_rsnumbers.vcf.gz | ||
tabix -p vcf BT18_rsnumbers.vcf.gz | tabix -p vcf BT18_rsnumbers.vcf.gz | ||
== Extracting variants using BEDtools == | == Extracting variants using BEDtools == | ||
Line 64: | Line 63: | ||
How many variants do you have now? | How many variants do you have now? | ||
== Variant Effect Predictor == | == Variant Effect Predictor == | ||
Line 75: | Line 73: | ||
gunzip -c /home/formacion/COMUNES/IAMZ/data/CIHEAM/ | gunzip -c BT18_rsnumbers.vcf.gz | perl /home/formacion/COMUNES/IAMZ/soft/ensembl-tools-release-78/scripts/variant_effect_predictor/variant_effect_predictor.pl --dir /home/formacion/COMUNES/IAMZ/data/CIHEAM/ReferenceGenome/VEP/ --species bos_taurus -o BT18.vep --fork 4 --canonical --sift b --coding_only --no_intergenic --offline --force_overwrite | ||
gunzip -c BT18_rsnumbers.vcf.gz | perl /home/formacion/COMUNES/IAMZ/soft/ensembl-tools-release-78/scripts/variant_effect_predictor/variant_effect_predictor.pl --dir /home/formacion/COMUNES/IAMZ/data/CIHEAM/ReferenceGenome/VEP/ --species bos_taurus -o BT18.vep --fork 4 --canonical --sift b --coding_only --no_intergenic --offline --force_overwrite --vcf | |||
gunzip -c test2.vep.vcf.gz | grep deleterious | sed 's/;/\t/g' | awk '{gsub("AF=","",$11); print}' | awk '$11<0.15&&$11>0' | wc -l | gunzip -c test2.vep.vcf.gz | grep deleterious | sed 's/;/\t/g' | awk '{gsub("AF=","",$11); print}' | awk '$11<0.15&&$11>0' | wc -l |
Revision as of 22:04, 6 March 2015
Code and tutorials for CIHEAM course. Variation annotation and function prediction sessions
Slicing and dicing of VCF files
VCF files can get very big. Being able to manipulate them and extract relevant information is important. In practice, flat text files such as VCF are currently the basis for further analysis - see the Imputation and Population practical sessions.
The first task involves selecting all called variants of chromosome 18, bgzip-ping them on-the-fly, and then indexing that file with tabix:
tabix -h allbt.vcf.gz 18 | bgzip >BT18.vcf.gz tabix -p vcf BT18.vcf.gz
Annotating VCF with rs-numbers
The VCF file you are working does not yet have the dbSNP identifiers (rs-numbers) annotated. The rs-numbers of all currently known variants as databased in dbSNP can be found in this file:
/path/to/ BT18.vcf.gz
Have a look at what is inside this file:
gunzip -c BT18.vcf.gz | grep -v '^##' | more
or
tabix BT18.vcf.gz 18:1-10000 | more
You can use vcftools to do many things related to VCF files. This includes annotating the VCF based on other interval-file formatted input, including other VCF files. For instance, you can annotate an interval of the VCF on the fly, by selecting that interval using tabix, and then piping it to vcf-annotate, like so:
tabix -h BT18.vcf.gz 18:100000-101000 | vcf-annotate -a BT_incl_cons.18.vcf.gz -c CHROM,FROM,ID
Now make a new VCF file for all the variants called for chromosome 18 using this tool:
tabix -h BT18.vcf.gz 18 | vcf-annotate -a BT_incl_cons.18.vcf.gz -c CHROM,FROM,ID | bgzip >BT18_rsnumbers.vcf.gz tabix -p vcf BT18_rsnumbers.vcf.gz
Extracting variants using BEDtools
You will very often want to extract variants based on interval files, e.g. a certain region on a chromosome. This can be easily achieve using the tabix command described above. However, sometimes you may want to extract a substantial number of intervals. You could use tabix in a loop, either as a shell-script or in a scripting language like Perl or Python. However, BedTools provides many utilities for manipulating interval data, such as BED, VCF, or GTF/GFF formats.
Say you are interested in the variants in all the genes lying between coordinates 1,000,000 and 2,000,000 on chromosome 18. The annotation of genes can be found in a GTF file. The GTF file for the UMD3.1 genome build can be found at Ensembl and downloaded using the command line (NOTE: the compute nodes you are working on do not have internet connectivity. This line of code is just for completeness):
wget ftp://ftp.ensembl.org/pub/release-78/gtf/bos_taurus/Bos_taurus.UMD3.1.78.gtf.gz
The file is already downloaded and can be found here:
/path/to/Bos_taurus.UMD3.1.78.gtf.gz
Have a quick look at what is in this file:
gunzip -c /path/to/Bos_taurus.UMD3.1.78.gtf.gz | more
Familiarise yourself with the GTF format if you have not already. For more information:
http://www.ensembl.org/info/website/upload/gff.html
You can select the appropriate region from the GTF file:
gunzip -c Bos_taurus.UMD3.1.78.gtf.gz | awk '$3=="gene"' | awk '$1==18&&$4>1000000&&$4<2000000' >mygenes.gtf
Subsequently you can intersect the VCF with the GTF file that contains just the genes between coordinates 1 and 2 million:
bedtools intersect -a BT18_rsnumbers.vcf.gz -b mygenes.gtf >intersection_genes.vcf
Have a quick look at the output, and count the number of variants in this VCF (e.g. by using 'wc -l').
Now do the same, but select only the exonic variants in the same interval, from 1 to 2 million.
How many variants do you have now?
Variant Effect Predictor
Various tools exist to make more precise annotations than what you can achieve using Bedtools and a GTF. Among them are Variant Effect Predictor (VEP), snpEff, and Annovar. For the current practical we will use former two.
VEP uses a cache directory that has the annotation information. On the infrastructure you are working on that directory can be found here:
/home/formacion/COMUNES/IAMZ/data/CIHEAM/ReferenceGenome/VEP/
gunzip -c BT18_rsnumbers.vcf.gz | perl /home/formacion/COMUNES/IAMZ/soft/ensembl-tools-release-78/scripts/variant_effect_predictor/variant_effect_predictor.pl --dir /home/formacion/COMUNES/IAMZ/data/CIHEAM/ReferenceGenome/VEP/ --species bos_taurus -o BT18.vep --fork 4 --canonical --sift b --coding_only --no_intergenic --offline --force_overwrite
gunzip -c BT18_rsnumbers.vcf.gz | perl /home/formacion/COMUNES/IAMZ/soft/ensembl-tools-release-78/scripts/variant_effect_predictor/variant_effect_predictor.pl --dir /home/formacion/COMUNES/IAMZ/data/CIHEAM/ReferenceGenome/VEP/ --species bos_taurus -o BT18.vep --fork 4 --canonical --sift b --coding_only --no_intergenic --offline --force_overwrite --vcf
gunzip -c test2.vep.vcf.gz | grep deleterious | sed 's/;/\t/g' | awk '{gsub("AF=","",$11); print}' | awk '$11<0.15&&$11>0' | wc -l
gunzip -c test2.vep.vcf.gz | grep deleterious | sed 's/;/ /g' | awk '{gsub("AF=","",$11); print}' | awk '$11>0.85' | sed 's/ /#_#/g' | sed 's/|/\t/g' | cut -f2 | sort | uniq -c 1 ENSBTAG00000000027 1 ENSBTAG00000000688 1 ENSBTAG00000000945 1 ENSBTAG00000006859 1 ENSBTAG00000011844 1 ENSBTAG00000012606 1 ENSBTAG00000014953 1 ENSBTAG00000017651 1 ENSBTAG00000019443 1 ENSBTAG00000019547 1 ENSBTAG00000023731 1 ENSBTAG00000030502 1 ENSBTAG00000032427 1 ENSBTAG00000032442 1 ENSBTAG00000034090 1 ENSBTAG00000037440 1 ENSBTAG00000037576 1 ENSBTAG00000037581 1 ENSBTAG00000039409 1 ENSBTAG00000039691 2 ENSBTAG00000045571 1 ENSBTAG00000045880 1 ENSBTAG00000046383 2 ENSBTAG00000047277 1 ENSBTAG00000047301 1 ENSBTAG00000047570 1 ENSBTAG00000047577 3 ENSBTAG00000047761 1 ENSBTAG00000048135
11 ENSBTAG00000037699
gunzip -c test2.vep.vcf.gz | grep deleterious | sed 's/;/ /g' | awk '{gsub("AF=","",$11); print}' | awk '$11<0.15' | sed 's/ /#_#/g' | sed 's/|/\t/g' | cut -f2 | sort | uniq -c | sed 's/^ \+//' | awk '$1==1' | awk '{print $2}' >genes_with1.txt
snpEff
java -Xmx4G -jar $snpEff -dataDir ~/snpEff/data -v UMD3.1.78 sample_Bt.vcf >testresult_Bt.txt
Inferring function
from VEP output to Polyphen/Provean input
extracting fasta record.
faOneRecord Bos_taurus.UMD3.1.pep.all.fa `cat Bos_taurus.UMD3.1.pep.all.fa | grep ENSBTAT00000063226 | awk '{print $1}' | sed 's/>//'`
Polyphen
/lustre/nobackup/WUR/ABGC/shared/public_data_store/polyphen/polyphen-2.2.2/bin/run_pph.pl -s test1033.fa test1033.coord #o_acc o_pos o_aa1 o_aa2 rsid acc pos aa1 aa2 nt1 nt2 prediction based_on effect site region PHAT dScore Score1 Score2 MSAv Nobs Nstruct Nfilt PDB_id PDB_pos PDB_ch ident lengthNormASA SecStr MapReg dVol dProp B-fact H-bonds AveNHet MinDHet AveNInt MinDInt AveNSit MinDSit Transv CodPos CpG MinDJxn PfamHit IdPmax IdPSNP IdQmin ENSSSCG00000001099ENSSSCT00000001195 368 S A ENSSSCG00000001099ENSSSCT00000001195 368 S A benign alignment +0.721 -1.180 -1.901 2 68 12.350 12.350 42.63 ENSSSCG00000001099ENSSSCT00000001195 453 R H ENSSSCG00000001099ENSSSCT00000001195 453 R H benign alignment +0.351 -2.130 -2.481 2 67 33.194 33.194 93.76 ENSSSCG00000001099ENSSSCT00000001195 431 K T ENSSSCG00000001099ENSSSCT00000001195 431 K T possibly damaging alignment +1.705 -1.678 -3.383 2 68 2.382 68.80
What goes in:
test1033.fa >ENSSSCG00000001099ENSSSCT00000001195 MSSIEQTTEILLCLSPAEAANLKEGINFVRNKSTGKDYILFKNKSRLKACKNMCKHQGGL FIKDIEDLNGRSVKCTKHNWKLDVSSMKYINPPGSFCQDELVVEKDEENGVLLLELNPPN PWDSEPRSPEDLAFGEVQITYLTHACMDLKLGDKRMVFDPWLIGPAFARGWWLLHEPPSD WLERLSRADLIYISHMHSDHLSYPTLKKLAERRPDVPIYVGNTERPVFWNLNQSGVQLTN INVVPFGIWQQVDKNLRFMILMDGVHPEMDTCIIVEYKGHKILNTVDCTRPNGGRLPMKV ALMMSDFAGGASGFPMTFSGGKFTEEWKAQFIKTERKKLLNYKARLVKDLQPRIYCPFAG YFVESHPSDKYIKETNIKNDPNELNNLIKKNSEVVTWTPRPGATLDLGRMLKDPTDSKGI VEPPEGTKIYKDSWDFGPYLNILNAAIGDEIFRHSSWIKEYFTWAGFKDYNLVVRMIETD EDFSPLPGGYDYLVDFLDLSFPKERPSREHPYEEIRSRVDVIRHVVKNGLLWDDLYIGFQ TRLQRDPDIYHHLFWNHFQIKLPLTPPDWKSFLMCSG
test1033.coord ENSSSCG00000001099ENSSSCT00000001195 368 S A ENSSSCG00000001099ENSSSCT00000001195 453 R H ENSSSCG00000001099ENSSSCT00000001195 431 K T
Provean
>ENSSSCP00000018263 pep:novel chromosome:Sscrofa10.2:12:6621092:6624938:1 gene:ENSSSCG00000017236 transcript:ENSSSCT00000018765 gene_biotype:protein_coding t transcript_biotype:protein_coding MTPRVGAVWLPSALLLLRVPGCLSLSGPPTAMGTKGGSLSVQCRYEEEYIDDKKYWDKSP CFLSWKHIVETTESAREVRRGRVSIRDDPANLTFTVTLERLTEEDAGTYCCGITAQFSVD PTHEVEVVVFPALGTSRPPSMPGPPTTLPATTWSFVSERETMANNLGKGPASQDPGQHPR SKHPSIRLLLLVFLEVPLFLGMLGAVLWVHRPLRSSESRSVAMDPVPGNTAPSAGWK
235,G,E 5,V,A 22,C,Y 34,T,I 51,D,N 53,K,N 59,S,Y 61,C,R 64,S,L 67,H,P 68,I,T 75,A,V 108,T,K 115,A,T 124,E,D 130,F,Y 133,L,P 142,P,A 155,F,I 158,E,G 186,I,V 21,G,D
## PROVEAN scores ## # VARIATION SCORE 235,G,E 0.076 5,V,A 0.287 22,C,Y -8.028 34,T,I -1.932 51,D,N -1.613 53,K,N 1.565 59,S,Y -2.140 61,C,R -5.826 64,S,L -0.437 67,H,P -0.511 68,I,T -2.664 75,A,V -1.061 108,T,K -4.051 115,A,T 1.557 124,E,D -1.587 130,F,Y -0.983 133,L,P 3.203 142,P,A -2.058 155,F,I 0.502 158,E,G -0.077 186,I,V 0.220 21,G,D -6.014