Variant annotation tutorial: Difference between revisions
No edit summary |
|||
Line 14: | Line 14: | ||
== Annotating VCF with rs-numbers == | == Annotating VCF with rs-numbers == | ||
The VCF file you are working does not yet have the dbSNP identifiers (rs-numbers) annotated. The rs-numbers of all currently known variants as databased in dbSNP can be found in this file: | |||
/path/to/ BT18.vcf.gz | |||
Have a look at what is inside this file: | |||
gunzip -c BT18.vcf.gz | grep -v '^##' | more | |||
or | |||
tabix BT18.vcf.gz 18:1-10000 | more | |||
You can use vcftools to do many things related to VCF files. This includes annotating the VCF based on other interval-file formatted input, including other VCF files. For instance, | |||
tabix -h BT18.vcf.gz 18:100000-101000 | vcf-annotate -a BT_incl_cons.18.vcf.gz -c CHROM,FROM,ID | tabix -h BT18.vcf.gz 18:100000-101000 | vcf-annotate -a BT_incl_cons.18.vcf.gz -c CHROM,FROM,ID | ||
Revision as of 21:33, 6 March 2015
Code and tutorials for CIHEAM course. Variation annotation and function prediction sessions
Slicing and dicing of VCF files
VCF files can get very big. Being able to manipulate them and extract relevant information is important. In practice, flat text files such as VCF are currently the basis for further analysis - see the Imputation and Population practical sessions.
The first task involves selecting all called variants of chromosome 18, bgzip-ping them on-the-fly, and then indexing that file with tabix:
tabix -h allbt.vcf.gz 18 | bgzip >BT18.vcf tabix -p vcf BT18.vcf.gz
Annotating VCF with rs-numbers
The VCF file you are working does not yet have the dbSNP identifiers (rs-numbers) annotated. The rs-numbers of all currently known variants as databased in dbSNP can be found in this file:
/path/to/ BT18.vcf.gz
Have a look at what is inside this file:
gunzip -c BT18.vcf.gz | grep -v '^##' | more
or
tabix BT18.vcf.gz 18:1-10000 | more
You can use vcftools to do many things related to VCF files. This includes annotating the VCF based on other interval-file formatted input, including other VCF files. For instance,
tabix -h BT18.vcf.gz 18:100000-101000 | vcf-annotate -a BT_incl_cons.18.vcf.gz -c CHROM,FROM,ID
something
Extracting variants using BEDtools
select a region of BT18 with genes:
gunzip -c Bos_taurus.UMD3.1.78.gtf.gz | awk '$3=="gene"' | awk '$1==18&&$4>1000000&&$4<2000000'
bedtools intersect -a BT18.vcf.gz -b mygenes.gtf | more
Variant Effect Predictor
gunzip -c /home/formacion/COMUNES/IAMZ/data/CIHEAM/MULTISAMPLE_VCF/all.fb.vcf.gz | perl /home/formacion/COMUNES/IAMZ/soft/ensembl-tools-release-78/scripts/variant_effect_predictor/variant_effect_predictor.pl --dir /home/formacion/COMUNES/IAMZ/data/CIHEAM/ReferenceGenome/VEP/ --species bos_taurus -o test2.vep --fork 4 --canonical --sift b --coding_only --no_intergenic --offline --force_overwrite --vcf
gunzip -c test2.vep.vcf.gz | grep deleterious | sed 's/;/\t/g' | awk '{gsub("AF=","",$11); print}' | awk '$11<0.15&&$11>0' | wc -l
gunzip -c test2.vep.vcf.gz | grep deleterious | sed 's/;/ /g' | awk '{gsub("AF=","",$11); print}' | awk '$11>0.85' | sed 's/ /#_#/g' | sed 's/|/\t/g' | cut -f2 | sort | uniq -c 1 ENSBTAG00000000027 1 ENSBTAG00000000688 1 ENSBTAG00000000945 1 ENSBTAG00000006859 1 ENSBTAG00000011844 1 ENSBTAG00000012606 1 ENSBTAG00000014953 1 ENSBTAG00000017651 1 ENSBTAG00000019443 1 ENSBTAG00000019547 1 ENSBTAG00000023731 1 ENSBTAG00000030502 1 ENSBTAG00000032427 1 ENSBTAG00000032442 1 ENSBTAG00000034090 1 ENSBTAG00000037440 1 ENSBTAG00000037576 1 ENSBTAG00000037581 1 ENSBTAG00000039409 1 ENSBTAG00000039691 2 ENSBTAG00000045571 1 ENSBTAG00000045880 1 ENSBTAG00000046383 2 ENSBTAG00000047277 1 ENSBTAG00000047301 1 ENSBTAG00000047570 1 ENSBTAG00000047577 3 ENSBTAG00000047761 1 ENSBTAG00000048135
11 ENSBTAG00000037699
gunzip -c test2.vep.vcf.gz | grep deleterious | sed 's/;/ /g' | awk '{gsub("AF=","",$11); print}' | awk '$11<0.15' | sed 's/ /#_#/g' | sed 's/|/\t/g' | cut -f2 | sort | uniq -c | sed 's/^ \+//' | awk '$1==1' | awk '{print $2}' >genes_with1.txt
snpEff
java -Xmx4G -jar $snpEff -dataDir ~/snpEff/data -v UMD3.1.78 sample_Bt.vcf >testresult_Bt.txt
Inferring function
from VEP output to Polyphen/Provean input
extracting fasta record.
faOneRecord Bos_taurus.UMD3.1.pep.all.fa `cat Bos_taurus.UMD3.1.pep.all.fa | grep ENSBTAT00000063226 | awk '{print $1}' | sed 's/>//'`
Polyphen
/lustre/nobackup/WUR/ABGC/shared/public_data_store/polyphen/polyphen-2.2.2/bin/run_pph.pl -s test1033.fa test1033.coord #o_acc o_pos o_aa1 o_aa2 rsid acc pos aa1 aa2 nt1 nt2 prediction based_on effect site region PHAT dScore Score1 Score2 MSAv Nobs Nstruct Nfilt PDB_id PDB_pos PDB_ch ident lengthNormASA SecStr MapReg dVol dProp B-fact H-bonds AveNHet MinDHet AveNInt MinDInt AveNSit MinDSit Transv CodPos CpG MinDJxn PfamHit IdPmax IdPSNP IdQmin ENSSSCG00000001099ENSSSCT00000001195 368 S A ENSSSCG00000001099ENSSSCT00000001195 368 S A benign alignment +0.721 -1.180 -1.901 2 68 12.350 12.350 42.63 ENSSSCG00000001099ENSSSCT00000001195 453 R H ENSSSCG00000001099ENSSSCT00000001195 453 R H benign alignment +0.351 -2.130 -2.481 2 67 33.194 33.194 93.76 ENSSSCG00000001099ENSSSCT00000001195 431 K T ENSSSCG00000001099ENSSSCT00000001195 431 K T possibly damaging alignment +1.705 -1.678 -3.383 2 68 2.382 68.80
What goes in:
test1033.fa >ENSSSCG00000001099ENSSSCT00000001195 MSSIEQTTEILLCLSPAEAANLKEGINFVRNKSTGKDYILFKNKSRLKACKNMCKHQGGL FIKDIEDLNGRSVKCTKHNWKLDVSSMKYINPPGSFCQDELVVEKDEENGVLLLELNPPN PWDSEPRSPEDLAFGEVQITYLTHACMDLKLGDKRMVFDPWLIGPAFARGWWLLHEPPSD WLERLSRADLIYISHMHSDHLSYPTLKKLAERRPDVPIYVGNTERPVFWNLNQSGVQLTN INVVPFGIWQQVDKNLRFMILMDGVHPEMDTCIIVEYKGHKILNTVDCTRPNGGRLPMKV ALMMSDFAGGASGFPMTFSGGKFTEEWKAQFIKTERKKLLNYKARLVKDLQPRIYCPFAG YFVESHPSDKYIKETNIKNDPNELNNLIKKNSEVVTWTPRPGATLDLGRMLKDPTDSKGI VEPPEGTKIYKDSWDFGPYLNILNAAIGDEIFRHSSWIKEYFTWAGFKDYNLVVRMIETD EDFSPLPGGYDYLVDFLDLSFPKERPSREHPYEEIRSRVDVIRHVVKNGLLWDDLYIGFQ TRLQRDPDIYHHLFWNHFQIKLPLTPPDWKSFLMCSG
test1033.coord ENSSSCG00000001099ENSSSCT00000001195 368 S A ENSSSCG00000001099ENSSSCT00000001195 453 R H ENSSSCG00000001099ENSSSCT00000001195 431 K T
Provean
>ENSSSCP00000018263 pep:novel chromosome:Sscrofa10.2:12:6621092:6624938:1 gene:ENSSSCG00000017236 transcript:ENSSSCT00000018765 gene_biotype:protein_coding t transcript_biotype:protein_coding MTPRVGAVWLPSALLLLRVPGCLSLSGPPTAMGTKGGSLSVQCRYEEEYIDDKKYWDKSP CFLSWKHIVETTESAREVRRGRVSIRDDPANLTFTVTLERLTEEDAGTYCCGITAQFSVD PTHEVEVVVFPALGTSRPPSMPGPPTTLPATTWSFVSERETMANNLGKGPASQDPGQHPR SKHPSIRLLLLVFLEVPLFLGMLGAVLWVHRPLRSSESRSVAMDPVPGNTAPSAGWK
235,G,E 5,V,A 22,C,Y 34,T,I 51,D,N 53,K,N 59,S,Y 61,C,R 64,S,L 67,H,P 68,I,T 75,A,V 108,T,K 115,A,T 124,E,D 130,F,Y 133,L,P 142,P,A 155,F,I 158,E,G 186,I,V 21,G,D
## PROVEAN scores ## # VARIATION SCORE 235,G,E 0.076 5,V,A 0.287 22,C,Y -8.028 34,T,I -1.932 51,D,N -1.613 53,K,N 1.565 59,S,Y -2.140 61,C,R -5.826 64,S,L -0.437 67,H,P -0.511 68,I,T -2.664 75,A,V -1.061 108,T,K -4.051 115,A,T 1.557 124,E,D -1.587 130,F,Y -0.983 133,L,P 3.203 142,P,A -2.058 155,F,I 0.502 158,E,G -0.077 186,I,V 0.220 21,G,D -6.014