Variant annotation tutorial
Jump to navigation
Jump to search
Code and tutorials for CIHEAM course. Variation annotation and function prediction sessions
Slicing and dicing of VCF files
VCF files can get very big. Being able to manipulate them and extract relevant information is important. In practice, flat text files such as VCF are currently the basis for further analysis - see the Imputation and Population practical sessions.
The first task involves selecting all called variants of chromosome 18, bgzip-ping them on-the-fly, and then indexing that file with tabix:
tabix -h allbt.vcf.gz 18 | bgzip >BT18.vcf tabix -p vcf BT18.vcf.gz
Annotating VCF with rs-numbers
tabix -h BT18.vcf.gz 18:100000-101000 | vcf-annotate -a BT_incl_cons.18.vcf.gz -c CHROM,FROM,ID
something
Extracting variants using BEDtools
select a region of BT18 with genes:
gunzip -c Bos_taurus.UMD3.1.78.gtf.gz | awk '$3=="gene"' | awk '$1==18&&$4>1000000&&$4<2000000'
bedtools intersect -a BT18.vcf.gz -b mygenes.gtf | more
Variant Effect Predictor
gunzip -c /home/formacion/COMUNES/IAMZ/data/CIHEAM/MULTISAMPLE_VCF/all.fb.vcf.gz | perl /home/formacion/COMUNES/IAMZ/soft/ensembl-tools-release-78/scripts/variant_effect_predictor/variant_effect_predictor.pl --dir /home/formacion/COMUNES/IAMZ/data/CIHEAM/ReferenceGenome/VEP/ --species bos_taurus -o test2.vep --fork 4 --canonical --sift b --coding_only --no_intergenic --offline --force_overwrite --vcf
gunzip -c test2.vep.vcf.gz | grep deleterious | sed 's/;/\t/g' | awk '{gsub("AF=","",$11); print}' | awk '$11<0.15&&$11>0' | wc -l
gunzip -c test2.vep.vcf.gz | grep deleterious | sed 's/;/ /g' | awk '{gsub("AF=","",$11); print}' | awk '$11>0.85' | sed 's/ /#_#/g' | sed 's/|/\t/g' | cut -f2 | sort | uniq -c 1 ENSBTAG00000000027 1 ENSBTAG00000000688 1 ENSBTAG00000000945 1 ENSBTAG00000006859 1 ENSBTAG00000011844 1 ENSBTAG00000012606 1 ENSBTAG00000014953 1 ENSBTAG00000017651 1 ENSBTAG00000019443 1 ENSBTAG00000019547 1 ENSBTAG00000023731 1 ENSBTAG00000030502 1 ENSBTAG00000032427 1 ENSBTAG00000032442 1 ENSBTAG00000034090 1 ENSBTAG00000037440 1 ENSBTAG00000037576 1 ENSBTAG00000037581 1 ENSBTAG00000039409 1 ENSBTAG00000039691 2 ENSBTAG00000045571 1 ENSBTAG00000045880 1 ENSBTAG00000046383 2 ENSBTAG00000047277 1 ENSBTAG00000047301 1 ENSBTAG00000047570 1 ENSBTAG00000047577 3 ENSBTAG00000047761 1 ENSBTAG00000048135
11 ENSBTAG00000037699
gunzip -c test2.vep.vcf.gz | grep deleterious | sed 's/;/ /g' | awk '{gsub("AF=","",$11); print}' | awk '$11<0.15' | sed 's/ /#_#/g' | sed 's/|/\t/g' | cut -f2 | sort | uniq -c | sed 's/^ \+//' | awk '$1==1' | awk '{print $2}' >genes_with1.txt
snpEff
java -Xmx4G -jar $snpEff -dataDir ~/snpEff/data -v UMD3.1.78 sample_Bt.vcf >testresult_Bt.txt
Inferring function
from VEP output to Polyphen/Provean input
extracting fasta record.
faOneRecord Bos_taurus.UMD3.1.pep.all.fa `cat Bos_taurus.UMD3.1.pep.all.fa | grep ENSBTAT00000063226 | awk '{print $1}' | sed 's/>//'`
Polyphen
/lustre/nobackup/WUR/ABGC/shared/public_data_store/polyphen/polyphen-2.2.2/bin/run_pph.pl -s test1033.fa test1033.coord #o_acc o_pos o_aa1 o_aa2 rsid acc pos aa1 aa2 nt1 nt2 prediction based_on effect site region PHAT dScore Score1 Score2 MSAv Nobs Nstruct Nfilt PDB_id PDB_pos PDB_ch ident lengthNormASA SecStr MapReg dVol dProp B-fact H-bonds AveNHet MinDHet AveNInt MinDInt AveNSit MinDSit Transv CodPos CpG MinDJxn PfamHit IdPmax IdPSNP IdQmin ENSSSCG00000001099ENSSSCT00000001195 368 S A ENSSSCG00000001099ENSSSCT00000001195 368 S A benign alignment +0.721 -1.180 -1.901 2 68 12.350 12.350 42.63 ENSSSCG00000001099ENSSSCT00000001195 453 R H ENSSSCG00000001099ENSSSCT00000001195 453 R H benign alignment +0.351 -2.130 -2.481 2 67 33.194 33.194 93.76 ENSSSCG00000001099ENSSSCT00000001195 431 K T ENSSSCG00000001099ENSSSCT00000001195 431 K T possibly damaging alignment +1.705 -1.678 -3.383 2 68 2.382 68.80
What goes in:
test1033.fa >ENSSSCG00000001099ENSSSCT00000001195 MSSIEQTTEILLCLSPAEAANLKEGINFVRNKSTGKDYILFKNKSRLKACKNMCKHQGGL FIKDIEDLNGRSVKCTKHNWKLDVSSMKYINPPGSFCQDELVVEKDEENGVLLLELNPPN PWDSEPRSPEDLAFGEVQITYLTHACMDLKLGDKRMVFDPWLIGPAFARGWWLLHEPPSD WLERLSRADLIYISHMHSDHLSYPTLKKLAERRPDVPIYVGNTERPVFWNLNQSGVQLTN INVVPFGIWQQVDKNLRFMILMDGVHPEMDTCIIVEYKGHKILNTVDCTRPNGGRLPMKV ALMMSDFAGGASGFPMTFSGGKFTEEWKAQFIKTERKKLLNYKARLVKDLQPRIYCPFAG YFVESHPSDKYIKETNIKNDPNELNNLIKKNSEVVTWTPRPGATLDLGRMLKDPTDSKGI VEPPEGTKIYKDSWDFGPYLNILNAAIGDEIFRHSSWIKEYFTWAGFKDYNLVVRMIETD EDFSPLPGGYDYLVDFLDLSFPKERPSREHPYEEIRSRVDVIRHVVKNGLLWDDLYIGFQ TRLQRDPDIYHHLFWNHFQIKLPLTPPDWKSFLMCSG
test1033.coord ENSSSCG00000001099ENSSSCT00000001195 368 S A ENSSSCG00000001099ENSSSCT00000001195 453 R H ENSSSCG00000001099ENSSSCT00000001195 431 K T
Provean
>ENSSSCP00000018263 pep:novel chromosome:Sscrofa10.2:12:6621092:6624938:1 gene:ENSSSCG00000017236 transcript:ENSSSCT00000018765 gene_biotype:protein_coding t transcript_biotype:protein_coding MTPRVGAVWLPSALLLLRVPGCLSLSGPPTAMGTKGGSLSVQCRYEEEYIDDKKYWDKSP CFLSWKHIVETTESAREVRRGRVSIRDDPANLTFTVTLERLTEEDAGTYCCGITAQFSVD PTHEVEVVVFPALGTSRPPSMPGPPTTLPATTWSFVSERETMANNLGKGPASQDPGQHPR SKHPSIRLLLLVFLEVPLFLGMLGAVLWVHRPLRSSESRSVAMDPVPGNTAPSAGWK
235,G,E 5,V,A 22,C,Y 34,T,I 51,D,N 53,K,N 59,S,Y 61,C,R 64,S,L 67,H,P 68,I,T 75,A,V 108,T,K 115,A,T 124,E,D 130,F,Y 133,L,P 142,P,A 155,F,I 158,E,G 186,I,V 21,G,D
## PROVEAN scores ## # VARIATION SCORE 235,G,E 0.076 5,V,A 0.287 22,C,Y -8.028 34,T,I -1.932 51,D,N -1.613 53,K,N 1.565 59,S,Y -2.140 61,C,R -5.826 64,S,L -0.437 67,H,P -0.511 68,I,T -2.664 75,A,V -1.061 108,T,K -4.051 115,A,T 1.557 124,E,D -1.587 130,F,Y -0.983 133,L,P 3.203 142,P,A -2.058 155,F,I 0.502 158,E,G -0.077 186,I,V 0.220 21,G,D -6.014