Variant annotation tutorial: Difference between revisions

From HPCwiki
Jump to navigation Jump to search
Line 68: Line 68:


Various tools exist to make more precise annotations than what you can achieve using Bedtools and a GTF. Among them are Variant Effect Predictor (VEP), snpEff, and Annovar. For the current practical we will use former two.  
Various tools exist to make more precise annotations than what you can achieve using Bedtools and a GTF. Among them are Variant Effect Predictor (VEP), snpEff, and Annovar. For the current practical we will use former two.  
VEP can be found here:
perl /home/formacion/COMUNES/IAMZ/soft/ensembl-tools-release-78/scripts/variant_effect_predictor/variant_effect_predictor.pl
If you invoke it without parameters you will see the following:
  Usage:
  perl variant_effect_predictor.pl [--cache|--offline|--database] [arguments]
  Basic options
  =============
 
  --help                Display this message and quit
 
  -i | --input_file      Input file
  -o | --output_file    Output file
  --force_overwrite      Force overwriting of output file
  --species [species]    Species to use [default: "human"]
                       
  --everything          Shortcut switch to turn on commonly used options. See web
                        documentation for details [default: off]                     
  --fork [num_forks]    Use forking to improve script runtime
 
  For full option documentation see:
  http://www.ensembl.org/info/docs/tools/vep/script/vep_options.html
 
For the current session we will use the following options:
  --dir /home/formacion/COMUNES/IAMZ/data/CIHEAM/ReferenceGenome/VEP/
  --species bos_taurus
  -o BT18.vep
  --fork 4
  --canonical
  --sift b
  --coding_only
  --no_intergenic
  --offline
  --force_overwrite


VEP uses a cache directory that has the annotation information. On the infrastructure you are working on that directory can be found here:
VEP uses a cache directory that has the annotation information. On the infrastructure you are working on that directory can be found here:

Revision as of 16:50, 7 March 2015

Code and tutorials for CIHEAM course. Variation annotation and function prediction sessions

Variant annotation

Slicing and dicing of VCF files

VCF files can get very big. Being able to manipulate them and extract relevant information is important. In practice, flat text files such as VCF are currently the basis for further analysis - see the Imputation and Population practical sessions.

The first task involves selecting all called variants of chromosome 18, bgzip-ping them on-the-fly, and then indexing that file with tabix:

 tabix -h allbt.vcf.gz 18 | bgzip >BT18.vcf.gz
 tabix -p vcf BT18.vcf.gz

Annotating VCF with rs-numbers

The VCF file you are working does not yet have the dbSNP identifiers (rs-numbers) annotated. The rs-numbers of all currently known variants as databased in dbSNP can be found in this file:

 /path/to/ BT18.vcf.gz

Have a look at what is inside this file:

 gunzip -c BT18.vcf.gz | grep -v '^##' | more

or

 tabix BT18.vcf.gz 18:1-10000 | more

You can use vcftools to do many things related to VCF files. This includes annotating the VCF based on other interval-file formatted input, including other VCF files. For instance, you can annotate an interval of the VCF on the fly, by selecting that interval using tabix, and then piping it to vcf-annotate, like so:

 tabix -h BT18.vcf.gz 18:100000-101000 | vcf-annotate -a BT_incl_cons.18.vcf.gz -c CHROM,FROM,ID

Now make a new VCF file for all the variants called for chromosome 18 using this tool:

 tabix -h BT18.vcf.gz 18 | vcf-annotate -a BT_incl_cons.18.vcf.gz -c CHROM,FROM,ID | bgzip >BT18_rsnumbers.vcf.gz
 tabix -p vcf BT18_rsnumbers.vcf.gz

Extracting variants using BEDtools

You will very often want to extract variants based on interval files, e.g. a certain region on a chromosome. This can be easily achieve using the tabix command described above. However, sometimes you may want to extract a substantial number of intervals. You could use tabix in a loop, either as a shell-script or in a scripting language like Perl or Python. However, BedTools provides many utilities for manipulating interval data, such as BED, VCF, or GTF/GFF formats.

Say you are interested in the variants in all the genes lying between coordinates 1,000,000 and 2,000,000 on chromosome 18. The annotation of genes can be found in a GTF file. The GTF file for the UMD3.1 genome build can be found at Ensembl and downloaded using the command line (NOTE: the compute nodes you are working on do not have internet connectivity. This line of code is just for completeness):

 wget ftp://ftp.ensembl.org/pub/release-78/gtf/bos_taurus/Bos_taurus.UMD3.1.78.gtf.gz

The file is already downloaded and can be found here:

 /path/to/Bos_taurus.UMD3.1.78.gtf.gz

Have a quick look at what is in this file:

gunzip -c /path/to/Bos_taurus.UMD3.1.78.gtf.gz | more

Familiarise yourself with the GTF format if you have not already. For more information:

 http://www.ensembl.org/info/website/upload/gff.html

You can select the appropriate region from the GTF file:

 gunzip -c Bos_taurus.UMD3.1.78.gtf.gz | awk '$3=="gene"' | awk '$1==18&&$4>1000000&&$4<2000000' >mygenes.gtf

Subsequently you can intersect the VCF with the GTF file that contains just the genes between coordinates 1 and 2 million:

 bedtools intersect -a BT18_rsnumbers.vcf.gz -b mygenes.gtf >intersection_genes.vcf

Have a quick look at the output, and count the number of variants in this VCF (e.g. by using 'wc -l').

Now do the same, but select only the exonic variants in the same interval, from 1 to 2 million.

How many variants do you have now?

Variant Effect Predictor

Various tools exist to make more precise annotations than what you can achieve using Bedtools and a GTF. Among them are Variant Effect Predictor (VEP), snpEff, and Annovar. For the current practical we will use former two.

VEP can be found here: perl /home/formacion/COMUNES/IAMZ/soft/ensembl-tools-release-78/scripts/variant_effect_predictor/variant_effect_predictor.pl

If you invoke it without parameters you will see the following:

 Usage:
 perl variant_effect_predictor.pl [--cache|--offline|--database] [arguments]
 Basic options
 =============
 
 --help                 Display this message and quit
 
 -i | --input_file      Input file
 -o | --output_file     Output file
 --force_overwrite      Force overwriting of output file
 --species [species]    Species to use [default: "human"]
                        
 --everything           Shortcut switch to turn on commonly used options. See web
                        documentation for details [default: off]                       
 --fork [num_forks]     Use forking to improve script runtime
 
 For full option documentation see:
 http://www.ensembl.org/info/docs/tools/vep/script/vep_options.html
 

For the current session we will use the following options:

 --dir /home/formacion/COMUNES/IAMZ/data/CIHEAM/ReferenceGenome/VEP/
 --species bos_taurus
 -o BT18.vep
 --fork 4
 --canonical
 --sift b
 --coding_only
 --no_intergenic
 --offline 
 --force_overwrite

VEP uses a cache directory that has the annotation information. On the infrastructure you are working on that directory can be found here:

 /home/formacion/COMUNES/IAMZ/data/CIHEAM/ReferenceGenome/VEP/

You can create a file that will provide annotation information for each SNP:

 gunzip -c BT18_rsnumbers.vcf.gz | perl /home/formacion/COMUNES/IAMZ/soft/ensembl-tools-release-78/scripts/variant_effect_predictor/variant_effect_predictor.pl --dir /home/formacion/COMUNES/IAMZ/data/CIHEAM/ReferenceGenome/VEP/ --species bos_taurus -o BT18.vep --fork 4 --canonical --sift b --coding_only --no_intergenic --offline --force_overwrite

Have a look inside the output file and study its content. Why are there no SNPs in this list that are intergenic?

For some purposes, it might be more convenient to have the annotations directly in the VCF. You could for instance do the following:

 gunzip -c BT18_rsnumbers.vcf.gz | perl /home/formacion/COMUNES/IAMZ/soft/ensembl-tools-release-78/scripts/variant_effect_predictor/variant_effect_predictor.pl --dir /home/formacion/COMUNES/IAMZ/data/CIHEAM/ReferenceGenome/VEP/ --species bos_taurus -o BT18.vep.vcf --fork 4 --canonical --sift b --coding_only --offline --force_overwrite --vcf 
 bgzip BT18.vep.vcf
 tabix -p vcf BT18.vep.vcf.gz

Again, study and interpret your outcome by looking inside the file. Specifically, note WHERE the annotation ended up.

 18	53205918	.	G	A	0.329662	.	AB=0;ABP=0;AC=0;AF=0.125;AN=16;AO=2;CIGAR=1X;DP=70;DPB
 =70;DPRA=1.03279;EPP=7.35324;EPPR=3.13803;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=60;NS=8;NUMALT=1;
 ODDS=2.54006;PAIRED=1;PAIREDR=0.985294;PAO=0;PQA=0;PQR=0;PRO=0;QA=70;QR=2421;RO=68;RPP=3.0103;
 RPPR=11.1853;RUN=1;SAF=1;SAP=3.0103;SAR=1;SRF=49;SRP=31.7504;SRR=19;TYPE=snp;technology.ILLUMINA=1;
 CSQ=A|ENSBTAG00000018834|ENSBTAT00000025071|Transcript|missense_variant|2132|1984|662|E/K|Gag/Aag|||1|
 YES|deleterious(0.02)	GT:DP:RO:QR:AO:QA:GL	0/0:7:7:263:0:0:0,-1.92588,-10	0/0:7:7:238:0:0:0,-1.92376,-10
 0/0:9:7:266:2:70:-3.90732,0,-10	0/0:10:10:362:0:0:0,-2.73805,-10	0/0:10:10:345:0:0:0,-2.73789,-10	0/0:8:8:287:0:0:0,-2.19662,-10	0/0:5:5:178:0:0:0,-1.38409,-10	0/0:14:14:482:0:0:0,-3.82032,-10

A further look at the VEP output

 ['18', '53205918', '.', 'G', 'A', '0.329662', '.', {'PAIREDR': '0.985294', 'DPB': '70', 'MQMR': '60', 'PQA': '0', 'PAO': '0', 
 'SAR': '1', 'AB': '0', 'GTI': '0', 'NUMALT': '1', 'RO': '68', 'DP': '70', 'DPRA': '1.03279', 
 'CSQ': ['A', 'ENSBTAG00000018834', 'ENSBTAT00000025071', 'Transcript', 'missense_variant', 
 '2132', '1984', '662', 'E/K', 'Gag/Aag', , , '1', 'YES', ['deleterious', '0.02']], 'EPP': '7.35324', 
 'SRP': '31.7504', 'QR': '2421', 'SAF': '1', 'ODDS': '2.54006', 'SRR': '19', 'SRF': '49', 
 'technology.ILLUMINA': '1', 'QA': '70', 'EPPR': '3.13803', 'SAP': '3.0103', 'RPPR': '11.1853', 'PQR': '0',
'NS': '8', 'CIGAR': '1X', 'TYPE': 'snp', 'AF': '0.125', 'ABP': '0', 'LEN': '1', 'AC': '0', 'MEANALT': '1',
'PAIRED': '1', 'AO': '2', 'AN': '16', 'MQM': '60', 'RUN': '1', 'PRO': '0', 'RPP': '3.0103'}, 'GT:DP:RO:QR:AO:QA:GL', 
 '0/0:7:7:263:0:0:0,-1.92588,-10', '0/0:7:7:238:0:0:0,-1.92376,-10', '0/0:9:7:266:2:70:-3.90732,0,-10', 
 '0/0:10:10:362:0:0:0,-2.73805,-10', '0/0:10:10:345:0:0:0,-2.73789,-10', '0/0:8:8:287:0:0:0,-2.19662,-10',
 '0/0:5:5:178:0:0:0,-1.38409,-10', '0/0:14:14:482:0:0:0,-3.82032,-10']
 18	53205918	0.125	ENSBTAG00000018834	missense_variant	deleterious	0.02

When dealing with deleterious alleles, we might be interested to learn how frequent that allele is in the population we are studying. Our expectation is that genuinly deleterious alleles should be relatively rare in a population. If not they might hint at either a sign of inbreeding, or, perhaps, something that might be deleterious in the wild but advantageous in the context of domestication/breeding. The VCF file holds the (reference) allele frequency information. The 8th column of the VCF file in fact is a bunch of fields concatenated with the ';' delimitor. One of the fields has the tag 'AF=', where AF stands for Allele Frequency. We can expand the number of space-delimited fields easily and then look for the 11th column. We need to remove the 'AF=' tag though, but if we do that we can further select for instance for allele frequencies between 0 and 15%.

 gunzip -c BT18.vep.vcf.gz | grep deleterious | sed 's/;/\t/g' | awk '{gsub("AF=","",$11); print}' | awk '$11<0.15&&$11>0' | wc -l

Analysing the annotations a bit further, you could discover that some

Not only do some putative deleterious alleles occur at high frequency, but there is also a number of genes that has quite a few deleterious alleles annotated in the current population sample.

 gunzip -c BT18.vep.vcf.gz | grep deleterious | sed 's/;/ /g' | awk '{gsub("AF=","",$11); print}' | awk '$11<1&&$11>0' | sed 's/ /#_#/g' | sed 's/|/\t/g' | cut -f2 | sort | uniq -c | sed 's/^ \+//' | awk '$1>1'  >genes_with_more_than1.txt  

However, while this will work, it is ugly and doesn't give you a lot of control.

 gunzip -c LeonEnv/BT18.vep.all.vcf.gz | python3 print_fields_from_vcf.py -m vep | grep deleterious | cut -f4 | sort | uniq -c | sed 's/^ \+//' | awk '$1>1'  >genes_with_more_than1.txt


Take two or three genes that have the highest number of deleterious variants. Manually look them up at the Ensembl website (ensembl.org). Check the 'orthologous' status. What do you observe? How would you interpret the validity of the gene annotation and/or the likelihood of off-site mapping of reads?

This provides us with a strategy to filter a bit more rigorously for genes that may actually contain genuine deleterious alleles. Ideally we would a priory filter for genes that have proper one-to-one orthologous relations with other mammalian species. Time and lack of direct internet connection makes that unfeasible for this practical. However, genes that have only one deleterious variant segregating in the study population at a low frequency would certainly be better candidates than genes that have multiple deleterious variants, as the above example indicates. To filter you can do the following:

 gunzip -c LeonEnv/BT18.vep.all.vcf.gz | python3 print_fields_from_vcf.py -m vep | grep deleterious | cut -f4 | sort | uniq -c | sed 's/^ \+//' | awk '$1==1' | awk '{print $2}'  >genes_with1.txt

How many genes are there that meet this condition?

snpEff

Another popular variant annotation tool is 'snpEff'. Among the advantages over VEP is that it is applicable to a wider range of species. Furthermore, it is blazingly fast. For chromosome 18, we will also use this tool:

 java -Xmx4G -jar $snpEff -dataDir ~/snpEff/data -v UMD3.1.78 BT18_rsnumbers.vcf.gz >testresult_snpEff_Bt.txt

Study the outcome. Try to filter out 'intergenic' SNPs and other things that are perhaps a bit less relevant (use your skills in 'grep-ping', etc.). What are some of the differences in how the variants are annotated, compared to VEP?

You will notice that the variants that have an effect are labelled 'MODERATE' or 'HIGH'. Try to filter out the ones that are 'HIGH'. How many are there?

Compare the results by selecting the 'MODERATE' and "HIGH" variants, and compare that with the 'deleterious' variants as annotated (through SIFT) in VEP.

 cat LeonEnv/testresult_snpEff_Bt.txt  | python3 cut_line.py -m snpeff | cut -f6 | sort | uniq -c
 gunzip -c LeonEnv/BT18.vep.all.vcf.gz  | python3 cut_line.py -m vep | cut -f5 | sort | uniq -c


You can use, e.g. vcf-compare.

Inferring function

Provean

 >ENSSSCP00000018263 pep:novel chromosome:Sscrofa10.2:12:6621092:6624938:1 gene:ENSSSCG00000017236 transcript:ENSSSCT00000018765 gene_biotype:protein_coding t transcript_biotype:protein_coding
 MTPRVGAVWLPSALLLLRVPGCLSLSGPPTAMGTKGGSLSVQCRYEEEYIDDKKYWDKSP
 CFLSWKHIVETTESAREVRRGRVSIRDDPANLTFTVTLERLTEEDAGTYCCGITAQFSVD
 PTHEVEVVVFPALGTSRPPSMPGPPTTLPATTWSFVSERETMANNLGKGPASQDPGQHPR
 SKHPSIRLLLLVFLEVPLFLGMLGAVLWVHRPLRSSESRSVAMDPVPGNTAPSAGWK
 235,G,E
 5,V,A
 22,C,Y
 34,T,I
 51,D,N
 53,K,N
 59,S,Y
 61,C,R
 64,S,L
 67,H,P
 68,I,T
 75,A,V
 108,T,K
 115,A,T
 124,E,D
 130,F,Y
 133,L,P
 142,P,A 
 155,F,I
 158,E,G
 186,I,V 
 21,G,D
 ## PROVEAN scores ##
 # VARIATION	SCORE
 235,G,E	0.076
 5,V,A	0.287
 22,C,Y	-8.028
 34,T,I	-1.932
 51,D,N	-1.613
 53,K,N	1.565
 59,S,Y	-2.140
 61,C,R	-5.826
 64,S,L	-0.437
 67,H,P	-0.511
 68,I,T	-2.664
 75,A,V	-1.061
 108,T,K	-4.051
 115,A,T	1.557
 124,E,D	-1.587
 130,F,Y	-0.983
 133,L,P	3.203
 142,P,A	-2.058
 155,F,I	0.502
 158,E,G	-0.077
 186,I,V	0.220
 21,G,D	-6.014

At the end of the VEP exercise we created a filtered list of genes that have only a single deleterious variant annotated. Among these genes was BLOC1S3 gene, ENSBTAG00000007070.

 http://www.ensembl.org/Bos_taurus/Gene/Summary?db=core;g=ENSBTAG00000007070;r=18:53231599-53232201;t=ENSBTAT00000009293


 GENE=ENSBTAG00000007070
 PROT=`cat ~/Bos_taurus.UMD3.1.pep.all.fa | grep $GENE | awk '{print $1}' | sed 's/>//'`
 cat test.vep | grep $GENE | grep missense | awk '{print $10","$11}' | sed 's/\//,/' >$PROT.var
 provean.sh -q $PROT.fa -v $PROT.var --save_supporting_set $PROT.sss >$PROT.result.txt 2>$PROT.error;
 faOneRecord ~/Bos_taurus.UMD3.1.pep.all.fa $PROT >$PROT.fa
 cat ENSBTAG00000007070.fa >>ENSBTAG00000007070.sss.fasta
 mafft --auto ENSBTAG00000007070.sss.fasta >ENSBTAG00000007070.sss.fasta.out

Transfer the alignment to your local environment and view (seaview?). What do you observe for the 26th amino-acid of the cow protein sequence? In this 'implicit evolutionary context' there is one alternative amino acid that occurs at this site. What do the D amino acid and this evolutionary viable alternative share in terms of properties?

Through the orthology with human, try to find some possible phenotypic consequences of the alternative allele (e.g. through OMIM). Could there be a reason that the variant is at a relatively high frequency in this cattle population?

Polyphen

 /lustre/nobackup/WUR/ABGC/shared/public_data_store/polyphen/polyphen-2.2.2/bin/run_pph.pl -s test1033.fa test1033.coord 
 #o_acc               	 o_pos	o_aa1	o_aa2	rsid      	acc       	   pos	aa1	aa2	nt1	nt2	        prediction	            based_on	    effect	    site	  region	    PHAT	dScore	Score1	Score2	MSAv	  Nobs	 Nstruct	 Nfilt	PDB_id	PDB_pos	PDB_ch	 ident	lengthNormASA	SecStr	MapReg	  dVol	 dProp	B-fact	 H-bonds	 AveNHet	 MinDHet	 AveNInt	 MinDInt	 AveNSit	 MinDSit	Transv	CodPos	CpG	 MinDJxn	     PfamHit	  IdPmax	  IdPSNP	  IdQmin
 ENSSSCG00000001099ENSSSCT00000001195	   368	    S	    A	          	ENSSSCG00000001099ENSSSCT00000001195	   368	  S	  A	   	   	            benign	           alignment	          	        	        	        	+0.721	-1.180	-1.901	   2	    68	        	      	             	      	      	      	       	      	      	      	      	      	        	        	        	        	        	        	        	      	      	   	        	            	  12.350	  12.350	   42.63
 ENSSSCG00000001099ENSSSCT00000001195	   453	    R	    H	          	ENSSSCG00000001099ENSSSCT00000001195	   453	  R	  H	   	   	            benign	           alignment	          	        	        	        	+0.351	-2.130	-2.481	   2	    67	        	      	             	      	      	      	       	      	      	      	      	      	        	        	        	        	        	        	        	      	      	   	        	            	  33.194	  33.194	   93.76
 ENSSSCG00000001099ENSSSCT00000001195	   431	    K	    T	          	ENSSSCG00000001099ENSSSCT00000001195	   431	  K	  T	   	   	 possibly damaging	           alignment	          	        	        	        	+1.705	-1.678	-3.383	   2	    68	        	      	             	      	      	      	       	      	      	      	      	      	        	        	        	        	        	        	        	      	      	   	        	            	   2.382	        	   68.80

What goes in:

 test1033.fa
 >ENSSSCG00000001099ENSSSCT00000001195
 MSSIEQTTEILLCLSPAEAANLKEGINFVRNKSTGKDYILFKNKSRLKACKNMCKHQGGL
 FIKDIEDLNGRSVKCTKHNWKLDVSSMKYINPPGSFCQDELVVEKDEENGVLLLELNPPN
 PWDSEPRSPEDLAFGEVQITYLTHACMDLKLGDKRMVFDPWLIGPAFARGWWLLHEPPSD
 WLERLSRADLIYISHMHSDHLSYPTLKKLAERRPDVPIYVGNTERPVFWNLNQSGVQLTN
 INVVPFGIWQQVDKNLRFMILMDGVHPEMDTCIIVEYKGHKILNTVDCTRPNGGRLPMKV
 ALMMSDFAGGASGFPMTFSGGKFTEEWKAQFIKTERKKLLNYKARLVKDLQPRIYCPFAG
 YFVESHPSDKYIKETNIKNDPNELNNLIKKNSEVVTWTPRPGATLDLGRMLKDPTDSKGI
 VEPPEGTKIYKDSWDFGPYLNILNAAIGDEIFRHSSWIKEYFTWAGFKDYNLVVRMIETD
 EDFSPLPGGYDYLVDFLDLSFPKERPSREHPYEEIRSRVDVIRHVVKNGLLWDDLYIGFQ
 TRLQRDPDIYHHLFWNHFQIKLPLTPPDWKSFLMCSG 
 test1033.coord
 ENSSSCG00000001099ENSSSCT00000001195 368        S A
 ENSSSCG00000001099ENSSSCT00000001195 453        R H
 ENSSSCG00000001099ENSSSCT00000001195 431        K T