Variant annotation tutorial: Difference between revisions
No edit summary |
No edit summary |
||
(83 intermediate revisions by one other user not shown) | |||
Line 1: | Line 1: | ||
Code and tutorials for CIHEAM course. Variation annotation and function prediction sessions | Code and tutorials for CIHEAM course. Variation annotation and function prediction sessions | ||
== Variant annotation | == Variant annotation == | ||
make a new folder to work in, and go there: | |||
<source lang='bash'> | |||
mkdir annotation_session | |||
cd annotation_session | |||
</source> | |||
Create an Env.Var. to the communal session directory that holds some of the files you need: | |||
<source lang='bash'> | |||
SESSIONDIR=/home/formacion/COMUNES/IAMZ/data/CIHEAM/sessions/variant_annotation | |||
</source> | |||
And make sure python3 can be found: | |||
<source lang='bash'> | |||
alias python3='/home/formacion/COMUNES/IAMZ/soft/python-3.4.2/bin/python3.4' | |||
</source> | |||
=== Slicing and dicing of VCF files === | |||
VCF files can get very big. Being able to manipulate them and extract relevant information is important. In practice, flat text files such as VCF are currently the basis for further analysis - see the Imputation and Population practical sessions. The first task involves selecting all called variants of chromosome 18, bgzip-ping them on-the-fly, and then indexing that file with tabix: | |||
<source lang='bash'> | |||
tabix -h $SESSIONDIR/allbt.vcf.gz 18 | bgzip >BT18.vcf.gz | |||
tabix -p vcf BT18.vcf.gz | |||
</source> | |||
Have a look at what is inside this file: | Have a look at what is inside this file: | ||
<source lang='bash'> | |||
gunzip -c BT18.vcf.gz | grep -v '^##' | more | |||
</source> | |||
or | or | ||
<source lang='bash'> | |||
tabix BT18.vcf.gz 18:1-10000 | more | |||
</source> | |||
Make sure you understand the basic features of the VCF. Have a look here, if you haven't done already: | |||
http://www.1000genomes.org/wiki/analysis/vcf4.0 | |||
=== Annotating VCF with rs-numbers === | |||
The VCF file you are working does not yet have the dbSNP identifiers (rs-numbers) annotated. The rs-numbers of all currently known variants as databased in dbSNP can be found in this file: | |||
<code> | |||
/home/formacion/COMUNES/IAMZ/data/CIHEAM/sessions/variant_annotation/BT_incl_cons.18.vcf.gz | |||
</code> | |||
You can use vcftools to do many things related to VCF files. This includes annotating the VCF based on other interval-file formatted input, including other VCF files. For instance, you can annotate an interval of the VCF on the fly, by selecting that interval using tabix, and then piping it to vcf-annotate, like so: | You can use vcftools to do many things related to VCF files. This includes annotating the VCF based on other interval-file formatted input, including other VCF files. For instance, you can annotate an interval of the VCF on the fly, by selecting that interval using tabix, and then piping it to vcf-annotate, like so: | ||
<source lang='bash'> | |||
## will take an interval of 1000bp and annotate the rs numbers; piped into 'tail' means last 10 lines are shown | |||
tabix -h BT18.vcf.gz 18:100000-101000 | vcf-annotate -a $SESSIONDIR/BT_incl_cons.18.vcf.gz -c CHROM,FROM,ID | tail | |||
</source> | |||
Now make a new VCF file for all the variants called for chromosome 18 using this tool: | Now make a new VCF file for all the variants called for chromosome 18 using this tool: | ||
<source lang='bash'> | |||
tabix -h BT18.vcf.gz 18 | vcf-annotate -a $SESSIONDIR/BT_incl_cons.18.vcf.gz -c CHROM,FROM,ID | bgzip >BT18_rsnumbers.vcf.gz | |||
tabix -p vcf BT18_rsnumbers.vcf.gz | |||
</source> | |||
Again, inspect the results. What do you notice in the third column? Do all variants now have an rs-number annotated? | |||
=== Extracting variants using BEDtools === | === Extracting variants using BEDtools === | ||
You will very often want to extract variants based on interval files, e.g. a certain region on a chromosome. This can be easily | You will very often want to extract variants based on interval files, e.g. a certain region on a chromosome. This can be easily achieved by using the tabix command described above. However, sometimes you may want to extract a substantial number of intervals. You could use tabix in a loop, either as a shell-script or in a scripting language like Perl or Python. However, BedTools provides many utilities for manipulating interval data, such as BED, VCF, or GTF/GFF formats. | ||
Say you are interested in the variants in all the genes lying between coordinates 1,000,000 and 2,000,000 on chromosome 18. The annotation of genes can be found in a GTF file. The GTF file for the UMD3.1 genome build can be found at Ensembl and downloaded using the command line. | |||
<source lang='bash'> | |||
## NOTE: the compute nodes you are working on do not have internet connectivity. | |||
## This line of code is just for completeness): | |||
wget ftp://ftp.ensembl.org/pub/release-78/gtf/bos_taurus/Bos_taurus.UMD3.1.78.gtf.gz | |||
</source> | |||
The file is already downloaded and can be found here: | The file is already downloaded and can be found here: | ||
/home/formacion/COMUNES/IAMZ/data/CIHEAM/sessions/variant_annotation/Bos_taurus.UMD3.1.78.gtf.gz | |||
Have a quick look at what is in this file: | Have a quick look at what is in this file: | ||
<source lang='bash'> | |||
gunzip -c | gunzip -c $SESSIONDIR/Bos_taurus.UMD3.1.78.gtf.gz | more | ||
</source> | |||
Familiarise yourself with the GTF format if you have not already. For more information: | Familiarise yourself with the GTF format if you have not already. For more information: | ||
Line 53: | Line 89: | ||
You can select the appropriate region from the GTF file: | You can select the appropriate region from the GTF file: | ||
<source lang='bash'> | |||
gunzip -c $SESSIONDIR/Bos_taurus.UMD3.1.78.gtf.gz | awk '$3=="gene"' | awk '$1==18&&$4>1000000&&$4<2000000' >mygenes.gtf | |||
</source> | |||
Subsequently you can intersect the VCF with the GTF file that contains just the genes between coordinates 1 and 2 million: | Subsequently you can intersect the VCF with the GTF file that contains just the genes between coordinates 1 and 2 million: | ||
<source lang='bash'> | |||
bedtools intersect -a BT18_rsnumbers.vcf.gz -b mygenes.gtf >intersection_genes.vcf | |||
</source> | |||
Have a quick look at the output, and count the number of variants in this VCF (e.g. by using 'wc -l'). | Have a quick look at the output, and count the number of variants in this VCF (e.g. by using 'wc -l'). | ||
Line 67: | Line 107: | ||
=== Variant Effect Predictor === | === Variant Effect Predictor === | ||
Various tools exist to make more precise annotations than what you can achieve using Bedtools and a GTF. Among them are Variant Effect Predictor (VEP), snpEff, and Annovar. For the current practical we will use former two. | Various tools exist to make more precise annotations than what you can achieve using Bedtools and a GTF. Among them are Variant Effect Predictor (VEP), snpEff, and Annovar. For the current practical we will use the former two. | ||
VEP can be found here: | |||
<source lang='bash'> | |||
perl /home/formacion/COMUNES/IAMZ/soft/ensembl-tools-release-78/scripts/variant_effect_predictor/variant_effect_predictor.pl | |||
</source> | |||
If you invoke it without parameters you will see the following: | |||
/home/formacion/COMUNES/IAMZ/data/CIHEAM/ReferenceGenome/VEP/ | Usage: | ||
perl variant_effect_predictor.pl [--cache|--offline|--database] [arguments] | |||
Basic options | |||
============= | |||
--help Display this message and quit | |||
-i | --input_file Input file | |||
-o | --output_file Output file | |||
--force_overwrite Force overwriting of output file | |||
--species [species] Species to use [default: "human"] | |||
--everything Shortcut switch to turn on commonly used options. See web | |||
documentation for details [default: off] | |||
--fork [num_forks] Use forking to improve script runtime | |||
For full option documentation see: | |||
http://www.ensembl.org/info/docs/tools/vep/script/vep_options.html | |||
For the current session we will use the following options: | |||
--species bos_taurus | |||
-o BT18.vep # outputfile | |||
--fork 4 # use 4 threads | |||
--canonical | |||
--sift b | |||
--coding_only | |||
--no_intergenic | |||
--offline # using a cache directory | |||
--dir /home/formacion/COMUNES/IAMZ/data/CIHEAM/ReferenceGenome/VEP/ # where the cache dir lives | |||
--force_overwrite | |||
You can create a file that will provide annotation information for each SNP: | You can create a file that will provide annotation information for each SNP: | ||
<source lang='bash'> | |||
gunzip -c BT18_rsnumbers.vcf.gz | \ | |||
perl /home/formacion/COMUNES/IAMZ/soft/ensembl-tools-release-78/scripts/variant_effect_predictor/variant_effect_predictor.pl \ | |||
--dir /home/formacion/COMUNES/IAMZ/data/CIHEAM/ReferenceGenome/VEP/ --species bos_taurus -o BT18.vep \ | |||
--fork 4 --canonical --sift b --coding_only --no_intergenic --offline --force_overwrite | |||
</source> | |||
Have a look inside the output file and study its content. Why are there no SNPs in this list that are intergenic? | |||
The VEP output you generated before has many advantages, one of which is that it is nicely tabular and easy to parse. In fact a demonstration of that will be covered in the 'Provean' practical. For some purposes, however, it might be more convenient to have the annotations directly in the VCF. Furthermore, you may want annotations for each and every SNP, even if e.g. they are intergenic. You could for instance do the following, omitting the '--no_intergenic' and '--coding_only' flags, and in stead adding the '--vcf' flag: | |||
<source lang='bash'> | |||
gunzip -c BT18_rsnumbers.vcf.gz | \ | |||
perl /home/formacion/COMUNES/IAMZ/soft/ensembl-tools-release-78/scripts/variant_effect_predictor/variant_effect_predictor.pl \ | |||
--dir /home/formacion/COMUNES/IAMZ/data/CIHEAM/ReferenceGenome/VEP/ --species bos_taurus -o BT18.vep.vcf \ | |||
--fork 4 --canonical --sift b --offline --force_overwrite --vcf | |||
bgzip BT18.vep.vcf | |||
tabix -p vcf BT18.vep.vcf.gz | |||
</source> | |||
Again, study and interpret your outcome by looking inside the file 'BT18.vep.vcf.gz'. Specifically, note WHERE the annotation ended up. | |||
For this you have to understand the VCF format in detail. If not done already, further familiarise yourself with the VCF format. | |||
http://www.1000genomes.org/wiki/analysis/vcf4.0 | |||
The description states that the 8th column is the "INFO" column. The VEP annotation has been added with the "CSQ" field tag. | |||
<pre style="white-space: pre-wrap; | |||
white-space: -moz-pre-wrap; | |||
white-space: -pre-wrap; | |||
white-space: -o-pre-wrap; | |||
word-wrap: break-word;"> | |||
18 53205918 . G A 0.329662 . AB=0;ABP=0;AC=0;AF=0.125;AN=16;AO=2;CIGAR=1X;DP=70;DPB=70;DPRA=1.03279;EPP=7.35324;EPPR=3.13803;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=60;NS=8;NUMALT=1;ODDS=2.54006;PAIRED=1;PAIREDR=0.985294;PAO=0;PQA=0;PQR=0;PRO=0;QA=70;QR=2421;RO=68;RPP=3.0103;RPPR=11.1853;RUN=1;SAF=1;SAP=3.0103;SAR=1;SRF=49;SRP=31.7504;SRR=19;TYPE=snp;technology.ILLUMINA=1;CSQ=A|ENSBTAG00000018834|ENSBTAT00000025071|Transcript|missense_variant|2132|1984|662|E/K|Gag/Aag|||1|YES|deleterious(0.02) GT:DP:RO:QR:AO:QA:GL 0/0:7:7:263:0:0:0,-1.92588,-10 0/0:7:7:238:0:0:0,-1.92376,-10 0/0:9:7:266:2:70:-3.90732,0,-10 0/0:10:10:362:0:0:0,-2.73805,-10 0/0:10:10:345:0:0:0,-2.73789,-10 0/0:8:8:287:0:0:0,-2.19662,-10 0/0:5:5:178:0:0:0,-1.38409,-10 0/0:14:14:482:0:0:0,-3.82032,-10 | |||
</pre> | |||
The two VEP runs have generated reports in HTML format. Transfer them to your local computer and study them. | |||
You can compare one or a few examples manually by checking them at the Ensembl VEP website: | |||
http://www.ensembl.org/Bos_taurus/Tools/VEP | |||
=== A further look at the VEP output === | |||
From studying the HTML reports it should be clear that the information you can find there is necessarily limited and general. It is therefore important to acquire skills to generate further analyses based on the annotation that meet your specific research needs. In this section we will delve a little bit deeper in the structure of the annotations and how to extract information from it. | |||
Although the VCF now looks even more complex than before, there is a good reason to add the VEP annotation data the way it was done. The 8th column of the VCF (the "INFO" column) provides a lot of information, each field separated by a ';'. In addition, each field contains a field name and a value, delimited by a '='. The VEP annotation has its own field name, or tag: 'CSQ'. Within this field, the values are again structured, but this time delimited by a '|'. What we in fact have now is a nested data structure: | |||
<source lang='python'> | |||
['18', '53205918', '.', 'G', 'A', '0.329662', '.', {'PAIREDR': '0.985294', 'DPB': '70', 'MQMR': '60', 'PQA': '0', 'PAO': '0', | |||
'SAR': '1', 'AB': '0', 'GTI': '0', 'NUMALT': '1', 'RO': '68', 'DP': '70', 'DPRA': '1.03279', | |||
'CSQ': ['A', 'ENSBTAG00000018834', 'ENSBTAT00000025071', 'Transcript', 'missense_variant', | |||
'2132', '1984', '662', 'E/K', 'Gag/Aag', '', '', '1', 'YES', ['deleterious', '0.02']], 'EPP': '7.35324', | |||
'SRP': '31.7504', 'QR': '2421', 'SAF': '1', 'ODDS': '2.54006', 'SRR': '19', 'SRF': '49', | |||
'technology.ILLUMINA': '1', 'QA': '70', 'EPPR': '3.13803', 'SAP': '3.0103', 'RPPR': '11.1853', 'PQR': '0', | |||
'NS': '8', 'CIGAR': '1X', 'TYPE': 'snp', 'AF': '0.125', 'ABP': '0', 'LEN': '1', 'AC': '0', 'MEANALT': '1', | |||
'PAIRED': '1', 'AO': '2', 'AN': '16', 'MQM': '60', 'RUN': '1', 'PRO': '0', 'RPP': '3.0103'}, ['GT:DP:RO:QR:AO:QA:GL',[ | |||
'0/0:7:7:263:0:0:0,-1.92588,-10', '0/0:7:7:238:0:0:0,-1.92376,-10', '0/0:9:7:266:2:70:-3.90732,0,-10', | |||
'0/0:10:10:362:0:0:0,-2.73805,-10', '0/0:10:10:345:0:0:0,-2.73789,-10', '0/0:8:8:287:0:0:0,-2.19662,-10', | |||
'0/0:5:5:178:0:0:0,-1.38409,-10', '0/0:14:14:482:0:0:0,-3.82032,-10']]] | |||
</source> | |||
When dealing with deleterious alleles, we might be interested to learn how frequent that allele is in the population we are studying. Our expectation is that genuinly deleterious alleles should be relatively rare in a population. If not they might hint at either a sign of inbreeding, or, perhaps, something that might be deleterious in the wild but advantageous in the context of domestication/breeding. The VCF file holds the (reference) allele frequency information. The 8th column of the VCF file in fact is a bunch of fields concatenated with the ';' delimitor. One of the fields has the tag 'AF=', where AF stands for (Alternative) Allele Frequency. We can expand the number of tab-delimited fields easily by replacing ';' for '\t', and then look for the 11th column. We need to remove the 'AF=' tag though, but if we do that we can further select for instance for allele frequencies between 0 and 15%. | |||
The next shell-oneliner will give you the variants with rare (<15%) deleterious alleles and count them (by omitting the last part you can retrieve the relevant variations themselves). | |||
<source lang='bash'> | |||
gunzip -c BT18.vep.vcf.gz | grep deleterious | sed 's/;/\t/g' | awk '{gsub("AF=","",$11); print}' | \ | |||
awk '$11<0.15&&$11>0' | wc -l | |||
</source> | |||
Analysing the annotations a bit further, you could discover that there are quite a few variants that have high deleterious allele frequencies. Not only do some putative deleterious alleles occur at high frequency, but there is also a number of genes that has quite a few deleterious alleles annotated in the current population sample. Say you would like to make a table that contains all genes that have more than one putative deleterious variant. We would need to isolate both the AF field and the CSQ field, and from the latter isolate the 'gene' field. While this is possible with a fairly simple shell scripting hack, it does require an awful lot of counting to make sure you end up with the right fields. Ideally you would like to have more control by putting the relevant data in a complex data structure. Languages such as Perl and particularly Python allow far more explicit syntax to achieve that, so we will swap out some complex string of shell-commands by a single Python script. That script does only one thing: takes lines from the VCF file, put all fields in a complex data structure (as seen above) and, in a structured way, prints a selection of fields. | |||
The Python script will take VCF input from a stream: | |||
<source lang='bash'> | |||
gunzip -c BT18.vep.vcf.gz | python3 $SESSIONDIR/print_fields_from_vcf.py -m vep | grep deleterious | more | |||
</source> | |||
Notice that the script itself does in fact take an option: 'vep'. This is done because it can also work for the snpEff annotation, which, highly inconveniently, is slightly different. | |||
and for each line provide output similar to this: | |||
#chrom coord AF Gene consequence sift_prediction sift_score | |||
18 53205918 0.125 ENSBTAG00000018834 missense_variant deleterious 0.02 | |||
Obviously, the functionality of the Python script is limited. However, it would be quite easy to provide additional statistics and output options based on the current script, because it makes the data structure so explicit. This means that the Python script provides you with a means for expansion of functionality, while that will be far more tricky using shell commands solely. | |||
Now we make a file that has the gene names, and counts, for all genes that have more than one deleterious allele: | |||
<source lang='bash'> | |||
gunzip -c BT18.vep.vcf.gz | python3 $SESSIONDIR/print_fields_from_vcf.py -m vep | grep deleterious | awk '$3>0&&$3<1' | \ | |||
cut -f4 | sort | uniq -c | sed 's/^ \+//' | awk '$1>1' >genes_with_more_than1.txt | |||
</source> | |||
Take two or three genes that have the highest number of deleterious variants. Manually look them up at the Ensembl website | Take two or three genes that have the highest number of deleterious variants. Manually look them up at the Ensembl website. | ||
http://www.ensembl.org/Bos_taurus/Info/Index | |||
Check the 'orthologous' status. What do you observe? How would you interpret the validity of the gene annotation and/or the likelihood of off-site mapping of reads? | |||
This provides us with a strategy to filter a bit more rigorously for genes that may actually contain genuine deleterious alleles. Ideally we would a priory filter for genes that have proper one-to-one orthologous relations with other mammalian species. Time and lack of direct internet connection makes that unfeasible for this practical. However, genes that have only one deleterious variant segregating in the study population at a low frequency would certainly be better candidates than genes that have multiple deleterious variants, as the above example indicates. To filter you can do the following: | This provides us with a strategy to filter a bit more rigorously for genes that may actually contain genuine deleterious alleles. Ideally we would a priory filter for genes that have proper one-to-one orthologous relations with other mammalian species. Time and lack of direct internet connection makes that unfeasible for this practical. However, genes that have only one deleterious variant segregating in the study population at a low frequency would certainly be better candidates than genes that have multiple deleterious variants, as the above example indicates. To filter you can do the following: | ||
<source lang='bash'> | |||
gunzip -c BT18.vep.vcf.gz | python3 $SESSIONDIR/print_fields_from_vcf.py -m vep | grep deleterious | awk '$3>0&&$3<1' | \ | |||
cut -f4 | sort | uniq -c | sed 's/^ \+//' | awk '$1==1' | awk '{print $2}' >genes_with1.txt | |||
</source> | |||
How many genes are there that meet this condition? | How many genes are there that meet this condition? | ||
You can now easily make tables for number of different consequences as annotated by both these annotation tools: | |||
<source lang='bash'> | |||
## for VEP results: | |||
gunzip -c BT18.vep.vcf.gz | python3 $SESSIONDIR/print_fields_from_vcf.py -m vep | cut -f5 | sort | uniq -c | |||
</source> | |||
VEP also made a report in HTML format. Transfer it to your local computer and view it, if you haven't done already. How do your 'hacky' results compare to the report VEP made? | |||
=== snpEff === | === snpEff === | ||
java -Xmx4G -jar $snpEff -dataDir ~/snpEff/data -v UMD3.1.78 | Another popular variant annotation tool is 'snpEff'. Among the advantages over VEP is that it is applicable to a wider range of species. Furthermore, it is blazingly fast. For chromosome 18, we will also use this tool: | ||
snpEff is Java-based. The path to the snpEff.jar Java archive is available as an environment variable. Do: | |||
<source lang='bash'> | |||
echo $snpEff | |||
/home/formacion/COMUNES/IAMZ/soft/snpEff/snpEff.jar | |||
</source> | |||
The snpEff program can be used like this: | |||
<source lang='bash'> | |||
java -Xmx4G -jar $snpEff -dataDir $SESSIONDIR/snpEff/data -v UMD3.1.78 BT18_rsnumbers.vcf.gz >BT18_snpEff.vcf | |||
bgzip BT18_snpEff.vcf | |||
tabix -p vcf BT18_snpEff.vcf.gz | |||
</source> | |||
The options used here are: | |||
-dataDir ~/snpEff/data # annotation data goes here | |||
-v UMD3.1.78 # Genome / version | |||
BT18_rsnumbers.vcf.gz # requires an input file, can be gzipped | |||
>testresult_snpEff_Bt.txt # results go to STDOUT and are captured in a file by '>' | |||
Study the outcome. Try to filter out 'intergenic' SNPs and other things that are perhaps a bit less relevant (use your skills in 'grep-ping', etc.). What are some of the differences in how the variants are annotated, compared to VEP? | |||
You will notice that the variants that have an effect are labelled 'MODERATE' or 'HIGH'. Try to filter out the ones that are 'HIGH'. How many are there? | |||
You can now again easily make tables for number of different consequences as annotated by both these annotation tools: | |||
== | <source lang='bash'> | ||
## for snpEff results: | |||
gunzip -c BT18_snpEff.vcf.gz | python3 $SESSIONDIR/print_fields_from_vcf.py -m snpeff | cut -f5 | sort | uniq -c | |||
</source> | |||
Like VEP, snpEff makes a report in HTML format. Transfer it to your local computer and study it. How do your 'hacky' results compare to the report snpEff made? | |||
Bonus question: | |||
Compare the results by selecting the 'MODERATE' and "HIGH" variants, and compare that with the 'deleterious' variants as annotated (through SIFT) in VEP. You can use, e.g. vcf-compare. | |||
==== If time permits: a human example ==== | |||
Of all species, human so far has the best annotation information for variation (or anything else for that matter). The reason, obviously, is the clinical relevance. In the coming years, increased knowledge on regulatory sequences, e.g. through the ENCODE project, will further increase the knowledge on relevance of non-exonic variation. Again, the human/clinical genomics community will lead the way here. The domestic animal genomics community will follow and profit from those insights. | |||
Currently, there are already important resources of information on clinically relevant variations. You can find more information here: | |||
http://www.ncbi.nlm.nih.gov/variation/docs/human_variation_vcf/ | |||
Among the variation sets available is the 'common&clinical' - a set of variants that has clinical relevance, and a relatively high occurrence in human populations (i.e. not the de novo mutations often found to be the reason for genetic disorders). | |||
We can annotate these SNPs using snpEff: | |||
<source lang='bash'> | |||
java -Xmx4G -jar $snpEff -dataDir $SESSIONDIR/snpEff/data -v GRCh38.78 \ | |||
$SESSIONDIR/common_and_clinical.vcf >common_clinical_snpEff.vcf | |||
</source> | |||
Copy the HTML report to your local computer and study it. A few things will be apparent that are different to the cattle example. You will see consequences that you have not seen in cattle, notably "TF_binding_site_variant". Another is that, because this is a non-random sample of variants, its distribution is non-random as well. Exonic variants, most notably, are highly overrepresented. | |||
== Inferring function == | |||
=== Provean === | === Provean === | ||
Provean uses an approach where a protein sequence is compared (Psiblast) to all known protein sequences (so called 'non-redundant database'). Subsequent clustering defines a group of 'similar', and implicitly homologous sequences. An example of in- and output of a pig sequence: | |||
What goes in: | |||
<code> | |||
Protein sequence (Fasta): | |||
>ENSSSCP00000018263 pep:novel chromosome:Sscrofa10.2:12:6621092:6624938:1 gene:ENSSSCG00000017236 transcript:ENSSSCT00000018765 gene_biotype:protein_coding t transcript_biotype:protein_coding | >ENSSSCP00000018263 pep:novel chromosome:Sscrofa10.2:12:6621092:6624938:1 gene:ENSSSCG00000017236 transcript:ENSSSCT00000018765 gene_biotype:protein_coding t transcript_biotype:protein_coding | ||
MTPRVGAVWLPSALLLLRVPGCLSLSGPPTAMGTKGGSLSVQCRYEEEYIDDKKYWDKSP | MTPRVGAVWLPSALLLLRVPGCLSLSGPPTAMGTKGGSLSVQCRYEEEYIDDKKYWDKSP | ||
Line 116: | Line 326: | ||
PTHEVEVVVFPALGTSRPPSMPGPPTTLPATTWSFVSERETMANNLGKGPASQDPGQHPR | PTHEVEVVVFPALGTSRPPSMPGPPTTLPATTWSFVSERETMANNLGKGPASQDPGQHPR | ||
SKHPSIRLLLLVFLEVPLFLGMLGAVLWVHRPLRSSESRSVAMDPVPGNTAPSAGWK | SKHPSIRLLLLVFLEVPLFLGMLGAVLWVHRPLRSSESRSVAMDPVPGNTAPSAGWK | ||
</code> | |||
and variation, with coordinates pertaining the above protein sequences, the second column the reference amino acid, and the third/last column the alternative amino acid (alternative allele): | |||
235,G,E | 235,G,E | ||
5,V,A | 5,V,A | ||
Line 140: | Line 352: | ||
21,G,D | 21,G,D | ||
Output looks like this: | |||
<code> | |||
## PROVEAN scores ## | ## PROVEAN scores ## | ||
# VARIATION SCORE | # VARIATION SCORE | ||
Line 164: | Line 378: | ||
186,I,V 0.220 | 186,I,V 0.220 | ||
21,G,D -6.014 | 21,G,D -6.014 | ||
</code> | |||
At the end of the VEP exercise we created a filtered list of genes that have only a single deleterious variant annotated. Among these genes was BLOC1S3 gene, ENSBTAG00000007070. | We will do something similar, although with a somewhat less complex example, for the cow data. At the end of the VEP exercise we created a filtered list of genes that have only a single deleterious variant annotated. Among these genes was BLOC1S3 gene, ENSBTAG00000007070. | ||
http://www.ensembl.org/Bos_taurus/Gene/Summary?db=core;g=ENSBTAG00000007070;r=18:53231599-53232201;t=ENSBTAT00000009293 | http://www.ensembl.org/Bos_taurus/Gene/Summary?db=core;g=ENSBTAG00000007070;r=18:53231599-53232201;t=ENSBTAT00000009293 | ||
What we need for the Provean analysis is 1) the protein sequence of the gene (or rather, the transcript), and 2) a file with the variation in the same format as listed above. Obviously, if we want to do this for many genes, we will not parse the relevant information by hand. We will make a little pipeline for that. A multifasta file that contains all protein sequences for cow can be found at Ensembl. It is located here on the machine you are working on: | |||
<source lang='bash'> | |||
## define the Env.Var. that holds the gene id: | |||
GENE=ENSBTAG00000007070 | |||
## search the protein file for the gene, and parse out the corresponding protein name | |||
## Note: ideally you do this by transcript, as a gene may be represented by multiple transcripts/protein sequences | |||
PROT=`cat $SESSIONDIR/Bos_taurus.UMD3.1.pep.all.fa | grep $GENE | awk '{print $1}' | sed 's/>//'` | |||
## use the first VEP annotation you did for this. Easiest to work with because completely tabular. | |||
## Note: in this case there is only a single missense variant in the gene. This won't work for all genes. | |||
## Note: the 'sed' statement replaces the '/', as found in the VEP annotation, with a ','. | |||
cat BT18.vep | grep $GENE | grep missense | awk '{print $10","$11}' | sed 's/\//,/' >$PROT.var | |||
## retrieve the protein sequence from the multifasta protein sequence file. | |||
## Note: the 'faOneRecord program is part of the so called 'Blat' suite, of which Blat is the best known | |||
faOneRecord $SESSIONDIR/Bos_taurus.UMD3.1.pep.all.fa $PROT >$PROT.fa | |||
</source> | |||
This should result in two files that are produced: | |||
ENSBTAP00000009293.fa | |||
>ENSBTAP00000009293 pep:known chromosome:UMD3.1:18:53231599:53232201:1 gene:ENSBTAG00000007070 transcript:ENSBTAT00000009293 gene_biotype:protein_coding transcript_biotype:protein_coding | |||
MESQSRRRRPLRRPETLVQGEAAESDSDLSASSSEEEELYLGPSGPTRGRPTGLRVAGEA | |||
AETDSDPEPEPKAAPRDLPPLVVQRETAGEAWAEEEAPAPAPARSLLQLRLAESQARLDH | |||
DVAAAVSGVYRRAGRDVAALAGRLAAAQAAGLAAAHSVRLARGDLCALAERLDIVASCRL | |||
LPDIRGVPGTEPEQDPGPRA | |||
ENSBTAP00000009293.var | |||
26,D,N | |||
Finally, the actual Provean analysis is done with this command (this may take a few minutes to run): | |||
<source lang='bash'> | |||
provean.sh --num_threads 8 -q $PROT.fa -v $PROT.var --save_supporting_set $PROT.sss >$PROT.result.txt 2>$PROT.error | |||
</source> | |||
Study the result. What is the verdict on the alternative allele? (D-->N) | |||
Provean does many things 'under the hood'. When you follow the process (e.g. using the 'top' command in a second terminal screen), you will notice that the majority of time is spent on Psiblast, and that it takes quite a bit of the total memory of the node you are working on (typically >10GB of ram). The similar sequences that Provean fished out of the nr-database and then clustered are written to a 'supporting set' (because you asked Provean to save the supporting set by using the --save_supporting flag). Briefly investigate what is in the supporting set (extension .sss.fasta). | |||
Transfer the alignment to your local environment and view (seaview | The supporting sequences are not aligned. Chances are that among the supporting sequences there is a cow sequence. Bit cumbersome to figure out which is which (although doable using blastdb tools - we won't go into that here). Easiest is to add the protein sequence you have used for the Provean analysis and add it to the supporting set fasta file (using the '>>', note that you need TWO of these to add to a file, otherwise the file will get overwritten if you use only one!). | ||
<source lang='bash'> | |||
## add protein sequence to supporting set fasta: | |||
cat $PROT.fa >>$PROT.sss.fasta | |||
## align all protein sequences using the multiple sequence aligner mafft: | |||
mafft --auto $PROT.sss.fasta >$PROT.sss.fasta.out | |||
</source> | |||
Transfer the alignment to your local environment and view. If installed on your local computer you could use e.g. Seaview. If you are on a Debian-based system (e.g. Ubuntu) and you have administrative rights, you can simply install it like this: | |||
<source lang='bash'> | |||
sudo apt-get install seaview | |||
</source> | |||
Alternatively, you can view the alignment through this online tool: | |||
http://toolkit.tuebingen.mpg.de/alnviz | |||
What do you observe for the 26th amino-acid of the cow protein sequence? In this 'implicit evolutionary context' is there another alternative amino acid residue possible? | |||
Through the orthology with human, try to find some possible phenotypic consequences of the alternative allele (e.g. through OMIM). Could there be a reason that the variant is at a relatively high frequency in this cattle population? | Through the orthology with human, try to find some possible phenotypic consequences of the alternative allele (e.g. through OMIM). Could there be a reason that the variant is at a relatively high frequency in this cattle population? | ||
Line 186: | Line 449: | ||
=== Polyphen === | === Polyphen === | ||
The Polyphen analysis is quite similar in what you need compared to Provean: a fasta file containing the protein sequence, and a file that lists coordinates and variants: | |||
What goes in: | What goes in: | ||
in.fa | |||
>ENSSSCG00000001099ENSSSCT00000001195 | >ENSSSCG00000001099ENSSSCT00000001195 | ||
MSSIEQTTEILLCLSPAEAANLKEGINFVRNKSTGKDYILFKNKSRLKACKNMCKHQGGL | MSSIEQTTEILLCLSPAEAANLKEGINFVRNKSTGKDYILFKNKSRLKACKNMCKHQGGL | ||
Line 206: | Line 465: | ||
TRLQRDPDIYHHLFWNHFQIKLPLTPPDWKSFLMCSG | TRLQRDPDIYHHLFWNHFQIKLPLTPPDWKSFLMCSG | ||
in.coord | |||
ENSSSCG00000001099ENSSSCT00000001195 368 S A | ENSSSCG00000001099ENSSSCT00000001195 368 S A | ||
ENSSSCG00000001099ENSSSCT00000001195 453 R H | ENSSSCG00000001099ENSSSCT00000001195 453 R H | ||
ENSSSCG00000001099ENSSSCT00000001195 431 K T | ENSSSCG00000001099ENSSSCT00000001195 431 K T | ||
<source lang='bash'> | |||
run_pph.pl -s in.fa in.coord | |||
</source> | |||
What comes out: | |||
/lustre/nobackup/WUR/ABGC/shared/public_data_store/polyphen/polyphen-2.2.2/bin/run_pph.pl -s test1033.fa test1033.coord | |||
#o_acc o_pos o_aa1 o_aa2 rsid acc pos aa1 aa2 nt1 nt2 prediction based_on effect site region PHAT dScore Score1 Score2 MSAv Nobs Nstruct Nfilt PDB_id PDB_pos PDB_ch ident lengthNormASA SecStr MapReg dVol dProp B-fact H-bonds AveNHet MinDHet AveNInt MinDInt AveNSit MinDSit Transv CodPos CpG MinDJxn PfamHit IdPmax IdPSNP IdQmin | |||
ENSSSCG00000001099ENSSSCT00000001195 368 S A ENSSSCG00000001099ENSSSCT00000001195 368 S A benign alignment +0.721 -1.180 -1.901 2 68 12.350 12.350 42.63 | |||
ENSSSCG00000001099ENSSSCT00000001195 453 R H ENSSSCG00000001099ENSSSCT00000001195 453 R H benign alignment +0.351 -2.130 -2.481 2 67 33.194 33.194 93.76 | |||
ENSSSCG00000001099ENSSSCT00000001195 431 K T ENSSSCG00000001099ENSSSCT00000001195 431 K T possibly damaging alignment +1.705 -1.678 -3.383 2 68 2.382 68.80 | |||
The fasta and coordinate files can be made similar to what you've done for Provean. Polyphen is much more human-oriented than Provean (which is more species agnostic) and has many more features, not all of which work very well for non-human species. Since Polyphen won't run properly in your present infrastructure we will not practice it actively. |
Latest revision as of 11:25, 22 January 2016
Code and tutorials for CIHEAM course. Variation annotation and function prediction sessions
Variant annotation
make a new folder to work in, and go there:
<source lang='bash'> mkdir annotation_session cd annotation_session </source>
Create an Env.Var. to the communal session directory that holds some of the files you need: <source lang='bash'> SESSIONDIR=/home/formacion/COMUNES/IAMZ/data/CIHEAM/sessions/variant_annotation </source> And make sure python3 can be found: <source lang='bash'> alias python3='/home/formacion/COMUNES/IAMZ/soft/python-3.4.2/bin/python3.4' </source>
Slicing and dicing of VCF files
VCF files can get very big. Being able to manipulate them and extract relevant information is important. In practice, flat text files such as VCF are currently the basis for further analysis - see the Imputation and Population practical sessions. The first task involves selecting all called variants of chromosome 18, bgzip-ping them on-the-fly, and then indexing that file with tabix:
<source lang='bash'> tabix -h $SESSIONDIR/allbt.vcf.gz 18 | bgzip >BT18.vcf.gz tabix -p vcf BT18.vcf.gz </source>
Have a look at what is inside this file:
<source lang='bash'> gunzip -c BT18.vcf.gz | grep -v '^##' | more </source> or <source lang='bash'> tabix BT18.vcf.gz 18:1-10000 | more </source>
Make sure you understand the basic features of the VCF. Have a look here, if you haven't done already:
http://www.1000genomes.org/wiki/analysis/vcf4.0
Annotating VCF with rs-numbers
The VCF file you are working does not yet have the dbSNP identifiers (rs-numbers) annotated. The rs-numbers of all currently known variants as databased in dbSNP can be found in this file:
/home/formacion/COMUNES/IAMZ/data/CIHEAM/sessions/variant_annotation/BT_incl_cons.18.vcf.gz
You can use vcftools to do many things related to VCF files. This includes annotating the VCF based on other interval-file formatted input, including other VCF files. For instance, you can annotate an interval of the VCF on the fly, by selecting that interval using tabix, and then piping it to vcf-annotate, like so:
<source lang='bash'>
- will take an interval of 1000bp and annotate the rs numbers; piped into 'tail' means last 10 lines are shown
tabix -h BT18.vcf.gz 18:100000-101000 | vcf-annotate -a $SESSIONDIR/BT_incl_cons.18.vcf.gz -c CHROM,FROM,ID | tail </source>
Now make a new VCF file for all the variants called for chromosome 18 using this tool: <source lang='bash'> tabix -h BT18.vcf.gz 18 | vcf-annotate -a $SESSIONDIR/BT_incl_cons.18.vcf.gz -c CHROM,FROM,ID | bgzip >BT18_rsnumbers.vcf.gz tabix -p vcf BT18_rsnumbers.vcf.gz </source>
Again, inspect the results. What do you notice in the third column? Do all variants now have an rs-number annotated?
Extracting variants using BEDtools
You will very often want to extract variants based on interval files, e.g. a certain region on a chromosome. This can be easily achieved by using the tabix command described above. However, sometimes you may want to extract a substantial number of intervals. You could use tabix in a loop, either as a shell-script or in a scripting language like Perl or Python. However, BedTools provides many utilities for manipulating interval data, such as BED, VCF, or GTF/GFF formats.
Say you are interested in the variants in all the genes lying between coordinates 1,000,000 and 2,000,000 on chromosome 18. The annotation of genes can be found in a GTF file. The GTF file for the UMD3.1 genome build can be found at Ensembl and downloaded using the command line.
<source lang='bash'>
- NOTE: the compute nodes you are working on do not have internet connectivity.
- This line of code is just for completeness):
wget ftp://ftp.ensembl.org/pub/release-78/gtf/bos_taurus/Bos_taurus.UMD3.1.78.gtf.gz </source>
The file is already downloaded and can be found here:
/home/formacion/COMUNES/IAMZ/data/CIHEAM/sessions/variant_annotation/Bos_taurus.UMD3.1.78.gtf.gz
Have a quick look at what is in this file: <source lang='bash'> gunzip -c $SESSIONDIR/Bos_taurus.UMD3.1.78.gtf.gz | more </source>
Familiarise yourself with the GTF format if you have not already. For more information:
http://www.ensembl.org/info/website/upload/gff.html
You can select the appropriate region from the GTF file:
<source lang='bash'> gunzip -c $SESSIONDIR/Bos_taurus.UMD3.1.78.gtf.gz | awk '$3=="gene"' | awk '$1==18&&$4>1000000&&$4<2000000' >mygenes.gtf </source>
Subsequently you can intersect the VCF with the GTF file that contains just the genes between coordinates 1 and 2 million:
<source lang='bash'> bedtools intersect -a BT18_rsnumbers.vcf.gz -b mygenes.gtf >intersection_genes.vcf </source>
Have a quick look at the output, and count the number of variants in this VCF (e.g. by using 'wc -l').
Now do the same, but select only the exonic variants in the same interval, from 1 to 2 million.
How many variants do you have now?
Variant Effect Predictor
Various tools exist to make more precise annotations than what you can achieve using Bedtools and a GTF. Among them are Variant Effect Predictor (VEP), snpEff, and Annovar. For the current practical we will use the former two.
VEP can be found here: <source lang='bash'> perl /home/formacion/COMUNES/IAMZ/soft/ensembl-tools-release-78/scripts/variant_effect_predictor/variant_effect_predictor.pl </source>
If you invoke it without parameters you will see the following:
Usage: perl variant_effect_predictor.pl [--cache|--offline|--database] [arguments] Basic options ============= --help Display this message and quit -i | --input_file Input file -o | --output_file Output file --force_overwrite Force overwriting of output file --species [species] Species to use [default: "human"] --everything Shortcut switch to turn on commonly used options. See web documentation for details [default: off] --fork [num_forks] Use forking to improve script runtime For full option documentation see: http://www.ensembl.org/info/docs/tools/vep/script/vep_options.html
For the current session we will use the following options:
--species bos_taurus -o BT18.vep # outputfile --fork 4 # use 4 threads --canonical --sift b --coding_only --no_intergenic --offline # using a cache directory --dir /home/formacion/COMUNES/IAMZ/data/CIHEAM/ReferenceGenome/VEP/ # where the cache dir lives --force_overwrite
You can create a file that will provide annotation information for each SNP: <source lang='bash'> gunzip -c BT18_rsnumbers.vcf.gz | \
perl /home/formacion/COMUNES/IAMZ/soft/ensembl-tools-release-78/scripts/variant_effect_predictor/variant_effect_predictor.pl \ --dir /home/formacion/COMUNES/IAMZ/data/CIHEAM/ReferenceGenome/VEP/ --species bos_taurus -o BT18.vep \ --fork 4 --canonical --sift b --coding_only --no_intergenic --offline --force_overwrite
</source>
Have a look inside the output file and study its content. Why are there no SNPs in this list that are intergenic?
The VEP output you generated before has many advantages, one of which is that it is nicely tabular and easy to parse. In fact a demonstration of that will be covered in the 'Provean' practical. For some purposes, however, it might be more convenient to have the annotations directly in the VCF. Furthermore, you may want annotations for each and every SNP, even if e.g. they are intergenic. You could for instance do the following, omitting the '--no_intergenic' and '--coding_only' flags, and in stead adding the '--vcf' flag: <source lang='bash'> gunzip -c BT18_rsnumbers.vcf.gz | \
perl /home/formacion/COMUNES/IAMZ/soft/ensembl-tools-release-78/scripts/variant_effect_predictor/variant_effect_predictor.pl \ --dir /home/formacion/COMUNES/IAMZ/data/CIHEAM/ReferenceGenome/VEP/ --species bos_taurus -o BT18.vep.vcf \ --fork 4 --canonical --sift b --offline --force_overwrite --vcf
bgzip BT18.vep.vcf tabix -p vcf BT18.vep.vcf.gz </source>
Again, study and interpret your outcome by looking inside the file 'BT18.vep.vcf.gz'. Specifically, note WHERE the annotation ended up. For this you have to understand the VCF format in detail. If not done already, further familiarise yourself with the VCF format.
http://www.1000genomes.org/wiki/analysis/vcf4.0
The description states that the 8th column is the "INFO" column. The VEP annotation has been added with the "CSQ" field tag.
18 53205918 . G A 0.329662 . AB=0;ABP=0;AC=0;AF=0.125;AN=16;AO=2;CIGAR=1X;DP=70;DPB=70;DPRA=1.03279;EPP=7.35324;EPPR=3.13803;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=60;NS=8;NUMALT=1;ODDS=2.54006;PAIRED=1;PAIREDR=0.985294;PAO=0;PQA=0;PQR=0;PRO=0;QA=70;QR=2421;RO=68;RPP=3.0103;RPPR=11.1853;RUN=1;SAF=1;SAP=3.0103;SAR=1;SRF=49;SRP=31.7504;SRR=19;TYPE=snp;technology.ILLUMINA=1;CSQ=A|ENSBTAG00000018834|ENSBTAT00000025071|Transcript|missense_variant|2132|1984|662|E/K|Gag/Aag|||1|YES|deleterious(0.02) GT:DP:RO:QR:AO:QA:GL 0/0:7:7:263:0:0:0,-1.92588,-10 0/0:7:7:238:0:0:0,-1.92376,-10 0/0:9:7:266:2:70:-3.90732,0,-10 0/0:10:10:362:0:0:0,-2.73805,-10 0/0:10:10:345:0:0:0,-2.73789,-10 0/0:8:8:287:0:0:0,-2.19662,-10 0/0:5:5:178:0:0:0,-1.38409,-10 0/0:14:14:482:0:0:0,-3.82032,-10
The two VEP runs have generated reports in HTML format. Transfer them to your local computer and study them.
You can compare one or a few examples manually by checking them at the Ensembl VEP website:
http://www.ensembl.org/Bos_taurus/Tools/VEP
A further look at the VEP output
From studying the HTML reports it should be clear that the information you can find there is necessarily limited and general. It is therefore important to acquire skills to generate further analyses based on the annotation that meet your specific research needs. In this section we will delve a little bit deeper in the structure of the annotations and how to extract information from it.
Although the VCF now looks even more complex than before, there is a good reason to add the VEP annotation data the way it was done. The 8th column of the VCF (the "INFO" column) provides a lot of information, each field separated by a ';'. In addition, each field contains a field name and a value, delimited by a '='. The VEP annotation has its own field name, or tag: 'CSQ'. Within this field, the values are again structured, but this time delimited by a '|'. What we in fact have now is a nested data structure: <source lang='python'>
['18', '53205918', '.', 'G', 'A', '0.329662', '.', {'PAIREDR': '0.985294', 'DPB': '70', 'MQMR': '60', 'PQA': '0', 'PAO': '0', 'SAR': '1', 'AB': '0', 'GTI': '0', 'NUMALT': '1', 'RO': '68', 'DP': '70', 'DPRA': '1.03279', 'CSQ': ['A', 'ENSBTAG00000018834', 'ENSBTAT00000025071', 'Transcript', 'missense_variant', '2132', '1984', '662', 'E/K', 'Gag/Aag', , , '1', 'YES', ['deleterious', '0.02']], 'EPP': '7.35324', 'SRP': '31.7504', 'QR': '2421', 'SAF': '1', 'ODDS': '2.54006', 'SRR': '19', 'SRF': '49', 'technology.ILLUMINA': '1', 'QA': '70', 'EPPR': '3.13803', 'SAP': '3.0103', 'RPPR': '11.1853', 'PQR': '0', 'NS': '8', 'CIGAR': '1X', 'TYPE': 'snp', 'AF': '0.125', 'ABP': '0', 'LEN': '1', 'AC': '0', 'MEANALT': '1', 'PAIRED': '1', 'AO': '2', 'AN': '16', 'MQM': '60', 'RUN': '1', 'PRO': '0', 'RPP': '3.0103'}, ['GT:DP:RO:QR:AO:QA:GL',[ '0/0:7:7:263:0:0:0,-1.92588,-10', '0/0:7:7:238:0:0:0,-1.92376,-10', '0/0:9:7:266:2:70:-3.90732,0,-10', '0/0:10:10:362:0:0:0,-2.73805,-10', '0/0:10:10:345:0:0:0,-2.73789,-10', '0/0:8:8:287:0:0:0,-2.19662,-10', '0/0:5:5:178:0:0:0,-1.38409,-10', '0/0:14:14:482:0:0:0,-3.82032,-10']]]
</source>
When dealing with deleterious alleles, we might be interested to learn how frequent that allele is in the population we are studying. Our expectation is that genuinly deleterious alleles should be relatively rare in a population. If not they might hint at either a sign of inbreeding, or, perhaps, something that might be deleterious in the wild but advantageous in the context of domestication/breeding. The VCF file holds the (reference) allele frequency information. The 8th column of the VCF file in fact is a bunch of fields concatenated with the ';' delimitor. One of the fields has the tag 'AF=', where AF stands for (Alternative) Allele Frequency. We can expand the number of tab-delimited fields easily by replacing ';' for '\t', and then look for the 11th column. We need to remove the 'AF=' tag though, but if we do that we can further select for instance for allele frequencies between 0 and 15%.
The next shell-oneliner will give you the variants with rare (<15%) deleterious alleles and count them (by omitting the last part you can retrieve the relevant variations themselves). <source lang='bash'> gunzip -c BT18.vep.vcf.gz | grep deleterious | sed 's/;/\t/g' | awk '{gsub("AF=","",$11); print}' | \
awk '$11<0.15&&$11>0' | wc -l
</source>
Analysing the annotations a bit further, you could discover that there are quite a few variants that have high deleterious allele frequencies. Not only do some putative deleterious alleles occur at high frequency, but there is also a number of genes that has quite a few deleterious alleles annotated in the current population sample. Say you would like to make a table that contains all genes that have more than one putative deleterious variant. We would need to isolate both the AF field and the CSQ field, and from the latter isolate the 'gene' field. While this is possible with a fairly simple shell scripting hack, it does require an awful lot of counting to make sure you end up with the right fields. Ideally you would like to have more control by putting the relevant data in a complex data structure. Languages such as Perl and particularly Python allow far more explicit syntax to achieve that, so we will swap out some complex string of shell-commands by a single Python script. That script does only one thing: takes lines from the VCF file, put all fields in a complex data structure (as seen above) and, in a structured way, prints a selection of fields.
The Python script will take VCF input from a stream: <source lang='bash'> gunzip -c BT18.vep.vcf.gz | python3 $SESSIONDIR/print_fields_from_vcf.py -m vep | grep deleterious | more </source> Notice that the script itself does in fact take an option: 'vep'. This is done because it can also work for the snpEff annotation, which, highly inconveniently, is slightly different.
and for each line provide output similar to this:
#chrom coord AF Gene consequence sift_prediction sift_score 18 53205918 0.125 ENSBTAG00000018834 missense_variant deleterious 0.02
Obviously, the functionality of the Python script is limited. However, it would be quite easy to provide additional statistics and output options based on the current script, because it makes the data structure so explicit. This means that the Python script provides you with a means for expansion of functionality, while that will be far more tricky using shell commands solely.
Now we make a file that has the gene names, and counts, for all genes that have more than one deleterious allele: <source lang='bash'> gunzip -c BT18.vep.vcf.gz | python3 $SESSIONDIR/print_fields_from_vcf.py -m vep | grep deleterious | awk '$3>0&&$3<1' | \
cut -f4 | sort | uniq -c | sed 's/^ \+//' | awk '$1>1' >genes_with_more_than1.txt
</source>
Take two or three genes that have the highest number of deleterious variants. Manually look them up at the Ensembl website.
http://www.ensembl.org/Bos_taurus/Info/Index
Check the 'orthologous' status. What do you observe? How would you interpret the validity of the gene annotation and/or the likelihood of off-site mapping of reads?
This provides us with a strategy to filter a bit more rigorously for genes that may actually contain genuine deleterious alleles. Ideally we would a priory filter for genes that have proper one-to-one orthologous relations with other mammalian species. Time and lack of direct internet connection makes that unfeasible for this practical. However, genes that have only one deleterious variant segregating in the study population at a low frequency would certainly be better candidates than genes that have multiple deleterious variants, as the above example indicates. To filter you can do the following: <source lang='bash'> gunzip -c BT18.vep.vcf.gz | python3 $SESSIONDIR/print_fields_from_vcf.py -m vep | grep deleterious | awk '$3>0&&$3<1' | \
cut -f4 | sort | uniq -c | sed 's/^ \+//' | awk '$1==1' | awk '{print $2}' >genes_with1.txt
</source>
How many genes are there that meet this condition?
You can now easily make tables for number of different consequences as annotated by both these annotation tools: <source lang='bash'>
- for VEP results:
gunzip -c BT18.vep.vcf.gz | python3 $SESSIONDIR/print_fields_from_vcf.py -m vep | cut -f5 | sort | uniq -c </source>
VEP also made a report in HTML format. Transfer it to your local computer and view it, if you haven't done already. How do your 'hacky' results compare to the report VEP made?
snpEff
Another popular variant annotation tool is 'snpEff'. Among the advantages over VEP is that it is applicable to a wider range of species. Furthermore, it is blazingly fast. For chromosome 18, we will also use this tool:
snpEff is Java-based. The path to the snpEff.jar Java archive is available as an environment variable. Do:
<source lang='bash'> echo $snpEff
/home/formacion/COMUNES/IAMZ/soft/snpEff/snpEff.jar
</source>
The snpEff program can be used like this:
<source lang='bash'> java -Xmx4G -jar $snpEff -dataDir $SESSIONDIR/snpEff/data -v UMD3.1.78 BT18_rsnumbers.vcf.gz >BT18_snpEff.vcf bgzip BT18_snpEff.vcf tabix -p vcf BT18_snpEff.vcf.gz </source>
The options used here are:
-dataDir ~/snpEff/data # annotation data goes here -v UMD3.1.78 # Genome / version BT18_rsnumbers.vcf.gz # requires an input file, can be gzipped >testresult_snpEff_Bt.txt # results go to STDOUT and are captured in a file by '>'
Study the outcome. Try to filter out 'intergenic' SNPs and other things that are perhaps a bit less relevant (use your skills in 'grep-ping', etc.). What are some of the differences in how the variants are annotated, compared to VEP?
You will notice that the variants that have an effect are labelled 'MODERATE' or 'HIGH'. Try to filter out the ones that are 'HIGH'. How many are there?
You can now again easily make tables for number of different consequences as annotated by both these annotation tools:
<source lang='bash'>
- for snpEff results:
gunzip -c BT18_snpEff.vcf.gz | python3 $SESSIONDIR/print_fields_from_vcf.py -m snpeff | cut -f5 | sort | uniq -c </source>
Like VEP, snpEff makes a report in HTML format. Transfer it to your local computer and study it. How do your 'hacky' results compare to the report snpEff made?
Bonus question: Compare the results by selecting the 'MODERATE' and "HIGH" variants, and compare that with the 'deleterious' variants as annotated (through SIFT) in VEP. You can use, e.g. vcf-compare.
If time permits: a human example
Of all species, human so far has the best annotation information for variation (or anything else for that matter). The reason, obviously, is the clinical relevance. In the coming years, increased knowledge on regulatory sequences, e.g. through the ENCODE project, will further increase the knowledge on relevance of non-exonic variation. Again, the human/clinical genomics community will lead the way here. The domestic animal genomics community will follow and profit from those insights.
Currently, there are already important resources of information on clinically relevant variations. You can find more information here:
http://www.ncbi.nlm.nih.gov/variation/docs/human_variation_vcf/
Among the variation sets available is the 'common&clinical' - a set of variants that has clinical relevance, and a relatively high occurrence in human populations (i.e. not the de novo mutations often found to be the reason for genetic disorders).
We can annotate these SNPs using snpEff:
<source lang='bash'> java -Xmx4G -jar $snpEff -dataDir $SESSIONDIR/snpEff/data -v GRCh38.78 \
$SESSIONDIR/common_and_clinical.vcf >common_clinical_snpEff.vcf
</source>
Copy the HTML report to your local computer and study it. A few things will be apparent that are different to the cattle example. You will see consequences that you have not seen in cattle, notably "TF_binding_site_variant". Another is that, because this is a non-random sample of variants, its distribution is non-random as well. Exonic variants, most notably, are highly overrepresented.
Inferring function
Provean
Provean uses an approach where a protein sequence is compared (Psiblast) to all known protein sequences (so called 'non-redundant database'). Subsequent clustering defines a group of 'similar', and implicitly homologous sequences. An example of in- and output of a pig sequence:
What goes in:
Protein sequence (Fasta):
>ENSSSCP00000018263 pep:novel chromosome:Sscrofa10.2:12:6621092:6624938:1 gene:ENSSSCG00000017236 transcript:ENSSSCT00000018765 gene_biotype:protein_coding t transcript_biotype:protein_coding
MTPRVGAVWLPSALLLLRVPGCLSLSGPPTAMGTKGGSLSVQCRYEEEYIDDKKYWDKSP
CFLSWKHIVETTESAREVRRGRVSIRDDPANLTFTVTLERLTEEDAGTYCCGITAQFSVD
PTHEVEVVVFPALGTSRPPSMPGPPTTLPATTWSFVSERETMANNLGKGPASQDPGQHPR
SKHPSIRLLLLVFLEVPLFLGMLGAVLWVHRPLRSSESRSVAMDPVPGNTAPSAGWK
and variation, with coordinates pertaining the above protein sequences, the second column the reference amino acid, and the third/last column the alternative amino acid (alternative allele):
235,G,E 5,V,A 22,C,Y 34,T,I 51,D,N 53,K,N 59,S,Y 61,C,R 64,S,L 67,H,P 68,I,T 75,A,V 108,T,K 115,A,T 124,E,D 130,F,Y 133,L,P 142,P,A 155,F,I 158,E,G 186,I,V 21,G,D
Output looks like this:
## PROVEAN scores ##
# VARIATION SCORE
235,G,E 0.076
5,V,A 0.287
22,C,Y -8.028
34,T,I -1.932
51,D,N -1.613
53,K,N 1.565
59,S,Y -2.140
61,C,R -5.826
64,S,L -0.437
67,H,P -0.511
68,I,T -2.664
75,A,V -1.061
108,T,K -4.051
115,A,T 1.557
124,E,D -1.587
130,F,Y -0.983
133,L,P 3.203
142,P,A -2.058
155,F,I 0.502
158,E,G -0.077
186,I,V 0.220
21,G,D -6.014
We will do something similar, although with a somewhat less complex example, for the cow data. At the end of the VEP exercise we created a filtered list of genes that have only a single deleterious variant annotated. Among these genes was BLOC1S3 gene, ENSBTAG00000007070.
http://www.ensembl.org/Bos_taurus/Gene/Summary?db=core;g=ENSBTAG00000007070;r=18:53231599-53232201;t=ENSBTAT00000009293
What we need for the Provean analysis is 1) the protein sequence of the gene (or rather, the transcript), and 2) a file with the variation in the same format as listed above. Obviously, if we want to do this for many genes, we will not parse the relevant information by hand. We will make a little pipeline for that. A multifasta file that contains all protein sequences for cow can be found at Ensembl. It is located here on the machine you are working on:
<source lang='bash'>
- define the Env.Var. that holds the gene id:
GENE=ENSBTAG00000007070
- search the protein file for the gene, and parse out the corresponding protein name
- Note: ideally you do this by transcript, as a gene may be represented by multiple transcripts/protein sequences
PROT=`cat $SESSIONDIR/Bos_taurus.UMD3.1.pep.all.fa | grep $GENE | awk '{print $1}' | sed 's/>//'`
- use the first VEP annotation you did for this. Easiest to work with because completely tabular.
- Note: in this case there is only a single missense variant in the gene. This won't work for all genes.
- Note: the 'sed' statement replaces the '/', as found in the VEP annotation, with a ','.
cat BT18.vep | grep $GENE | grep missense | awk '{print $10","$11}' | sed 's/\//,/' >$PROT.var
- retrieve the protein sequence from the multifasta protein sequence file.
- Note: the 'faOneRecord program is part of the so called 'Blat' suite, of which Blat is the best known
faOneRecord $SESSIONDIR/Bos_taurus.UMD3.1.pep.all.fa $PROT >$PROT.fa </source>
This should result in two files that are produced:
ENSBTAP00000009293.fa
>ENSBTAP00000009293 pep:known chromosome:UMD3.1:18:53231599:53232201:1 gene:ENSBTAG00000007070 transcript:ENSBTAT00000009293 gene_biotype:protein_coding transcript_biotype:protein_coding MESQSRRRRPLRRPETLVQGEAAESDSDLSASSSEEEELYLGPSGPTRGRPTGLRVAGEA AETDSDPEPEPKAAPRDLPPLVVQRETAGEAWAEEEAPAPAPARSLLQLRLAESQARLDH DVAAAVSGVYRRAGRDVAALAGRLAAAQAAGLAAAHSVRLARGDLCALAERLDIVASCRL LPDIRGVPGTEPEQDPGPRA
ENSBTAP00000009293.var
26,D,N
Finally, the actual Provean analysis is done with this command (this may take a few minutes to run):
<source lang='bash'> provean.sh --num_threads 8 -q $PROT.fa -v $PROT.var --save_supporting_set $PROT.sss >$PROT.result.txt 2>$PROT.error </source>
Study the result. What is the verdict on the alternative allele? (D-->N)
Provean does many things 'under the hood'. When you follow the process (e.g. using the 'top' command in a second terminal screen), you will notice that the majority of time is spent on Psiblast, and that it takes quite a bit of the total memory of the node you are working on (typically >10GB of ram). The similar sequences that Provean fished out of the nr-database and then clustered are written to a 'supporting set' (because you asked Provean to save the supporting set by using the --save_supporting flag). Briefly investigate what is in the supporting set (extension .sss.fasta).
The supporting sequences are not aligned. Chances are that among the supporting sequences there is a cow sequence. Bit cumbersome to figure out which is which (although doable using blastdb tools - we won't go into that here). Easiest is to add the protein sequence you have used for the Provean analysis and add it to the supporting set fasta file (using the '>>', note that you need TWO of these to add to a file, otherwise the file will get overwritten if you use only one!).
<source lang='bash'>
- add protein sequence to supporting set fasta:
cat $PROT.fa >>$PROT.sss.fasta
- align all protein sequences using the multiple sequence aligner mafft:
mafft --auto $PROT.sss.fasta >$PROT.sss.fasta.out </source>
Transfer the alignment to your local environment and view. If installed on your local computer you could use e.g. Seaview. If you are on a Debian-based system (e.g. Ubuntu) and you have administrative rights, you can simply install it like this: <source lang='bash'> sudo apt-get install seaview </source>
Alternatively, you can view the alignment through this online tool:
http://toolkit.tuebingen.mpg.de/alnviz
What do you observe for the 26th amino-acid of the cow protein sequence? In this 'implicit evolutionary context' is there another alternative amino acid residue possible?
Through the orthology with human, try to find some possible phenotypic consequences of the alternative allele (e.g. through OMIM). Could there be a reason that the variant is at a relatively high frequency in this cattle population?
Polyphen
The Polyphen analysis is quite similar in what you need compared to Provean: a fasta file containing the protein sequence, and a file that lists coordinates and variants:
What goes in: in.fa
>ENSSSCG00000001099ENSSSCT00000001195 MSSIEQTTEILLCLSPAEAANLKEGINFVRNKSTGKDYILFKNKSRLKACKNMCKHQGGL FIKDIEDLNGRSVKCTKHNWKLDVSSMKYINPPGSFCQDELVVEKDEENGVLLLELNPPN PWDSEPRSPEDLAFGEVQITYLTHACMDLKLGDKRMVFDPWLIGPAFARGWWLLHEPPSD WLERLSRADLIYISHMHSDHLSYPTLKKLAERRPDVPIYVGNTERPVFWNLNQSGVQLTN INVVPFGIWQQVDKNLRFMILMDGVHPEMDTCIIVEYKGHKILNTVDCTRPNGGRLPMKV ALMMSDFAGGASGFPMTFSGGKFTEEWKAQFIKTERKKLLNYKARLVKDLQPRIYCPFAG YFVESHPSDKYIKETNIKNDPNELNNLIKKNSEVVTWTPRPGATLDLGRMLKDPTDSKGI VEPPEGTKIYKDSWDFGPYLNILNAAIGDEIFRHSSWIKEYFTWAGFKDYNLVVRMIETD EDFSPLPGGYDYLVDFLDLSFPKERPSREHPYEEIRSRVDVIRHVVKNGLLWDDLYIGFQ TRLQRDPDIYHHLFWNHFQIKLPLTPPDWKSFLMCSG
in.coord
ENSSSCG00000001099ENSSSCT00000001195 368 S A ENSSSCG00000001099ENSSSCT00000001195 453 R H ENSSSCG00000001099ENSSSCT00000001195 431 K T
<source lang='bash'> run_pph.pl -s in.fa in.coord </source>
What comes out:
/lustre/nobackup/WUR/ABGC/shared/public_data_store/polyphen/polyphen-2.2.2/bin/run_pph.pl -s test1033.fa test1033.coord #o_acc o_pos o_aa1 o_aa2 rsid acc pos aa1 aa2 nt1 nt2 prediction based_on effect site region PHAT dScore Score1 Score2 MSAv Nobs Nstruct Nfilt PDB_id PDB_pos PDB_ch ident lengthNormASA SecStr MapReg dVol dProp B-fact H-bonds AveNHet MinDHet AveNInt MinDInt AveNSit MinDSit Transv CodPos CpG MinDJxn PfamHit IdPmax IdPSNP IdQmin ENSSSCG00000001099ENSSSCT00000001195 368 S A ENSSSCG00000001099ENSSSCT00000001195 368 S A benign alignment +0.721 -1.180 -1.901 2 68 12.350 12.350 42.63 ENSSSCG00000001099ENSSSCT00000001195 453 R H ENSSSCG00000001099ENSSSCT00000001195 453 R H benign alignment +0.351 -2.130 -2.481 2 67 33.194 33.194 93.76 ENSSSCG00000001099ENSSSCT00000001195 431 K T ENSSSCG00000001099ENSSSCT00000001195 431 K T possibly damaging alignment +1.705 -1.678 -3.383 2 68 2.382 68.80
The fasta and coordinate files can be made similar to what you've done for Provean. Polyphen is much more human-oriented than Provean (which is more species agnostic) and has many more features, not all of which work very well for non-human species. Since Polyphen won't run properly in your present infrastructure we will not practice it actively.