Author: Carolina Pita Barros
Contact: carolina.pitabarros@wur.nl
ABG

For up-to-date documentation see here

Population level variant calling

Path to pipeline: /lustre/nobackup/WUR/ABGC/shared/PIPELINES/population-variant-calling

First follow the instructions here

Step by step guide on how to use my pipelines
Click here for an introduction to Snakemake

ABOUT

This is a pipeline that takes short reads aligned to a genome (in .bam format) and performs population level variant calling with Freebayes. It uses VEP to annotate the resulting VCF, calculates statistics, and calculates and plots a PCA.

It was developed to work with the results of this population mapping pipeline. There are a few Freebayes requirements that you need to take into account if you don't use the mapping pipeline mentioned above to map your reads. You should make sure that:

Alignments have read groups
Alignments are sorted
Duplicates are marked

See here for more details.

Tools used

Freebayes - variant calling using short reads
bcftools - vcf statistics
Plink - compute PCA
R - Plot PCA


Pipeline workflow

Edit config.yaml with the paths to your files

<syntaxhighlight lang="yaml">ASSEMBLY: /path/to/fasta MAPPING_DIR: /path/to/bams/dir PREFIX: <prefix> OUTDIR: /path/to/outdir SPECIES: <species> NUM_CHRS: <number of chromosomes></syntaxhighlight>

ASSEMBLY - path to genome fasta file
MAPPING_DIR - path to directory with bam files to be used
- the pipeline will use all bam files in the directory, if you want to use a subset of those, create a file named bam_list.txt that contains the paths to the bam files you want to use. One path per line.

PREFIX - prefix for the created files
OUTDIR - directory where snakemake will run and where the results will be written to

If you want the results to be written to this directory (not to a new directory), open config.yaml and comment out OUTDIR: /path/to/outdir

SPECIES - species name to be used for VEP
NUM_CHRS - number of chromosomes for your species (necessary for plink). ex: 38

ADDITIONAL SET UP

Configuring VEP

This pipeline uses VEP in offline mode, which increases performance. In order to use it in this mode, the cache for the species used needs to be installed:

For people using WUR's Anunna:

Check if the cache file for your species already exist in /lustre/nobackup/SHARED/cache/. If it doesn't, create it with

/usr/bin/perl /shared/apps/SHARED/ensembl-vep/INSTALL.pl --CACHEDIR /lustre/nobackup/SHARED/cache/ --AUTO c -n --SPECIES <species>

When multiple assemblies are found you need to run it again with --ASSEMBLY <assembly name>, where "assembly name" is the name of the assembly you want to use.

For those not from WUR:

You can install VEP with

conda install -c bioconda ensembl-vep

and install the cache with

vep_install --CACHEDIR <where/to/install/cache> --AUTO c -n --SPECIES <species>

When multiple assemblies are found you need to run it again with --ASSEMBLY <assembly name>, where "assembly name" is the name of the assembly you want to use.

In the Snakefile, in rule run_vep, replace /shared/apps/SHARED/ensembl-vep/vep with vep

Installing R packages

First load R: module load R/3.6.2

Enter the R environment by writing R and clicking enter. Install the packages:

list.of.packages <- c("optparse", "data.table", "ggplot2")

new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)

If you get an error like this:

Warning in install.packages(new.packages) :
'lib = "/shared/apps/R/3.6.2/lib64/R/library"' is not writable

Follow the instructions on how to install R packages locally here and try to install the packages again.

RESULTS

The most important files and directories are:

<run_date>_files.txt dated file with an overview of the files used to run the pipeline (for documentation purposes)
results directory that contains
- final_VCF directory with variant calling VCF files, as well as VCF stats
  - {prefix}.vep.vcf.gz - final VCF file
  - {prefix}.vep.vcf.gz.stats
- PCA PCA results and plot
  - {prefix}.eigenvec and {prefix}.eigenval - file with PCA eigenvectors and eigenvalues, respectively
  - {prefix}.pdf - PCA plot

The VCF file has been filtered for QUAL > 20. Freebayes is ran with parameters --use-best-n-alleles 4 --min-base-quality 10 --min-alternate-fraction 0.2 --haplotype-length 0 --ploidy 2 --min-alternate-count 2. These parameters can be changed in the Snakefile.

Population variant calling pipeline

Contents

Population level variant calling

First follow the instructions here

ABOUT

Tools used

Edit config.yaml with the paths to your files

ADDITIONAL SET UP

Configuring VEP

For people using WUR's Anunna:

For those not from WUR:

Installing R packages

RESULTS

Navigation menu

Population variant calling pipeline

Population level variant calling

First follow the instructions here

ABOUT

Tools used

Edit config.yaml with the paths to your files

ADDITIONAL SET UP

Configuring VEP

For people using WUR's Anunna:

For those not from WUR:

Installing R packages

RESULTS

Navigation menu

Search