Short read mapping pipeline pig: Difference between revisions

From HPCwiki
Jump to navigation Jump to search
(Created page with "A generic pipeline developed for pig at ABGC. Modifications include use of bwa 5.9 in stead of later versions, as per the requirements of the 1000 Bulls consortium. Furthermore, ...")
 
No edit summary
 
(25 intermediate revisions by 2 users not shown)
Line 1: Line 1:
A generic pipeline developed for pig at ABGC. Modifications include use of bwa 5.9 in stead of later versions, as per the requirements of the 1000 Bulls consortium. Furthermore, the script will automate setting metadata in the BAM files such as including the official 1000 Bulls Ids of the individual cows. Another modification is the use of an sqlite-based database that holds some the metadata required for performing the mapping and setting the correct ids. The sqlite database should be called 'cow_schema.db' and should be in the same working directory as the Python3 master script. Currently, the following two tables should be present in the database: <code>cow_schema_main</code> and <code>bulls1K_id</code>.
The latest short-read mapping pipeline for the pig project is based on a Python3 script that creates a shell script that can subsequently be executed from the command line or submitted to the cluster using SLURM.
The latest version of the Python3 script can be found at [https://github.com/hjmegens/NGStools/blob/master/ABGC_mapping_v2.py GitHub]. The script requires to be executed using python3.


Code to create the <code>cow_schema_main</code> table:


<source lang='sql'>
CREATE TABLE cow_schema_main (archive text not null, seq_file_name text not null primary key, animal_id text not null, md5sum_zipped text not null, tmp_inserte datetime default current_timestamp);
</source>


The table will hold one line for each fastq file (seq_file_name). Archive refers to the base directory that holds the fastq file. Animal ID can be a trivial name. Note that the pipeline will assume files to be gzipped plain-text fastq. The md5sums are from the gzipped files.  
== Prerequisites ==
  Table: cow_schema_main:
=== Data sources ===
  archive seq_file_name animal_id md5sum_zipped
* path to [[ABGSA | sequence archives]] /lustre/nobackup/WUR/ABGC/shared/Pig/ABGSA/ (for pig only)
  SZAIPI019130-16 121202_I598_FCD1JRLACXX_L6_SZAIPI019130-16_1.fq.gz.clean.dup.clean.gz BOV-WUR-1 f95b35158fe5802a6d6ca18a0e973e87
* access to ABGSA meta-database, currently hosted at scomp1095.wurnet.nl, database='ABGSAschema')
  SZAIPI019130-16 121202_I598_FCD1JRLACXX_L6_SZAIPI019130-16_2.fq.gz.clean.dup.clean.gz BOV-WUR-1 333aee62675082d42c2355cc7f21e89b
* path to reference genome, including index for BWA /lustre/nobackup/WUR/ABGC/shared/Pig/Sscrofa_build10_2/Ensembl72/Sus_scrofa.Sscrofa10.2.72.dna.toplevel.fa


=== Hardcoded paths ===
All paths to data and software are currently hardcoded. This is done for transparency (hardcoded==explicit). Hardcoded paths do require work however, when migrating to new environment. Currently contemplated to switch to using environment variables.
* bwa 5.9 /cm/shared/apps/WUR/ABGC/bwa/bwa-0.5.9/ (required if using bwa 5.9, e.g. for 1000 Bulls project)
* bwa 7.5 /cm/shared/apps/WUR/ABGC/bwa/bwa-0.7.5a/ (required if using bwa 7.5, e.g. when using BWA men)
* samtools 1.19 /cm/shared/apps/WUR/ABGC/samtools/samtools-0.1.19/ (required)
* samtools 1.12 /cm/shared/apps/WUR/ABGC/samtools/samtools-0.1.12a/ (required if variant calling)
* picard /cm/shared/apps/WUR/ABGC/picard/picard-tools-1.93/ (currently not enabled, not required)
* GATK /cm/shared/apps/WUR/ABGC/GATK/GATK2.6/ (required)
* Mosaik /path/to/mosaik/ref.dat (required when using Mosaik as mapping tool)
* Mosaik Jump Library /path/to/mosaikjump/ref.j15 (required when using Mosaik as mapping tool)
* dbSNPfile=reffolder+'/dbSNP/dbSNP.vcf' (required for re-callibration)
* gatk_gvcf /cm/shared/apps/WUR/ABGC/GATK/GATK_gVCFmod/ (required when variant calling)
* gvcftools /cm/shared/apps/WUR/ABGC/gvcftools/gvcftools-0.16/bin/ (required when variant calling)
* helper scripts  /cm/shared/apps/WUR/ABGC/abgsascripts/ (required)
* Variant Effect Predictor (VEP) /cm/shared/apps/WUR/ABGC/variant_effect_predictor/VEP231213/ (required when variant calling)


Code to create the <code>bulls1K_id</code> table:
=== Present in PATH ===
<source lang='sql'>
* sickle
CREATE TABLE bulls1K_id (animal_id text not null, bull1K_id text not null primary key, tmp_inserted datetime default current_timestamp);
* pigz
</source>
* bgzip
* tabix
* perl
* python2 (link to python 2.6 or 2.7 with name python2)
* java7 (link to Java v.7 with name java7)


The table will hold one line per individual. Its purpose is to connect the trivial name used in the cow_schema_main table to the official 1000 Bulls Id. In addition, it provides an overview of the animals represented in the sequence archives.
=== Present in the working directory ===
  TABLE bulls1K_id
* [[1000Bulls_mapping_pipeline_at_ABGC | cow_schema.db]] (SQLite db for 1000 Bulls project - for cow only at the moment).
  animal_id bull1K_id
  BOV-WUR-1 HOLNLDM000120873995
  BOV-WUR-2 HOLNLDM000811488961


The code to generate the runfiles for the first 15 bulls to be analysed:
== Basic execution ==


<source lang='bash'>
<source lang='bash'>
(virtenv)[megen002@nfs01 rundir]$ python ABGC_mapping_v2.py -i LW22F04 -a /lustre/nobackup/WUR/ABGC/shared/Pig/ABGSA/ -r /lustre/nobackup/WUR/ABGC/shared/Pig/Sscrofa_build10_2/Ensembl72/Sus_scrofa.Sscrofa10.2.72.dna.toplevel.fa -t 4
</source>
The code should produce the [https://github.com/hjmegens/NGStools/blob/master/runLW22F08.sh following shell script], ready for execution with SLURM.
== Automated runfile creation ==


for i in `seq 1 15`; do echo $i; python3 ABGC_mapping_v2.py -i BOV-WUR-$i -a /srv/mds01/shared/Bulls1000/F12FPCEUHK0755_alq121122/cleandata/ -r /srv/mds01/shared/Bulls1000/UMD31/umd_3_1.fa -t 12 -s cow -m bwa-aln -b 5.9 -o; done
<source lang='bash'>
mysql -u ABGSAuser -h scomp1095.wurnet.nl -p ABGSAschema -e 'select ABG_individual_id from ABGSAschema_main where archive_name like "ABGSA0%" group by ABG_individual_id' >list.txt
FILES=`cat list.txt`
for ID in $FILES; do python ABGC_mapping_v2.py -i $ID -a /lustre/nobackup/WUR/ABGC/shared/Pig/ABGSA/ -r /lustre/nobackup/WUR/ABGC/shared/Pig/Sscrofa_build10_2/Ensembl72/Sus_scrofa.Sscrofa10.2.72.dna.toplevel.fa -t 4; done
</source>
</source>


* [https://github.com/hjmegens/NGStools/blob/master/ABGC_mapping_v2.py The Python3-based masterscript]
== Output files ==
* [https://github.com/hjmegens/NGStools/blob/master/runBOV-WUR-4.sh Example of a runfile created by the master script]
 
== See also ==
* [[ABGSA | Animal Breeding & Genomics Sequence Archives]]
* [[1000Bulls_mapping_pipeline_at_ABGC | 1000 Bulls @ABGC implementation of the pipeline]]
== External links ==
[https://github.com/hjmegens/NGStools/blob/master/ABGC_mapping_v2.py NGStools page on GitHub]

Latest revision as of 20:43, 27 December 2013

The latest short-read mapping pipeline for the pig project is based on a Python3 script that creates a shell script that can subsequently be executed from the command line or submitted to the cluster using SLURM. The latest version of the Python3 script can be found at GitHub. The script requires to be executed using python3.


Prerequisites

Data sources

  • path to sequence archives /lustre/nobackup/WUR/ABGC/shared/Pig/ABGSA/ (for pig only)
  • access to ABGSA meta-database, currently hosted at scomp1095.wurnet.nl, database='ABGSAschema')
  • path to reference genome, including index for BWA /lustre/nobackup/WUR/ABGC/shared/Pig/Sscrofa_build10_2/Ensembl72/Sus_scrofa.Sscrofa10.2.72.dna.toplevel.fa

Hardcoded paths

All paths to data and software are currently hardcoded. This is done for transparency (hardcoded==explicit). Hardcoded paths do require work however, when migrating to new environment. Currently contemplated to switch to using environment variables.

  • bwa 5.9 /cm/shared/apps/WUR/ABGC/bwa/bwa-0.5.9/ (required if using bwa 5.9, e.g. for 1000 Bulls project)
  • bwa 7.5 /cm/shared/apps/WUR/ABGC/bwa/bwa-0.7.5a/ (required if using bwa 7.5, e.g. when using BWA men)
  • samtools 1.19 /cm/shared/apps/WUR/ABGC/samtools/samtools-0.1.19/ (required)
  • samtools 1.12 /cm/shared/apps/WUR/ABGC/samtools/samtools-0.1.12a/ (required if variant calling)
  • picard /cm/shared/apps/WUR/ABGC/picard/picard-tools-1.93/ (currently not enabled, not required)
  • GATK /cm/shared/apps/WUR/ABGC/GATK/GATK2.6/ (required)
  • Mosaik /path/to/mosaik/ref.dat (required when using Mosaik as mapping tool)
  • Mosaik Jump Library /path/to/mosaikjump/ref.j15 (required when using Mosaik as mapping tool)
  • dbSNPfile=reffolder+'/dbSNP/dbSNP.vcf' (required for re-callibration)
  • gatk_gvcf /cm/shared/apps/WUR/ABGC/GATK/GATK_gVCFmod/ (required when variant calling)
  • gvcftools /cm/shared/apps/WUR/ABGC/gvcftools/gvcftools-0.16/bin/ (required when variant calling)
  • helper scripts /cm/shared/apps/WUR/ABGC/abgsascripts/ (required)
  • Variant Effect Predictor (VEP) /cm/shared/apps/WUR/ABGC/variant_effect_predictor/VEP231213/ (required when variant calling)

Present in PATH

  • sickle
  • pigz
  • bgzip
  • tabix
  • perl
  • python2 (link to python 2.6 or 2.7 with name python2)
  • java7 (link to Java v.7 with name java7)

Present in the working directory

  • cow_schema.db (SQLite db for 1000 Bulls project - for cow only at the moment).

Basic execution

<source lang='bash'> (virtenv)[megen002@nfs01 rundir]$ python ABGC_mapping_v2.py -i LW22F04 -a /lustre/nobackup/WUR/ABGC/shared/Pig/ABGSA/ -r /lustre/nobackup/WUR/ABGC/shared/Pig/Sscrofa_build10_2/Ensembl72/Sus_scrofa.Sscrofa10.2.72.dna.toplevel.fa -t 4 </source> The code should produce the following shell script, ready for execution with SLURM.

Automated runfile creation

<source lang='bash'> mysql -u ABGSAuser -h scomp1095.wurnet.nl -p ABGSAschema -e 'select ABG_individual_id from ABGSAschema_main where archive_name like "ABGSA0%" group by ABG_individual_id' >list.txt FILES=`cat list.txt` for ID in $FILES; do python ABGC_mapping_v2.py -i $ID -a /lustre/nobackup/WUR/ABGC/shared/Pig/ABGSA/ -r /lustre/nobackup/WUR/ABGC/shared/Pig/Sscrofa_build10_2/Ensembl72/Sus_scrofa.Sscrofa10.2.72.dna.toplevel.fa -t 4; done </source>

Output files

See also

External links

NGStools page on GitHub