1000Bulls mapping pipeline at ABGC: Difference between revisions

From HPCwiki
Jump to navigation Jump to search
No edit summary
No edit summary
 
(4 intermediate revisions by 2 users not shown)
Line 24: Line 24:
   BOV-WUR-1 HOLNLDM000120873995
   BOV-WUR-1 HOLNLDM000120873995
   BOV-WUR-2 HOLNLDM000811488961
   BOV-WUR-2 HOLNLDM000811488961
The code to generate the runfiles for the first 15 bulls to be analysed:


<source lang='bash'>
<source lang='bash'>
Line 29: Line 31:
for i in `seq 1 15`; do echo $i; python3 ABGC_mapping_v2.py -i BOV-WUR-$i -a /srv/mds01/shared/Bulls1000/F12FPCEUHK0755_alq121122/cleandata/ -r /srv/mds01/shared/Bulls1000/UMD31/umd_3_1.fa -t 12 -s cow -m bwa-aln -b 5.9 -o; done
for i in `seq 1 15`; do echo $i; python3 ABGC_mapping_v2.py -i BOV-WUR-$i -a /srv/mds01/shared/Bulls1000/F12FPCEUHK0755_alq121122/cleandata/ -r /srv/mds01/shared/Bulls1000/UMD31/umd_3_1.fa -t 12 -s cow -m bwa-aln -b 5.9 -o; done
</source>
</source>
* [https://github.com/hjmegens/NGStools/blob/master/ABGC_mapping_v2.py The Python3-based masterscript]
* [https://github.com/hjmegens/NGStools/blob/master/ABGC_mapping_v2.py The Python3-based masterscript]
* [https://github.com/hjmegens/NGStools/blob/master/runBOV-WUR-4.sh Example of a runfile created by the master script]
* [https://github.com/hjmegens/NGStools/blob/master/runBOV-WUR-4.sh Example of a runfile created by the master script]
Line 36: Line 39:
<source lang='bash'>
<source lang='bash'>
java -Xmx4G -jar GenomeAnalysisTK.jar -T DepthOfCoverage -R umd_3_1.fa -I BOV-WUR-2_rh.dedup_st.reA.bam --omitDepthOutputAtEachBase --logging_level ERROR --summaryCoverageThreshold 10 --summaryCoverageThreshold 20 --summaryCoverageThreshold 30 --summaryCoverageThreshold 40 --summaryCoverageThreshold 50 --summaryCoverageThreshold 80 --summaryCoverageThreshold 90 --summaryCoverageThreshold 100 --summaryCoverageThreshold 150 --minBaseQuality 15 --minMappingQuality 30 --start 1 --stop 1000 --nBins 999 -dt NONE -o BOV-WUR-2_rh.dedup_st.reA.coverage
java -Xmx4G -jar GenomeAnalysisTK.jar -T DepthOfCoverage -R umd_3_1.fa -I BOV-WUR-2_rh.dedup_st.reA.bam --omitDepthOutputAtEachBase --logging_level ERROR --summaryCoverageThreshold 10 --summaryCoverageThreshold 20 --summaryCoverageThreshold 30 --summaryCoverageThreshold 40 --summaryCoverageThreshold 50 --summaryCoverageThreshold 80 --summaryCoverageThreshold 90 --summaryCoverageThreshold 100 --summaryCoverageThreshold 150 --minBaseQuality 15 --minMappingQuality 30 --start 1 --stop 1000 --nBins 999 -dt NONE -o BOV-WUR-2_rh.dedup_st.reA.coverage
</source>
Conventions of the mapping pipeline dictate that for all fastq files in a folder, the file containing the second read should immediate follow the file containing the first read lexicographically. Therefore, if necessary, the files can be renamed with the following python script:
<source lang='bash'>
python rename_files.py -t transtable.txt
</source>
The translation table should look like this; note that files are not ordered correctly lexicographically.
  bcded3e7eb29e423cfb7559a81b6f651 bov-wur-16 run0262 bov-wur-16_CGATGT_L004_R1_001.cleaned.fastq.gz
  6f1e32498eb31d138bd4fdacf0cef5d0 bov-wur-16 run0262 bov-wur-16_CGATGT_L004_R2_001.cleaned.fastq.gz
  ec3f4ecd34402d6aaa33c897bf62430d bov-wur-16 run0264 bov-wur-16_CGATGT_L004_R1_001.cleaned.fastq.gz
  6cfbc04bf11fd5d71bb937a9d978056d bov-wur-16 run0264 bov-wur-16_CGATGT_L004_R2_001.cleaned.fastq.gz
  eafca33360fa81c68933363d393252db bov-wur-16 run0264 bov-wur-16_CGATGT_L005_R1_001.cleaned.fastq.gz
  22b24d014613d465b13c0d1a89281074 bov-wur-16 run0264 bov-wur-16_CGATGT_L005_R2_001.cleaned.fastq.gz
  e0e44096300bec49405c6c7591352ab3 bov-wur-16 run0269 bov-wur-16_CGATGT_L002_R1_001.cleaned.fastq.gz
  9e97fa54293f90b1f3d9738791e38603 bov-wur-16 run0269 bov-wur-16_CGATGT_L002_R1_002.cleaned.fastq.gz
  cec51a979f766b18e7a9cbec99b0a433 bov-wur-16 run0269 bov-wur-16_CGATGT_L002_R1_003.cleaned.fastq.gz
  7ac8f3eb1f0ff26503137b4428120e5b bov-wur-16 run0269 bov-wur-16_CGATGT_L002_R2_001.cleaned.fastq.gz
<source lang='python'>
import argparse
import sys
import os
import re
parser = argparse.ArgumentParser( description='Renames a bunch of files, creates dirs, and reorganizes files')
parser.add_argument("-t", "--translatetable", help="translatetable", nargs=1)
def rename():
  translist = translatetable(transfile)
  newlist = fq_filelist(translist)
  for fq in newlist:
    oldname = fq[1]+'/'+fq[2]+'/'+fq[3]
    newname = fq[1]+'/'+fq[4]
    print(oldname+' ---> '+newname)
    os.rename(oldname,newname)
#    do_md5check(fq[0],newname)
   
def do_md5check(md5original,filenm):
      newmd5= os.popen('md5sum '+filenm).read().split()[0]
      if md5original != newmd5:
        print('WARNING! - conflict in md5sums: original --> '+md5original+' :: newmd5 --> '+newmd5)
      else:
        print('md5 ok: original --> '+md5original+' :: newmd5 --> '+newmd5)
def fq_filelist(flist):
  newlist=[]
  for item in flist:
    parts = item[3].split('_',1)
    newname = parts[0]+'_'+item[2]+'_'+parts[1]
    if newname.count('_R1_')>0:
        parts1=newname.rsplit('_R1_',1)
        parts2=parts1[1].split('.',1)
        newname=parts1[0]+'_'+parts2[0]+'_R1.'+parts2[1]
    elif newname.count('_R2_')>0:
        parts1=newname.rsplit('_R2_',1)
        parts2=parts1[1].split('.',1)
        newname=parts1[0]+'_'+parts2[0]+'_R2.'+parts2[1]
    newlist.append(item+[newname])
  return newlist
def translatetable(trans_file):
  flist=[]
  with open(trans_file) as trans:
    for l in trans.readlines():
      l=l.rstrip().split()
      flist.append([l[0],l[1],l[2],l[3]])
  return flist
args = parser.parse_args()
transfile=args.translatetable[0]
if __name__=="__main__":
  rename()
</source>
</source>

Latest revision as of 14:17, 28 March 2014

The 1000 Bulls mapping pipeline as currently implemented at ABGC is an extension of a generic pipeline developed for pig at ABGC. Modifications include use of bwa 5.9 in stead of later versions, as per the requirements of the 1000 Bulls consortium. Furthermore, the script will automate setting metadata in the BAM files such as including the official 1000 Bulls Ids of the individual cows. Another modification is the use of an sqlite-based database that holds some the metadata required for performing the mapping and setting the correct ids. The sqlite database should be called 'cow_schema.db' and should be in the same working directory as the Python3 master script. Currently, the following two tables should be present in the database: cow_schema_main and bulls1K_id.

Code to create the cow_schema_main table:

<source lang='sql'> CREATE TABLE cow_schema_main (archive text not null, seq_file_name text not null primary key, animal_id text not null, md5sum_zipped text not null, tmp_inserte datetime default current_timestamp); </source>

The table will hold one line for each fastq file (seq_file_name). Archive refers to the base directory that holds the fastq file. Animal ID can be a trivial name. Note that the pipeline will assume files to be gzipped plain-text fastq. The md5sums are from the gzipped files.

 Table: cow_schema_main:
 archive	seq_file_name	animal_id	md5sum_zipped
 SZAIPI019130-16	121202_I598_FCD1JRLACXX_L6_SZAIPI019130-16_1.fq.gz.clean.dup.clean.gz	BOV-WUR-1	f95b35158fe5802a6d6ca18a0e973e87
 SZAIPI019130-16	121202_I598_FCD1JRLACXX_L6_SZAIPI019130-16_2.fq.gz.clean.dup.clean.gz	BOV-WUR-1	333aee62675082d42c2355cc7f21e89b


Code to create the bulls1K_id table: <source lang='sql'> CREATE TABLE bulls1K_id (animal_id text not null, bull1K_id text not null primary key, tmp_inserted datetime default current_timestamp); </source>

The table will hold one line per individual. Its purpose is to connect the trivial name used in the cow_schema_main table to the official 1000 Bulls Id. In addition, it provides an overview of the animals represented in the sequence archives.

 TABLE	bulls1K_id
 animal_id	bull1K_id
 BOV-WUR-1	HOLNLDM000120873995
 BOV-WUR-2	HOLNLDM000811488961

The code to generate the runfiles for the first 15 bulls to be analysed:

<source lang='bash'>

for i in `seq 1 15`; do echo $i; python3 ABGC_mapping_v2.py -i BOV-WUR-$i -a /srv/mds01/shared/Bulls1000/F12FPCEUHK0755_alq121122/cleandata/ -r /srv/mds01/shared/Bulls1000/UMD31/umd_3_1.fa -t 12 -s cow -m bwa-aln -b 5.9 -o; done </source>

Calculating coverage stats:

<source lang='bash'> java -Xmx4G -jar GenomeAnalysisTK.jar -T DepthOfCoverage -R umd_3_1.fa -I BOV-WUR-2_rh.dedup_st.reA.bam --omitDepthOutputAtEachBase --logging_level ERROR --summaryCoverageThreshold 10 --summaryCoverageThreshold 20 --summaryCoverageThreshold 30 --summaryCoverageThreshold 40 --summaryCoverageThreshold 50 --summaryCoverageThreshold 80 --summaryCoverageThreshold 90 --summaryCoverageThreshold 100 --summaryCoverageThreshold 150 --minBaseQuality 15 --minMappingQuality 30 --start 1 --stop 1000 --nBins 999 -dt NONE -o BOV-WUR-2_rh.dedup_st.reA.coverage </source>

Conventions of the mapping pipeline dictate that for all fastq files in a folder, the file containing the second read should immediate follow the file containing the first read lexicographically. Therefore, if necessary, the files can be renamed with the following python script: <source lang='bash'> python rename_files.py -t transtable.txt </source> The translation table should look like this; note that files are not ordered correctly lexicographically.

 bcded3e7eb29e423cfb7559a81b6f651	bov-wur-16	run0262	bov-wur-16_CGATGT_L004_R1_001.cleaned.fastq.gz
 6f1e32498eb31d138bd4fdacf0cef5d0	bov-wur-16	run0262	bov-wur-16_CGATGT_L004_R2_001.cleaned.fastq.gz
 ec3f4ecd34402d6aaa33c897bf62430d	bov-wur-16	run0264	bov-wur-16_CGATGT_L004_R1_001.cleaned.fastq.gz
 6cfbc04bf11fd5d71bb937a9d978056d	bov-wur-16	run0264	bov-wur-16_CGATGT_L004_R2_001.cleaned.fastq.gz
 eafca33360fa81c68933363d393252db	bov-wur-16	run0264	bov-wur-16_CGATGT_L005_R1_001.cleaned.fastq.gz
 22b24d014613d465b13c0d1a89281074	bov-wur-16	run0264	bov-wur-16_CGATGT_L005_R2_001.cleaned.fastq.gz
 e0e44096300bec49405c6c7591352ab3	bov-wur-16	run0269	bov-wur-16_CGATGT_L002_R1_001.cleaned.fastq.gz
 9e97fa54293f90b1f3d9738791e38603	bov-wur-16	run0269	bov-wur-16_CGATGT_L002_R1_002.cleaned.fastq.gz
 cec51a979f766b18e7a9cbec99b0a433	bov-wur-16	run0269	bov-wur-16_CGATGT_L002_R1_003.cleaned.fastq.gz
 7ac8f3eb1f0ff26503137b4428120e5b	bov-wur-16	run0269	bov-wur-16_CGATGT_L002_R2_001.cleaned.fastq.gz

<source lang='python'> import argparse import sys import os import re

parser = argparse.ArgumentParser( description='Renames a bunch of files, creates dirs, and reorganizes files') parser.add_argument("-t", "--translatetable", help="translatetable", nargs=1)

def rename():

 translist = translatetable(transfile)
 newlist = fq_filelist(translist)
 for fq in newlist:
   oldname = fq[1]+'/'+fq[2]+'/'+fq[3]
   newname = fq[1]+'/'+fq[4]
   print(oldname+' ---> '+newname)
   os.rename(oldname,newname)
  1. do_md5check(fq[0],newname)

def do_md5check(md5original,filenm):

     newmd5= os.popen('md5sum '+filenm).read().split()[0]
     if md5original != newmd5:
        print('WARNING! - conflict in md5sums: original --> '+md5original+' :: newmd5 --> '+newmd5)
     else:
        print('md5 ok: original --> '+md5original+' :: newmd5 --> '+newmd5)

def fq_filelist(flist):

 newlist=[]
 for item in flist:
    parts = item[3].split('_',1)
    newname = parts[0]+'_'+item[2]+'_'+parts[1]
    if newname.count('_R1_')>0:
       parts1=newname.rsplit('_R1_',1)
       parts2=parts1[1].split('.',1)
       newname=parts1[0]+'_'+parts2[0]+'_R1.'+parts2[1]
    elif newname.count('_R2_')>0:
       parts1=newname.rsplit('_R2_',1)
       parts2=parts1[1].split('.',1)
       newname=parts1[0]+'_'+parts2[0]+'_R2.'+parts2[1]
    newlist.append(item+[newname])
 return newlist

def translatetable(trans_file):

 flist=[]
 with open(trans_file) as trans:
   for l in trans.readlines():
     l=l.rstrip().split()
     flist.append([l[0],l[1],l[2],l[3]])
 return flist

args = parser.parse_args() transfile=args.translatetable[0]

if __name__=="__main__":

 rename()

</source>