1000Bulls mapping pipeline at ABGC: Difference between revisions

From HPCwiki
Jump to navigation Jump to search
No edit summary
No edit summary
Line 1: Line 1:
The 1000 Bulls mapping pipeline as currently implemented at ABGC is an extension of a generic pipeline developed for pig at ABGC. Modifications include use of bwa 5.9 in stead of later versions, as per the requirements of the 1000 Bulls consortium. Furthermore, the script will automate setting metadata in the BAM files such as including the official 1000 Bulls Ids of the individual cows. Another modification is the use of an sqlite-based database that holds some the metadata required for performing the mapping and setting the correct ids. The sqlite database should be called 'cow_schema.db' and should be in the same working directory as the Python3 master script. Currently, the following two tables should be present in the database:
<source lang='sql'>
CREATE TABLE cow_schema_main (archive text not null, seq_file_name text not null primary key, animal_id text not null, md5sum_zipped text not null, tmp_inserte datetime default current_timestamp);
</source>
  Table: cow_schema_main:
  archive seq_file_name animal_id md5sum_zipped
  SZAIPI019130-16 121202_I598_FCD1JRLACXX_L6_SZAIPI019130-16_1.fq.gz.clean.dup.clean.gz BOV-WUR-1 f95b35158fe5802a6d6ca18a0e973e87
  SZAIPI019130-16 121202_I598_FCD1JRLACXX_L6_SZAIPI019130-16_2.fq.gz.clean.dup.clean.gz BOV-WUR-1 333aee62675082d42c2355cc7f21e89b
<source lang='sql'>
CREATE TABLE bulls1K_id (animal_id text not null, bull1K_id text not null primary key, tmp_inserted datetime default current_timestamp);
</source>
  TABLE bulls1K_id
  animal_id bull1K_id
  BOV-WUR-1 HOLNLDM000120873995
  BOV-WUR-2 HOLNLDM000811488961
<source lang='bash'>
<source lang='bash'>


Line 5: Line 25:
* [https://github.com/hjmegens/NGStools/blob/master/ABGC_mapping_v2.py The Python3-based masterscript]
* [https://github.com/hjmegens/NGStools/blob/master/ABGC_mapping_v2.py The Python3-based masterscript]
* [https://github.com/hjmegens/NGStools/blob/master/runBOV-WUR-4.sh Example of a runfile created by the master script]
* [https://github.com/hjmegens/NGStools/blob/master/runBOV-WUR-4.sh Example of a runfile created by the master script]
Calculating coverage stats:


<source lang='bash'>
<source lang='bash'>
java -Xmx4G -jar GenomeAnalysisTK.jar -T DepthOfCoverage -R umd_3_1.fa -I BOV-WUR-2_rh.dedup_st.reA.bam --omitDepthOutputAtEachBase --logging_level ERROR --summaryCoverageThreshold 10 --summaryCoverageThreshold 20 --summaryCoverageThreshold 30 --summaryCoverageThreshold 40 --summaryCoverageThreshold 50 --summaryCoverageThreshold 80 --summaryCoverageThreshold 90 --summaryCoverageThreshold 100 --summaryCoverageThreshold 150 --minBaseQuality 15 --minMappingQuality 30 --start 1 --stop 1000 --nBins 999 -dt NONE -o BOV-WUR-2_rh.dedup_st.reA.coverage
java -Xmx4G -jar GenomeAnalysisTK.jar -T DepthOfCoverage -R umd_3_1.fa -I BOV-WUR-2_rh.dedup_st.reA.bam --omitDepthOutputAtEachBase --logging_level ERROR --summaryCoverageThreshold 10 --summaryCoverageThreshold 20 --summaryCoverageThreshold 30 --summaryCoverageThreshold 40 --summaryCoverageThreshold 50 --summaryCoverageThreshold 80 --summaryCoverageThreshold 90 --summaryCoverageThreshold 100 --summaryCoverageThreshold 150 --minBaseQuality 15 --minMappingQuality 30 --start 1 --stop 1000 --nBins 999 -dt NONE -o BOV-WUR-2_rh.dedup_st.reA.coverage
</source>
</source>

Revision as of 23:26, 26 November 2013

The 1000 Bulls mapping pipeline as currently implemented at ABGC is an extension of a generic pipeline developed for pig at ABGC. Modifications include use of bwa 5.9 in stead of later versions, as per the requirements of the 1000 Bulls consortium. Furthermore, the script will automate setting metadata in the BAM files such as including the official 1000 Bulls Ids of the individual cows. Another modification is the use of an sqlite-based database that holds some the metadata required for performing the mapping and setting the correct ids. The sqlite database should be called 'cow_schema.db' and should be in the same working directory as the Python3 master script. Currently, the following two tables should be present in the database:

<source lang='sql'> CREATE TABLE cow_schema_main (archive text not null, seq_file_name text not null primary key, animal_id text not null, md5sum_zipped text not null, tmp_inserte datetime default current_timestamp); </source>

 Table: cow_schema_main:
 archive	seq_file_name	animal_id	md5sum_zipped
 SZAIPI019130-16	121202_I598_FCD1JRLACXX_L6_SZAIPI019130-16_1.fq.gz.clean.dup.clean.gz	BOV-WUR-1	f95b35158fe5802a6d6ca18a0e973e87
 SZAIPI019130-16	121202_I598_FCD1JRLACXX_L6_SZAIPI019130-16_2.fq.gz.clean.dup.clean.gz	BOV-WUR-1	333aee62675082d42c2355cc7f21e89b

<source lang='sql'> CREATE TABLE bulls1K_id (animal_id text not null, bull1K_id text not null primary key, tmp_inserted datetime default current_timestamp); </source>

 TABLE	bulls1K_id
 animal_id	bull1K_id
 BOV-WUR-1	HOLNLDM000120873995
 BOV-WUR-2	HOLNLDM000811488961

<source lang='bash'>

for i in `seq 1 15`; do echo $i; python3 ABGC_mapping_v2.py -i BOV-WUR-$i -a /srv/mds01/shared/Bulls1000/F12FPCEUHK0755_alq121122/cleandata/ -r /srv/mds01/shared/Bulls1000/UMD31/umd_3_1.fa -t 12 -s cow -m bwa-aln -b 5.9 -o; done </source>

Calculating coverage stats:

<source lang='bash'> java -Xmx4G -jar GenomeAnalysisTK.jar -T DepthOfCoverage -R umd_3_1.fa -I BOV-WUR-2_rh.dedup_st.reA.bam --omitDepthOutputAtEachBase --logging_level ERROR --summaryCoverageThreshold 10 --summaryCoverageThreshold 20 --summaryCoverageThreshold 30 --summaryCoverageThreshold 40 --summaryCoverageThreshold 50 --summaryCoverageThreshold 80 --summaryCoverageThreshold 90 --summaryCoverageThreshold 100 --summaryCoverageThreshold 150 --minBaseQuality 15 --minMappingQuality 30 --start 1 --stop 1000 --nBins 999 -dt NONE -o BOV-WUR-2_rh.dedup_st.reA.coverage </source>