|
|
| Line 1: |
Line 1: |
| A generic pipeline developed for pig at ABGC. Modifications include use of bwa 5.9 in stead of later versions, as per the requirements of the 1000 Bulls consortium. Furthermore, the script will automate setting metadata in the BAM files such as including the official 1000 Bulls Ids of the individual cows. Another modification is the use of an sqlite-based database that holds some the metadata required for performing the mapping and setting the correct ids. The sqlite database should be called 'cow_schema.db' and should be in the same working directory as the Python3 master script. Currently, the following two tables should be present in the database: <code>cow_schema_main</code> and <code>bulls1K_id</code>.
| | python3 ABGC_mapping_v2.py -i LW22F08 -a /lustre/nobackup/WUR/ABGC/shared/Pig/ABGSA/ -r /lustre/nobackup/WUR/ABGC/shared/Pig/Sscrofa_build10_2/Ensembl72/Sus_scrofa.Sscrofa10.2.72.dna.toplevel.fa -t 10 |
| | |
| Code to create the <code>cow_schema_main</code> table:
| |
| | |
| <source lang='sql'>
| |
| CREATE TABLE cow_schema_main (archive text not null, seq_file_name text not null primary key, animal_id text not null, md5sum_zipped text not null, tmp_inserte datetime default current_timestamp);
| |
| </source>
| |
| | |
| The table will hold one line for each fastq file (seq_file_name). Archive refers to the base directory that holds the fastq file. Animal ID can be a trivial name. Note that the pipeline will assume files to be gzipped plain-text fastq. The md5sums are from the gzipped files.
| |
| Table: cow_schema_main:
| |
| archive seq_file_name animal_id md5sum_zipped
| |
| SZAIPI019130-16 121202_I598_FCD1JRLACXX_L6_SZAIPI019130-16_1.fq.gz.clean.dup.clean.gz BOV-WUR-1 f95b35158fe5802a6d6ca18a0e973e87
| |
| SZAIPI019130-16 121202_I598_FCD1JRLACXX_L6_SZAIPI019130-16_2.fq.gz.clean.dup.clean.gz BOV-WUR-1 333aee62675082d42c2355cc7f21e89b
| |
| | |
| | |
| Code to create the <code>bulls1K_id</code> table:
| |
| <source lang='sql'>
| |
| CREATE TABLE bulls1K_id (animal_id text not null, bull1K_id text not null primary key, tmp_inserted datetime default current_timestamp);
| |
| </source>
| |
| | |
| The table will hold one line per individual. Its purpose is to connect the trivial name used in the cow_schema_main table to the official 1000 Bulls Id. In addition, it provides an overview of the animals represented in the sequence archives.
| |
| TABLE bulls1K_id
| |
| animal_id bull1K_id
| |
| BOV-WUR-1 HOLNLDM000120873995
| |
| BOV-WUR-2 HOLNLDM000811488961
| |
| | |
| The code to generate the runfiles for the first 15 bulls to be analysed:
| |
| | |
| <source lang='bash'>
| |
| | |
| for i in `seq 1 15`; do echo $i; python3 ABGC_mapping_v2.py -i BOV-WUR-$i -a /srv/mds01/shared/Bulls1000/F12FPCEUHK0755_alq121122/cleandata/ -r /srv/mds01/shared/Bulls1000/UMD31/umd_3_1.fa -t 12 -s cow -m bwa-aln -b 5.9 -o; done
| |
| </source>
| |
| | |
| * [https://github.com/hjmegens/NGStools/blob/master/ABGC_mapping_v2.py The Python3-based masterscript]
| |
| * [https://github.com/hjmegens/NGStools/blob/master/runBOV-WUR-4.sh Example of a runfile created by the master script]
| |
python3 ABGC_mapping_v2.py -i LW22F08 -a /lustre/nobackup/WUR/ABGC/shared/Pig/ABGSA/ -r /lustre/nobackup/WUR/ABGC/shared/Pig/Sscrofa_build10_2/Ensembl72/Sus_scrofa.Sscrofa10.2.72.dna.toplevel.fa -t 10