Provean Sus scrofa: Difference between revisions

Revision as of 13:27, 27 December 2013

This page describes the procedure for mapping all known variants (batch of first 150 pigs, wild boar re-sequencing) at the ABGC.

Pre-requisites

From Variant Effect Predictor output, select only protein altering variants and sort by transcript: <source lang='bash'> cat outVEP_*.txt | awk '$11~/\//' | sed 's/:/\t/' | sort -k6 >prot_alt.txt </source>

Protein models for Sus scrofa:

 /lustre/nobackup/WUR/ABGC/shared/public_data_store/genomes/pig/Ensembl74/pep/Sus_scrofa.Sscrofa10.2.74.pep.all.fa

Automated procedure for mapping

The Provean analysis is somewhat involved because of an apparent bug in the program that results in conflict in temporary files. This is particularly problematic when farming out thousands of individual searches (i.e. per protein sequence) on the cluster. The cluster nodes need periodic 'cleaning' of those remaining temporary directories.

Master script to control the submission of jobs and cleaning

The following script will add 300 runs every hour. Note that it will kill remaining Provean processes, and, importantly, will clean the /tmp dirs of all nodes of remaining Provean related temporary folders. This to prevent the error message that Provean has problems creating temporary folders. <source lang='bash'> !/bin/bash

SBATCH --time=4800
SBATCH --ntasks=1
SBATCH --mem-per-cpu=16000
SBATCH --nice=1000
SBATCH --output=output_%j.txt
SBATCH --error=error_output_%j.txt
SBATCH --job-name=Provean
SBATCH --partition=ABGC_Research
cat outVEP_*.txt | awk '$11~/\//' | sed 's/:/\t/' | sort -k6 >prot_alt.txt

TELLER=100 echo $TELLER; let TELLER+=1; echo $TELLER; while [ $TELLER -gt 99 ]; do

 PROVS=`squeue | grep Provean | sed 's/^ \+//' | sed 's/ \+/\t/' | cut -f1`;
 for PROV in $PROVS; do scancel $PROV; done;
 sleep 10;
 for i in `seq 1 2`; do ssh fat00$i 'rm -rf /tmp/provean*'; done;
 for i in `seq 10 60`; do ssh node0$i 'rm -rf /tmp/provean*'; done;
 for i in `seq 1 9`; do ssh node00$i 'rm -rf /tmp/provean*'; done;
 TRANS=`cat prot_alt.txt | head -15000 | cut -f6 | sort | uniq`;
 TELLER2=0;
 for TRAN in $TRANS; do
    if [ $TELLER2 -lt 300 ]; then
      echo "transcript: $TRAN";
      echo "teller boven: $TELLER2";
      PROT=`cat /lustre/nobackup/WUR/ABGC/shared/public_data_store/genomes/pig/Ensembl74/pep/Sus_scrofa.Sscrofa10.2.74.pep.all.fa | grep $TRAN | sed 's/ \+/\t/g' | sed 's/^>//' | cut -f1`;
      echo "protein: $PROT";
      if [ -f $PROT.sss ];
       then
         echo "$PROT $TRAN already done";
       else
         echo "will do sbatch testProvean_sub.sh $TRAN'";
         sbatch runProvean_sub.sh $TRAN;
         let TELLER2+=1;
         echo "teller onder: $TELLER2";
      fi;
   fi;
 done;
 sleep 3600;

done

</source>

The slave script that does the actual submission

The 'runProvean_sub.sh' script referred to in the above script consists of the following code: <source lang='bash'>

!/bin/bash
SBATCH --time=4800
SBATCH --ntasks=1
SBATCH --mem-per-cpu=16000
SBATCH --nice=1000
SBATCH --output=output_%j.txt
SBATCH --error=error_output_%j.txt
SBATCH --job-name=Provean
SBATCH --partition=ABGC_Research

TRANS=$1 PROT=`cat /lustre/nobackup/WUR/ABGC/shared/public_data_store/genomes/pig/Ensembl74/pep/Sus_scrofa.Sscrofa10.2.74.pep.all.fa | grep $TRANS | sed 's/ \+/\t/g' | sed 's/^>//' | cut -f1` cat prot_alt.txt | grep $TRANS | awk '{print $11,$12}' | sed 's/ \+/\t/' | sed 's/\//\t/' | awk '{OFS=","; print $1,$2,$3}' | sed 's/\t//g' | sed 's/ \+//g' >$TRANS.var; cat prot_alt.txt | grep $TRANS | awk -v prot=$PROT '{OFS="\t"; print $1,$2,$3,$5,$6,$7,$8,prot, $11,$12,$13,$14,$15}' >$PROT.var.info; faOneRecord /lustre/nobackup/WUR/ABGC/shared/public_data_store/genomes/pig/Ensembl74/pep/Sus_scrofa.Sscrofa10.2.74.pep.all.fa $PROT >$PROT.fa; mv $TRANS.var $PROT.var; provean.sh -q $PROT.fa -v $PROT.var --save_supporting_set $PROT.sss >$PROT.result.txt 2>$PROT.error; </source>

Alternative: submission per transcript - no cleaning

Individual transcripts can also be submitted using the following script: <source lang='bash'>

!/bin/bash
SBATCH --time=4800
SBATCH --ntasks=1
SBATCH --mem-per-cpu=16000
SBATCH --nice=1000
SBATCH --output=output_%j.txt
SBATCH --error=error_output_%j.txt
SBATCH --job-name=Provean
SBATCH --partition=ABGC_Research
cat outVEP_*.txt | awk '$11~/\//' | sed 's/:/\t/' | sort -k6 >prot_alt.txt

TRANS=$1 PROT=`cat /lustre/nobackup/WUR/ABGC/shared/public_data_store/genomes/pig/Ensembl74/pep/Sus_scrofa.Sscrofa10.2.74.pep.all.fa | grep $TRANS | sed 's/ \+/\t/g' | sed 's/^>//' | cut -f1` if [ -f $PROT.sss ];

 then
 echo "$PROT $TRANS already done.";
 else
 cat prot_alt.txt | grep $TRANS | awk '{print $11,$12}' | sed 's/ \+/\t/' | sed 's/\//\t/' | awk '{OFS=","; print $1,$2,$3}' | sed 's/\t//g' | sed 's/ \+//g' >$TRANS.var;
 cat prot_alt.txt | grep $TRANS | awk -v prot=$PROT '{OFS="\t"; print $1,$2,$3,$5,$6,$7,$8,prot, $11,$12,$13,$14,$15}' >$PROT.var.info;
 faOneRecord /lustre/nobackup/WUR/ABGC/shared/public_data_store/genomes/pig/Ensembl74/pep/Sus_scrofa.Sscrofa10.2.74.pep.all.fa $PROT >$PROT.fa;
 mv $TRANS.var $PROT.var;
 provean.sh -q $PROT.fa -v $PROT.var --save_supporting_set $PROT.sss >$PROT.result.txt 2>$PROT.error;

fi;

</source>

Provean Sus scrofa: Difference between revisions

Revision as of 13:27, 27 December 2013

Contents

Pre-requisites

Automated procedure for mapping

Master script to control the submission of jobs and cleaning

The slave script that does the actual submission

Alternative: submission per transcript - no cleaning

See also

Navigation menu

Provean Sus scrofa: Difference between revisions

Revision as of 13:27, 27 December 2013

Pre-requisites

Automated procedure for mapping

Master script to control the submission of jobs and cleaning

The slave script that does the actual submission

Alternative: submission per transcript - no cleaning

See also

Navigation menu

Search