Provean Sus scrofa: Difference between revisions
No edit summary |
No edit summary |
||
(9 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
This page describes the procedure for mapping all known variants (batch of first 150 pigs, wild boar re-sequencing) at the ABGC. | |||
== Pre-requisites == | |||
From Variant Effect Predictor output, select only protein altering variants and sort by transcript: | From Variant Effect Predictor output, select only protein altering variants and sort by transcript: | ||
<source lang='bash'> | <source lang='bash'> | ||
Line 8: | Line 10: | ||
/lustre/nobackup/WUR/ABGC/shared/public_data_store/genomes/pig/Ensembl74/pep/Sus_scrofa.Sscrofa10.2.74.pep.all.fa | /lustre/nobackup/WUR/ABGC/shared/public_data_store/genomes/pig/Ensembl74/pep/Sus_scrofa.Sscrofa10.2.74.pep.all.fa | ||
== Automated procedure for mapping == | |||
The Provean analysis is somewhat involved because of an apparent bug in the program that results in conflict in temporary files. This is particularly problematic when farming out thousands of individual searches (i.e. per protein sequence) on the cluster. The cluster nodes need periodic 'cleaning' of those remaining temporary directories. | |||
=== Master script to control the submission of jobs and cleaning === | |||
The following script will add 300 runs every hour. Note that it will kill remaining Provean processes, and, importantly, will clean the <code>/tmp</code> dirs of all nodes of remaining Provean related temporary folders. This to prevent the error message that Provean has problems creating temporary folders. | |||
<source lang='bash'> | <source lang='bash'> | ||
!/bin/bash | !/bin/bash | ||
Line 18: | Line 24: | ||
#SBATCH --error=error_output_%j.txt | #SBATCH --error=error_output_%j.txt | ||
#SBATCH --job-name=Provean | #SBATCH --job-name=Provean | ||
#cat outVEP_*.txt | awk '$11~/\//' | sed 's/:/\t/' | sort -k6 >prot_alt.txt | #cat outVEP_*.txt | awk '$11~/\//' | sed 's/:/\t/' | sort -k6 >prot_alt.txt | ||
TELLER=100 | TELLER=100 | ||
Line 32: | Line 37: | ||
for i in `seq 10 60`; do ssh node0$i 'rm -rf /tmp/provean*'; done; | for i in `seq 10 60`; do ssh node0$i 'rm -rf /tmp/provean*'; done; | ||
for i in `seq 1 9`; do ssh node00$i 'rm -rf /tmp/provean*'; done; | for i in `seq 1 9`; do ssh node00$i 'rm -rf /tmp/provean*'; done; | ||
TRANS=`cat prot_alt.txt | TRANS=`cat prot_alt.txt | cut -f6 | sort | uniq`; | ||
TELLER2=0; | TELLER2=0; | ||
for TRAN in $TRANS; do | for TRAN in $TRANS; do | ||
Line 44: | Line 49: | ||
echo "$PROT $TRAN already done"; | echo "$PROT $TRAN already done"; | ||
else | else | ||
echo "will do sbatch testProvean_sub.sh $TRAN | echo "will do sbatch testProvean_sub.sh $TRAN"; | ||
sbatch | sbatch runProvean_sub.sh $TRAN; | ||
let TELLER2+=1; | let TELLER2+=1; | ||
echo "teller onder: $TELLER2"; | echo "teller onder: $TELLER2"; | ||
Line 55: | Line 60: | ||
</source> | </source> | ||
=== The slave script that does the actual submission === | |||
The 'runProvean_sub.sh' script referred to in the above script consists of the following code: | |||
<source lang='bash'> | |||
#!/bin/bash | |||
#SBATCH --time=4800 | |||
#SBATCH --ntasks=1 | |||
#SBATCH --mem-per-cpu=16000 | |||
#SBATCH --nice=1000 | |||
#SBATCH --output=output_%j.txt | |||
#SBATCH --error=error_output_%j.txt | |||
#SBATCH --job-name=Provean | |||
TRANS=$1 | |||
PROT=`cat /lustre/nobackup/WUR/ABGC/shared/public_data_store/genomes/pig/Ensembl74/pep/Sus_scrofa.Sscrofa10.2.74.pep.all.fa | grep $TRANS | sed 's/ \+/\t/g' | sed 's/^>//' | cut -f1` | |||
cat prot_alt.txt | grep $TRANS | awk '{print $11,$12}' | sed 's/ \+/\t/' | sed 's/\//\t/' | awk '{OFS=","; print $1,$2,$3}' | sed 's/\t//g' | sed 's/ \+//g' >$TRANS.var; | |||
cat prot_alt.txt | grep $TRANS | awk -v prot=$PROT '{OFS="\t"; print $1,$2,$3,$5,$6,$7,$8,prot, $11,$12,$13,$14,$15}' >$PROT.var.info; | |||
faOneRecord /lustre/nobackup/WUR/ABGC/shared/public_data_store/genomes/pig/Ensembl74/pep/Sus_scrofa.Sscrofa10.2.74.pep.all.fa $PROT >$PROT.fa; | |||
mv $TRANS.var $PROT.var; | |||
provean.sh -q $PROT.fa -v $PROT.var --save_supporting_set $PROT.sss >$PROT.result.txt 2>$PROT.error; | |||
</source> | |||
== Alternative: submission per transcript - no cleaning == | |||
Individual transcripts can also be submitted using the following script: | |||
<source lang='bash'> | |||
#!/bin/bash | |||
#SBATCH --time=4800 | |||
#SBATCH --ntasks=1 | |||
#SBATCH --mem-per-cpu=16000 | |||
#SBATCH --nice=1000 | |||
#SBATCH --output=output_%j.txt | |||
#SBATCH --error=error_output_%j.txt | |||
#SBATCH --job-name=Provean | |||
#SBATCH --partition=ABGC_Research | |||
#cat outVEP_*.txt | awk '$11~/\//' | sed 's/:/\t/' | sort -k6 >prot_alt.txt | |||
TRANS=$1 | |||
PROT=`cat /lustre/nobackup/WUR/ABGC/shared/public_data_store/genomes/pig/Ensembl74/pep/Sus_scrofa.Sscrofa10.2.74.pep.all.fa | grep $TRANS | sed 's/ \+/\t/g' | sed 's/^>//' | cut -f1` | |||
if [ -f $PROT.sss ]; | |||
then | |||
echo "$PROT $TRANS already done."; | |||
else | |||
cat prot_alt.txt | grep $TRANS | awk '{print $11,$12}' | sed 's/ \+/\t/' | sed 's/\//\t/' | awk '{OFS=","; print $1,$2,$3}' | sed 's/\t//g' | sed 's/ \+//g' >$TRANS.var; | |||
cat prot_alt.txt | grep $TRANS | awk -v prot=$PROT '{OFS="\t"; print $1,$2,$3,$5,$6,$7,$8,prot, $11,$12,$13,$14,$15}' >$PROT.var.info; | |||
faOneRecord /lustre/nobackup/WUR/ABGC/shared/public_data_store/genomes/pig/Ensembl74/pep/Sus_scrofa.Sscrofa10.2.74.pep.all.fa $PROT >$PROT.fa; | |||
mv $TRANS.var $PROT.var; | |||
provean.sh -q $PROT.fa -v $PROT.var --save_supporting_set $PROT.sss >$PROT.result.txt 2>$PROT.error; | |||
fi; | |||
</source> | |||
== See also == | |||
[[Provean_1.1.3 | Provean on Anunna]] |
Latest revision as of 16:02, 15 July 2019
This page describes the procedure for mapping all known variants (batch of first 150 pigs, wild boar re-sequencing) at the ABGC.
Pre-requisites
From Variant Effect Predictor output, select only protein altering variants and sort by transcript: <source lang='bash'> cat outVEP_*.txt | awk '$11~/\//' | sed 's/:/\t/' | sort -k6 >prot_alt.txt </source>
Protein models for Sus scrofa:
/lustre/nobackup/WUR/ABGC/shared/public_data_store/genomes/pig/Ensembl74/pep/Sus_scrofa.Sscrofa10.2.74.pep.all.fa
Automated procedure for mapping
The Provean analysis is somewhat involved because of an apparent bug in the program that results in conflict in temporary files. This is particularly problematic when farming out thousands of individual searches (i.e. per protein sequence) on the cluster. The cluster nodes need periodic 'cleaning' of those remaining temporary directories.
Master script to control the submission of jobs and cleaning
The following script will add 300 runs every hour. Note that it will kill remaining Provean processes, and, importantly, will clean the /tmp
dirs of all nodes of remaining Provean related temporary folders. This to prevent the error message that Provean has problems creating temporary folders.
<source lang='bash'>
!/bin/bash
- SBATCH --time=4800
- SBATCH --ntasks=1
- SBATCH --mem-per-cpu=16000
- SBATCH --nice=1000
- SBATCH --output=output_%j.txt
- SBATCH --error=error_output_%j.txt
- SBATCH --job-name=Provean
- cat outVEP_*.txt | awk '$11~/\//' | sed 's/:/\t/' | sort -k6 >prot_alt.txt
TELLER=100 echo $TELLER; let TELLER+=1; echo $TELLER; while [ $TELLER -gt 99 ]; do
PROVS=`squeue | grep Provean | sed 's/^ \+//' | sed 's/ \+/\t/' | cut -f1`; for PROV in $PROVS; do scancel $PROV; done; sleep 10; for i in `seq 1 2`; do ssh fat00$i 'rm -rf /tmp/provean*'; done; for i in `seq 10 60`; do ssh node0$i 'rm -rf /tmp/provean*'; done; for i in `seq 1 9`; do ssh node00$i 'rm -rf /tmp/provean*'; done; TRANS=`cat prot_alt.txt | cut -f6 | sort | uniq`; TELLER2=0; for TRAN in $TRANS; do if [ $TELLER2 -lt 300 ]; then echo "transcript: $TRAN"; echo "teller boven: $TELLER2"; PROT=`cat /lustre/nobackup/WUR/ABGC/shared/public_data_store/genomes/pig/Ensembl74/pep/Sus_scrofa.Sscrofa10.2.74.pep.all.fa | grep $TRAN | sed 's/ \+/\t/g' | sed 's/^>//' | cut -f1`; echo "protein: $PROT"; if [ -f $PROT.sss ]; then echo "$PROT $TRAN already done"; else echo "will do sbatch testProvean_sub.sh $TRAN"; sbatch runProvean_sub.sh $TRAN; let TELLER2+=1; echo "teller onder: $TELLER2"; fi; fi; done; sleep 3600;
done
</source>
The slave script that does the actual submission
The 'runProvean_sub.sh' script referred to in the above script consists of the following code: <source lang='bash'>
- !/bin/bash
- SBATCH --time=4800
- SBATCH --ntasks=1
- SBATCH --mem-per-cpu=16000
- SBATCH --nice=1000
- SBATCH --output=output_%j.txt
- SBATCH --error=error_output_%j.txt
- SBATCH --job-name=Provean
TRANS=$1 PROT=`cat /lustre/nobackup/WUR/ABGC/shared/public_data_store/genomes/pig/Ensembl74/pep/Sus_scrofa.Sscrofa10.2.74.pep.all.fa | grep $TRANS | sed 's/ \+/\t/g' | sed 's/^>//' | cut -f1` cat prot_alt.txt | grep $TRANS | awk '{print $11,$12}' | sed 's/ \+/\t/' | sed 's/\//\t/' | awk '{OFS=","; print $1,$2,$3}' | sed 's/\t//g' | sed 's/ \+//g' >$TRANS.var; cat prot_alt.txt | grep $TRANS | awk -v prot=$PROT '{OFS="\t"; print $1,$2,$3,$5,$6,$7,$8,prot, $11,$12,$13,$14,$15}' >$PROT.var.info; faOneRecord /lustre/nobackup/WUR/ABGC/shared/public_data_store/genomes/pig/Ensembl74/pep/Sus_scrofa.Sscrofa10.2.74.pep.all.fa $PROT >$PROT.fa; mv $TRANS.var $PROT.var; provean.sh -q $PROT.fa -v $PROT.var --save_supporting_set $PROT.sss >$PROT.result.txt 2>$PROT.error; </source>
Alternative: submission per transcript - no cleaning
Individual transcripts can also be submitted using the following script: <source lang='bash'>
- !/bin/bash
- SBATCH --time=4800
- SBATCH --ntasks=1
- SBATCH --mem-per-cpu=16000
- SBATCH --nice=1000
- SBATCH --output=output_%j.txt
- SBATCH --error=error_output_%j.txt
- SBATCH --job-name=Provean
- SBATCH --partition=ABGC_Research
- cat outVEP_*.txt | awk '$11~/\//' | sed 's/:/\t/' | sort -k6 >prot_alt.txt
TRANS=$1 PROT=`cat /lustre/nobackup/WUR/ABGC/shared/public_data_store/genomes/pig/Ensembl74/pep/Sus_scrofa.Sscrofa10.2.74.pep.all.fa | grep $TRANS | sed 's/ \+/\t/g' | sed 's/^>//' | cut -f1` if [ -f $PROT.sss ];
then echo "$PROT $TRANS already done."; else cat prot_alt.txt | grep $TRANS | awk '{print $11,$12}' | sed 's/ \+/\t/' | sed 's/\//\t/' | awk '{OFS=","; print $1,$2,$3}' | sed 's/\t//g' | sed 's/ \+//g' >$TRANS.var; cat prot_alt.txt | grep $TRANS | awk -v prot=$PROT '{OFS="\t"; print $1,$2,$3,$5,$6,$7,$8,prot, $11,$12,$13,$14,$15}' >$PROT.var.info; faOneRecord /lustre/nobackup/WUR/ABGC/shared/public_data_store/genomes/pig/Ensembl74/pep/Sus_scrofa.Sscrofa10.2.74.pep.all.fa $PROT >$PROT.fa; mv $TRANS.var $PROT.var; provean.sh -q $PROT.fa -v $PROT.var --save_supporting_set $PROT.sss >$PROT.result.txt 2>$PROT.error;
fi;
</source>