RNA-seq analysis

From HPCwiki
Jump to navigation Jump to search

Typical commands used for analyzing RNA-seq with Tophat (including Bowtie2 as aligner).

  • Examples are RNA-seq (stranded) from pig aligned against the pig reference genome (S. scrofa - 10.2)
  • Tophat, Bowtie2, Picard and GATK need to be in PATH or loaded as modules (e.g.: module load SHARED/bowtie/2-2.2.1; module load SHARED/tophat/2.0.11)
  • Bowtie2 index (made with bowtie2_build) of reference genome (need only to be made ones)
  • PCR duplicates removed with Picard
  • For allelic expression and rna-editing re-aligning with GATK

For expression analysis

  • Allowing multiply hits (20 times)
#!/bin/bash
#SBATCH --time=2-12:00:00
#SBATCH -n 1
#SBATCH -c 16
#SBATCH --output=output_%j.txt
#SBATCH --error=error_output_%j.txt
#SBATCH --job-name=AG_tophat
#SBATCH --mem=20000

#Setting java tmp for slurm
export _JAVA_OPTIONS=-Djava.io.tmpdir=/lustre/scratch/WUR/ABGC/madse001/tmp

#Brain_fontalloop
#Tophat alignment
tophat -p 16 -G ../ssc10_2_ens/Sscrofa10.2.68_ens_Tophat.gtf -M --rg-id agLW_BRfl --rg-sample ag_LW_BRfl --rg-platform Illumina --keep-fasta-order --read-realign-edit-dist 0 -r 120 --library-type fr-firststrand  \
-o tophat2_agLW_BRfl_g20_peO_reAln_ens_RG_M --mate-std-dev 250 ../ssc10_2_ens/ssc10_2_ens agLW_BRfl_read1.trimmed.final.fastq.gz agLW_BRfl_read2.trimmed.final.fastq.gz

#Rename Tophat alignment file
mv tophat2_agLW_BRfl_g20_peO_reAln_ens_RG_M/accepted_hits.bam tophat2_agLW_BRfl_g20_peO_reAln_ens_RG_M/agLW_BRfl_g20_peO_reAln_ens_RG_M.bam

#Remove PCR duplicates with Picard MarkDuplicates
java -Xms16g -jar ~/bin/MarkDuplicates.jar I=tophat2_agLW_BRfl_g1_peO_reAln_ens_RG_M/agLW_BRfl_g1_peO_reAln_ens_RG_M.bam O=tophat2_agLW_BRfl_g1_peO_reAln_ens_RG_M/agLW_BRfl_g1_peO_reAln_ens_RG_M_RDp.bam \
M=tophat2_agLW_BRfl_g1_peO_reAln_ens_RG_M/agLW_BRfl_g1_peO_reAln_ens_RG_M.bam.out REMOVE_DUPLICATES=true

For allelic expression and rna-editing analysis

  • Unique alignment only
#!/bin/bash
#SBATCH --time=2-12:00:00
#SBATCH -n 1
#SBATCH -c 16
#SBATCH --output=output_%j.txt
#SBATCH --error=error_output_%j.txt
#SBATCH --job-name=AG_tophat
#SBATCH --mem=20000

#Setting java tmp for slurm
export _JAVA_OPTIONS=-Djava.io.tmpdir=/lustre/scratch/WUR/ABGC/madse001/tmp

#Brain_fontalloop
#Tophat alignment
tophat -p 16 -g 1 -G ../ssc10_2_ens/Sscrofa10.2.68_ens_Tophat.gtf -M --rg-id agLW_BRfl --rg-sample ag_LW_BRfl --rg-platform Illumina --keep-fasta-order --read-realign-edit-dist 0 -r 120 --library-type fr-firststrand  \
-o tophat2_agLW_BRfl_g1_peO_reAln_ens_RG_M --mate-std-dev 250 ../ssc10_2_ens/ssc10_2_ens agLW_BRfl_read1.trimmed.final.fastq.gz agLW_BRfl_read2.trimmed.final.fastq.gz

#Rename Tophat alignment file
mv tophat2_agLW_BRfl_g1_peO_reAln_ens_RG_M/accepted_hits.bam tophat2_agLW_BRfl_g1_peO_reAln_ens_RG_M/agLW_BRfl_g1_peO_reAln_ens_RG_M.bam

#Remove PCR duplicates with Picard
java -Xms16g -jar ~/bin/MarkDuplicates.jar I=tophat2_agLW_BRfl_g1_peO_reAln_ens_RG_M/agLW_BRfl_g1_peO_reAln_ens_RG_M.bam O=tophat2_agLW_BRfl_g1_peO_reAln_ens_RG_M/agLW_BRfl_g1_peO_reAln_ens_RG_M_RDp.bam \
M=tophat2_agLW_BRfl_g1_peO_reAln_ens_RG_M/agLW_BRfl_g1_peO_reAln_ens_RG_M.bam.out REMOVE_DUPLICATES=true

#Index BAM file for GATK analysis
samtools index tophat2_agLW_BRfl_g1_peO_reAln_ens_RG_M/agLW_BRfl_g1_peO_reAln_ens_RG_M_RDp.bam tophat2_agLW_BRfl_g1_peO_reAln_ens_RG_M/agLW_BRfl_g1_peO_reAln_ens_RG_M_RDp.bai

#Re-align with GATK 
java -Xms16g -jar ~/bin/GenomeAnalysisTK.jar -T RealignerTargetCreator -R ../ssc10_2_ens/ssc10_2_ens.fa -I tophat2_agLW_BRfl_g1_peO_reAln_ens_RG_M/agLW_BRfl_g1_peO_reAln_ens_RG_M_RDp.bam \
-o tophat2_agLW_BRfl_g1_peO_reAln_ens_RG_M/realigner.intervals
java -Xms16g -jar ~/bin/GenomeAnalysisTK.jar -T IndelRealigner -R ../ssc10_2_ens/ssc10_2_ens.fa -I tophat2_agLW_BRfl_g1_peO_reAln_ens_RG_M/agLW_BRfl_g1_peO_reAln_ens_RG_M_RDp.bam \
-targetIntervals tophat2_agLW_BRfl_g1_peO_reAln_ens_RG_M/realigner.intervals -o tophat2_agLW_BRfl_g1_peO_reAln_ens_RG_M/agLW_BRfl_g1_peO_reAln_ens_RG_M_RDp_reG.bam

#Index final BAM file
samtools index tophat2_agLW_BRfl_g1_peO_reAln_ens_RG_M/agLW_BRfl_g1_peO_reAln_ens_RG_M_RDp_reG.bam tophat2_agLW_BRfl_g1_peO_reAln_ens_RG_M/agLW_BRfl_g1_peO_reAln_ens_RG_M_RDp_reG.bai

#Remove tmp bam (bai) file
rm tophat2_agLW_BRfl_g1_peO_reAln_ens_RG_M/agLW_BRfl_g1_peO_reAln_ens_RG_M_RDp.bam
rm tophat2_agLW_BRfl_g1_peO_reAln_ens_RG_M/agLW_BRfl_g1_peO_reAln_ens_RG_M_RDp.bai

Some notes on Tophat options:

  • -g (number of alignments, default 20)
  • -M (prefilter-multihits against genome)
  • --rg-id; --rg-sample; --keep-fasta-order (needed for many downstream analysis like Picard and GATK)