RNA-seq analysis
Jump to navigation
Jump to search
Typical commands used for analyzing RNA-seq with Tophat (including Bowtie2 as aligner).
- Examples are RNA-seq (stranded) from pig aligned against the pig reference genome (S. scrofa - 10.2)
- Tophat, Bowtie2, Picard and GATK need to be in PATH or loaded as modules (e.g.: module load SHARED/bowtie/2-2.2.1; module load SHARED/tophat/2.0.11)
- Bowtie2 index (made with bowtie2_build) of reference genome (need only to be made ones)
- PCR duplicates removed with Picard
- For allelic expression and rna-editing re-aligning with GATK
For expression analysis
- Allowing multiply hits (20 times)
#!/bin/bash
#SBATCH --time=2-12:00:00
#SBATCH -n 1
#SBATCH -c 16
#SBATCH --output=output_%j.txt
#SBATCH --error=error_output_%j.txt
#SBATCH --job-name=AG_tophat
#SBATCH --mem=20000
#Setting java tmp for slurm
export _JAVA_OPTIONS=-Djava.io.tmpdir=/lustre/scratch/WUR/ABGC/madse001/tmp
#Brain_fontalloop
#Tophat alignment
tophat -p 16 -G ../ssc10_2_ens/Sscrofa10.2.68_ens_Tophat.gtf -M --rg-id agLW_BRfl --rg-sample ag_LW_BRfl --rg-platform Illumina --keep-fasta-order --read-realign-edit-dist 0 -r 120 --library-type fr-firststrand \
-o tophat2_agLW_BRfl_g20_peO_reAln_ens_RG_M --mate-std-dev 250 ../ssc10_2_ens/ssc10_2_ens agLW_BRfl_read1.trimmed.final.fastq.gz agLW_BRfl_read2.trimmed.final.fastq.gz
#Rename Tophat alignment file
mv tophat2_agLW_BRfl_g20_peO_reAln_ens_RG_M/accepted_hits.bam tophat2_agLW_BRfl_g20_peO_reAln_ens_RG_M/agLW_BRfl_g20_peO_reAln_ens_RG_M.bam
#Remove PCR duplicates with Picard MarkDuplicates
java -Xms16g -jar ~/bin/MarkDuplicates.jar I=tophat2_agLW_BRfl_g1_peO_reAln_ens_RG_M/agLW_BRfl_g1_peO_reAln_ens_RG_M.bam O=tophat2_agLW_BRfl_g1_peO_reAln_ens_RG_M/agLW_BRfl_g1_peO_reAln_ens_RG_M_RDp.bam \
M=tophat2_agLW_BRfl_g1_peO_reAln_ens_RG_M/agLW_BRfl_g1_peO_reAln_ens_RG_M.bam.out REMOVE_DUPLICATES=true
For allelic expression and rna-editing analysis
- Unique alignment only
#!/bin/bash
#SBATCH --time=2-12:00:00
#SBATCH -n 1
#SBATCH -c 16
#SBATCH --output=output_%j.txt
#SBATCH --error=error_output_%j.txt
#SBATCH --job-name=AG_tophat
#SBATCH --mem=20000
#Setting java tmp for slurm
export _JAVA_OPTIONS=-Djava.io.tmpdir=/lustre/scratch/WUR/ABGC/madse001/tmp
#Brain_fontalloop
#Tophat alignment
tophat -p 16 -g 1 -G ../ssc10_2_ens/Sscrofa10.2.68_ens_Tophat.gtf -M --rg-id agLW_BRfl --rg-sample ag_LW_BRfl --rg-platform Illumina --keep-fasta-order --read-realign-edit-dist 0 -r 120 --library-type fr-firststrand \
-o tophat2_agLW_BRfl_g1_peO_reAln_ens_RG_M --mate-std-dev 250 ../ssc10_2_ens/ssc10_2_ens agLW_BRfl_read1.trimmed.final.fastq.gz agLW_BRfl_read2.trimmed.final.fastq.gz
#Rename Tophat alignment file
mv tophat2_agLW_BRfl_g1_peO_reAln_ens_RG_M/accepted_hits.bam tophat2_agLW_BRfl_g1_peO_reAln_ens_RG_M/agLW_BRfl_g1_peO_reAln_ens_RG_M.bam
#Remove PCR duplicates with Picard
java -Xms16g -jar ~/bin/MarkDuplicates.jar I=tophat2_agLW_BRfl_g1_peO_reAln_ens_RG_M/agLW_BRfl_g1_peO_reAln_ens_RG_M.bam O=tophat2_agLW_BRfl_g1_peO_reAln_ens_RG_M/agLW_BRfl_g1_peO_reAln_ens_RG_M_RDp.bam \
M=tophat2_agLW_BRfl_g1_peO_reAln_ens_RG_M/agLW_BRfl_g1_peO_reAln_ens_RG_M.bam.out REMOVE_DUPLICATES=true
#Index BAM file for GATK analysis
samtools index tophat2_agLW_BRfl_g1_peO_reAln_ens_RG_M/agLW_BRfl_g1_peO_reAln_ens_RG_M_RDp.bam tophat2_agLW_BRfl_g1_peO_reAln_ens_RG_M/agLW_BRfl_g1_peO_reAln_ens_RG_M_RDp.bai
#Re-align with GATK
java -Xms16g -jar ~/bin/GenomeAnalysisTK.jar -T RealignerTargetCreator -R ../ssc10_2_ens/ssc10_2_ens.fa -I tophat2_agLW_BRfl_g1_peO_reAln_ens_RG_M/agLW_BRfl_g1_peO_reAln_ens_RG_M_RDp.bam \
-o tophat2_agLW_BRfl_g1_peO_reAln_ens_RG_M/realigner.intervals
java -Xms16g -jar ~/bin/GenomeAnalysisTK.jar -T IndelRealigner -R ../ssc10_2_ens/ssc10_2_ens.fa -I tophat2_agLW_BRfl_g1_peO_reAln_ens_RG_M/agLW_BRfl_g1_peO_reAln_ens_RG_M_RDp.bam \
-targetIntervals tophat2_agLW_BRfl_g1_peO_reAln_ens_RG_M/realigner.intervals -o tophat2_agLW_BRfl_g1_peO_reAln_ens_RG_M/agLW_BRfl_g1_peO_reAln_ens_RG_M_RDp_reG.bam
#Index final BAM file
samtools index tophat2_agLW_BRfl_g1_peO_reAln_ens_RG_M/agLW_BRfl_g1_peO_reAln_ens_RG_M_RDp_reG.bam tophat2_agLW_BRfl_g1_peO_reAln_ens_RG_M/agLW_BRfl_g1_peO_reAln_ens_RG_M_RDp_reG.bai
#Remove tmp bam (bai) file
rm tophat2_agLW_BRfl_g1_peO_reAln_ens_RG_M/agLW_BRfl_g1_peO_reAln_ens_RG_M_RDp.bam
rm tophat2_agLW_BRfl_g1_peO_reAln_ens_RG_M/agLW_BRfl_g1_peO_reAln_ens_RG_M_RDp.bai
Some notes on Tophat options:
- -g (number of alignments, default 20)
- -M (prefilter-multihits against genome)
- --rg-id; --rg-sample; --keep-fasta-order (needed for many downstream analysis like Picard and GATK)