RNA-seq analysis

From HPCwiki
Jump to navigation Jump to search

Typical commands used for analyzing RNA-seq with Tophat (including Bowtie2 as aligner).

  • Examples are RNA-seq (stranded) from pig aligned against the pig reference genome (S. scrofa - 10.2)
  • Tophat, Bowtie2, Picard and GATK need to be in PATH or loaded as modules (e.g.: module load SHARED/bowtie/2-2.2.1; module load SHARED/tophat/2.0.11)
  • Bowtie2 index (made with bowtie2_build) of reference genome (need only to be made ones)
  • PCR duplicates removed with Picard
  • For allelic expression and rna-editing re-aligning with GATK

For expression analysis

  • Allowing multiply hits (20 times)

<source lang='bash'>

  1. !/bin/bash
  2. SBATCH --time=2-12:00:00
  3. SBATCH -n 1
  4. SBATCH -c 16
  5. SBATCH --output=output_%j.txt
  6. SBATCH --error=error_output_%j.txt
  7. SBATCH --job-name=AG_tophat
  8. SBATCH --mem=20000
  1. Setting java tmp for slurm

export _JAVA_OPTIONS=-Djava.io.tmpdir=/lustre/scratch/WUR/ABGC/madse001/tmp

  1. Brain_fontalloop
  2. Tophat alignment

tophat -p 16 -G ../ssc10_2_ens/Sscrofa10.2.68_ens_Tophat.gtf -M --rg-id agLW_BRfl --rg-sample ag_LW_BRfl --rg-platform Illumina --keep-fasta-order --read-realign-edit-dist 0 -r 120 --library-type fr-firststrand \ -o tophat2_agLW_BRfl_g20_peO_reAln_ens_RG_M --mate-std-dev 250 ../ssc10_2_ens/ssc10_2_ens agLW_BRfl_read1.trimmed.final.fastq.gz agLW_BRfl_read2.trimmed.final.fastq.gz

  1. Rename Tophat alignment file

mv tophat2_agLW_BRfl_g20_peO_reAln_ens_RG_M/accepted_hits.bam tophat2_agLW_BRfl_g20_peO_reAln_ens_RG_M/agLW_BRfl_g20_peO_reAln_ens_RG_M.bam

  1. Remove PCR duplicates with Picard MarkDuplicates

java -Xms16g -jar ~/bin/MarkDuplicates.jar I=tophat2_agLW_BRfl_g1_peO_reAln_ens_RG_M/agLW_BRfl_g1_peO_reAln_ens_RG_M.bam O=tophat2_agLW_BRfl_g1_peO_reAln_ens_RG_M/agLW_BRfl_g1_peO_reAln_ens_RG_M_RDp.bam \ M=tophat2_agLW_BRfl_g1_peO_reAln_ens_RG_M/agLW_BRfl_g1_peO_reAln_ens_RG_M.bam.out REMOVE_DUPLICATES=true

</source>

For allelic expression and rna-editing analysis

  • Unique alignment only

<source lang='bash'>

  1. !/bin/bash
  2. SBATCH --time=2-12:00:00
  3. SBATCH -n 1
  4. SBATCH -c 16
  5. SBATCH --output=output_%j.txt
  6. SBATCH --error=error_output_%j.txt
  7. SBATCH --job-name=AG_tophat
  8. SBATCH --mem=20000
  1. Setting java tmp for slurm

export _JAVA_OPTIONS=-Djava.io.tmpdir=/lustre/scratch/WUR/ABGC/madse001/tmp

  1. Brain_fontalloop
  2. Tophat alignment

tophat -p 16 -g 1 -G ../ssc10_2_ens/Sscrofa10.2.68_ens_Tophat.gtf -M --rg-id agLW_BRfl --rg-sample ag_LW_BRfl --rg-platform Illumina --keep-fasta-order --read-realign-edit-dist 0 -r 120 --library-type fr-firststrand \ -o tophat2_agLW_BRfl_g1_peO_reAln_ens_RG_M --mate-std-dev 250 ../ssc10_2_ens/ssc10_2_ens agLW_BRfl_read1.trimmed.final.fastq.gz agLW_BRfl_read2.trimmed.final.fastq.gz

  1. Rename Tophat alignment file

mv tophat2_agLW_BRfl_g1_peO_reAln_ens_RG_M/accepted_hits.bam tophat2_agLW_BRfl_g1_peO_reAln_ens_RG_M/agLW_BRfl_g1_peO_reAln_ens_RG_M.bam

  1. Remove PCR duplicates with Picard

java -Xms16g -jar ~/bin/MarkDuplicates.jar I=tophat2_agLW_BRfl_g1_peO_reAln_ens_RG_M/agLW_BRfl_g1_peO_reAln_ens_RG_M.bam O=tophat2_agLW_BRfl_g1_peO_reAln_ens_RG_M/agLW_BRfl_g1_peO_reAln_ens_RG_M_RDp.bam \ M=tophat2_agLW_BRfl_g1_peO_reAln_ens_RG_M/agLW_BRfl_g1_peO_reAln_ens_RG_M.bam.out REMOVE_DUPLICATES=true

  1. Index BAM file for GATK analysis

samtools index tophat2_agLW_BRfl_g1_peO_reAln_ens_RG_M/agLW_BRfl_g1_peO_reAln_ens_RG_M_RDp.bam tophat2_agLW_BRfl_g1_peO_reAln_ens_RG_M/agLW_BRfl_g1_peO_reAln_ens_RG_M_RDp.bai

  1. Re-align with GATK

java -Xms16g -jar ~/bin/GenomeAnalysisTK.jar -T RealignerTargetCreator -R ../ssc10_2_ens/ssc10_2_ens.fa -I tophat2_agLW_BRfl_g1_peO_reAln_ens_RG_M/agLW_BRfl_g1_peO_reAln_ens_RG_M_RDp.bam \ -o tophat2_agLW_BRfl_g1_peO_reAln_ens_RG_M/realigner.intervals java -Xms16g -jar ~/bin/GenomeAnalysisTK.jar -T IndelRealigner -R ../ssc10_2_ens/ssc10_2_ens.fa -I tophat2_agLW_BRfl_g1_peO_reAln_ens_RG_M/agLW_BRfl_g1_peO_reAln_ens_RG_M_RDp.bam \ -targetIntervals tophat2_agLW_BRfl_g1_peO_reAln_ens_RG_M/realigner.intervals -o tophat2_agLW_BRfl_g1_peO_reAln_ens_RG_M/agLW_BRfl_g1_peO_reAln_ens_RG_M_RDp_reG.bam

  1. Index final BAM file

samtools index tophat2_agLW_BRfl_g1_peO_reAln_ens_RG_M/agLW_BRfl_g1_peO_reAln_ens_RG_M_RDp_reG.bam tophat2_agLW_BRfl_g1_peO_reAln_ens_RG_M/agLW_BRfl_g1_peO_reAln_ens_RG_M_RDp_reG.bai

  1. Remove tmp bam (bai) file

rm tophat2_agLW_BRfl_g1_peO_reAln_ens_RG_M/agLW_BRfl_g1_peO_reAln_ens_RG_M_RDp.bam rm tophat2_agLW_BRfl_g1_peO_reAln_ens_RG_M/agLW_BRfl_g1_peO_reAln_ens_RG_M_RDp.bai

</source>

Some notes on Tophat options:

  • -g (number of alignments, default 20)
  • -M (prefilter-multihits against genome)
  • --rg-id; --rg-sample; --keep-fasta-order (needed for many downstream analysis like Picard and GATK)