Assembly & Annotation: Difference between revisions

From HPCwiki
Jump to navigation Jump to search
Derks047 (talk | contribs)
Derks047 (talk | contribs)
Line 59: Line 59:


To do a hybrid assembly with DBG2OLC use the following script:
To do a hybrid assembly with DBG2OLC use the following script:
<source lang='bash'> Assembly/run_dbg2olc.sh -k <kmer_size> -f <shortreads.fastq> -p <long_reads.fastq> -g <genome_size> -a <adaptive_threshold> </source>
<source lang='bash'>Assembly/run_dbg2olc.sh -k <kmer_size> -f <shortreads.fastq> -p <long_reads.fastq> -g <genome_size> -a <adaptive_threshold> </source>


Use both Sparseassembler to build the contigs based on Illumina data or use Platanus (which often gives better results). You should run platanus with default settings.
Use both Sparseassembler to build the contigs based on Illumina data or use Platanus (which often gives better results). You should run platanus with default settings.
Line 70: Line 70:
After assembly don’t forget to polish using Quiver or PBDagcon (same as for DBG2OLC).
After assembly don’t forget to polish using Quiver or PBDagcon (same as for DBG2OLC).


<source lang='bash'> blasr /mnt/nexenta/derks047/xero_cast/nobackup/xero/PacBio/xero_pacbio_sorted_3000.fasta ../quickmerge/merged_default.fasta -bestn 1 -m 5 -out mapped.m5
<source lang='bash'>blasr /mnt/nexenta/derks047/xero_cast/nobackup/xero/PacBio/xero_pacbio_sorted_3000.fasta ../quickmerge/merged_default.fasta -bestn 1 -m 5 -out mapped.m5
pbdagcon -c 1 xvis_mapped.m5 > xvis_consensus.fasta </source>
pbdagcon -c 1 xvis_mapped.m5 > xvis_consensus.fasta </source>


Line 76: Line 76:
Use Pilon to polish with Illumina data:
Use Pilon to polish with Illumina data:


<source lang='bash'> Assembly/run_pilon -a <assembly> -b <bam_file> -t <num_threads> </source>
<source lang='bash'>Assembly/run_pilon -a <assembly> -b <bam_file> -t <num_threads> </source>


''Illumina scaffolding''
''Illumina scaffolding''


Use SSPACE to scaffold using Illumina mate pair or paired end data.
Use SSPACE to scaffold using Illumina mate pair or paired end data.
<source lang='bash'> Assembly/run_sspace.sh -a <assembly> -l <libraries.txt> -k <num_links> -t <num_threads> -p <prefix> </source>
<source lang='bash'>Assembly/run_sspace.sh -a <assembly> -l <libraries.txt> -k <num_links> -t <num_threads> -p <prefix> </source>


''PacBio scaffolding''
''PacBio scaffolding''
Line 87: Line 87:
SSPACE-LongRead can be used to post-scaffold your generated contigs. Be aware that SSPACE-LongRead does not do consensus calling for scaffolding so there will be a gap introduced.
SSPACE-LongRead can be used to post-scaffold your generated contigs. Be aware that SSPACE-LongRead does not do consensus calling for scaffolding so there will be a gap introduced.


<source lang='bash'> Assembly/run_sspace_longread.sh -a <assembly> -p  
<source lang='bash'>Assembly/run_sspace_longread.sh -a <assembly> -p  
<pacbio.fasta> -k <num_links> -t <num_threads> </source>
<pacbio.fasta> -k <num_links> -t <num_threads> </source>



Revision as of 12:03, 21 January 2016

Protocol with typical commands used for de novo assembly and annotation

  • Software
  • Preprocessing
  • Assembly
  • Assembly validation
  • Annotation
  • Submission
  • Visualization

Software

Preprocessing

Quality control:

Check quality of your data using FastQC and fastq_stats.py.

 fastqc ../*.gz

Explore the report to do the quality check and identify potential adapters and primers in the sequences.

K-mer analysis:

Us the script kmer_analysis.sh to get the genomic properties based on the k-mer distribution. Genomic properties include genome size and percentage of heterozygosity.

 Preprocessing/kmer_analysis.sh  -m <kmer_size> -c <error-cutoff> -s <hash-size> -t <threads> -i <R1.fastq.gz R2.fastq.gz ...> -o <output_dir>

Trimming:

Use Trimmomatic to trim Illumina data. Make sure your fasta file with the adapters corresponds to the adapters found in the FastQC report.

Use the following script:

 preprocessing/run_trimmomatic.sh -t <num_threads> -f <FW_reads.fastq> -r <RV_reads.fastq>

Error correction

Lighter is a fast tool to error correct your Illumina data.

Use the following script:

 preprocessing/run_lighter_error_correction.sh -g <genome_size> -c <coverage> -f <FW_reads.fastq> -r <RV_reads.fastq>

Organelle assembly

Download a proper reference from the NCBI database. Use the IOGA pipeline to assemble to organellar genome.

 assembly/run_IOGA.sh -a <assembly> -f <fw_reads.fastq> -r <reverse_reads.fastq> -i <insert_size> -t <num_threads> -n <name_prefix>

Map your reads to the newly assembled genome and manually check if it is circular.

Use Pilon to correct remaining errors in the assembly using the mapped reads.

Annotate using MITOS or DOGMA online tools.

Submit here: http://www.ncbi.nlm.nih.gov/LargeDirSubs/dir_submit.cgi

If long-read data is available, use the new software Circlator to make your assembly circular.

Assembly

The type of software used heavily depends on the type of data you have. Below are some examples:


PacBio/Illumina hybrid assembly:

To do a hybrid assembly with DBG2OLC use the following script:

Assembly/run_dbg2olc.sh -k <kmer_size> -f <shortreads.fastq> -p <long_reads.fastq> -g <genome_size> -a <adaptive_threshold>

Use both Sparseassembler to build the contigs based on Illumina data or use Platanus (which often gives better results). You should run platanus with default settings.

Don’t forget to polish the genome with Sparc after the assembly is finished, as described in the [1]

PacBio only assembly

Use falcon is you consider to do a PacBio only assembly. Run on the cluster and follow the instructions in the manual: https://github.com/PacificBiosciences/FALCON/wiki/Manual After assembly don’t forget to polish using Quiver or PBDagcon (same as for DBG2OLC).

blasr /mnt/nexenta/derks047/xero_cast/nobackup/xero/PacBio/xero_pacbio_sorted_3000.fasta ../quickmerge/merged_default.fasta -bestn 1 -m 5 -out mapped.m5
pbdagcon -c 1 xvis_mapped.m5 > xvis_consensus.fasta


Use Pilon to polish with Illumina data:

Assembly/run_pilon -a <assembly> -b <bam_file> -t <num_threads>

Illumina scaffolding

Use SSPACE to scaffold using Illumina mate pair or paired end data.

Assembly/run_sspace.sh -a <assembly> -l <libraries.txt> -k <num_links> -t <num_threads> -p <prefix>

PacBio scaffolding

SSPACE-LongRead can be used to post-scaffold your generated contigs. Be aware that SSPACE-LongRead does not do consensus calling for scaffolding so there will be a gap introduced.

Assembly/run_sspace_longread.sh -a <assembly> -p 
<pacbio.fasta> -k <num_links> -t <num_threads>

Use PBJelly after SSPACE-LongRead to fill the gaps and to extend scaffolds.

Assembly/run_pbjelly.sh -p <protocol> -t <num_threads>

Assembly validation

Annotation

Functional annotation

Submission

Other types of analysis