Assembly & Annotation

From HPCwiki
Jump to navigation Jump to search

Protocol with typical commands used for de novo assembly and annotation

Author: Martijn Derks

All script referred to in this protocol can be found here: /home/WUR/derks047/BifDev/derks047

  • Software
  • Preprocessing
  • Assembly
  • Assembly validation
  • Annotation
  • Submission
  • Visualization

Software

List of tools described in this protocol

Fastqc

Jellyfish

Trimmomatic

Lighter

IOGA

Circlator

DBG2OLC

FALCON

SSPACE

SSPACE-Longread

PBJelly

Sparc

PBDagcon

Pilon

Vecscreen

Haplomerger2

Bowtie2

BBmap

BLASR

CEGMA

BUSCO

Isoblat

RepeatModeler

RepeatMasker

Hisat2

Tophat2

Cufflinks

TransDecoder

Braker1

MAKER2

JBrowse

Freebayes

WoLFPSort

SignalP

Preprocessing

Quality control:

Check quality of your data using FastQC and fastq_stats.py. <source lang='bash'> fastqc ../*.gz </source> Explore the report to do the quality check and identify potential adapters and primers in the sequences.

K-mer analysis:

Us the script kmer_analysis.sh to get the genomic properties based on the k-mer distribution. Genomic properties include genome size and percentage of heterozygosity.

<source lang='bash'> Preprocessing/kmer_analysis.sh -m <kmer_size> -c <error-cutoff> -s <hash-size> -t <threads> -i <R1.fastq.gz R2.fastq.gz ...> -o <output_dir> </source> Trimming:

Use Trimmomatic to trim Illumina data. Make sure your fasta file with the adapters corresponds to the adapters found in the FastQC report. Use the following script: <source lang='bash'> preprocessing/run_trimmomatic.sh -t <num_threads> -f <FW_reads.fastq> -r <RV_reads.fastq> </source>

Error correction

Lighter is a fast tool to error correct your Illumina data. Use the following script: <source lang='bash'> preprocessing/run_lighter_error_correction.sh -g <genome_size> -c <coverage> -f <FW_reads.fastq> -r <RV_reads.fastq> </source>

Organelle assembly

Download a proper reference from the NCBI database. Use the IOGA pipeline to assemble to organellar genome. <source lang='bash'> assembly/run_IOGA.sh -a <assembly> -f <fw_reads.fastq> -r <reverse_reads.fastq> -i <insert_size> -t <num_threads> -n <name_prefix> </source> Map your reads to the newly assembled genome and manually check if it is circular.

Use Pilon to correct remaining errors in the assembly using the mapped reads.

Annotate using MITOS or DOGMA online tools.

Submit here: http://www.ncbi.nlm.nih.gov/LargeDirSubs/dir_submit.cgi

If long-read data is available, use the new software Circlator to make your assembly circular.

Assembly

The type of software used heavily depends on the type of data you have. Below are some examples:

PacBio/Illumina hybrid assembly:

To do a hybrid assembly with DBG2OLC use the following script: <source lang='bash'>Assembly/run_dbg2olc.sh -k <kmer_size> -f <shortreads.fastq> -p <long_reads.fastq> -g <genome_size> -a <adaptive_threshold> </source>

Use both Sparseassembler to build the contigs based on Illumina data or use Platanus (which often gives better results). You should run platanus with default settings.

Don’t forget to polish the genome with Sparc after the assembly is finished, as described in the [1]

PacBio only assembly

Use falcon is you consider to do a PacBio only assembly. Run on the cluster and follow the instructions in the manual: https://github.com/PacificBiosciences/FALCON/wiki/Manual After assembly don’t forget to polish using Quiver or PBDagcon (same as for DBG2OLC).

<source lang='bash'>blasr <reads.fasta> <assembly.fasta> -bestn 1 -m 5 -out mapped.m5 pbdagcon -c 1 mapped.m5 > assembly_consensus.fasta </source>

Use Pilon to polish with Illumina data:

<source lang='bash'>Assembly/run_pilon -a <assembly> -b <bam_file> -t <num_threads> </source>

Illumina scaffolding

Use SSPACE to scaffold using Illumina mate pair or paired end data. <source lang='bash'>Assembly/run_sspace.sh -a <assembly> -l <libraries.txt> -k <num_links> -t <num_threads> -p <prefix> </source>

PacBio scaffolding

SSPACE-LongRead can be used to post-scaffold your generated contigs. Be aware that SSPACE-LongRead does not do consensus calling for scaffolding so there will be a gap introduced.

<source lang='bash'>Assembly/run_sspace_longread.sh -a <assembly> -p <pacbio.fasta> -k <num_links> -t <num_threads> </source>

Use PBJelly after SSPACE-LongRead to fill the gaps and to extend scaffolds.

<source lang='bash'>Assembly/run_pbjelly.sh -p <protocol> -t <num_threads> </source>

Assembly validation

Vecscreen

Use Vecscreen to check for adapters or contamination in your assembly: <source lang='bash'>Assembly/run_vecscreen.sh -g <assembly> -t <num_threads> </source>

Use HaploMerger on a highly heterozygous genome to merge redundant contigs. Make sure you run HaploMerger on a soft masked genome. A new version (v2) has been released [2].

Mapping

Map the Illumina and/or PacBio reads to the assembly to validate the completeness of the assembly

Illumina:

<source lang='bash'>mapping/bowtie2.sh [-h] -a <assembly> -l <name-prefix> -f <FW_reads.fastq> -r <RV_reads.fastq> -t <num_threads> </source> <source lang='bash'>mapping/run_bbmap.sh [-h] -a <assembly> -l <name-prefix> -f <FW_reads.fastq> -r <RV_reads.fastq> -t <num_threads></source>

PacBio:

<source lang='bash'>mapping/run_blasr.sh [-h] -a <assembly> -l <name-prefix> -f <FW_reads.fastq> -r <RV_reads.fastq> -t <num_threads></source>

Completeness

Use CEGMA and BUSCO to evaluate the completeness of the genome in terms of gene space. <source lang='bash'>run_cegma.sh [-h] -a <assembly> -t <num_threads> run_BUSCO.sh [-h] -a <assembly> -t <num_threads> -m <mode:[genome|OGS]> -c <clade> -o <output_prefix> </source> If you have an de novo transcriptome available you can evaluate the completeness of your assembly using Isoblat [3].

Annotation

De novo repeat library

It is important to build a denovo repeat library using RepeatModeler especially for a species that doesn't have many closely related species sequenced:

<source lang='bash'>annotation/run_repeatmodeler.sh [-h] -a <assembly> -l <name-prefix> -t <num_threads> </source>

To mask your genome with the constructed repeat library use:

<source lang='bash'>annotation/run_repeatmasker.sh [-h] -l <denovo_library> -a <assembly> -t <num_threads> </source>

Transcriptome Use hisat2 or tophat2 to map RNA-seq reads to your assembled genome. Make sure you do the quality check and trimming, e.g. using the same approach as for the genomic reads.

Mapping

<source lang='bash'> annotation/run_hisat2.sh [-h] -a <assembly> -l <name-prefix> -f <FW_reads.fastq> -r <RV_reads.fastq> -t <num_threads> annotation/run_tuxedo.sh [-h] -a <assembly> -l <name-prefix> -f <FW_reads.fastq> -r <RV_reads.fastq> -t <num_threads> </source>

Both scripts include the Cufflinks transcript abundance estimations. In addition cufflinks2gff3 from the maker pipeline is run to parse the results to a maker accepted gff3 file. The cufflinks gtf files (or the merged one) can be used in transdecoder to get the full transcripts to use for the annotation:

<source lang='bash'>run_transdecoder.sh [-h] -a <assembly> -g <transcripts.gtf> </source> The .mRNA and gff3 files can be given to maker2 as evidence.

Annotation

Braker

The braker pipeline is an annotation tool based on RNA-seq. It automatically trains AUGUSTUS and GeneMark which can be used in maker. Braker needs a bam file produced by tophat2 or hisat2 to predict gene models. If multiple RNA-seq libraries are available try to merge these bam files (samtools merge) and run braker.

<source lang='bash'>annotation/run_braker.sh [-h] -a <assembly> -b <rnaseq_bam_file> -l <species_name> -t <threads> </source>

The perl script getAnnoFasta.pl from AUGUSTUS can be used to get the protein and transcript sequences from the predicted gene models. These can again be used in maker2 as evidence.

MAKER2

Carefully select your evidence in the maker_opts file for maker2. <source lang='bash'> Est= Set your de novo transcriptome and the transdecoder .mRNA files in this section. Est_gff=Your cufflinks gff3 files Protein= Set your protein evidence. If you have several RNA-seq libraries available try to set little protein evidence, the est2genome is quite slow. Rmlib= Set the denovo repeatlibrary from RepeatModeler. Gmhmm= Set the .mod file from braker. Augustus_species= The augustus species set with braker. </source>

Once you set your evidence you can use this script to run maker2 in parallel: <source lang='bash'> annotation/run_maker.sh [-h] -p <threads> -l <name-prefix> </source>

Functional annotation

Post processing & Functional annotation:

CEGMA and BUSCO validation on your complete set of annotated proteins can be performed similar as for the assembly except that the mode should be changed to OGS.

The post processing of the maker2 output consists of three sections:

<source lang='bash'>annotation/maker_pp.sh –a <assembly> -b <maker_base_name> -g <gene_name_prefix></source>

The first section changes the names of maker2 output files and calculated the assembly and annotation statistics. Also the identified repeats are summarized.

The second section is the functional annotation which should run on the cluster. It will provide the command to be run on the cluster to find homologous with blastp in SwissProt and TrEMBL databases and to identify functional domains in the proteins.

The third section adds the functional annotation to the files and summarizes all the information in an tab delimited file.

Submission

Submission

Check gene properties including intron lengths with GeneProperties.py: <source lang='bash'> GeneProperties.py [-h] ANNOTATION_GFF </source>

Parse blast products:

<source lang='bash'>annotation/blast2product.py [-h] SWISS_BLAST > product_annotation.annotations </source>

Convert gff to tbl format:

GAG [4] is used to convert the genome fasta file and gff file to ncbi .tbl format. gag.py -f <assembly> -g <genes_gff> -a product_annotation.annotations --fix_start_stop

The resulting .tbl file can be found in the gag_output directory Remove unwanted lines in tbl file: Some unwanted lines can be removed from the tbl file.

This is an example for Folsomia candida submission: grep -v "REFERENCE" gag_output/genome.tbl | grep -v "PBARC" | grep -vP "gene\tFcan01" > Fcan01.tbl

Run tbl2asn: Finally tbl2asn can be used to produce the file for submission. If you do not have a template file create one here: [5]

<source lang='bash'>tbl2asn.sh [-h] -d <directory> -t <template_file> -e <evidence></source>

See the help section on how to use the script. Examine the errorsummary file to see if there are no errors, warnings are accepted by genbank. In addition, look in the discrep file to see if there are any other issues.

If not, submit the .sqn file to NCBI.

Other types of analysis

Protein_localization

Script to run signalp and WoLFPSort to get the predict protein cellular locations: <source lang='bash'>annotation/protein_localization_analysis.sh [-h] -p <proteome.fasta> -n <name_prefix> -o <organism_type></source>

Organism type is either animal, plant or fungi.

Variant calling

Use freebayes to call variants for your genome: <source lang='bash'>mapping/run_freebayes.sh [-h] -a <assembly> -b <bam_file> -l <name-prefix> -p <ploidy></source>

Change the script of you want to use different settings than default.

Visualization

JBrowse

Get the latest jbrowse here: http://jbrowse.org/ Unzip in your target directory and run setup.sh to install JBrowse.

Run the script: <source lang='bash'>annotation/set_jbrowse.sh</source>in your JBrowse directory All the tracks are listed in the data/trackList.json file, you can manually set categories or other things you want to add here.

Portal

To set up a portal copy the portal directory from e.g. /mnt/geninf15/prog/www/htdocs/springtail/portal/ Use sed to change organism in the index.html files for each section.

Blast: Viroblast can be used to set up a blast service. Create the blast databases in the db folder. And set the viroblast.ini with the correct names. You can embed the the viroblast URL in the index.html file.

Browser: Embed the browser in the index.html file.

Data: The index.html contains a table with links to all the datafiles, you can manually set the data you would like to put on the portal here.

Search: To get a working search page in your portal copy an example directory from here: /mnt/geninf15/prog/www/htdocs/springtail/genes Set the produced annotation.tsv file in the search directory and set the right file links in genes.py.

Uregions: Copy this directory to start with: /mnt/geninf15/prog/www/htdocs/springtail/upstream_regions

Change the name in the .html files in the templates folder to your species. Set the links to the gene_body gff file and assembly in the extract_gene_positions.py file. Embed the working URL in the index.html