General advice for de novo assembly

De novo assembly is one of the most RAM intensive and CPU intensive tasks that can be undertaken in Geneious. See our Knowledge base post for an overview of hardware requirements for de novo assembly using the Geneious de novo assembler.

To improve quality of assembly and reduce the time and RAM required for assembly you should always:

1. Trim ends using BBDuk. Trim stringently with a "Q" value of 20 or greater.

2. Aim for assembly coverage of 50-100x. Higher coverage will not usually improve final results and will increase RAM requirements and time for assembly. See the following Link for information on how to calculate expect coverage based on your expected genome size, average read length (trimmed), and number of reads.

To de novo assemble a subset of your reads use one of the following options:

a) In the De novo assembly settings window check the option to Use X% of data and set an appropriate value based on coverage calculations. This will always use the first X% reads in your list.

b) Use Workflows → Randomly sample sequences to create a randomly sampled subset (in pairs) of your read list.

c) Use Normalization to reduce the size of your data set as described below.

Will de novo assembly give me a complete contigous genome?

All de novo assembly algorithms, regardless of the technique used, are unable to unambiguously assemble across perfect repeats when the repeat unit is longer than the read length or the paired read insert size. In practice this means that assembly of genomic Illumina short read data will in most cases result in the creation of multiple nonoverlapping contigs.

For example, microbial genomes usually contain multiple copies of the rRNA gene cluster which each contain close to perfect repeats of the SSU (16S) LSU (23S) and 5S rRNA genes. These identical rRNA clusters will prevent assembly and recovery of a single contigous consensus sequence. In most cases other repeat units (duplicated genes, transposons etc) will also generate further "breaks" in the de novo assembly.

All reads derived from the repeats will end up assembled together at the end of a single contig. As a consequence, coverage of the assembled repeat region will be a higher than the average coverage for the unique portion of the genome.



Other preprocessing tools

Error correct and Normalize reads

Accessed via menu Sequence → Error correct and normalize reads.

The Error correct and Normalize reads tool utilizes BBNorm, which is designed to normalize assembly coverage by down-sampling reads in high-depth areas of a genome, resulting in a more even coverage distribution. Importantly, normalization will not remove reads in lower coverage areas.

Normalization can substantially reduce data set sizes, and subsequently, for de novo assembly, it will reduce assembly time and reduce RAM requirements.

Note that normalisation can potentially "amplify" errors in "difficult to sequence" areas to the point the errors appear significant.  Therefore, if you use normalization then we recommend using the strategy depicted in the flow diagram shown below.  This combined normalization/de novo assembly/map to reference strategy will usually still be far faster than attempting de novo assembly of your full data set.

Merge paired Reads

Accessed via menu Sequence → Merge paired reads

This tool utilizes BBMerge and is designed to merge two overlapping paired reads into a single read. This tool is useful generating a consensus from overlapping reads generated by amplicon sequencing.

Remove duplicate Reads

Accessed via menu Sequence → Remove duplicate reads

This tool utilizes Dedupe and is designed to find and remove all contained and overlapping sequences in a dataset.

Remove Chimeras

Accessed via menu Sequence → Remove chimeric reads.

This tool will filter chimeric reads from sequencing data by comparing to a reference database. You can choose between the bundled public domain UCHIME algorithm or download and use the faster USEARCH 8. Note that the free version of USEARCH 8 is limited to using 4 GB of RAM and so cannot handle larger NGS datasets.

Barcode splitting

Accessed via menu Sequence → Separate by barcodes.

This tool will demultiplex custom barcoded data into separate lists. The tool has 454 MID barcode presets, or you can define and use your own custom barcode sets.

Note: demultiplexing should always be performed before trimming using BBduk.


Go to:

Introduction: Introduction

Overview: Overview: Best Practice for preprocessing of NGS reads

Exercise 1: NGS read Preprocessing

Exercise 2: De novo assembly of paired-end data