10.3.1 The de novo assembly algorithm
The sequence assembler in Geneious is ﬂexible enough to handle read errors consisting of either incorrect bases or short indels. It can handle reads of any length, including paired-reads and mixtures of reads from diﬀerent sequencing machines (hybrid assemblies).
De novo assemblers are generally either overlap or k-mer (De Bruijn graph) based. The Geneious de novo assembler is an overlap assembler which uses a greedy algorithm similar to that used in multiple sequence alignment.
- For each sequence a blast-like algorithm is used to ﬁnd the closest matching sequence among all other sequences.
- The highest scoring sequence and its closest matching sequence are merged together into a contig (reverse complementing if necessary). This process is repeated, appending sequences to contigs and joining contigs where necessary.
- For paired read de novo assembly, 2 sequences with similar expected mate distances are given a higher matching score if their mates also score well against each other. Similarly a sequence and its mate will be given a higher score if they both align at approximately their expected distance apart to an already formed contig. The eﬀect of this heuristic is that paired read de novo assembly starts out by ﬁnding 2 sets of paired reads and forming 2 contigs. Each of these 2 contigs will contain 1 sequence from each pair and the 2 contigs are expected to be separated by the expected mate distance. Assembly proceeds from there either adding new paired reads to the contigs or forming new pairs of contigs which eventually merge together. Due to the nature of this algorithm, paired read de novo assembly in Geneious only works well if you have high coverage of paired reads - a hybrid assembly of mostly unpaired data with a few paired reads will not make good use of the paired read data, but this is expected to improve in future versions.
- Each contig generated by a gapped de novo assembly has some minor ﬁne tuning performed on it both during assembly and upon completion. For each gapped position in a sequence, a base adjacent to the gap is shuﬄed along into the gap if it is the same base as the most common base in other sequences in the contig at that position. After doing this if any column now consists entirely of gaps that column is removed from the contig
- Other heuristics are applied throughout the assembly to improve the results such as identifying repeat regions
- Both the Geneious de novo and reference assemblers use a deterministic method (even when spreading the work cross multiple CPUs) such that if you rerun the assembler using the same settings and same input data it will always produce the same results.