10.4.5 The map to reference algorithm

The reference assembly algorithm used is a seed and expand style mapper followed by an optional fine tuning step to better align reads around indels to each other rather than the reference sequence. Various optimizations and heuristics are applied at each stage, but a general outline of the algorithm is

  1. First the reference sequence(s) is indexed to create a table making a record of all locations in the reference sequence that every possible word (series of bases of a specified length) occurs.
  2. Each read is processed one at a time. Each word within that read is located in the reference sequence and that is used as a seed point where the matching range is later expanded outwards to the end of the read.
  3. If a read does not find a perfectly matching seed, the assembler can optionally look for all seeds that differ by a single nucleotide.
  4. Before the seed expansion step, all seeds for a single read that lie on the same diagonal are filtered down to a single seed.
  5. During seed expansion, when mismatches occur a look-ahead is used decide whether to accept it as a mismatch or to introduce a gap (in either the reference sequence or read)
  6. The mapper handles circular reference sequences by indexing reference sequence words spanning the origin and allowing the expansion step to wrap past the ends
  7. All results are given a score based on the number of mismatches and gaps introduced. Normally the best scoring (or a random one of equally best scoring) matches are saved although there is an option to map the read to all best scoring locations
  8. Paired reads are given an additional score penalty based on their distance from their expected distance so that they prefer mapping close to their expected distance with as few mismatches as possible, but they can also map any distance apart if an ideal location is not found.
  9. The final optional fine tuning step at the end, shuffles the gaps around so that they reads better align to each other rather than the reference sequence.
  10. For details on how mapping qualities are calculated, see section 10.5 .

For further details and for a comparison of the Geneious reference assembler to other software, see the Geneious Mapper white paper.