15.2.3 Algorithm Overview

  1. A range within the reference sequence is identified by mapping all reads to the reference sequence and trimming them such that no more than 2 out of 10 bp at either end mismatch. They are then further trimmed to ensure that no more than 50% of bases of shorter lengths mismatch. This is to filter out primers on the ends of sequences. The range of the reference sequence used is that which is covered by 99% of the trimmed reads which map.
  2. Untrimmed Reads which have more than 2 mismatches out of 10 bp at either end of this reference sequence range are discarded. This is to ensure that reads which are very poor quality on either end (and therefore not true variants from the reference sequence) do not interfere with variant calling.
  3. If the variants region of interest in the options specifies to use a region within a specified number of bp of the cut site, the region of interest is calculated as follows
    1. Reads are sampled randomly, taking the intersection of the range of variants within those reads until it converges to a small range.
    2. Repeat the above step 100 times.
    3. The ranges converged on are sorted by decreasing frequency.
    4. Take the union of the most frequent ranges until that covers 1/6 of the region of interest size, but also include ranges with frequency over 1/5 of the highest frequency as long as the combined range does not exceed 3/4 of the region of interest size.
    5. The full region of interest is centered on this range.
  4. Read alignments are trimmed to the region of interest, and collapsed into identical clusters. This means that reads which only differ from each other outside of the region of interest will be considered identical.
  5. Clusters are sorted by decreasing frequency, and lower frequency clusters are merged into higher frequency clusters if it is likely the lower frequency cluster came from the higher frequency cluster due to sequencing error. Only substitution errors are considered during this merging process. Indel sequencing errors are never collapsed.