9.5 Consensus sequences

To display a consensus sequence on your alignment, check the Consensus option under the PIC Display tab.

The consensus sequence is displayed above the alignment or assembly, and shows which residues are conserved (are always the same), and which residues are variable. A consensus is constructed from the most frequent residues at each site (alignment column), so that the total fraction of rows represented by the selected residues in that column reaches at least a specified threshold.

To work with the consensus sequence in a downstream analysis, you must first Extract it from your alignment. To do this, click on Consensus to select the entire sequence, then click Extract to extract it to a new sequence document. Alternatively, go to Tools Generate Consensus Sequence. This operation allows you to choose the options for how your consensus sequence is called (as described above), and then saves it to a separate document. If your consensus sequence contains ‘?’ characters where there are regions with no or low coverage in your assembly, you can split the consensus sequence at these bases to generate multiple sequences by checking the option to Split into separate sequences around ‘?’ calls

    Threshold settings
    Assigning quality to the consensus
    Other settings
Threshold settings

The Threshold determines which base in called in the consensus, and can be set to a percentage, or by using the quality scores on the reads. IUPAC ambiguity codes (such as R for an A or G nucleotide) are counted as fractional support for each nucleotide in the ambiguity set (A and G, in this case), thus two rows with R are counted the same as one row with A and one row with G. When more than one nucleotide is necessary to reach the desired threshold, this is represented by the best-fit ambiguity symbol in the consensus; for protein sequences, this will always be an X.

For example, assume a column contains 6 A’s, 3 G’s and 1 T. If the consensus threshold is set to 60% or below, then the consensus will be A. If the consensus threshold is set to between 60% and 90%, then the consensus will be R. If the consensus threshold is set to over 90%, then the consensus will be D.

In the case of ties, either all or none of the involved residues will be selected. For example, if the above case instead had 6 A’s, 2 G’s and 2 T’s, then for a consensus threshold of 60% or below, an A will be called. Above a threshold of 60%, a D will be called.

When the aligned sequences contain quality information in the form of chromatograms or fastq data, you can select Highest Quality to calculate a majority consensus that takes the relative residue quality into account. This sums the total quality for each potential base call, and if the total for a base exceeds 60% of the total quality for all bases, then that base is called. Highest Quality (Raw) uses the raw chromatogram scores.

When Highest Quality (Adjusted) consensus calling is selected, in order to improve consensus accuracy, Geneious will adjust the value of the raw quality scores that contribute to consensus calling such that

  1. Homopolymer regions are made symmetric using the lower of the two quality scores. For example if there are 3 consecutive G’s with quality scores of 14, 12, 10, these will be replaced with quality scores of 10, 12, 10 to reflect that the confidence that the first position is a G is the same as the confidence that the last position is a G. This is important for an alignment against another sequence where there are only two G’s where it is equally valid to position the gap before the first G or after the last G.
  2. Low quality gaps (under quality 30) have their quality scores up to halved so that consensus calling is not biased towards the sequence with a shorter version of a homopolymer. For example if NA-N (the A has quality 20) is aligned to NAAN (the 2 As have quality 10 because of the lower confidence that this homopolymer is 2 bp long), then the gap which would normally be assigned a quality score of 20 (assuming the nearby N has a higher quality), will have it’s quality score halved so that it will be comparable to the quality scores in the other sequence.
  3. For Sanger traces, gap quality scores take on the minimum quality score of the two previous and two next homopolymers. For example G–AAA—CCT the quality of the 3 gaps is the minimum quality of the G, the two outer A’s, the C’s and the T. This is because in an equally good alignment the C’s or A’s could be shifted into this gap, so the gap could be between the G/A or C/T instead.
  4. For Sanger traces, traces are analyzed and the quality score is reduced if the called peak is not an obvious peak or if there is an alternative peak over 50% of the height of the called peak.

Assigning quality to the consensus

The Assign Quality setting enables you to map the quality of the sequences onto the consensus. Choose Highest to map the quality of the highest quality base at each column onto the consensus. Select Total to map the sum of the contributing bases, minus the sum of the non-contributing bases.

For example: if there are two G’s and three A’s in a column, with the G’s having qualities of 16 and 24, and the A’s having qualities of 40, 42, and 50 respectively, then because (40 + 42 + 50) > 60% of (40 + 42 + 50 + 16 + 24), then an A will be called for the consensus. This consensus A will have a quality of (40 + 42 + 50) (16 + 24) = 92 if using Total or 50 if using Highest.

A more complicated example for Highest Quality consensus calling using Total: Assume a column contains 2 A’s with qualities of 30 and 25, 1 G with quality 30 and 1 T with quality 15. Because the total qualities of the A’s is 55 out of 100 for the column, this is not higher than the 60% threshold to call an A. With the G included, the total quality is 30 + 25 + 30 = 85, which is higher than the 60% threshold, so a consensus call of R will be made. The quality assigned to this R will be the sum of the bases that agree with the consensus call minus the bases that disagree, which is 30 + 25 + 30 15 = 70.

When reads have mapping qualities (confidence that the entire read is mapped to the correct location), the mapping quality is combined with the base pair quality to form the quality used during consensus calling. The log scale qualities are combined as probabilities so a very rough rule of thumb is the combined quality will be approximately equal to the minimum of the mapping quality and base call quality, except in cases where the two values are very close in which case the combined quality will be slightly smaller.

Gaps are assigned a quality score equal to half the minimum base call quality on either side of the gap.

For example, if a column has a C with quality 41 and mapping quality 3, a gap with adjacent base calls of quality 41 and 30 with mapping quality 240, and a C with quality 41 and mapping quality 20. The two C’s will have combined quality scores of approximately 3 and 20 respectively. The gap will have an effective combined quality score of 302 = 15. So the consensus call with be a C with quality 20 + 3 15 = 8.

Other settings
Ignore Gaps (alignment documents only):
When this is checked, the consensus is calculated as if each alignment column consisted only of the non-gap characters; otherwise, the gap character is treated like a normal residue, but mixing a gap with any other residue in the consensus always produces the total ambiguity symbol (N and X for nucleotides and amino acids, respectively).
If no coverage call:
For alignments or contigs with a reference sequence, this setting can be used to control what character the consensus sequence should use when the reference sequence has no coverage. Options available are -, X/N, ? or Ref. A ‘?’ represents an unknown character, potentially a gap. If Ref is selected, then the consensus is assigned whatever character the reference sequence has at that position. Note that if any sequence in the alignment/contig has an internal gap in it, that is still considered valid coverage at that position, and this setting will not apply.
Call N if Quality below:
Enables you to change consensus bases to N’s if the quality is below the threshold that you set. This is particularly useful for exporting sequences to file formats which do not preserve quality (for example FASTA).
Call Sanger Heterozygotes(choromatogram assemblies only):
When a Sanger trace has an alternative peak that is at least as high as the specified percentage of the best peak, that trace will contribute a heterozygous call to the consensus calculation. Base calls with a quality score of at least 63 (i.e those manually edited) will not be analyzed for heterozygotes.