To display a consensus sequence on your alignment, check the Consensus option under the Display tab.
The consensus sequence is displayed above the alignment or assembly, and shows which residues are conserved (are always the same), and which residues are variable. A consensus is constructed from the most frequent residues at each site (alignment column), so that the total fraction of rows represented by the selected residues in that column reaches at least a specified threshold.
To work with the consensus sequence in a downstream analysis, you must first Extract it from your alignment. To do this, click on Consensus to select the entire sequence, then click Extract to extract it to a new sequence document. Alternatively, go to Tools → Generate Consensus Sequence. This operation allows you to choose the options for how your consensus sequence is called (as described below), and then saves it to a separate document. If your consensus sequence contains ‘?’ characters where there are regions with no or low coverage in your assembly, you can split the consensus sequence at these bases to generate multiple sequences by checking the option to Split into separate sequences around ‘?’ calls
The Threshold determines which base in called in the consensus, and can be set to a percentage, or by using the quality scores on the reads. IUPAC ambiguity codes (such as R for an A or G nucleotide) are counted as fractional support for each nucleotide in the ambiguity set (A and G, in this case), thus two rows with R are counted the same as one row with A and one row with G. When more than one nucleotide is necessary to reach the desired threshold, this is represented by the best-fit ambiguity symbol in the consensus; for protein sequences, this will always be an X.
For example, assume a column contains 6 A’s, 3 G’s and 1 T. If the consensus threshold is set to 60% or below, then the consensus will be A. If the consensus threshold is set to between 60% and 90%, then the consensus will be R. If the consensus threshold is set to over 90%, then the consensus will be D.
In the case of ties, either all or none of the involved residues will be selected. For example, if the above case instead had 6 A’s, 2 G’s and 2 T’s, then for a consensus threshold of 60% or below, an A will be called. Above a threshold of 60%, a D will be called.
When the aligned sequences contain quality information in the form of chromatograms or fastq data, you can select Highest Quality to calculate a majority consensus that takes the relative residue quality into account. This sums the total quality for each potential base call, and if the total for a base exceeds 60% of the total quality for all bases, then that base is called. Highest Quality (Raw) uses the raw chromatogram scores.
When Highest Quality (50% or 60% or 75%) consensus calling is selected, in order to improve consensus accuracy, Geneious will adjust the value of the raw quality scores that contribute to consensus calling such that
The Assign Quality setting enables you to map the quality of the sequences onto the consensus. Choose Highest to map the quality of the highest quality base at each column onto the consensus. Select Total to map the sum of the contributing bases, minus the sum of the non-contributing bases.
For example: if there are two G’s and three A’s in a column, with the G’s having qualities of 16 and 24, and the A’s having qualities of 40, 42, and 50 respectively, then because (40 + 42 + 50) > 60% of (40 + 42 + 50 + 16 + 24), then an A will be called for the consensus. This consensus A will have a quality of (40 + 42 + 50) − (16 + 24) = 92 if using Total or 50 if using Highest.
A more complicated example for Highest Quality consensus calling using Total: Assume a column contains 2 A’s with qualities of 30 and 25, 1 G with quality 30 and 1 T with quality 15. Because the total qualities of the A’s is 55 out of 100 for the column, this is not higher than the 60% threshold to call an A. With the G included, the total quality is 30 + 25 + 30 = 85, which is higher than the 60% threshold, so a consensus call of R will be made. The quality assigned to this R will be the sum of the bases that agree with the consensus call minus the bases that disagree, which is 30 + 25 + 30 − 15 = 70.
When reads have mapping qualities (confidence that the entire read is mapped to the correct location), the mapping quality is combined with the base pair quality to form the quality used during consensus calling. The log scale qualities are combined as probabilities so a very rough rule of thumb is the combined quality will be approximately equal to the minimum of the mapping quality and base call quality, except in cases where the two values are very close in which case the combined quality will be slightly smaller.
Gaps are treated just like a standard base and are assigned a quality score equal to half the minimum base call quality on either side of the gap.
For example, if a column has a C with quality 41 and mapping quality 3, a gap with adjacent base calls of quality 41 and 30 with mapping quality 240, and a C with quality 41 and mapping quality 20. The two C’s will have combined quality scores of approximately 3 and 20 respectively. The gap will have an effective combined quality score of 30∕2 = 15. So the consensus call with be a C with quality 20 + 3 − 15 = 8.