9.5 Consensus sequences

To display a consensus sequence on your alignment, check the Consensus option under the PIC Display tab.

The consensus sequence is displayed above the alignment or assembly, and shows which residues are conserved (are always the same), and which residues are variable. A consensus is constructed from the most frequent residues at each site (alignment column), so that the total fraction of rows represented by the selected residues in that column reaches at least a specified threshold. IUPAC ambiguity codes (such as R for an A or G nucleotide) are counted as fractional support for each nucleotide in the ambiguity set (A and G, in this case), thus two rows with R are counted the same as one row with A and one row with G. When more than one nucleotide is necessary to reach the desired threshold, this is represented by the best-fit ambiguity symbol in the consensus; for protein sequences, this will always be an X.

For example, assume a column contains 6 A’s, 3 G’s and 1 T. If the consensus threshold is set to 60% or below, then the consensus will be A. If the consensus threshold is set to between 60% and 90%, then the consensus will be R. If the consensus threshold is set to over 90%, then the consensus will be D.

In the case of ties, either all or none of the involved residues will be selected. For example, if the above case instead had 6 A’s, 2 G’s and 2 T’s, then for a consensus threshold of 60% or below, an A will be called. Above a threshold of 60%, a D will be called.

When Ignore Gaps is checked, the consensus is calculated as if each alignment column consisted only of the non-gap characters; otherwise, the gap character is treated like a normal residue, but mixing a gap with any other residue in the consensus always produces the total ambiguity symbol (N and X for nucleotides and amino acids, respectively).

When the aligned sequences contain quality information in the form of chromatograms or fastq data, you can select Highest Quality to calculate a majority consensus that takes the relative residue quality into account. This sums the total quality for each potential base call, and if the total for a base exceeds 60% of the total quality for all bases, then that base is called.

You can also choose to map the quality of the sequences onto the consensus. Choose Highest to map the quality of the highest quality base at each column onto the consensus. Select Total to map the sum of the contributing bases, minus the sum of the non-contributing bases.

For example: if there are two G’s and three A’s in a column, with the G’s having qualities of 16 and 24, and the A’s having qualities of 40, 42, and 50 respectively, then because (40 + 42 + 50) > 60% of (40 + 42 + 50 + 16 + 24), then an A will be called for the consensus. This consensus A will have a quality of (40 + 42 + 50) (16 + 24) = 92 if using Total or 50 if using Highest.

A more complicated example for Highest Quality consensus calling using Total: Assume a column contains 2 A’s with qualities of 30 and 25, 1 G with quality 30 and 1 T with quality 15. Because the total qualities of the A’s is 55 out of 100 for the column, this is not higher than the 60% threshold to call an A. With the G included, the total quality is 30 + 25 + 30 = 85, which is higher than the 60% threshold, so a consensus call of R will be made. The quality assigned to this R will be the sum of the bases that agree with the consensus call minus the bases that disagree, which is 30 + 25 + 30 15 = 70.

When reads have mapping qualities (confidence that the entire read is mapped to the correct location), the mapping quality is combined with the base pair quality to form the quality used during consensus calling. The log scale qualities are combined as probabilities so a very rough rule of thumb is the combined quality will be approximately equal to the minimum of the mapping quality and base call quality, except in cases where the two values are very close in which case the combined quality will be slightly smaller.

Gaps are assigned a quality score equal to half the minimum base call quality on either side of the gap.

For example, if a column has a C with quality 41 and mapping quality 3, a gap with adjacent base calls of quality 41 and 30 with mapping quality 240, and a C with quality 41 and mapping quality 20.

The two C’s will have combined quality scores of approximately 3 and 20 respectively. The gap will have an effective combined quality score of 302 = 15. So the consensus call with be a C with quality 20 + 3 15 = 8.

For alignments or contigs with a reference sequence, the If no coverage call setting can be used to control what character the consensus sequence should use when the reference sequence has no coverage. Options available are -, X/N, ? or Ref. A ‘?’ represents an unknown character, potentially a gap. If Ref is selected, then the consensus is assigned whatever character the reference sequence has at that position. Note that if any sequence in the alignment/contig has an internal gap in it, that is still considered valid coverage at that position, and this setting will not apply.

Choose Call N if Quality below to change consensus bases to N’s if the quality is below the threshold that you set. This is particularly useful for exporting sequences to file formats which do not preserve quality (for example FASTA).

To work with the consensus sequence in a downstream analysis, you must first Extract it from your alignment. To do this, click on Consensus to select the entire sequence, then click Extract to extract it to a new sequence document. Alternatively, go to Tools Generate Consensus Sequence. This operation allows you to choose the options for how your consensus sequence is called (as described above), and then saves it to a separate document. If your consensus sequence contains ? characters where there are regions of no/low coverage in your assembly, you can split the consensus sequence at these bases to generate multiple sequences by checking the option to Split into separate sequences around ? calls