Find Variations/SNPs

11.1.1 Find Variations/SNPs

Manually investigating every little disagreement can be time consuming on larger contigs. The Find Variations/SNPs feature from the Annotate & Predict menu will annotate regions of disagreement and can be conﬁgured to only ﬁnd disagreements above a minimum threshold to screen out disagreements due to read errors. This feature can also be conﬁgured to only ﬁnd disagreements in coding regions (if the reference sequence has CDS annotations present) and can analyze the eﬀects of variations on the protein translation to allow you to quickly identify silent or non-silent mutations. It can also calculate p-values for variations and ﬁlter only for variations with a speciﬁed maximum P-Value.

For full details of how the various settings work in the Variation/SNP ﬁnder, hover the mouse over them to read the tooltips or click one of the ‘?’ buttons.

P-values

The p-value represents the probability of a sequencing error resulting in observing bases with at least the given sum of qualities. The lower the p-value, the more likely the variation at the given position represents an real variant. Click the down arrow next to the exponent of the Maximum Variant P-Value setting to increase the number of variants found.

When calculating P-Values:

The contig is assumed to have been ﬁne tuned around indels
Ambiguity characters are ignored (other characters in the column are still used)
Homopolymer region qualities are reduced to be symmetrical across the homopolymer. For example if a series of 6 G’s have quality values 37, 31, 23, 15, 7, 2 then these are treated as though they are 2, 7, 15, 15, 7, 2. This is done because variations may be called at either end of the homopolymer and because reads may be from diﬀerent strands.
Gaps are assumed to have a quality equal to the minimum quality on either side of them (after adjusting for homopolymers)
When ﬁnding variations relative to a reference sequence, the p-value calculated is for the variant, not the change. In other words the p-values calculated are independent of the reference sequence data.
The approximate p-value method calculates the p-value by ﬁrst averaging the qualities of each base equal to the proposed SNP and averaging the qualities of each base not equal to the proposed SNP.
Example: Assume you have a column where the reference sequence is an A and there are 3 reads covering that position.
1 read contains an A in the column and the other 2 reads contain a G. All 3 reads have quality 20 ( = 99% conﬁdence) at this position. We want to calculate the p-value for calling a G SNP in this column.
Since the quality values are all equal, the p-value is the probability of seeing at least 2 G’s if there isn’t really a variant here. In other words, the probability of seeing 2 G’s by chance due to a sequencing error plus the probability of seeing 3 G’s by chance due to a sequencing error, which is calculated using the binomial distribution: ³C₂ ∗ 0.01² ∗ 0.99 + ³C₃ ∗ 0.01³ = 0.0003 (^NC_K is a binomial coeﬃcient)

False SNPs due to strand-bias (when sequencing errors tend to occur only on reads in a single direction) can be eliminated by specifying a value for the Minimum Strand-Bias P-value setting. A Strand-Bias P-Value property is added to each SNP to indicate the probability of seeing a strand bias at least this extreme assuming that there is no strand bias. SNPs with a smaller strand bias p-value will be excluded from the results when using this setting.

Strand-Bias >50% P-value example: Assume you have a column covered by 9 reads containing an A, 8 of which are on the forward strand. We calculate the probability of seeing bias at least this extreme, assuming there is no strand-bias, which is the probability of seeing either 0, 1, 8, or 9 reads on the forward strand. Using the binomial distribution, this is ⁹C₀ ∗ 0.5⁹ + ⁹C₁ ∗ 0.5⁹ + ⁹C₈ ∗ 0.5⁹ + ⁹C₉ ∗ 0.5⁹ = 0.039 (^NC_K is a binomial coeﬃcient)

Click the up arrow next to the exponent of the Minimum Strand-Bias P-Value setting to increase the number of variants found. If there are any forward/reverse or reverse/forward style paired reads, then variants with strand bias which are less than 1.5 times the insert size from either end of the contig will not be ﬁltered out.

Results display

The results of the Variant/SNP ﬁnder are added to the reference sequence in the assembly or alignment as an annotation track. Clicking Save and clicking “Yes” when prompted to apply the changes to the original sequences will add this annotation track onto the original reference sequence ﬁle. If there is no reference sequence for the alignment or assembly the annotations are added to the consensus sequence.

The results are also displayed in the annotations table and the following columns can be displayed:

Change: Indicates the reference sequence nucleotides followed by the variant nucleotides. For example ‘C → A’
Coverage: The number of reads that cover the SNP region in the contig. The coverage includes both the reads containing the SNP and other reads at that position.
Reference Frequency: The percentage of reads that agree with the reference sequence at that position. This ﬁeld will only be present if at least 1 read agrees with the reference sequence.
Variant Frequency: The percentage of reads that have the variation at that position. For variations that span more than a single nucleotide, the variant frequency may appear as a range (e.g. 47.8% – 51.7%) to indicate the minimum/maximum variant frequency over that range.
Polymorphism Type: This may be one of the following.
SNP (Transition): a single nucleotide transition change from the reference sequence
SNP (Transversion): a single nucleotide transversion change from the reference sequence
SNP: At a single position, there are multiple variations from the reference sequence
Substitution: A change of 2 or more adjacent nucleotides from the reference sequence
Insertion: 1 or more nucleotides inserted relative to the reference sequence
Deletion: 1 or more nucleotides deleted relative to the reference sequence
Mixture: multiple variations from the reference sequence which are not all the same length

For variations inside coding regions (CDS annotations) the following ﬁelds can be displayed:

Codon Change: indicates the change in codon. Essentially this is the same as the ‘Change’ ﬁeld, but extended to include the full codon(s). For example ‘TTC → TTA’
Amino Acid Change: indicates the change (if any) in the amino acid(s) by translating the codon change. For example ‘F → L’
Protein Eﬀect: summarizes the change on the protein as either a substitution, frame shift, truncation (stop codon introduced) or extension (stop codon lost)
Average Quality: is the average of the quality score of all base-calls in reads that have the variation at that position.