Manually investigating every little disagreement can be time consuming on larger contigs. The Find Variations/SNPs feature from the Annotate & Predict menu will annotate regions of disagreement and can be conﬁgured to only ﬁnd disagreements above a minimum threshold to screen out disagreements due to read errors. This feature can also be conﬁgured to only ﬁnd disagreements in coding regions (if the reference sequence has CDS annotations present) and can analyze the eﬀects of variations on the protein translation to allow you to quickly identify silent or non-silent mutations. It can also calculate p-values for variations and ﬁlter only for variations with a speciﬁed maximum P-Value.
For full details of how the various settings work in the Variation/SNP ﬁnder, hover the mouse over them to read the tooltips or click one of the ‘?’ buttons.
The p-value represents the probability of a sequencing error resulting in observing bases with at least the given sum of qualities. The lower the p-value, the more likely the variation at the given position represents an real variant. Click the down arrow next to the exponent of the Maximum Variant P-Value setting to increase the number of variants found.
When calculating P-Values:
1 read contains an A in the column and the other 2 reads contain a G. All 3 reads have quality 20 ( = 99% conﬁdence) at this position. We want to calculate the p-value for calling a G SNP in this column.
Since the quality values are all equal, the p-value is the probability of seeing at least 2 G’s if there isn’t really a variant here. In other words, the probability of seeing 2 G’s by chance due to a sequencing error plus the probability of seeing 3 G’s by chance due to a sequencing error, which is calculated using the binomial distribution: 3C2 ∗ 0.012 ∗ 0.99 + 3C3 ∗ 0.013 = 0.0003 (NCK is a binomial coeﬃcient)
False SNPs due to strand-bias (when sequencing errors tend to occur only on reads in a single direction) can be eliminated by specifying a value for the Minimum Strand-Bias P-value setting. A Strand-Bias P-Value property is added to each SNP to indicate the probability of seeing a strand bias at least this extreme assuming that there is no strand bias. SNPs with a smaller strand bias p-value will be excluded from the results when using this setting.
Strand-Bias >50% P-value example: Assume you have a column covered by 9 reads containing an A, 8 of which are on the forward strand. We calculate the probability of seeing bias at least this extreme, assuming there is no strand-bias, which is the probability of seeing either 0, 1, 8, or 9 reads on the forward strand. Using the binomial distribution, this is 9C0 ∗ 0.59 + 9C1 ∗ 0.59 + 9C8 ∗ 0.59 + 9C9 ∗ 0.59 = 0.039 (NCK is a binomial coeﬃcient)
Click the up arrow next to the exponent of the Minimum Strand-Bias P-Value setting to increase the number of variants found. If there are any forward/reverse or reverse/forward style paired reads, then variants with strand bias which are less than 1.5 times the insert size from either end of the contig will not be ﬁltered out.
The results of the Variant/SNP ﬁnder are added to the reference sequence in the assembly or alignment as an annotation track. Clicking Save and clicking “Yes” when prompted to apply the changes to the original sequences will add this annotation track onto the original reference sequence ﬁle. If there is no reference sequence for the alignment or assembly the annotations are added to the consensus sequence.
The results are also displayed in the annotations table and the following columns can be displayed:
SNP (Transition): a single nucleotide transition change from the reference sequence
SNP (Transversion): a single nucleotide transversion change from the reference sequence
SNP: At a single position, there are multiple variations from the reference sequence
Substitution: A change of 2 or more adjacent nucleotides from the reference sequence
Insertion: 1 or more nucleotides inserted relative to the reference sequence
Deletion: 1 or more nucleotides deleted relative to the reference sequence
Mixture: multiple variations from the reference sequence which are not all the same length
For variations inside coding regions (CDS annotations) the following ﬁelds can be displayed: