Exercise 3: Exploring contig documents

Open the contig document (which should be called "yghJ paired Illumina reads (trimmed) assembled to yghJ CDS (divergent reference)") to see how the reads map to the reference sequence. Under the Advanced settings tab to the right of the sequence viewer, ensure that "Vertically compress contig" is checked. This will display the reads in horizontal rows. Now go back to the General tab and ensure that Colors is set on "Paired Distance". This setting allows you to see at a glance whether your paired-end reads map at the expected distance apart, based on the insert size you specified when you set up the paired reads. In this contig you can see that most of the reads are green, meaning that they map with approximately the expected insert size. Click on the Insert sizes tab above the sequence viewer to see the actual distribution of insert sizes. You can see that most pairs map with inserts of 450-500 bp, close to the expected size of 500 bp.

Switch back to the Contig View. At the top of the contig you'll see a consensus sequence. Zoom in so that you can see the sequence bases. This is the consensus of the reads only and does not include the reference sequence. The settings used to call the consensus are set under the Display tab to the right of the sequence viewer. Because these reads have quality scores attached, Highest Quality should be chosen as the Threshold for calling the consensus sequence. This setting calculates a majority consensus that takes the into account the relative quality scores for each base at that position (see the Geneious user manual for more information).

Now change the Threshold to "100% - Identical". You should see a number of ambiguous bases appear in the sequence. Under this setting an ambiguous base will be inserted anywhere there is a mixture of bases among the reads, even if only one read out of a thousand contains a different base. This setting should not be used for mapping NGS data, as it will introduce ambiguities in the consensus sequence that are the result of read errors, rather than real polymorphisms. If you do not have quality scores on your reads, then a Threshold of 90-99% is most appropriate for ensuring only real polymorphisms appear as ambiguous bases in your consensus sequence. Change the setting to 95% to see how it affects the ambiguities, then change it back to "Highest Quality" for the remainder of the tutorial.

Underneath the consensus sequence you should be able to see a blue coverage graph as in the screenshot below. If you can't see this, click on the graphs tab to the right of the sequence viewer and enable "Show graphs" and "Coverage". The coverage graph shows how many reads map at each base and can be useful for assessing the quality of your mapping.


There are two tools available in Geneious that allow you to quickly identify regions of high or low coverage:

1. Under the Graphs tab you can highlight regions above or below a certain coverage. Check "Highlight above" and set this on 50. You should now see a yellow bar under the coverage graph covering regions where the coverage is greater than 50.

2. You can annotate regions of high or low coverage by going to Annotate and Predict → Find Low/High Coverage.

We will use the second option to annotate regions of low coverage so that we can exclude these regions when we call SNPs in the next exercise. Check Find regions with coverage below and choose Standard deviations from the mean = 2. Check both Merge regions options and uncheck the High Coverage options. Click OK. You should now see a Coverage annotation track on the reference sequence, which has annotations in regions of low coverage. Click Save and choose "Yes" when asked if you want to apply changes to the original sequence.




Exercise 4: Calling SNPs
Exercise 5: Comparing SNPs