In this exercise you will build on your experience with dotplots and DNA alignments to align a pair of proteins. Click here to select the two protein sequences.
If you are not already in the Dotplot select it now. You may need to adjust sensitivity settings. Select High Sensitivity/Slow and set the window size to 10 and the threshold to 23. This should give you a good idea of where the sequences share common features. You should see two major regions of homology like this (red lines):
Now select Align/Assemble→Pairwise Align to perform a standard protein alignment. Select Geneious Alignment and Reset Defaults if necessary. Click OK and this will create an alignment document. Turn off the annotations and turn on the Identity graph in the
Graphs panel. In the Highlighting options select Agreements to Consensus to make the view clearer. You should have a view that looks like this:
What is the level of pairwise identity between these two sequences and do you think it is a fair representation of their sequence homology?
The pairwise identity can be found in the Statistics panel. The sequences share 43.2% identity. However, there are regions of low identity and other regions of high identity so this number does not truly represent the overall homology.
These sequences are part of two Type I restriction enzymes and are the DNA sequence specificity domains. More information about restriction enzymes can be found here. Examine the alignment and you will see that the level of identity varies greatly along the length of the alignment. The regions that are conserved are the helical spacers shown in this sketch where the specificity subunit is shaded:
Depending on the family, these helical spacers are highly conserved. However, the Target Recognition Domains are highly variable and this variability is evidence for horizontal transfer. Without the high level of identity in the helical regions it is unlikely that a decent alignment could be made for these two sequences. The TRD regions can be considered to be aligned purely by chance as the identity is so low. The algorithm has simply computed the cheapest path through this region of mismatch.
In this next example, we have two sequences which are distantly related across their entire length. Click here to select them.
Look at the dotplot and you will see that you need to adjust the sensitivity settings significantly to see an obvious diagonal. Hint Try selecting High Sensitivity and increasing the Window Size.
Again, perform an alignment using default settings and examine the result.
Note the % pairwise identity between these two sequences. Is this a fair representation of their homology?
These sequences share 25.2% identity. Mismatches are spread quite evenly throughout the two sequences so this accurately reflects their overall homology.
The identities here are more evenly spread than in the previous example. This alignment is in the twilight zone for a protein alignment although it has features that suggest it is a good match. Clumping of the identities and the fact that the alignment covers most of the length of the two sequences is a good sign. Since both these sequences are annotated, you can turn on annotations and you will see that various regions in the two sequences line up well. Manipulate the view using the zoom functions and other features to observe the details of the alignment and relationship between the sequences.
Look at the pattern of identities and regions of mismatch and note how these relate to the annotated domains and regions. The first part of the alignment has a generally higher level of identity than the second part and this corresponds to the HTH region versus the LysR substrate region. This suggests that conservation needs to be higher in the first domain in order to preserve the function.
There is a region labelled "HTH 1 Region" in both sequences. Examine how the pattern of matches and mismatches relate to periodicity in alpha helical structures. Turn on the hydrophobicity graph and see if there us a relationship between this and the pattern of matches. The HTH 1 Region is helical and there are 3.6 residues per turn of a helix. This periodicity does appear to be reflected in the conservation of matches. The hydrophobicity graph supports this as the hydrophobic residues tend to be conserved so as the backbone moves from buried to exposed in the helix the buried residues are conserved and the exposed residues are variable.
One of these sequences has secondary structure annotations for the LysR Substrate region. Looking at this alignment and other evidence such as patterns of matches, would you think it is likely that the two sequences share the detailed secondary structure? The annotated domains line up nicely and the pattern of hydrophicity supports the pattern of secondary structure elements so it is likely that the secondary structure and tertiary structure of the two proteins is very similar despite the low overall identity.
Realign these two sequences using a strict Blosum90 table and Smith Waterman algorithm. Look at what has happened to the alignment identity and length. Aligning these two sequences with Blosum90 and Smith and Waterman results in the alignment being truncated and the reported sequence identity has increased. This is because the Blosum90 scoring table penalises mismatches more than Blosum62. For a distantly related pair of sequences, this will tend to result in a lower overall score and thus regions of poor match become harder to traverse. The local similarity algorithm combined with this lower scoring prevents the alignment of the lower identity C-terminal ends of the two sequences resulting in the higher apparent identity (28.1%). Although the dotplot shows that there is a match beyond approximately residue 190 in each sequence, this stricter alignment is not able to bridge the gap because the score benefit is insufficient.
You should realise now that alignments are not a simple process and that alignments do not necessarily represent biological truth. Use them in conjunction with other available information and take care especially when working with distantly related sequences. This will help you understand what is going on with the search parameters and alignments in BLAST searches and also when you start working with multiple alignments.