Exercise 2: How it works and choosing parameters

How it works

The sequence classifier performs pairwise local alignments between the query sequence and each sequence in the database. When multiple loci are used, pairwise alignments for each gene are performed separately, and these alignments are then concatenated if there are database sequences with the same name for each gene.

The "overlap identity" between your query and the database sequences is then used to determine the likely taxon of your query sequence by picking the database sequence with the highest identity to the query. The overlap identity is the pairwise identity in the region in common between the query and database sequence, where data in end gap regions is ignored. The sequence will only be classified if it meets a minimum overlap identity that the user specifies, and if multiple database sequences have similar overlap identities, then the query sequence will be classified to the taxonomic level that these sequences have in common.

It is also possible to set cutoffs for classifying sequences at various taxonomic levels. For instance, if 95% is set as a minimum identity to classify at species level, and the top match to the database has an overlap identity of 94.5%, then the query will only be classified to genus level. Thus, you need to know the approximate levels of sequence identity within and between the different taxonomic levels of your database sequences in order to choose the correct settings for classification.


Choosing appropriate parameters

Select the Unknown sequences list and open the Sequence classifier by going to Tools→Classify Sequences. In the Classification panel you'll see options where you can set the Minimum overlap identity to classify at various taxonomic levels. We will look at a multiple alignment of all the database sequences to decide the appropriate parameters to use here.

Close the Classify Sequences window by clicking the Cancel button, and open the control region alignment document. Switch to the Distances tab and choose "% identity" in the Matrix option to display the % identity between sequences. For control region sequences, within-species identity ranges from about 95-100%, and between species identity ranges from 90-99%. For the cytochrome b alignment, within species identity is about 98-100% and between species is 93-99%. Thus, as within species identity may be as low as 95%, we should set this as the minimum value to classify at species level, and 90% as a minimum value to classify at genus level.


Exercise 3: Running the sequence classifier
Exercise 4: Interpreting the results