The classify sequences plugin uses a BLAST-like algorithm to compare your query sequence to others in a specified database. Here we will use kiwi mitochondrial DNA sequences available on Genbank as our database to classify our unknown sequences. We will use sequences from two loci - cytochrome b and the control region. Sequences from these two loci need to be identified separately in both the database and query sequences.
The tutorial folder contains a subfolder called Database sequences. Click on this folder, and then open the control region sequences folder within it to look at how the database sequences are formatted. These sequences were downloaded from Genbank, and the sequence names have been edited so that they are in the correct format for the database.
You'll see that the sequence name is in the format sequence name -locus. Each sequence in the database must have a unique name. For these sequences I have used Batch Rename to edit the original Genbank files so that the sequence name contains the organism, then the specimen voucher ID and/or haplotype name (if this information was on the original Genbank record), as this information may be useful for us in classifying the sequences beyond species level.
Because we will be using multiple loci to classify our sequences, the locus name (control region) has been appended to the sequence name with a specific delimiter (in this case "-"). If the sequence name up to the delimiter is identical for different loci, then pairwise alignments containing these sequences are concatenated for the overall result. Note that if you are only using a single locus you do not need to include the locus name in the sequence name.
Now check the sequences in the cytochrome b folder and you'll see that the sequence names are in the same format, but have "-cytochrome b" appended to the end.
Now go back to the top folder and look at the Unknown sequences list. This list contains 3 query sequences, named Unknown1, Unknown2 and Unknown3. Sequences from cytochrome b and the control region are in separate files, with the locus name appended in the same way as for the database sequences