Step 3: Batch BLAST OTUs and create a taxonomy database.

In this exercise we will create a curated database specific to our dataset by blasting the OTUs we generated in the previous step to the preformatted NCBI 16S Microbial database. This will enable us to pull out the most relevant accessions to use for taxonomic classification of the entire dataset.


1. Installation of local 16S Microbial database

We will BLAST to the 16s Microbial database from NCBI, which is a curated set of 16S rRNA sequences from bacteria and archaea type strains. This is a relatively small database and it is faster to set up a local copy of the database to BLAST to rather than sending the sequences to NCBI.

To BLAST to a local database you must firstly install the custom BLAST executables if you have not already done so in the past. To do this go to Tools → Add/Remove Databases → Set up BLAST services. Change the Service to Custom BLAST, check Let Geneious do the setup, note the database location and click OK. This will download and install the BLAST program at the location specified. Once the download has finished, navigate to this location on your drive and find the BLAST/data folder. This is where you will put the preformatted files from NCBI.

To get the preformatted BLAST database files, click on this link to go to the BLAST ftp site. The first file in the list should be 16SMicrobial.tar.gz. Click on this file to download it, and then uncompress the file. Once uncompressed, you should have a folder called 16SMicrobial containing files with names like 16SMicrobial.nni, 16SMicrobial.nsi, etc. Move all the files out of the 16SMicrobial folder and into the BLAST/data folder that was created when you set up custom BLAST.

Once you have added the 16S Microbial files to your BLAST/data folder, restart Geneious so that it picks up the new database. You are now ready to run the BLAST search.



Further instructions for installing custom BLAST and using preformatted BLAST databases are in chapter 15 of the User manual, and this post on the Geneious Knowledge Base.

Note: If you are not working with 16S sequences and do not have a suitable stand-alone BLAST database, it is possible to blast to NCBI's nt database (with some Entrez filters to exclude environmental, uncultured and unclassified sequences) for this step. Be aware that this will take a lot longer to run than using a local BLAST database targeted to your amplicon sequence.


2. Batch BLAST OTU consensus sequences

Select both the Consensus Sequences and Unused Reads lists then click the Blast button.

Under the Database dropdown menu, you should now see the 16S Microbial database you created - select this as the database. Set the rest of the settings as in the screenshot below. We will only return the top hit (Maximum Hits=1) and retrieve the matching region with annotations, as this will retrieve the taxonomic information from the database as well.

Under the advanced options you can increase the speed of the BLAST search by increasing the number of CPUs. You should set this to 1 less that the total number of CPUs on your machine (e.g. if your computer has a quad core processor, set it on 3).



Click Search to run the BLAST. Note that you will get an Ambiguous Query warning due to some consensus sequences containing ambiguities. Just go OK to continue.

Once the search has finished, you will see a dialog stating that a small number of sequences had no results. These are most likely either contaminant or incorrectly merged sequences. For each query that does return a result you’ll see one BLAST alignment document in the document table. Select the Alignment View tab to view the alignment between the query and the hit for each document.


3. Creating a sequence classifier database from the BLAST hits

Once the BLAST results are returned we need to do some processing in order to get them in a format where they can be used as a database for the Sequence Classifier. We will perform the following steps:

A. Remove duplicates
B. Download the full hits
C. Extract the BLAST HIT regions
D. Create a database for the Sequence Classifier tool


3A. Removing Duplicates

Due to the high stringency of our de novo assembly, it is likely that some OTU's will have returned the same best Blast hit, so we should check and remove duplicate sequences.

Select the folder containing the Blast hits, then select all the returned Blast hits and go menu Edit -> Find Duplicates.  Set the options to;

Find: Documents with the same name
Search Scope: Current Folder
What to do: Select most recently modified duplicates


This will select all duplicate files.  Hit delete to remove the duplicates. This will have reduced the number of BLAST hits in the folder from 140 to 47.

3B.  Download the Hits

At this stage our Blast hits are only summaries and we need to download the full sequences.  Select all sequences in the Document Table and hit the Download Full Sequence/s button.  This may take a minute or so to complete.


3C.  Extract the Hits region

To keep our database small and speed up the classification process we will now extract only the regions of the Blast hits relevant to our amplicon.

To do this, select all sequences in the Document Table, click on the Annotations tab, and under Type, select Blast Hit. Click in the table and use control/command-A to select all, then click the Extract button to extract all Blast hit regions to a sequence list file.


The list file created will by default have a name something like "Extraction of 47 annotations".

3D.  Creating the database

This step is very simple.  We will simply place the list file we just created into a new folder to "create" a 16S database.

To do this, Select a suitable location in the Sources panel, right click and choose New Folder.

Give the folder the name SRR7140083 16S database.

The go back to the location of your new list file and drag it into the new 16S database folder.


Renaming the database entries

In the last step we will rename the extracted Blast-hits and give them the name of their source organism.

To do this, Select database list and go menu Edit → Batch Rename and use the settings shown below. 

When you select the first dropdown list for the Replace with setting you may notice that there are two occurrences of Organism.  The first refers to metadata associated with the list file, the second refers to metadata associated with individual sequences within the list. Be sure to select the second occurrence of Organism from the Replace with dropdown list.

Once you have selected the appropriate settings, go OK.  You will then be shown a preview window that will allow you to confirm that the rename operation will change each sequence name from an accession number to a binomial species name.


Step 1: Preprocessing NGS amplicon data

Step 2: Clustering reads into OTUs using the de novo assembler

Step 3: Batch BLAST OTUs and create a taxonomy database

Step 4: Classifying amplicon data with the Sequence Classifier