Step 2: Clustering reads into OTUs using the de novo assembler.

Amplicon datasets from NGS sequencing typically contain millions of reads and it is not practical to BLAST each sequence to assign taxonomy. Instead, we define operational taxonomic units (OTUs) by clustering the reads by similarity, and BLAST one representative sequence from each OTU. We can then use these BLAST results as a targeted taxonomic database with the Sequence Classifier plugin, in order to quantify the biodiversity in the full read set (described in Step 4).

In this step we will perform a de novo assembly with customized, high-stringency settings to cluster all closely-related sequences into separate contigs. The consensus sequence from each contig will represent an OTU. This exercise should simplify our dataset significantly and provide a reduced dataset for a subsequent batch-BLAST.

Select the trimmed, merged and length filtered read set you prepared in the previous exercise (SRR7140083_50000 (trimmed) (merged) - length 150-260) and go to Align/Assemble → De novo assembly. Set Sensitivity: to "Low sensitivity / Fastest" to load high stringency settings, then set Sensitivity: to Custom Sensitivity and adjust the settings to those as shown below. This will ensure each contig will comprise only closely related sequences.



Click OK to start the de novo assembler.  The de novo assembly process may take 5-10 min. Once completed, you'll see a new folder containing the assembly reports under the tutorial folder. Select this folder and then view the assembly report.

With our data set the above settings should return 58 contigs and 86 unused reads. A consensus sequence for each contig is in the Consensus Sequences list, and you should have another list containing the unused reads, representing the unclustered unique sequences. In the next step we will BLAST the sequences from these two lists.

Step 1: Preprocessing NGS amplicon data

Step 2: Clustering reads into OTUs using the de novo assembler

Step 3: Batch BLAST OTUs and create a taxonomy database

Step 4: Classifying amplicon data with the Sequence Classifier