Find Duplicates

4.3 Find Duplicates

Find Duplicates, under the Edit menu, is used to identify duplicate copies of sequences and other documents. Duplicates can be identiﬁed by sequence name, database ID (e.g. accession) or by the residues/bases, and the Search Scope can be set so that it checks within either a selected set of documents, all documents in a folder or in the sequences of a single alignment or sequence list.

When searching for duplicates within sequences of a single alignment or sequence list, two options are available for displaying results once the search has run:

Select earlier duplicates in list: This will select all but one copy of a duplicated document, allowing the duplicates to easily be deleted or moved to another folder leaving one copy behind.
Extract unique sequences: Unique sequences will be extracted to a new sequence list, and the sequence names modiﬁed to show the duplicate count for that sequence. For large data sets, or removing duplicates in paired reads, or removing non-exact duplicates, see Remove Duplicate Reads using BBTools.

If you are searching for duplicates within a folder or multiple select documents, you can choose to select either the most recently or least recently modiﬁed copy.

Remove Duplicate Reads

For identifying non-exact duplicates, removing exact duplicates from large data sets, or removing duplicates on paired read data sets, use Remove Duplicate Reads... from the Sequence menu. This tool runs Dedupe from the BBTools suite.

For a detailed explanation of any Dedupe setting, hover the mouse over the setting, or click the help (question mark) button next to the custom options under More Options.

< Prev Next > Up