3.2 Data input formats

Geneious Prime version 2022.1 can import the following file formats:






Format Extensions Data types Common sources




BED *.bed Annotations UCSC
Common Assembly Format *.caf Contigs Sequencher
Clone Manager molecule *.cm5 Sequences and annotations Clone Manager
Clustal *.aln Alignments ClustalX
CSFASTA *.csfasta Color space FASTA ABI SOLiD
Comma/Tab Separated Values *.csv, *.tsv Spreadsheet files Microsoft Excel
DNAStar *.seq, *.pro Nucleotide & protein sequences DNAStar
DNA Strider *.str Sequences DNA Strider (Mac program), ApE
Embl/UniProt *.embl, *.swp Sequences Embl, UniProt
EMBOSS codon usage table *.cusp, *.cut Codon usage table EMBOSS cusp tool
Endnote (8.0 or 9.0) XML *.xml Journal article references Endnote, Journal article websites
Excel spreadsheet *.xlsx, *.xls Spreadsheet files Microsoft Excel
FASTA *.fasta, *.fas, *fasta.gz etc. Sequences, alignments PAUP*, ClustalX, BLAST, FASTA
FASTQ *.fastq, *.fq, *fastq.gz etc. Sequences with quality Illumina and other NGS sequencers
GCG *.seq Sequences GCG
GCG codon usage table *.cod Codon usage table GCG CodonFrequency tool, https://www.kazusa.or.jp/codon/
GenBank *.gb, *.xml Nucleotide & protein sequences GenBank
Geneious *.xml, *.geneious Preferences, databases Geneious
Geneious Education *.tutorial.zip Tutorial, assignment etc. Geneious
GFF, GFF3, GTF *.gff, *.gff3, *.gtf Annotations NCBI, Ensembl and other genome browsers
MEGA *.meg Alignments MEGA
Molecular structure *.pdb, *.mol, *.xyz, *.cml,
*.gpr, *.hin, *.nwo 3D molecular structures 3D structure databases and programs
Newick *.tre, *.tree, etc. Phylogenetic trees PHYLIP, Tree-Puzzle, PAUP*, ClustalX
Nexus *.nxs, *.nex Trees, Alignments PAUP*, Mesquite, MrBayes & MacClade
PDB *.pdb 3D Protein structures SP3, SP2, SPARKS, Protein Data Bank
PDF *.pdf Documents, presentations Adobe Writer, LATEX, Miktex
Phrap ACE *.ace Contig assemblies Phrap/Consed
PileUp *.msf Alignments pileup (gcg)
PIR/NBRF *.pir Sequences, alignments NBRF PIR
Qual *.qual Quality file Associated with a FASTA file
Raw sequence text *.seq Sequences Any file that contains only a sequence
Rich Sequence Format *.rsf Sequences, alignments GCGs NetFetch
SAM/BAM *.sam, *.bam Contigs SAMtools
Sequence Chromatograms *.ab1, *.scf Raw sequencing trace & sequence Sequencing machines
SnapGene sequence *.dna, *prot Sequences and annotations SnapGene
Text/html .txt, .rtf, .html Any text Simple text editors
VCF *.VCF Annotations 1000 Genomes Project
Vector NTI sequence *.gb, *.gp Nucleotide & protein sequences Vector NTI
Vector NTI/AlignX alignment *.apr Alignments Vector NTI, AlignX
Vector NTI Archive *.ma4, *.pa4, *.oa4, Nucleotide & protein sequences,
*.ea4, *.ca6 enzyme sets and publications Vector NTI
Vector NTI/ContigExpress *.cep Nucleotide sequence assemblies Vector NTI
Vector NTI database VNTI Database Nucleotide & protein sequences,
enzyme sets and publications Vector NTI





    BED annotations
    Clone Manager
    CLUSTAL alignment
    CSFASTA format
    CSV/TSV (Comma/Tab Separated Values) and Excel spreadsheet files
    DNAStar sequences
    DNA Strider sequences
    EMBL/Swiss-Prot sequences
    EndNote 8.0/9.0 XML
    FASTA sequences
    FASTQ sequences
    GenBank sequences
    Geneious format
    Geneious tutorial
    GFF annotations
    MEGA alignment
    Molecular structure
    Newick tree
    Nexus tree
    PDB structure
    PDF
    ACE/PHRAP assembly
    GCG PileUp alignment
    PIR/NBRF sequences
    Qual quality/Phred scores
    RSF rich sequences
    SAM/BAM alignment
    Sequence Chromatograms
    SnapGene
    Text/HTML
    Unformatted sequence
    VCF variant calls
    Vector NTI®
   3.2.1 Importing metadata from a spreadsheet onto existing documents
    Matching Geneious document fields to your spreadsheet
    Preview window
    Metadata mapping
   3.2.2 Importing Vector NTI Databases
BED annotations

The BED format contains sequence annotation information. You can use a BED file to annotate existing sequences in your local database, import entirely new sequences, or import the annotations onto blank sequences.

Clone Manager

Geneious can import annotated sequences files in the standard Clone Manager molecule format .cm5. This will import name, description, topology, sequence and annotations. Currently it does not import other fields, restriction cut sites or primer binding sites.

Other Clone Manager formats such as .cx5 and .pd4 are not currently supported for import.

CLUSTAL alignment

The Clustal format is used by the well known multiple sequence alignment programs ClustalW, ClustalX and Clustal Omega .

Clustal format files are used to store multiple sequence alignments and contain the word clustal at the beginning. An example Clustal file:

CLUSTAL W (1.74) multiple sequence alignment  
 
HQ625570 MRVMGMWRNYPQWWIWGILGLWM--ICSVVGKLWVTVYYGVPVWTDAKATLFCASDAKAY  
HQ625589 MRVKGRSRNYPQWWVWGILGFWMFMICNGVGNRWVTVYYGVPVWKEAKATLFCASDAKAY  
HQ625572 MRVKGILKNYQQWWIWVILGFWMLMICNVVGNQWVTVYYGVPVWREAKATLFCASDAKAY  
HQ625588 MRVMGKWRNCQQWWIWGILGFWIILICN-AEQLWVTVYYGVPVWKEAKTTLFCASDAKAY  
HQ625568 MRVRGTQRNWPQWWIWTSLGFWIILMCR--GNLWVTVYYGVPVWTDAKTTLFCASDAKAY  
HQ625581 MRVMGIPRNWPQWWIWGILGFWIMLMCRVEENSWVTVYYGVPVWKEATTTLFCASDAKAY

CSFASTA format

ABI .csfasta files represent the color calls generated by the SOLiD sequencing system.

CSV/TSV (Comma/Tab Separated Values) and Excel spreadsheet files

Sequences, primers and metadata information stored in spreadsheets can be uploaded to Geneious from either .csv, .tsv, .xlsx or .xls files. For files containing sequences, including nucleotides, proteins, primers or probes, Geneious will create a new document containing the sequence and any additional fields chosen for import. For more information on importing primers from a spreadsheet, see the PCR Primers section. Files containing only metadata can be imported onto existing sequences in Geneious, see section 3.2.1 for details.

DNAStar sequences

DNAStar .seq and .pro files are used in Lasergene, a sequence analysis tool produced by DNAStar.

DNA Strider sequences

Sequence files generated by the Mac program DNA Strider, containing one Nucleotide or Protein sequence.

EMBL/Swiss-Prot sequences

Nucleotide sequences from the EMBL Nucleotide Sequence Database, and protein sequences from UniProt (the Universal Protein Resource)

EndNote 8.0/9.0 XML

EndNote is a popular reference and bibliography manager. EndNote lets you search for journal articles online, import citations, perform searches on your own notes, and insert references into documents. It also generates a bibliography in different styles. Geneious can interoperate with EndNote using Endnote’s XML (Extensible Markup Language) file format to export and import its files.

FASTA sequences

The FASTA file format is commonly used by many programs and tools, including BLAST, T-Coffee and ClustalX. Each sequence in a FASTA file has a header line beginning with a “>” followed by a number of lines containing the raw protein or DNA sequence data. The sequence data may span multiple lines and these sequence may contain gap characters. An empty line may or may not separate consecutive sequences. Here is an example of three sequences in FASTA format (DNA, Protein, Aligned DNA):

>Orangutan  
ATGGCTTGTGGTCTGGTCGCCAGCAACCTGAATCTCAAACCTGGAGAGTGCCTTCGAGTG  
 
>gi|532319|pir|TVFV2E|TVFV2E envelope protein  
ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTTGLLLNGSYSENRT  
QIWQK  
 
>Chicken  
CTACCCCCCTAAAACACTTTGAAGCCTGATCCTCACTA------------------CTGT  
CATCTTAA

FASTQ sequences

FASTQ format stores sequences and Phred qualities in a single file. These should typically be used to import NGS sequence data e.g from Illumina, Ion Torrent, 454 and PacBio sequencers. From R11 onwards, you can set read technology and pair reads as part of the Fastq import process. Note that the native HDF5 file format from PacBio and Oxford Nanopore is not supported and must be converted to fastq for import into Geneious.

GenBank sequences

Records retrieved from the NCBI website (http://www.ncbi.nlm.nih.gov) can be saved in a number of formats. Records saved in GenBank or INSDSeq XML formats can be imported into Geneious.

Geneious format

The Geneious format can be used to store all your local documents, meta-data types and program preferences. A file in Geneious format will usually have a .geneious extension or a .xml extension. This format is useful for sharing documents with other Geneious users and backing up your Geneious data.

Geneious tutorial

This is an archive containing a whole bundle of files which together comprise a Geneious education document. This format can be used to create assignments for your students, bioinformatics tutorials, and much more. See chapter 18 for information on how to create such files.

GFF annotations

The GFF format contains sequence annotation information (and optional sequences). You can use a GFF file to annotate existing sequences in your local database, import entirely new sequences, or import the annotations onto blank sequences. Geneious also supports GFF3 and GTF formats.

MEGA alignment

The MEGA format is used by MEGA (Molecular Evolutionary Genetics Analysis).

Molecular structure

Geneious imports a range of molecular structure formats. These formats support showing the locations of the atoms in a molecule in 3D:

Newick tree

The Newick format is commonly used to represent phylogenetic trees (such as those inferred from multiple sequence alignments). Newick trees use pairs of parentheses to group related taxa, separated by a comma (,). Some trees include numbers (branch lengths) that indicate the distance on the evolutionary tree from that taxa to its most recent ancestor. If these branch lengths are present they are prefixed with a colon (:). The Newick format is produced by phylogeny programs such as PHYLIP, PAUP*, Tree-Puzzle and PHYML. Geneious can import and export trees (including bootstrap values and branch lengths) in Newick format.

Nexus tree

The Nexus format was designed to standardize the exchange of phylogenetic data, including sequences, trees, distance matrices and so on. The format is composed of a number of blocks such as TAXA, TREES and CHARACTERS. Each block contains pre-defined fields. Geneious imports and exports files in Nexus format, and can process the information stored in them for analysis.

If you want to export a tree in a format that preserves bootstrap values make sure you export with metacomments enabled, otherwise the bootstraps will be lost.

PDB structure

Protein Databank files contain a list of XYZ co-ordinates that describe the position of atoms in a protein. These are then used to generate a 3D model which is usually viewed with Rasmol or SPDB viewer. Geneious can read PDB format files and display an interactive 3D view of the protein structure, including support for displaying the protein’s secondary structure when the appropriate information is available.

PDF

PDF stands for Portable Document Format and is developed and distributed by Adobe Systems. It contains the entire description of a document including text, fonts, graphics, colors, links and images. The advantage of PDF files is that they look the same regardless of the software used to create them. Some word processors are able to export a document into PDF format. Alternatively, Adobe Writer can be used. You can use Geneious to read, store and open PDF files.

ACE/PHRAP assembly

Ace is the format used by the Phrap/Consed package, created by the University of Washington Genome Center. This package is used mainly to assemble sequences.

GCG PileUp alignment

The PileUp format is used by the pileup program, a part of the Genetics Computer Group (GCG) Wisconsin Package.

PIR/NBRF sequences

Format used by the Protein Information Resource, a database established by the National Biomedical Research Foundation

Qual quality/Phred scores

Quality file which must be in the same folder as the sequence file (FASTA format) for the quality scores to be used.

RSF rich sequences

RSF (Rich Sequence Format) files contain one or more sequences that may or may not be related. In addition to the sequence data, each sequence can be annotated with descriptive sequence information.

SAM/BAM alignment

SAM and BAM format are produced and used by SAMtools. SAM/BAM files contain the results of an assembly in the form of reads and their mappings to reference sequences.

Sequence Chromatograms

Sequence chromatogram documents contain the results of a sequencing run (the trace) and a guess at the sequence data (base calling).

Informally, the trace is a graph showing the concentration of each nucleotide against sequence positions. Base calling software detects peaks in the four traces and assigns the most probable base at more or less even intervals.

SnapGene

Geneious can import annotated DNA sequence files in .dna format and protein sequences in .prot format from SnapGene. Note that for nucleotide sequences longer than 65,536 bases, restriction sites are not imported automatically, but you will be asked if you wish to import them as an enzyme set. You can then re-annotate the sites onto the sequence using “Find restriction sites”.

Text/HTML

Plain text files and simple HTML can be imported and displayed. HTML is a widely used markup language that can apply format and structure to text, and will be interpreted by the sequence viewer. In Geneious R10 and above, text files can also be created and edited in Geneious, see section 4.1.4 .

Unformatted sequence

A file containing only a sequence.

VCF variant calls

The VCF format contains sequence annotation information. You can use a VCF file to annotate existing sequences in your local database, import entirely new sequences, or import the annotations onto blank sequences.

Vector NTI®

In addition to the import of whole VNTI databases (section 3.2.2 ), Geneious supports the import of several Vector NTI file formats: