Data input formats


Format	Extensions	Data types	Common sources

BED	*.bed	Annotations	UCSC
Common Assembly Format	*.caf	Contigs	Sequencher
Clone Manager molecule	*.cm5	Sequences and annotations	Clone Manager
Clustal	*.aln	Alignments	ClustalX
CSFASTA	*.csfasta	Color space FASTA	ABI SOLiD
DNAStar	.seq, .pro	Nucleotide & protein sequences	DNAStar
DNA Strider	*.str	Sequences	DNA Strider (Mac program), ApE
Embl/UniProt	.embl, .swp	Sequences	Embl, UniProt
EMBOSS codon usage table	.cusp, .cut	Codon usage table	EMBOSS cusp tool
Endnote (8.0 or 9.0) XML	*.xml	Journal article references	Endnote, Journal article websites
FASTA	.fasta, .fas, *fasta.gz etc.	Sequences, alignments	PAUP*, ClustalX, BLAST, FASTA
FASTQ	.fastq, .fq, *fastq.gz etc.	Sequences with quality	Illumina and other NGS sequencers
GCG	*.seq	Sequences	GCG
GCG codon usage table	*.cod	Codon usage table	GCG CodonFrequency tool, https://www.kazusa.or.jp/codon/
GenBank	.gb, .xml	Nucleotide & protein sequences	GenBank
Geneious	.xml, .geneious	Preferences, databases	Geneious
Geneious Education	*.tutorial.zip	Tutorial, assignment etc.	Geneious
GFF, GFF3, GTF	.gﬀ, .gﬀ3, *.gtf	Annotations	NCBI, Ensembl and other genome browsers
MEGA	*.meg	Alignments	MEGA
Molecular structure	.pdb, .mol, .xyz, .cml,
	.gpr, .hin, *.nwo	3D molecular structures	3D structure databases and programs
Newick	.tre, .tree, etc.	Phylogenetic trees	PHYLIP, Tree-Puzzle, PAUP*, ClustalX
Nexus	.nxs, .nex	Trees, Alignments	PAUP*, Mesquite, MrBayes & MacClade
PDB	*.pdb	3D Protein structures	SP3, SP2, SPARKS, Protein Data Bank
PDF	*.pdf	Documents, presentations	Adobe Writer, LATEX, Miktex
Phrap ACE	*.ace	Contig assemblies	Phrap/Consed
PileUp	*.msf	Alignments	pileup (gcg)
PIR/NBRF	*.pir	Sequences, alignments	NBRF PIR
Qual	*.qual	Quality ﬁle	Associated with a FASTA ﬁle
Raw sequence text	*.seq	Sequences	Any ﬁle that contains only a sequence
Rich Sequence Format	*.rsf	Sequences, alignments	GCGs NetFetch
Comma/Tab Separated Values	.csv, .tsv	Spreadsheet ﬁles	Microsoft Excel
SAM/BAM	.sam, .bam	Contigs	SAMtools
Sequence Chromatograms	.ab1, .scf	Raw sequencing trace & sequence	Sequencing machines
SnapGene sequence	*.dna	Sequences and annotations	SnapGene
Text/html	.txt, .rtf, .html	Any text	Simple text editors
VCF	*.VCF	Annotations	1000 Genomes Project
Vector NTI sequence	.gb, .gp	Nucleotide & protein sequences	Vector NTI
Vector NTI/AlignX alignment	*.apr	Alignments	Vector NTI, AlignX
Vector NTI Archive	.ma4, .pa4, *.oa4,	Nucleotide & protein sequences,
	.ea4, .ca6	enzyme sets and publications	Vector NTI
Vector NTI/ContigExpress	*.cep	Nucleotide sequence assemblies	Vector NTI
Vector NTI database	VNTI Database	Nucleotide & protein sequences,
		enzyme sets and publications	Vector NTI

BED annotations

The BED format contains sequence annotation information. You can use a BED ﬁle to annotate existing sequences in your local database, import entirely new sequences, or import the annotations onto blank sequences.

Clone Manager

Geneious can import annotated sequences ﬁles in the standard Clone Manager molecule format .cm5. This will import name, description, topology, sequence and annotations. Currently it does not import other ﬁelds, restriction cut sites or primer binding sites.

Other Clone Manager formats such as .cx5 and .pd4 are not currently supported for import.

CLUSTAL alignment

The Clustal format is used by ClustalW and ClustalX, two well known multiple sequence alignment programs.

Clustal format ﬁles are used to store multiple sequence alignments and contain the word clustal at the beginning. An example Clustal ﬁle:

CLUSTAL W (1.74) multiple sequence alignment

seq1 -----------------------KSKERYKDENGGNYFQLREDWWDANRETVWKAITCNA
seq2 ---------------YEGLTTANGXKEYYQDKNGGNFFKLREDWWTANRETVWKAITCGA
seq3 ----KRIYKKIFKEIHSGLSTKNGVKDRYQN-DGDNYFQLREDWWTANRSTVWKALTCSD
seq4 ------------------------SQRHYKD-DGGNYFQLREDWWTANRHTVWEAITCSA
seq5 --------------------NVAALKTRYEK-DGQNFYQLREDWWTANRATIWEAITCSA
seq6 ------FSKNIX--QIEELQDEWLLEARYKD--TDNYYELREHWWTENRHTVWEALTCEA
seq7 -------------------------------------------------KELWEALTCSR

seq1 --GGGKYFRNTCDG--GQNPTETQNNCRCIG----------ATVPTYFDYVPQYLRWSDE
seq2 P-GDASYFHATCDSGDGRGGAQAPHKCRCDG---------ANVVPTYFDYVPQFLRWPEE
seq3 KLSNASYFRATC--SDGQSGAQANNYCRCNGDKPDDDKP-NTDPPTYFDYVPQYLRWSEE
seq4 DKGNA-YFRRTCNSADGKSQSQARNQCRC---KDENGKN-ADQVPTYFDYVPQYLRWSEE
seq5 DKGNA-YFRATCNSADGKSQSQARNQCRC---KDENGXN-ADQVPTYFDYVPQYLRWSEE
seq6 P-GNAQYFRNACS----EGKTATKGKCRCISGDP----------PTYFDYVPQYLRWSEE
seq7 P-KGANYFVYKLD-----RPKFSSDRCGHNYNGDP---------LTNLDYVPQYLRWSDE

CSFASTA format

ABI .csfasta ﬁles represent the color calls generated by the SOLiD sequencing system.

DNAStar sequences

DNAStar .seq and .pro ﬁles are used in Lasergene, a sequence analysis tool produced by DNAStar.

DNA Strider sequences

Sequence ﬁles generated by the Mac program DNA Strider, containing one Nucleotide or Protein sequence.

EMBL/Swiss-Prot sequences

Nucleotide sequences from the EMBL Nucleotide Sequence Database, and protein sequences from UniProt (the Universal Protein Resource)

EndNote 8.0/9.0 XML

EndNote is a popular reference and bibliography manager. EndNote lets you search for journal articles online, import citations, perform searches on your own notes, and insert references into documents. It also generates a bibliography in diﬀerent styles. Geneious can interoperate with EndNote using Endnote’s XML (Extensible Markup Language) ﬁle format to export and import its ﬁles.

FASTA sequences

The FASTA ﬁle format is commonly used by many programs and tools, including BLAST, T-Coﬀee and ClustalX. Each sequence in a FASTA ﬁle has a header line beginning with a “>” followed by a number of lines containing the raw protein or DNA sequence data. The sequence data may span multiple lines and these sequence may contain gap characters. An empty line may or may not separate consecutive sequences. Here is an example of three sequences in FASTA format (DNA, Protein, Aligned DNA):

FASTQ sequences

FASTQ format stores sequences and Phred qualities in a single ﬁle. These should typically be used to import NGS sequence data e.g from Illumina, Ion Torrent, 454 and PacBio sequencers. From R11 onwards, you can set read technology and pair reads as part of the Fastq import process. Note that the native HDF5 ﬁle format from PacBio and Oxford Nanopore is not supported and must be converted to fastq for import into Geneious.

GenBank sequences

Records retrieved from the NCBI website (http://www.ncbi.nlm.nih.gov) can be saved in a number of formats. Records saved in GenBank or INSDSeq XML formats can be imported into Geneious.

Geneious format

The Geneious format can be used to store all your local documents, meta-data types and program preferences. A ﬁle in Geneious format will usually have a .geneious extension or a .xml extension. This format is useful for sharing documents with other Geneious users and backing up your Geneious data.

Geneious tutorial

This is an archive containing a whole bundle of ﬁles which together comprise a Geneious education document. This format can be used to create assignments for your students, bioinformatics tutorials, and much more. See chapter 17 for information on how to create such ﬁles.

GFF annotations

The GFF format contains sequence annotation information (and optional sequences). You can use a GFF ﬁle to annotate existing sequences in your local database, import entirely new sequences, or import the annotations onto blank sequences. Geneious also supports GFF3 and GTF formats.

MEGA alignment

Molecular structure

Geneious imports a range of molecular structure formats. These formats support showing the locations of the atoms in a molecule in 3D:

Newick tree

The Newick format is commonly used to represent phylogenetic trees (such as those inferred from multiple sequence alignments). Newick trees use pairs of parentheses to group related taxa, separated by a comma (,). Some trees include numbers (branch lengths) that indicate the distance on the evolutionary tree from that taxa to its most recent ancestor. If these branch lengths are present they are preﬁxed with a colon (:). The Newick format is produced by phylogeny programs such as PHYLIP, PAUP*, Tree-Puzzle and PHYML. Geneious can import and export trees (including bootstrap values and branch lengths) in Newick format.

Nexus tree

The Nexus format was designed to standardize the exchange of phylogenetic data, including sequences, trees, distance matrices and so on. The format is composed of a number of blocks such as TAXA, TREES and CHARACTERS. Each block contains pre-deﬁned ﬁelds. Geneious imports and exports ﬁles in Nexus format, and can process the information stored in them for analysis.

If you want to export a tree in a format that preserves bootstrap values make sure you export with metacomments enabled, otherwise the bootstraps will be lost.

PDB structure

Protein Databank ﬁles contain a list of XYZ co-ordinates that describe the position of atoms in a protein. These are then used to generate a 3D model which is usually viewed with Rasmol or SPDB viewer. Geneious can read PDB format ﬁles and display an interactive 3D view of the protein structure, including support for displaying the protein’s secondary structure when the appropriate information is available.

PDF

PDF stands for Portable Document Format and is developed and distributed by Adobe Systems. It contains the entire description of a document including text, fonts, graphics, colors, links and images. The advantage of PDF ﬁles is that they look the same regardless of the software used to create them. Some word processors are able to export a document into PDF format. Alternatively, Adobe Writer can be used. You can use Geneious to read, store and open PDF ﬁles.

ACE/PHRAP assembly

Ace is the format used by the Phrap/Consed package, created by the University of Washington Genome Center. This package is used mainly to assemble sequences.

GCG PileUp alignment

The PileUp format is used by the pileup program, a part of the Genetics Computer Group (GCG) Wisconsin Package.

PIR/NBRF sequences

Format used by the Protein Information Resource, a database established by the National Biomedical Research Foundation

Qual quality/Phred scores

Quality ﬁle which must be in the same folder as the sequence ﬁle (FASTA format) for the quality scores to be used.

Unformatted sequence

RSF rich sequences

RSF (Rich Sequence Format) ﬁles contain one or more sequences that may or may not be related. In addition to the sequence data, each sequence can be annotated with descriptive sequence information.

CSV/TSV (Comma/Tab Separated Values) sequences

Sequences such as primer lists are often stored in spreadsheets. Geneious has an importer that can be given the ﬁeld values for a spreadsheet ﬁle exported in CSV or TSV format, and it will import them and convert them to documents as well as preserving the additional ﬁeld contents. It can handle nucleotide and protein sequences, as well as primers and probes. For more information on importing primers from a spreadsheet, see the PCR Primers section.

SAM/BAM alignment

SAM and BAM format are produced and used by SAMtools. SAM/BAM ﬁles contain the results of an assembly in the form of reads and their mappings to reference sequences.

Chromatograms

Sequence chromatogram documents contain the results of a sequencing run (the trace) and a guess at the sequence data (base calling).

Informally, the trace is a graph showing the concentration of each nucleotide against sequence positions. Base calling software detects peaks in the four traces and assigns the most probable base at more or less even intervals.

SnapGene

Geneious can import annotated DNA sequence ﬁles in .dna format from SnapGene. Note that for sequences longer than 65,536 bases, restriction sites are not imported automatically, but you will be asked if you wish to import them as an enzyme set. You can then re-annotate the sites onto the sequence using “Find restriction sites”.

Text/HTML

Plain text ﬁles and simple HTML can be imported and displayed. HTML is a widely used markup language that can apply format and structure to text, and will be interpreted by the sequence viewer. In Geneious R10 and above, text ﬁles can also be created and edited in Geneious, see section 4.1.4 .

VCF variant calls

The VCF format contains sequence annotation information. You can use a VCF ﬁle to annotate existing sequences in your local database, import entirely new sequences, or import the annotations onto blank sequences.

3.2 Data input formats

BED annotations

Clone Manager

CLUSTAL alignment

CSFASTA format

DNAStar sequences

DNA Strider sequences

EMBL/Swiss-Prot sequences

EndNote 8.0/9.0 XML

FASTA sequences

FASTQ sequences

GenBank sequences

Geneious format

Geneious tutorial

GFF annotations

MEGA alignment

Molecular structure

Newick tree

Nexus tree

PDB structure

PDF

ACE/PHRAP assembly

GCG PileUp alignment

PIR/NBRF sequences

Qual quality/Phred scores

Unformatted sequence

RSF rich sequences

CSV/TSV (Comma/Tab Separated Values) sequences

SAM/BAM alignment

Chromatograms

SnapGene

Text/HTML

VCF variant calls

Vector NTI^®

3.2 Data input formats

BED annotations

Clone Manager

CLUSTAL alignment

CSFASTA format

DNAStar sequences

DNA Strider sequences

EMBL/Swiss-Prot sequences

EndNote 8.0/9.0 XML

FASTA sequences

FASTQ sequences

GenBank sequences

Geneious format

Geneious tutorial

GFF annotations

MEGA alignment

Molecular structure

Newick tree

Nexus tree

PDB structure

PDF

ACE/PHRAP assembly

GCG PileUp alignment

PIR/NBRF sequences

Qual quality/Phred scores

Unformatted sequence

RSF rich sequences

CSV/TSV (Comma/Tab Separated Values) sequences

SAM/BAM alignment

Chromatograms

SnapGene

Text/HTML

VCF variant calls

Vector NTI®

Vector NTI^®