Principal Component Analysis for DESeq2 results

11.2.6 Principal Component Analysis for DESeq2 results

Principal component analysis (PCA) can be used to visualize variation between expression analysis samples. This method is especially useful for quality control, for example in identifying problems with your experimental design, mislabeled samples, or other problems.

When you perform a PCA, the normalized diﬀerences in expression patterns are used to compute a distance matrix. The X- and Y-axes in a PCA plot correspond to a mathematical transformation of these distances so that data can be displayed in two dimensions. This can make interpreting PCA plots challenging, as their meaning is fairly abstract from a biological perspective.

Creating a PCA Plot

A PCA plot will automatically be generated when you compare expression levels using DESeq2. This plot will be available to view in the PCA Plot viewer (Figure 11.1 ) once you have saved the newly-generated diﬀerential expression sequence track to your document. If you have multiple diﬀerential expression tracks from running DESeq2 more than once, you will have the option to select which track you’d like to show in the PCA Plot viewer.

Figure 11.1: PCA plot viewer for RNA-Seq data from Vibrio ﬁscheri ES114 collected under two conditions with three samples per condition (Thompson et al, Env Microbiol 2017). This plot shows that samples cluster with other samples grown in the same type of medium, and the ﬁrst component explains most of the variance.

Interpreting PCA Plots

PCA is typically used primarily as a quality control or exploratory tool. In general, if your samples were produced under two experimental conditions (e.g. treated vs. untreated), the PCA plot should normally show that a) samples subjected to the same condition cluster together, and b) the clusters should be reasonably well-separated along the X-axis (“PC1”: the ﬁrst principal component).

An Example of Using PCA As An Exploratory Tool

The plot in Figure 11.2 shows data from a ﬁctitious bacterial strain that could potentially be useful for bioremediation, cultured in the presence or absence of a halogenated industrial solvent (“Halogen”). The halogen is toxic to the two mutant strains, but not to the wild type. In this case, samples were compared according to the presence (blue) or absence (orange) of the halogen in the culture medium. The mutants contain a deletion in a transcriptional element thought to aﬀect metabolism of the halogen, so the expected result is that expression levels in mutants would be similar to those of wild-type samples grown in the absence of the halogen.

Figure 11.2: PCA plot of expression data from wild-type and mutant strains grown in the presence and absence of a halogenated solvent.

On inspection of the PCA plot in Figure 11.2 , two things are apparent:

The variance explained by the ﬁrst principal component (X-axis) is not consistent with the expected result of this experiment, which is a strong indication that further investigation is required. Some possible explanations for this result are:
- Perhaps two of the wild type samples were mislabeled and the sample labeled “wt -Halogen sample 1” was actually grown in the presence of the halogen and “wt +Halogen sample 2” without the halogen.
- There could be some other explanation that should be investigated (e.g. contaminated medium, a malfunctioning incubator, etc.).
- These samples might be outliers that you choose to discard.
The second principal component (Y-axis) explains quite a lot of variance, particularly the variance between the two mutant strains. While this result might be expected, it could also be interesting!

< Prev Next > Up