Determining the Extreme Rows (aka Extreme Genes)

When constructing the PCAGinzu object, the dataset to be analyzed is the only required argument. One of the most important optional arguments allows users to specify how the extreme data points, aka the "extreme genes", (formerly referred to as "outliers" or "outlier rows") are selected.

The preferred method that we implemented permits the user to specify a likelihood threshold below which a data point (a row, e.g. a gene) is considered extreme. All of the data points form a 1-dimensional distribution of values when projected along a particular principal component's axis. We assume the values are distributed roughly in a Gaussian shape (we've observed this to be true for most gene expression data).

For each data point we can compute a likelihood of membership to that Gaussian distribution. Points having a likelihood less than or equal to this cutoff (at either end of the axis) are itentified as "outliers" aka "extreme points" for that principal component. This threshold can be controlled by specifying the optional outlierCutoff argument, e.g.:

p = pcaGinzu.pcaGinzu(cho,outlierCutoff=0.05)

The lower the likelihood threshold, the fewer points will be included in the extreme point set. Likewise, the higher the threshold, the larger the set of extreme points. In some cases there will be NO extreme points at one or both ends of a principal component's axis. In this case you might consider increasing the threshold.

An older ad-hoc method we employed permits the user to specify an explicit number (e.g. 10 or 50) of the most extreme points to select at each end of a principal component's axis. This can be specified using the optional nOutliers argument, e.g.:

p = pcaGinzu.pcaGinzu(cho,nOutliers=10)

Only one of the two above optional arguments, outlierCutoff or nOutliers, can be given to the constructor. If neither argument is given the analysis defaults to outlierCutoff=0.001.

In order to better understand the two outlier selection approaches, we can graphically compare the extreme point sets resulting from each approach. Figure 1 shows a scatter plot of the data points in the PC1 vs. PC2 space with the extreme points selected using outlierCutoff=0.05 highlighted. Figure 2 shows the same scatter plot but with the extreme points selected noutliers=10 highlighted. In cases where the distribution of a principal component is slightly skewed, the outlierCutoff approach may identify more extreme points at one end of the distribution. The nOutliers approach will always generate an even number of high and low extreme points.

**Figure 1:** Cho PC2 Extreme Genes at outlierCutoff=0.05
$\includegraphics[width=6in]{pca-images/cho-pc1vspc2-oc=0.05.eps}$

**Figure 2:** Cho PC2 Extreme Genes at nOutliers=10
$\includegraphics[width=6in]{pca-images/cho-pc1vspc2-no=10.eps}$

Determining how many extreme points to use, and thus the appropriate settings of these parameters for a particular data set, is somewhat heuristic. In gene expression analysis, identifying fewer extreme genes allows you to obtain small but well-refined gene sets. Recall that these extreme genes are used to identify groups of conditions (e.g. tissues or samples) in which the genes are differentially expressed. If you have a small number of genes, e.g. approximately 20 high and 20 low extreme genes, you will more precisely focus your search for conditions in which only those genes are differentially expressed. If you have a large number of genes, e.g. approximately 200 high and 200 low extreme genes, you get a broader, more inclusive set of genes, and subsequently will identify only those conditions in which the larger sets of 200 high vs. 200 low extreme genes are differentially expressed. It is plausible that two analyses, one producing fewer extreme points and one producing larger extreme points, can each provide informative results that emphasize different phenomena.

Joe Roden 2005-12-13