To better understand how stable our PCA-based methodology is to variations in a dataset, we compared the results of our PCA analysis of the published diabetes dataset (10983 genes vs. 17 NGT + 18 DM2 sample columns) to results of analyzing modified versions of the dataset (both noise added and columns deleted variations). From these experiments and our prior experience we believe that our methodology is robust to reasonable amounts of noise. We expect this conclusion to hold across datasets, but to what extent (ie. how many PCs will be robustly recovered) is not clear. The conclusions from this analysis should be interpreted with some reserve, because it is not known just what level of noise is biologically relevant. Different datasets have different inherent noise depending on many factors including how many times the measurements have been replicated, the type and amount of tissue samples come from, the experimenters themselves, the microarray platform, and even the pre-processing steps preceding the application of our PCA methods. One should not assume these results can be immediately transferred to other datasets and broadly generalized. However, by performing similar analyses to those shown here, it should be possible to assess the robustness of these methods on any specific dataset.

We compared the results of analysis of the published dataset to
results of analysis of the dataset after noise was added. Gaussian
noise (mean zero and sigma as pct. noise times mean expression level)
was added to the log base 2 expression data. This experiment was
repeated 5 times at each of 8 different noise levels, and at 3
different settings of *extremeThresh* used to select the
principal component extreme gene (PCEG) sets. PCA interpretation
results are compared by forming the PCEG confusion matrix showing
correspondence of every principal component from the original data
vs. every principal component of the noise added data. Each cell of
the confusion matrix represents the fraction of genes that intersect
between the corresponding PCEG sets (the count of intersecting PCEG
genes divided by the count union of PCEG genes). A single "PCEG
overlap" score is calculated as the average of all row maximums and
all column maximums. If two analysis correspond well the PCEG
confusion matrix will have a strong diagonal component with diagonal
cell values near 1, and thus will get a PCEG overlap score near 1.
Scores greater than zero indicate that there is partial overlap,
possibly off-diagonal, between genes in corresponding PCEG sets, thus
some fraction of the principal component gene sets are recovered in
spite of the additional noise.

PCEG confusion matrices for *extremeThresh*=0.01.

PCEG confusion matrices for *extremeThresh*=0.001.

PCEG confusion matrices for *extremeThresh*=0.00001.

PCEG confusion matrices for *extremeThresh*=0.01.

PCEG confusion matrices for *extremeThresh*=0.001.

PCEG confusion matrices for *extremeThresh*=0.00001.

Return to the supplemental information page.