Investigating the stability of principal components analysis of the diabetes dataset

To better understand how stable our PCA-based methodology is to variations in a dataset, we compared the results of our PCA analysis of the published diabetes dataset (10983 genes vs. 17 NGT + 18 DM2 sample columns) to results of analyzing modified versions of the dataset (both noise added and columns deleted variations). From these experiments and our prior experience we believe that our methodology is robust to reasonable amounts of noise. We expect this conclusion to hold across datasets, but to what extent (ie. how many PCs will be robustly recovered) is not clear. The conclusions from this analysis should be interpreted with some reserve, because it is not known just what level of noise is biologically relevant. Different datasets have different inherent noise depending on many factors including how many times the measurements have been replicated, the type and amount of tissue samples come from, the experimenters themselves, the microarray platform, and even the pre-processing steps preceding the application of our PCA methods. One should not assume these results can be immediately transferred to other datasets and broadly generalized. However, by performing similar analyses to those shown here, it should be possible to assess the robustness of these methods on any specific dataset.

Experiments with noise added

We compared the results of analysis of the published dataset to results of analysis of the dataset after noise was added. Gaussian noise (mean zero and sigma as pct. noise times mean expression level) was added to the log base 2 expression data. This experiment was repeated 5 times at each of 8 different noise levels, and at 3 different settings of extremeThresh used to select the principal component extreme gene (PCEG) sets. PCA interpretation results are compared by forming the PCEG confusion matrix showing correspondence of every principal component from the original data vs. every principal component of the noise added data. Each cell of the confusion matrix represents the fraction of genes that intersect between the corresponding PCEG sets (the count of intersecting PCEG genes divided by the count union of PCEG genes). A single "PCEG overlap" score is calculated as the average of all row maximums and all column maximums. If two analysis correspond well the PCEG confusion matrix will have a strong diagonal component with diagonal cell values near 1, and thus will get a PCEG overlap score near 1. Scores greater than zero indicate that there is partial overlap, possibly off-diagonal, between genes in corresponding PCEG sets, thus some fraction of the principal component gene sets are recovered in spite of the additional noise.

Principal Component Extreme Gene (PCEG) confusion matrices for individual experiments

PCEG confusion matrices for extremeThresh=0.01.
PCEG confusion matrices for extremeThresh=0.001.
PCEG confusion matrices for extremeThresh=0.00001.

Figure 1. Summary of PCEG overlap results for noise added experiments

Experiments with balanced column subsets

We performed a number of experiments in which we generated two random data subsets having fewer than the original number of samples (columns), ran PCA interpretation analysis on each random subset, and then compared the results using the PCEG confusion matrix, summarized by the PCEG overlap score as described above. To try to ensure that the PCA results from each run were appropriate for comparison, we selected column subsets representing an equal number of samples from both major diagnosis classes: NGT (normal glucose tolerance, n=17) and DM2 (type 2 diabetes mellitus, n=18). We expect their differences to be the dominant contribution to the observed variance, so we wanted to avoid generating datasets and subsequent analyses that were biased toward one or the other population. E.g. for diagnosis class subset size 15 one random class-balanced data subset would contain 15/17 NGTs and 15/18 DM2s selected at random. The analysis comparison experiment was repeated 10 times at each of 8 different column subset sizes (10 through 17), and at 3 different settings of extremeThresh used to select the PCEG sets. The results indicate that, as expected, PCA interpretation of larger subsets produced greater overlap in PCEG results, but smaller subsets still retain some degree of PCEG overlap. Thus, some fraction of the principal component gene sets are recovered in spite of dropping individual columns. These results appear to be mostly independent of the extremeThresh setting.

Principal Component Extreme Gene (PCEG) confusion matrices for individual experiments

PCEG confusion matrices for extremeThresh=0.01.
PCEG confusion matrices for extremeThresh=0.001.
PCEG confusion matrices for extremeThresh=0.00001.

Figure 1. Summary of PCEG overlap results for balanced subset experiments

Return to the supplemental information page.