To better understand how stable our PCA-based methodology is to variations in a dataset, we compared the results of our PCA analysis of the published diabetes dataset (10983 genes vs. 17 NGT + 18 DM2 sample columns) to results of analyzing modified versions of the dataset (both noise added and columns deleted variations). From these experiments and our prior experience we believe that our methodology is robust to reasonable amounts of noise. We expect this conclusion to hold across datasets, but to what extent (ie. how many PCs will be robustly recovered) is not clear. The conclusions from this analysis should be interpreted with some reserve, because it is not known just what level of noise is biologically relevant. Different datasets have different inherent noise depending on many factors including how many times the measurements have been replicated, the type and amount of tissue samples come from, the experimenters themselves, the microarray platform, and even the pre-processing steps preceding the application of our PCA methods. One should not assume these results can be immediately transferred to other datasets and broadly generalized. However, by performing similar analyses to those shown here, it should be possible to assess the robustness of these methods on any specific dataset.
We compared the results of analysis of the published dataset to results of analysis of the dataset after noise was added. Gaussian noise (mean zero and sigma as pct. noise times mean expression level) was added to the log base 2 expression data. This experiment was repeated 5 times at each of 8 different noise levels, and at 3 different settings of extremeThresh used to select the principal component extreme gene (PCEG) sets. PCA interpretation results are compared by forming the PCEG confusion matrix showing correspondence of every principal component from the original data vs. every principal component of the noise added data. Each cell of the confusion matrix represents the fraction of genes that intersect between the corresponding PCEG sets (the count of intersecting PCEG genes divided by the count union of PCEG genes). A single "PCEG overlap" score is calculated as the average of all row maximums and all column maximums. If two analysis correspond well the PCEG confusion matrix will have a strong diagonal component with diagonal cell values near 1, and thus will get a PCEG overlap score near 1. Scores greater than zero indicate that there is partial overlap, possibly off-diagonal, between genes in corresponding PCEG sets, thus some fraction of the principal component gene sets are recovered in spite of the additional noise.
PCEG confusion matrices for extremeThresh=0.01.
PCEG confusion matrices for extremeThresh=0.001.
PCEG confusion matrices for extremeThresh=0.00001.
PCEG confusion matrices for extremeThresh=0.01.
PCEG confusion matrices for extremeThresh=0.001.
PCEG confusion matrices for extremeThresh=0.00001.
Return to the supplemental information page.