Covariate analysis of the PGC diabetes dataset

To provide a more meaningful demonstration of this feature, we will switch to an analysis of a human diabetes expression dataset [Mootha et al., 2003] from Broad Institute Cancer Program dataset repository We will refer to this as the PGC diabetes dataset.

This dataset contains 10,983 gene probes measured across 43 skeletal muscle samplesfrom 3 diagnosis groups: normal glucose tolerance (NGT, n=17); impaired glucose tolerance (IGT, n=8); and Type 2 diabetes mellitus (DM2, n=18). We used our PCA interpretation software to perform an unsupervised analysis of the DM2 vs. NGT subset (comparable to the previous published result). As described in our Methodology section, PCEG sets were determined using an extremeThresh likelihood threshold of 0.001, which yielded about 50 high and 50 low extreme genes per principal component. For more information on this dataset, see [Roden et al., 2005].

Because this dataset contains more than 50 covariates, it provides us a nice opportunity to interpret each principal component by searching for covariates that correlate well with expression patterns in the PCEG sets.

We can perform the principal component analysis of this dataset as follows:

pgc = LoadExample.LoadPGC()
p = pcaGinzu.pcaGinzuVisualizeMatplotlib(pgc,verbose=True,outlierCutoff=0.001)

The PCAGinzu software has a variety of functions to support the covariate analysis. The scoreColumnLabelingsForPCN() function attempts to identify any column labelings that correlate well with a particular principal component's condition partitioning. Due to the slightly complex nature of the scores we calculate (one score for discrete covariates and three scores for continuous), this function returns a list of ColumnScore objects that contain the results, one ColumnScore per covariate. To analyze all covariates against a single principal componenent e.g. PC12, you can call:

pc12scores = p.scoreColumnLabelingsForPCN(12)

However, because the details of the list of ColumnScore representation are a bit involved and there are so many scores to calculate and examine, we will move ahead and instead show how to generate all covariate scores across all principal components. This complete covariate analysis, which may take a minute or so depending on your hardware and the size of the dataset, can be performed as follows:

covariateScores = pcaGinzuScoring.getAllResults(p,1,35)

This result is a dictionary containing a set of covariate scores, indexed by principal component number. One way we recommend viewing the results is to generate a summary of only those covariates whos scores meet or exceed a given threshold, indicdating they may be interesting and suggestive of further study. Because low scores are better for continuous, and high scores are better for discrete, there are two separate functions for showing summaries of covariate scores. For continous covariates their score must be less than or equal to the given threshold; likewise a discrete covariate's score must be greater than or equal to the given threshold. E.g.:

continuousSummary = pcaGinzuScoring.summarizeContinuousResults(covariateScores,0.01)
discreteSummary = pcaGinzuScoring.summarizeDiscreteResults(covariateScores,0.8)

Each summary returned is a list of the principal components and, for each, a list of any covariates that were observed to correlate well with that principal component. You can observe in the continuous covariate summary for example that at a Wilcoxon significance threshold of 0.01, principal component 12 is well correlated with four different Type2c fiber covariates. One may then "drill down" to the value distributions underlying these scores to investigate the relationship, e.g.:


Again it's unfortunate that the PGC dataset has only two covariates that are discrete: "status" and "Samplename_at_WICGR". Neither column correlated with a principal component at a level exceeding the default average NMI threshold of 0.8. If either did, we would want to drill down to see the underlying confusion matrix that illustrates high mutual information between that covariate and a principal component's Up/Flat/Down column labeling.

The pcaGinzuScoring module has a variety of functions to dump you output the results of covariate analysis to a file. Here we send the complete table of results to a file (after generating a results table dictionary):

resultsTable = pcaGinzuScoring.generateResultsTable(covariateScores)
pcaGinzuScoring.writeResultsTableDictToFile('all-covariate-scores.txt', resultsTable)

The resulting file will contain one column per covariate and one row per principal component. The cells contain the minimum of the three Wilcoxon significance scores for continuous covariates, and the average NMI score for discrete covariates. For continuous covariates, if there are too few data points to compare in any of the Up/Flat/Down partitions (controlled by an optional minSetSize argument to scoreColumnLabelingsForPCN, defaults to minSetSize=2), a value of 2 is reported so that situation can be recognized. For a large number of covariates this is a dense result table and is only useful if you are going to filter the results further by e.g. by loading into a spreadsheet and applying a threshold.

The following example shows how to use writeSignificantResultsToFile to send a covariate analysis significant results summary to a file:

pcaGinzuScoring.writeSignificantResultsToFile('significant-continuous-covariate-scores.txt', continuousSummary)
pcaGinzuScoring.writeSignificantResultsToFile('significant-discrete-covariate-scores.txt', discreteSummary)

The resulting files offers a nice summary of the most relevant covariates across all of the principal components.

It might be useful to analyze a particular labeling of interest to determine which principal component correlates best. For example,

pcaGinzuScoring.displayScoresForColLabeling('status', covariateScores)

Among others, principal component 14 seems to have the highest correlation to the status covariate, and so perhaps warrants a bit of investigation.

Joe Roden 2005-12-13