As an additional means of interpreting principal components, we would like to determine if any individual principal components offer a means to predict values of condition covariates attached to the dataset columns. There are a number of approaches for determining if a covariate correlates well with a specific principal component. The approach we took is to identify those covariates correlate with a principal component's derived condition partitioning. As described in the previous section, a principal component's extreme points can be used to partition the conditions into three groups "up", "flat", and "down".
For a given principal component PCn, we can compute a score for each condition (e.g. tissue or sample) covariate indicating the degree to which that covariate is correlated with that PCn's significant condition grouping into the Up, Flat, or Down conditions. A score will be generated for each user-supplied column labeling attached to a dataset (i.e. all of the dataset's covariate annotations, e.g. age, tissue, treatment). For discrete covariates a normalized mutual information (NMI) score is computed, indicating the degree agreement between the covariate's discrete values and the condition Up/Flat/Down grouping (a higher score means better agreement). For continuous covariates, Wilcoxon rank sum tests are generated, giving the likelihood of 2 sets of covariate values belonging to the same distribution. Three pairs of condition groups are scored: (Up vs. Flat), (Up vs. Down) and (Flat vs. Down).
Beyond identifying extreme points and the condition partitionings they generate, the above covariate analysis allows us an additional data-driven, unsupervised approach to help interpret each principal component. A report of condition covariates that correlate with condition partitions essentially generates a number of hypotheses at a specified significance threshold. It is then up to the researcher to follow up on those hypotheses and verify whether an observed correspondence between e.g. a particular set of of genes, tissues and covariates is relevant and in fact meaningful. Of course this covariate analysis approach is inherently limited by the number of column covariates available. In the case of the Cho dataset, and likewise with the GNF human dataset we have only one column covariate (in Cho we only have Time, and in GNF we only have tissue).