Introduction to Clustering

Clustering Algorithms are a general class of unsupervised machine learning techniques which attempt to find an approximation to the optimal partitioning (or separation) of a given data set into discrete classes or clusters. An optimal partition is is defined as the partitioning for which each data vector in a cluster is more similar to all other data vectors with the same cluster memberships than to all other data vectors with different cluster memberships. It can be shown that this problem is NP-hard (computationally intractable, if you think about it, to be certain you have found the `''true'' optimal clustering you would have to search through all possible clusterings - a rapidly growing set of possibilities as the number of data points grows.).

In the context of microarray data analysis clustering has become a staple technique to provide insights into which genes have similar behavior across the conditions being assayed. Clusters essentially form groups of co-expressed genes, using these data as a starting point biologists can then start to address the more interesting questions regarding why these genes are co-expressed. For instance, are the co-regulated genes part of a similar biochemical pathway. The comparative tools provide utilities to compare clusterings, which can elucidate such things as: how different parameters affect a clustering; how different algorithms partition the data; and how different experimental perturbations affect which cluster, representing a similar expression response, a gene might belong to. This can have potentially profound effects on downstream analysis.

Although we don't stress it in this tutorial, you can cluster conditions (columns) just as easily as you can cluster genes (rows) using the same techniques.

Brandon King 2005-05-27