Introduction to Clustering

Clustering Algorithms are a general class of unsupervised machine learning techniques which attempt to find an approximation to the optimal partition (or seperation) of a given data set into discrete classes or clusters. An optimal partition is is defined as the partitioning for which each data vector in a cluster is more simular to all other data vectors with the same cluster memberships than to all other data vectors with different cluster memberships. It can be shown that this problem is NP-hard (computationally untracktable, if you think about it, to be certain you have found the `''true'' optimal clustering you would have to search through all possible clusterings - a very large number as the number of data points becomes big.).

In the context of microarray data analysis clustering has become a staple technique to provide insights into which genes have simular behavior across the conditions being assayed. Clusters essentially form groups of co-expressed genes, using these data as a starting point biologists can then start to address the more interesting questions regarding why these genes are coexpressed. Are the co-regulated functionally, part of simular biochemical pathway, etc. The comparitive tools provide utilities to explore the effects of both what experimental pertabations might have on clustering but also how different algorithms differ from each other on how they partion the genes. This can have potentially profound effects on downstream analysis.

Although we don't stress it in this tutorial, you can cluster conditions just as easily as you can cluster conditions and the same methods apply.

Brandon King 2004-05-13