DiagEM

The first clustering program we will use is DiagEM. DiagEM is an implementation of the expectation maximization (EM) algorithm that attempts to fit data vectors to Gaussian clusters. The DiagEM algorithm is so named because it only uses the diagonal of the covariance matrix that describes the cluster mean and variance. Because of this the Gaussians discovered by DiagEM can only vary along the axes of the data space.

In principal the more common KMeans algorithm is a simplification of the EM algorithm which is limited to a simple high-dimensional sphere instead of the ellipsoid, or even more complex shapes, that can be described by a covariance matrix.

The EM algorithm using a full covariance matrix can create Gaussians that not only are of different widths on each dimension but can also be rotated in different direction in the data space. Unfortunately, since biological datasets frequently have many conditions this corresponds to a high dimensionality data space and because of this high dimensionality there rarely is enough data to properly estimate all of the parameters required to fill the full covariance matrix, and thus we standardized on the Diagonal EM algorithm.

We will use most of the default parameters, but we will change the number of clusters we would like DiagEM to create. Change K from two (default) to five as we will compare our results to the Cho Classification, which has been partitioned into five clusters which represent five stages of the yeast cell cycle.

Choose 'DiagEM' from the 'Clustering' menu and then change K from 2 to 5 as shown below.

**Figure:** Clustering|DiagEM
$\includegraphics[height=4in]{tkImages/compClustTk-Clustering-DiagEMDialog}$

When clustering is finished you should see a dialog box pop up similar to the one below.

**Figure:** Clustering|DiagEM
$\includegraphics{tkImages/compClustTk-Clustering-DiagEMDialogComplete}$

Brandon King 2005-05-16