next up previous contents
Next: FullEM Up: Unsupervised Wrappers Previous: Unsupervised Wrappers   Contents

DiagEM

Requirements:

DiagEM stands for Diagonal EM (Expectation-Maximization) and is named such to distinguish it from the full EM algorithm described below. The diagonal refers to the covariance matrix used in the optimization. Typically, this will be a full $N$-by-$N$ matrix where $N$ is the number of dimensions of the dataset. This means that there are $\frac{1}{2}kN(N+3)$ parameters to optimize. Where $k$ is the number of clusters, $\frac{1}{2}N(N+1)$ if the number of free parameters in the covariance matrix of each cluster and $N$ is the number of free parameters to specify the mean for each cluster. Thus, if $F$ is the number of free parameters,


\begin{displaymath}
F = \frac{1}{2}kN(N+1) + N
\end{displaymath}


\begin{displaymath}
2F = k(N^2 + N + 2N)
\end{displaymath}


\begin{displaymath}
2F = k(N^2 + 3N)
\end{displaymath}


\begin{displaymath}
F = \frac{1}{2}kN(N+3)
\end{displaymath}


\begin{displaymath}
F = \frac{1}{2}kN(N+3)
\end{displaymath}

Since this value grows quadratically with $N$, we may not have enough datapoints in our dataset to statistically justify the resulting classification. So, instead of using a full covariance matrix, we only consider the diagonal of the matrix.

By limiting ourselves, the Gaussians which the algorithm finds are restricted to being axis-aligned with the dataset's coordinate system. However, this restriction reduces the number of free parameters to $2kN$, a vast improvement.

This aside is indented to ensure that you are aware of the limitation of DiagEM and are able to recognize when its results should be taken with a string dose of NaCl.

So, which that out of the way, one to our first example of clustering!. For all of the clustering examples, we will be using a dataset which can be found on the woldlab server. The file itself is located at

/proj/cluster_gazing2/data/synthetic/datasets/ ... 
 synth_t_15c3_p_0750_d_03_v_0d3.txt.gz

Note that it is a gzip-ed file, so you will need to copy it to your working directory and uncompress it there. Also, you will need to set an environment variable to indicate what binary program to use for running DiagEM. The variable is named DIAGEM_COMMAND. To set it, type the following:

export DIAGEM_COMMAND=/proj/code/bin/diagem

Once you have done so, we are ready to work. Start up the python interpreter and try out the following.

>>> from compClust.mlx.datasets import Dataset
>>> from compClust.mlx.wrapper.DiagEM import DiagEM
>>> from compClust.util.DistanceMetrics import *
>>> import MLab
>>> import Numeric
>>> ds = Dataset('synth_t_15c3_p_0750_d_03_v_0d3.txt')
>>> ds
Dataset: None, 750 by 3
>>> parameters = {}
>>> parameters['k'] = 15
>>> parameters['num_iterations'] = 100
>>> parameters['distance_metric'] = 'euclidean'
>>> parameters['init_method'] = 'random_point'
>>> diagem = DiagEM(ds, parameters)
>>> diagem.validate()
1
>>> diagem.run()
1
>>> results = diagem.getLabeling()
>>> results.getLabels()
['8', '9', '6', '7', '4', '5', '2', '3', '1', '14', '15', 
 '12', '13', '10', '11']
>>> len(results.getLabels())
15
>>> map(lambda x : len(results.getRowsByLabel(x)), 
... results.getLabels())
[40, 57, 54, 12, 2, 66, 137, 69, 6, 17, 99, 10, 32, 78, 71]

So, from these results, we can see that the DiagEM algorithm produced 15 clusters (exactly what we asked for), and there are between 2 and 137 datapoints to a cluster. Now let's make a group of subsets, one for each class, and compare on a class basis. Specifically, we'll look at how closely correlated the classes are to each other.

>>> classes = map(ds.subsetRows, 
... map(results.getRowsByLabel, results.getLabels()))
>>> class_means = map(lambda x : MLab.mean(x.getData()), 
... classes)
>>> def corr_matrix(means):
...    matrix = []
...    for mean in means:
...       corrs = []
...       for m in means:
...          corrs.append(CorrelationDistance(mean, m))
...       matrix.append(corrs)
...    return matrix
... 
>>> corrs = corr_matrix(class_means)
>>> corrs_rank = map(ranks, Numeric.array(corrs))
>>> coors_most_similar = 
... map(lambda x : x.tolist().index(14), corrs_rank)
>>> coors_most_similar
[11, 4, 3, 8, 9, 2, 12, 8, 3, 4, 14, 0, 14, 2, 10]

What this analysis shows is which class each class is most strongly correlated to other than itself. Some interesting things to notice are the mutual neighbors (classes 0/11, 10/14, 3/8, and 4/9). This might show that even though these clusters may be far apart, they express similar data.


next up previous contents
Next: FullEM Up: Unsupervised Wrappers Previous: Unsupervised Wrappers   Contents
Lucas Scharenbroich 2003-08-27