DiagEM

Next: FullEM Up: Unsupervised Wrappers Previous: Unsupervised Wrappers Contents

DiagEM

Requirements:

environment variable DIAGEM_COMMAND set to the diagem executable.
parameter 'k' set to the number of clusters to find
parameter 'num_iterations' set to the maximum number of iterations to try. The algorithm may converge sooner as well.
parameter 'distance_metric' set to the type of distance metric to use.
parameter 'init_method' set to the type of mean initialization to use.

DiagEM stands for Diagonal EM (Expectation-Maximization) and is named such to distinguish it from the full EM algorithm described below. The diagonal refers to the covariance matrix used in the optimization. Typically, this will be a full -by- matrix where is the number of dimensions of the dataset. This means that there are $\frac{1}{2}kN(N+3)$ parameters to optimize. Where is the number of clusters, $\frac{1}{2}N(N+1)$ if the number of free parameters in the covariance matrix of each cluster and is the number of free parameters to specify the mean for each cluster. Thus, if is the number of free parameters,

$\begin{displaymath} F = \frac{1}{2}kN(N+1) + N \end{displaymath}$

$\begin{displaymath} 2F = k(N^2 + N + 2N) \end{displaymath}$

$\begin{displaymath} 2F = k(N^2 + 3N) \end{displaymath}$

$\begin{displaymath} F = \frac{1}{2}kN(N+3) \end{displaymath}$

Since this value grows quadratically with , we may not have enough datapoints in our dataset to statistically justify the resulting classification. So, instead of using a full covariance matrix, we only consider the diagonal of the matrix.

By limiting ourselves, the Gaussians which the algorithm finds are restricted to being axis-aligned with the dataset's coordinate system. However, this restriction reduces the number of free parameters to , a vast improvement.

This aside is indented to ensure that you are aware of the limitation of DiagEM and are able to recognize when its results should be taken with a string dose of NaCl.

So, which that out of the way, one to our first example of clustering!. For all of the clustering examples, we will be using a dataset which can be found on the woldlab server. The file itself is located at

/proj/cluster_gazing2/data/synthetic/datasets/ ... 
 synth_t_15c3_p_0750_d_03_v_0d3.txt.gz

Note that it is a gzip-ed file, so you will need to copy it to your working directory and uncompress it there. Also, you will need to set an environment variable to indicate what binary program to use for running DiagEM. The variable is named DIAGEM_COMMAND. To set it, type the following:

export DIAGEM_COMMAND=/proj/code/bin/diagem

Once you have done so, we are ready to work. Start up the python interpreter and try out the following.

>>> from compClust.mlx.datasets import Dataset
>>> from compClust.mlx.wrapper.DiagEM import DiagEM
>>> from compClust.util.DistanceMetrics import *
>>> import MLab
>>> import Numeric
>>> ds = Dataset('synth_t_15c3_p_0750_d_03_v_0d3.txt')
>>> ds
Dataset: None, 750 by 3
>>> parameters = {}
>>> parameters['k'] = 15
>>> parameters['num_iterations'] = 100
>>> parameters['distance_metric'] = 'euclidean'
>>> parameters['init_method'] = 'random_point'
>>> diagem = DiagEM(ds, parameters)
>>> diagem.validate()
1
>>> diagem.run()
1
>>> results = diagem.getLabeling()
>>> results.getLabels()
['8', '9', '6', '7', '4', '5', '2', '3', '1', '14', '15', 
 '12', '13', '10', '11']
>>> len(results.getLabels())
15
>>> map(lambda x : len(results.getRowsByLabel(x)), 
... results.getLabels())
[40, 57, 54, 12, 2, 66, 137, 69, 6, 17, 99, 10, 32, 78, 71]

So, from these results, we can see that the DiagEM algorithm produced 15 clusters (exactly what we asked for), and there are between 2 and 137 datapoints to a cluster. Now let's make a group of subsets, one for each class, and compare on a class basis. Specifically, we'll look at how closely correlated the classes are to each other.

>>> classes = map(ds.subsetRows, 
... map(results.getRowsByLabel, results.getLabels()))
>>> class_means = map(lambda x : MLab.mean(x.getData()), 
... classes)
>>> def corr_matrix(means):
...    matrix = []
...    for mean in means:
...       corrs = []
...       for m in means:
...          corrs.append(CorrelationDistance(mean, m))
...       matrix.append(corrs)
...    return matrix
... 
>>> corrs = corr_matrix(class_means)
>>> corrs_rank = map(ranks, Numeric.array(corrs))
>>> coors_most_similar = 
... map(lambda x : x.tolist().index(14), corrs_rank)
>>> coors_most_similar
[11, 4, 3, 8, 9, 2, 12, 8, 3, 4, 14, 0, 14, 2, 10]

What this analysis shows is which class each class is most strongly correlated to other than itself. Some interesting things to notice are the mutual neighbors (classes 0/11, 10/14, 3/8, and 4/9). This might show that even though these clusters may be far apart, they express similar data.

Next: FullEM Up: Unsupervised Wrappers Previous: Unsupervised Wrappers Contents

Lucas Scharenbroich 2003-08-27