next up previous contents
Next: HutSOM Up: Unsupervised Wrappers Previous: DiagEM   Contents

FullEM

Requirements:

The FullEM algorithm is very similar to the DiagEM algorithm described above, except that it uses the full covariance matrix, thus having to solve for $\frac{1}{2}kN(N+3)$ unknowns. However, by using a full covariance matrix, the Gaussians may take on an arbitrary rotation which could fit the data more tightly. As an example, let's re-cluster the same dataset as we did in the DiagEM section and compare the results.

>>> from compClust.mlx.dataset import Dataset
>>> from compClust.mlx.wrapper.FullEM import FullEM
>>> from compClust.util.DistanceMetrics import *
>>> import MLab
>>> import Numeric
>>> ds = Dataset('synth_t_15c3_p_0750_d_03_v_0d3.txt')
>>> ds
Dataset: None, 750 by 3
>>> parameters = {}
>>> parameters['k'] = 15
>>> parameters['seed'] = 1234.0
>>> fullem = FullEM(ds, parameters)
1
>>> fullem.run()
>>> #
... # A bunch of fullEM messages print out
... #
...
1
>>> results = fullem.getLabeling()
>>> map(lambda x : len(results.getRowsByLabel(x)), 
... results.getLabels())
[71, 49, 12, 73, 50, 30, 22, 35, 87, 29, 64, 66, 76, 
 26, 60]

So, it looks like the results from FullEM are more consistent than those from DiagEM. The DiagEM class counts had a standard deviation of 38.4, while the FullEM has a standard deviation of only 23.0, an excellent improvement. Remember, this synthetic dataset, is composed of a total of 45 clusters - 15 clusters made of 3 sub-clusters each.

Let's see what happens if we try to find all 45 clusters at once instead of the 15 top-level clusters.

>>> parameters['k'] = 45
>>> fullem2 = FullEM(ds, parameters)
>>> fullem2.validate()
1
>>> fullem2.run()
...
1results2 = fullem2.getLabeling()
>>> map(lambda x : len(results2.getRowsByLabel(x)), 
... results2.getLabels())
[25, 8, 18, 14, 22, 16, 13, 50, 16, 31, 19, 20, 13, 
 4, 71, 18, 19, 20, 6, 16, 4, 14, 19, 14, 13, 15, 21, 
 22, 14, 15, 5, 8, 17, 10, 16, 20, 4, 21, 4, 18, 10, 
 22, 11, 10, 4]
>>>

We would expect each cluster to have a size of 16 (750/45), and many clusters are in the 15-20 datapoint range, though there are several outliers. The standard deviation of this clustering is 11.6, which is much larger relative to the number of cluster than our previous result with k = 15. Although we can't be sure without further analysis, this results may indicate that we are getting a less stable result by over-specifying k. We will continue this analysis later in the tutorial.


next up previous contents
Next: HutSOM Up: Unsupervised Wrappers Previous: DiagEM   Contents
Lucas Scharenbroich 2003-08-27