Requirements:
fullem
executable.
The FullEM algorithm is very similar to the DiagEM algorithm described above, except that it uses the full covariance matrix, thus having to solve for unknowns. However, by using a full covariance matrix, the Gaussians may take on an arbitrary rotation which could fit the data more tightly. As an example, let's re-cluster the same dataset as we did in the DiagEM section and compare the results.
>>> from compClust.mlx.dataset import Dataset >>> from compClust.mlx.wrapper.FullEM import FullEM >>> from compClust.util.DistanceMetrics import * >>> import MLab >>> import Numeric >>> ds = Dataset('synth_t_15c3_p_0750_d_03_v_0d3.txt') >>> ds Dataset: None, 750 by 3 >>> parameters = {} >>> parameters['k'] = 15 >>> parameters['seed'] = 1234.0 >>> fullem = FullEM(ds, parameters) 1 >>> fullem.run() >>> # ... # A bunch of fullEM messages print out ... # ... 1 >>> results = fullem.getLabeling() >>> map(lambda x : len(results.getRowsByLabel(x)), ... results.getLabels()) [71, 49, 12, 73, 50, 30, 22, 35, 87, 29, 64, 66, 76, 26, 60]
So, it looks like the results from FullEM are more consistent than those from DiagEM. The DiagEM class counts had a standard deviation of 38.4, while the FullEM has a standard deviation of only 23.0, an excellent improvement. Remember, this synthetic dataset, is composed of a total of 45 clusters - 15 clusters made of 3 sub-clusters each.
Let's see what happens if we try to find all 45 clusters at once instead of the 15 top-level clusters.
>>> parameters['k'] = 45 >>> fullem2 = FullEM(ds, parameters) >>> fullem2.validate() 1 >>> fullem2.run() ... 1results2 = fullem2.getLabeling() >>> map(lambda x : len(results2.getRowsByLabel(x)), ... results2.getLabels()) [25, 8, 18, 14, 22, 16, 13, 50, 16, 31, 19, 20, 13, 4, 71, 18, 19, 20, 6, 16, 4, 14, 19, 14, 13, 15, 21, 22, 14, 15, 5, 8, 17, 10, 16, 20, 4, 21, 4, 18, 10, 22, 11, 10, 4] >>>
We would expect each cluster to have a size of 16 (750/45), and many clusters are in the 15-20 datapoint range, though there are several outliers. The standard deviation of this clustering is 11.6, which is much larger relative to the number of cluster than our previous result with k = 15. Although we can't be sure without further analysis, this results may indicate that we are getting a less stable result by over-specifying k. We will continue this analysis later in the tutorial.