next up previous contents
Next: TSplit Up: Unsupervised Wrappers Previous: HutSOM   Contents

KMeans

Requirements:

The KMeans algorithm is a very straightforward clustering algorithm which simply assigned points to the nearest cluster centroid. The KMeans algorithms can be interpreted as a special case of the EM algorithms described previously where the objective function is binary. Because of the similarities to EM, it is appropriate to contrast KMeans results with the EM results.

>>> from compClust.mlx.dataset import Dataset
>>> from compClust.mlx.wrapper.KMeans import KMeans
>>> ds = Dataset('synth_t_15c3_p_0750_d_03_v_0d3.txt')
>>> parameters = {}
>>> parameters['k'] = 15
>>> parameters['distance_metric'] = 'euclidean'
>>> parameters['init_means'] = 'random'
>>> parameters['max_iterations'] = 100           
>>> kmeans = KMeans(ds, parameters)
>>> kmeans.validate()
1
>>> kmeans.run()
1
>>> results = kmeans.getLabeling()
>>> map(lambda x : len(results.getRowsByLabel(x)),
... results.getLabels())
[169, 135, 109, 134, 203]

Notice that even though we asked for 15 cluster, KMeans only returned 5. Whenever KMeans find that a cluster has collapsed to a single point, it will back off and try again with k = k-1. To get around this, we can either try setting a different random seed, or turning on k-strict mode. Let's try setting the seed.

>>> parameters['seed'] = 1234
>>> kmeans = KMeans(ds, parameters)
>>> kmeans.validate()
1
>>> kmeans.run()
1
>>> results = kmeans.getLabeling()
>>> len(results.getLabels())
6

So this time, we only got 6 clusters. Still not what we asked for, so let's try turning on k-strict.

>>> parameters['k_strict'] = 'true'
>>> parameters['max_restarts'] = 10
>>> kmeans = KMeans(ds, parameters)
>>> kmeans.validate()
1
>>> kmeans.run()
-1

A return value of -1 indicates that KMeans was unable to find exactly 15 clusters. The last alternative is to try a different initialization method. The Church initialization places all the means as far apart from each other as possible which sounds like a good idea, so let's try it.

>>> parameters['init_means'] = 'church'
>>> kmeans = KMeans(ds, parameters)
>>> kmeans.validate()
1
>>> kmeans.run()
1

Good, KMeans ran successfully and we now have 15 clusters to examine.

>>> results = kmeans.getLabeling()
>>> map(lambda x : len(results.getRowsByLabel(x)),
... results.getLabels())
[38, 19, 32, 50, 65, 29, 71, 57, 81, 21, 50, 34, 62, 119, 22]

This looks like a fairly good partitioning that compares well with DiagEM. If fact, the standard deviation of the points per cluster from KMeans is 27.2, which is almost as good as FullEM. Just to be fair, DiagEM using Church initialization turns in a 33.7 for its results' standard deviation, which is an improvement, but still not up to the level of KMeans.


next up previous contents
Next: TSplit Up: Unsupervised Wrappers Previous: HutSOM   Contents
Lucas Scharenbroich 2003-08-27