Leveraging Labelings & Views

Next: Confusion Matrices Up: Analysis....and stuff Previous: Analysis....and stuff Contents

Leveraging Labelings & Views

Although not fancy, the Labeling/View classes do provide the ability to analyze a dataset in several way. One of the most common operations, is to see which datapoints correspond to different classes based of different clusterings. That is, how consistent are the clusterings? Do they agree on which points belong together. To illustrate this technique, let's compare the results from the KMeans and FullEM clustering algorithms.

>>> from compClust.mlx.datasets import Dataset
>>> from compClust.mlx.wrapper.FullEM import FullEM
>>> from compClust.mlx.wrapper.KMeans import KMeans
>>> from compClust.util.unique import unique
>>> #
... # Create some duplicate subsets
...
>>> ds = Dataset('synth_t_15c3_p_0750_d_03_v_0d3.txt')
>>> kmeans_data = ds.subsetRows(range(750))
>>> fullem_data = ds.subsetRows(range(750))
>>> #
... # Build the parameters table, notice that is it
... # perfectly acceptable to use the same hash for
... # different algorithms as long as it contains
... # all the required parameters for both.
...
>>> p = {}
>>> p['k'] = 5 
>>> p['max_iterations'] = 1000
>>> p['init_means'] = 'church'
>>> p['distance_metric'] = 'euclidean'
>>> p['seed'] = 1234
>>> kmeans = KMeans(kmeans_data, p)
>>> kmeans.validate()
1
>>> fullem = FullEM(fullem_data, p)
>>> fullem.validate()
1
>>> kmeans.run()
1
>>> fullem.run()
1
>>> kmeans_labeling = kmeans.getLabeling()
>>> fullem_labeling = fullem.getLabeling() 
>>> classes = map(lambda x : 
... fullem_data.subset(fullem_labeling, x), 
... fullem_labeling.getLabels())
>>> #
... # Build a set of labelings
...
>>> class1labels = classes[0].labelUsing(kmeans_labeling)
>>> class2labels = classes[1].labelUsing(kmeans_labeling)
>>> class3labels = classes[2].labelUsing(kmeans_labeling)
>>> class4labels = classes[3].labelUsing(kmeans_labeling)
>>> class5labels = classes[4].labelUsing(kmeans_labeling)
>>> all_class_labels = [class1labels, class2labels, 
... class3labels, class4labels, class5labels]
>>> for i in range(5):         
...    print i,
...    print unique(all_class_labels[i].getRowLabels())
... 
0 ['2', '4']
1 ['0', '4']
2 ['0', '1']
3 ['3']
4 ['2', '4']

By inspection the two algorithms appear to have a non-trivial amount of agreement. Notice that cluster 3 produced by FullEM contains only elements from cluster 3 of KMeans and that cluster 3 appear no where else. This indicate 100% agreement on the points which compose this cluster. The other cluster don't agree quite as well, but there is never more than two KMeans classes per FullEM class.

Lucas Scharenbroich 2003-08-27