next up previous contents
Next: Hierarchical Up: Meta-Wrappers Previous: Fitness Tables   Contents

Running MCCV

Typically within the schema, MCCV is used to find an optimal number of clusters in a dataset by generating fitness curves. Let's try an example of this style of use.

>>> from compClust.mlx.datasets import Dataset
>>> from compClust.mlx.wrapper.MCCV import MCCV
>>> from compClust.mlx.wrapper.DiagEM import DiagEM
>>> parameters = {}
>>> parameters['mccv_parameter_name'] = 'k'
>>> parameters['mccv_parameter_values'] = range(10,30)
>>> parameters['mccv_test_fraction'] = 0.5
>>> parameters['mccv_num_trials'] = 6
>>> parameters['num_iterations'] = 250
>>> parameters['distance_metric'] = 'euclidean'
>>> parameters['init_method'] = 'church_means'
>>> ds = Dataset('synth_t_15c3_p_0750_d_03_v_0d3.txt')
>>> diagem = DiagEM()     
>>> mccv = MCCV(ds, parameters, diagem)
>>> mccv.validate()
1
>>> #
... # .run() takes a while....
...
>>> mccv.run()
1
>>> fitness_table = mccv.getFitnessTable()
>>> fitness_scores = map(lambda x : x[2], fitness_table)
>>> fitness_scores.sort()
>>> best = fitness_scores[-1]
>>> best
-889.39357467205207
>>> #
... # our best score was -889, what index was this?
...
>>> map(lambda x : x[2], fitness_table).index(best)
15
>>> fitness_table[15]
[25, 25, -889.39357467205207]

So, MCCV decided that 25 clusters is the optimal choice of k for this dataset, even though we know that k = 15 or k = 45 are the better choices. Let's see what MCCV got for k = 15.

>>> for i in range(0,20):
...    print fitness_table[i]
... 
[10, 10, -1033.5521324482584]
[11, 11, -1065.8959512850583]
[12, 12, -1007.1365541756992]
[13, 13, -994.33707715086541]
[14, 14, -997.98790638981552]
[15, 15, -1034.5531481223265]
[16, 16, -1021.524089374353]
[17, 17, -959.20473715505]
[18, 18, -959.0622038406899]
[19, 19, -941.34529509893059]
[20, 20, -938.26333395371717]
[21, 21, -912.91811083260927]
[22, 22, -913.51046227335962]
[23, 23, -918.66792167430026]
[24, 24, -925.541551492542]
[25, 25, -889.39357467205207]
[26, 26, -894.10952528835833]
[27, 27, -900.9496687048254]
[28, 28, -1008.228402419493]
[29, 29, -1007.0726811982155]

Strangely, k = 15 is one of the worst scores in the entire fitness table! Perhaps this is an artifact of the sub-clustering. Let's see what MCCV returns looping from k = 40 to 50.

>>> mccv.getParameters()['mccv_parameter_values'] = 
... range(40,50)
>>> mccv.run()
1
>>> fitness_table = mccv.getFitnessTable()
>>> best = max(map(lambda x : x[2], fitness_table))
>>> map(lambda x : x[2], fitness_table).index(best)
0
>>> fitness_table[0]
[40, 39, -907.89855160200625]
>>> for i in range(0,10):
...    print fitness_table[i]
... 
[40, 39, -907.89855160200625]
[41, 40, -918.08015090848448]
[42, 41, -929.90250751382689]
[43, 41, -925.97350376475447]
[44, 42, -949.32511844651754]
[45, 43, -960.61586289174261]
[46, 44, -949.83066376493514]
[47, 44, -973.45056832036857]
[48, 46, -1004.8372568468484]
[49, 47, -1006.718583971027]

Again, MCCV chose something we didn't expect. It chose a score where we asked for 40 clusters, but only got back 39, so it effectively chose k = 39 for this problem. The actual score at k = 45 was -960, but none of the trials ended up with k = 45.

The moral of this section is that MCCV is a great way to perform a near=exhaustive search of your dataspace, but it is not a universal cure-all. The answer depends greatly on the choice of Model for a clustering as that determines exactly what the ``peak of the fitness curve'' actually means.


next up previous contents
Next: Hierarchical Up: Meta-Wrappers Previous: Fitness Tables   Contents
Lucas Scharenbroich 2003-08-27