Typically within the schema, MCCV is used to find an optimal number of clusters in a dataset by generating fitness curves. Let's try an example of this style of use.
>>> from compClust.mlx.datasets import Dataset >>> from compClust.mlx.wrapper.MCCV import MCCV >>> from compClust.mlx.wrapper.DiagEM import DiagEM >>> parameters = {} >>> parameters['mccv_parameter_name'] = 'k' >>> parameters['mccv_parameter_values'] = range(10,30) >>> parameters['mccv_test_fraction'] = 0.5 >>> parameters['mccv_num_trials'] = 6 >>> parameters['num_iterations'] = 250 >>> parameters['distance_metric'] = 'euclidean' >>> parameters['init_method'] = 'church_means' >>> ds = Dataset('synth_t_15c3_p_0750_d_03_v_0d3.txt') >>> diagem = DiagEM() >>> mccv = MCCV(ds, parameters, diagem) >>> mccv.validate() 1 >>> # ... # .run() takes a while.... ... >>> mccv.run() 1 >>> fitness_table = mccv.getFitnessTable() >>> fitness_scores = map(lambda x : x[2], fitness_table) >>> fitness_scores.sort() >>> best = fitness_scores[-1] >>> best -889.39357467205207 >>> # ... # our best score was -889, what index was this? ... >>> map(lambda x : x[2], fitness_table).index(best) 15 >>> fitness_table[15] [25, 25, -889.39357467205207]
So, MCCV decided that 25 clusters is the optimal choice of k for this dataset, even though we know that k = 15 or k = 45 are the better choices. Let's see what MCCV got for k = 15.
>>> for i in range(0,20): ... print fitness_table[i] ... [10, 10, -1033.5521324482584] [11, 11, -1065.8959512850583] [12, 12, -1007.1365541756992] [13, 13, -994.33707715086541] [14, 14, -997.98790638981552] [15, 15, -1034.5531481223265] [16, 16, -1021.524089374353] [17, 17, -959.20473715505] [18, 18, -959.0622038406899] [19, 19, -941.34529509893059] [20, 20, -938.26333395371717] [21, 21, -912.91811083260927] [22, 22, -913.51046227335962] [23, 23, -918.66792167430026] [24, 24, -925.541551492542] [25, 25, -889.39357467205207] [26, 26, -894.10952528835833] [27, 27, -900.9496687048254] [28, 28, -1008.228402419493] [29, 29, -1007.0726811982155]
Strangely, k = 15 is one of the worst scores in the entire fitness table! Perhaps this is an artifact of the sub-clustering. Let's see what MCCV returns looping from k = 40 to 50.
>>> mccv.getParameters()['mccv_parameter_values'] = ... range(40,50) >>> mccv.run() 1 >>> fitness_table = mccv.getFitnessTable() >>> best = max(map(lambda x : x[2], fitness_table)) >>> map(lambda x : x[2], fitness_table).index(best) 0 >>> fitness_table[0] [40, 39, -907.89855160200625] >>> for i in range(0,10): ... print fitness_table[i] ... [40, 39, -907.89855160200625] [41, 40, -918.08015090848448] [42, 41, -929.90250751382689] [43, 41, -925.97350376475447] [44, 42, -949.32511844651754] [45, 43, -960.61586289174261] [46, 44, -949.83066376493514] [47, 44, -973.45056832036857] [48, 46, -1004.8372568468484] [49, 47, -1006.718583971027]
Again, MCCV chose something we didn't expect. It chose a score where we asked for 40 clusters, but only got back 39, so it effectively chose k = 39 for this problem. The actual score at k = 45 was -960, but none of the trials ended up with k = 45.
The moral of this section is that MCCV is a great way to perform a near=exhaustive search of your dataspace, but it is not a universal cure-all. The answer depends greatly on the choice of Model for a clustering as that determines exactly what the ``peak of the fitness curve'' actually means.