The Distance from Means model keeps a list of means, one mean per class and evaluated fitness by computing distances to the nearest mean. The exact formula is
(4) |
where is the mean nearest to . One known shortcoming of this fitness function is that as the number of clusters for a given dataset goes up, the fitness increases as well. This is because as more datapoints are added to a finite volume, the average distance between them decreases. This reduces the summation and increases the score.
Let's try an example which illustrated this problem. The KMeans algorithm used a Distance from Means model by default, so we'll use it.
>>> ds = Dataset('synth_t_15c3_p_0750_d_03_v_0d3.txt') >>> evens = ds.subsetRows(range(0,750,2)) >>> odds = ds.subsetRows(range(1,750,2)) >>> parameters = {} >>> parameters['k'] = 15 >>> parameters['distance_metric'] = 'euclidean' >>> parameters['init_means'] = 'church' >>> parameters['max_iterations'] = 1000 >>> kmeans = KMeans(ds, parameters) >>> kmeans.validate() 1 >>> kmeans.run() 1 >>> model = kmeans.getModel() >>> model.evaluateFitness(odds.getData()) 5.6023786988275539 >>> # ... # Try for many different k's ... >>> parameters['k'] = 30 >>> kmeans.run() 1 >>> kmeans.getModel().evaluateFitness(odds.getData()) 12.404585259842882 >>> parameters['k'] = 50 >>> kmeans.run() 1 >>> kmeans.getModel().evaluateFitness(odds.getData()) 12.404585259842882 >>> parameters['k'] = 100 >>> kmeans.run() 1 >>> kmeans.getModel().evaluateFitness(odds.getData()) 20.875393776420339