next up previous contents
Next: Meta-Wrappers Up: Models Previous: Mixture of Gaussians   Contents

Distance from Means

The Distance from Means model keeps a list of means, one mean per class and evaluated fitness by computing distances to the nearest mean. The exact formula is


\begin{displaymath}
F({\bf x}) = \frac{N}{\sum_{i=1}^N (\mu_a - x_i)^2}
\end{displaymath} (4)

where $\mu_a$ is the mean nearest to $x_i$. One known shortcoming of this fitness function is that as the number of clusters for a given dataset goes up, the fitness increases as well. This is because as more datapoints are added to a finite volume, the average distance between them decreases. This reduces the summation and increases the score.

Let's try an example which illustrated this problem. The KMeans algorithm used a Distance from Means model by default, so we'll use it.

>>> ds = Dataset('synth_t_15c3_p_0750_d_03_v_0d3.txt')
>>> evens = ds.subsetRows(range(0,750,2))
>>> odds = ds.subsetRows(range(1,750,2))
>>> parameters = {}
>>> parameters['k'] = 15
>>> parameters['distance_metric'] = 'euclidean'
>>> parameters['init_means'] = 'church'
>>> parameters['max_iterations'] = 1000
>>> kmeans = KMeans(ds, parameters)
>>> kmeans.validate()
1
>>> kmeans.run()
1
>>> model = kmeans.getModel()
>>> model.evaluateFitness(odds.getData())
5.6023786988275539
>>> #
... # Try for many different k's
...
>>> parameters['k'] = 30
>>> kmeans.run()
1
>>> kmeans.getModel().evaluateFitness(odds.getData())
12.404585259842882
>>> parameters['k'] = 50
>>> kmeans.run()
1
>>> kmeans.getModel().evaluateFitness(odds.getData())
12.404585259842882
>>> parameters['k'] = 100
>>> kmeans.run()
1
>>> kmeans.getModel().evaluateFitness(odds.getData())
20.875393776420339



Lucas Scharenbroich 2003-08-27