next up previous contents
Next: Distance from Means Up: Models Previous: Models   Contents

Mixture of Gaussians

The Mixture of Gaussians model represents a dataset by a set of mean and covariance matrices. Each class is centered at a mean and has a Gaussian which extends as described by it's matrix. Each class also has a weight associated with it which is simply its total fraction of points divided by the total number of points in the dataset.

The formula for computing the fitness of a dataset given a model is defined by


\begin{displaymath}
L = \prod_{x \in X} \prod_k \frac{1}{\sqrt{(2\pi)^d \vert\Sigma_{k}\vert}} e^{-(x_k - \mu_k) \Sigma_k (x_k - \mu_k)}
\end{displaymath} (2)

where $\mu_k$ is the mean of cluster $k$, $\Sigma_k$ is the covariance matrix of cluster $k$, $d$ is the dimensionality of the data, and $X$ is the set of test datapoints.

Typically, we compute the log-likelihood instead since the exponentiation in equation 2 will often exceed machine precision and the cumulative products will drive the answer to zero rather quickly. The formula for the likelihood then becomes


\begin{displaymath}
L = \sum_{x \in X} \sum_k -\frac{1}{2}(d\log 2\pi + \log \vert\Sigma_{k}\vert + (x_k - \mu_k) \Sigma_k (x_k - \mu_k))
\end{displaymath} (3)

Typical values for fitness will be on the order of $-1 \times 10^3$. Now let's try an example where we divide the dataset in half putting even rows in one subset, odd rows in another. We will then cluster one of the subsets via the DiagEM wrapper (since it produces a Mixture of Gaussians model) and score the model based of the other subset.

>>>
>>> ds = Dataset('synth_t_15c3_p_0750_d_03_v_0d3.txt')
>>> evens = ds.subsetRows(range(0,750,2))
>>> odds = ds.subsetRows(range(1,750,2))
>>> parameters = {}
>>> parameters['k'] = 15
>>> parameters['num_iterations'] = 1000
>>> parameters['distance_metric'] = 'euclidean'
>>> parameters['init_method'] = 'church_means'
>>> diagem = DiagEM(evens, parameters)
>>> diagem.validate()
1
>>> diagem.run() 
1
>>> model = diagem.getModel()
>>> model
MoG_Model(k=15, ...)
>>> model.evaluateFitness(odds.getData())
-963.03029940902013

The actual fitness number returned is meaningless on its own as it is a dimensionless number. Fitness scores are useful when used as relative comparisons. It is important to remember that since there is no intrinsic scale associated with a fitness score (it is model-dependent), you cannot say that a fitness of -500 is twice as good as a fitness of -1000. All you know is that it is a better score.


next up previous contents
Next: Distance from Means Up: Models Previous: Models   Contents
Lucas Scharenbroich 2003-08-27