XClust

Next: Supervised Wrappers Up: Unsupervised Wrappers Previous: TSplit Contents

XClust

Requirements:

environment variable XCLUST_COMMAND set to the xclust executable.
parameter 'cluster_on' set to cluster on the rows or columns of the data.
parameter 'transform_method' set to the type of pre-process transformation to apply to the data. Usually set to 'none' since all transforms can be applied by View.
parameter 'distance_metric' set to the type of distance measure to use when computing the distance between points.
parameter 'agglomerate_method' set as how to prune the resulting tree into 'k' clusters.

XClust is the wrapper around the XCluster algorithm created by Prof. Gavin Sherlock of Stanford University. This is a bottom-up clustering algorithm, and as such, runs in $O\big (N^2 \log N)$ time. However, in our experience, XClust can produce consistently good results over a wide variety of datasets. In light of this experience, it would seem that XClust might be a good first choice for clustering a dataset if there is no obvious favorite.

To see just how well XClust can perform, let's cluster our favorite dataset again.

>>> from compClust.mlx.datasets import Dataset
>>> from compClust.mlx.wrapper.XClust import XClust
>>> ds = Dataset('synth_t_15c3_p_0750_d_03_v_0d3.txt')
>>> parameters = {}
>>> parameters['k'] = 15
>>> parameters['transform_method'] = 'none'
>>> parameters['cluster_on'] = 'rows'
>>> parameters['distance_metric'] = 'euclidean'
>>> parameters['agglomerate_method'] = 'clusterNumber'
>>> xclust = XClust(ds, parameters)
>>> xclust.validate()
1
>>> xclust.run()
1
>>> results = xclust.getLabeling()
>>> map(lambda x : len(results.getRowsByLabel(x)),
... results.getLabels())
[57, 65, 51, 82, 13, 86, 36, 12, 71, 22, 87, 29, 76, 13, 50]

These results look much better than what we've been able to produce before. The standard deviation of the number of points per cluster is 27.6, better than DiagEM, but FullEM is still better.

Next: Supervised Wrappers Up: Unsupervised Wrappers Previous: TSplit Contents

Lucas Scharenbroich 2003-08-27