TSplit

Next: XClust Up: Unsupervised Wrappers Previous: KMeans Contents

TSplit

Requirements:

environment variable TSPLIT_COMMAND set to the tsplit executable.
parameter 'k' set to the number of clusters to find
parameter 'distance_metric' set to the type of distance measure to use when computing the distance between points.
parameter 'splitting_method' set to the split either via PCA projection or energy minimization.
parameter 'agglomerate_method' set as how to prune the resulting tree into 'k' clusters.

The TSplit algorithm is a clustering algorithm developed by Prof. Ruye Wang over the summer of 2001. It is a top-down clustering algorithm, and as such, runs much faster than the bottom-up XClust algorithm. However, since it globally assesses the data, it is very sensitive to overlap in high-variance datasets. This sensitivity come from the fact the when clusters overlap, they tend to create a region of artificially high density which can be mistaken for an independent cluster.

This algorithm provides a useful counterpart to others in the suite, but one its own is not robust enough to use without prior knowledge of the dataset. To illustrate these shortcomings, we will compare how well it performs on a low- and high-variance dataset. For this example, we will need another dataset. Grab the dataset located at

/proj/cluster_gazing2/data/synthetic/datasets/ ... 
 synth_t_15c3_p_0750_d_03_v_2d0.txt.gz

and uncompress it in the same manner as the first. Now, on to clustering.

>>> from compClust.mlx.datasets import Dataset
>>> from compClust.mlx.wrapper.TSplit import TSplit
>>> low_variance = 
... Dataset('synth_t_15c3_p_0750_d_03_v_0d3.txt')
>>> high_variance = 
... Dataset('synth_t_15c3_p_0750_d_03_v_2d0.txt')
>>> parameters = {}
>>> parameters['k'] = 15
>>> parameters['distance_metric'] = 'euclidean'
>>> parameters['splitting_method'] = 'PCA'
>>> parameters['energy'] = 0.9
>>> parameters['agglomerate_method'] = 'clusterNumber'
>>> tsplit_low = TSplit(low_variance, parameters)
>>> tsplit_high = TSplit(high_variance, parameters)
>>> tsplit_low.validate()
1
>>> tsplit_high.validate()
1
>>> tsplit_low.run()
1
>>> tsplit_high.run()
1
>>> results_low = tsplit_low.getLabeling()
>>> results_high = tsplit_high.getLabeling()
>>> map(lambda x : len(results_low.getRowsByLabel(x)),
... results_low.getLabels())
[1, 1, 1, 1, 13, 15, 1, 1, 1, 1, 694, 1, 12, 6, 1]
>>> map(lambda x : len(results_high.getRowsByLabel(x)),
... results_high.getLabels())
[2, 1, 1, 735, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

Interestingly, even with a low variance ratio of 0.3, TSplit still has difficulty partitioning the dataset. When the variance ratio is increased to 2.0, TSplit totally fails, placing nearly all the datapoints in a single, large cluster.

The clustering results here are so bad, that it is not even worth calculation the standard deviation, since by inspection it is seen to be much larger than any of the previous clustering algorithms (FYI, the standard deviations are 178.2 and 189.5 respectively). If we re-run the process with parameters['k'] = 45, we get the following results for the number of points per class.

>>> map(lambda x : len(results_low.getRowsByLabel(x)),
... results_low.getLabels())
[3, 1, 3, 1, 1, 1, 1, 1, 10, 1, 1, 1, 3, 1, 6, 1, 32, 1, 
 1, 12, 1, 1, 1, 1, 1, 1, 13, 15, 1, 1, 1, 1, 495, 1, 1, 
 26, 1, 87, 2, 1, 1, 1, 2, 11, 1]
>>> map(lambda x : len(results_high.getRowsByLabel(x)),
... results_high.getLabels())
[1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 2, 1, 1, 
 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
 1, 1, 1, 1, 1, 702, 1]

Again, marginal behavior for the variance ratio of 0.3, degenerate behavior for a variance ratio of 2.0. The interested reader should try running TSplit on a dataset of many more point with a variance ratio of 0.1. It is instructive to see how the performance of TSplit falls off as a function of datapoints and variance ratio.

Next: XClust Up: Unsupervised Wrappers Previous: KMeans Contents

Lucas Scharenbroich 2003-08-27