Confusion Hypercubes

Next: ROC Curves Up: Analysis....and stuff Previous: Adjacencies Contents

Confusion Hypercubes

The confusion matrix code is not limited to comparing just two labelings at a time. Any number of labelings may be packed together in a ConfusionHypercube. This related functionality completes the API for the ConfusionMatrix class. For reference, here is the list of all the methods implemented in the ConfusionMatrix class.

getNumElements()
getDimensionalLabeling()
projectConfusionHypercube(labels)
getAgreementList()
getStarburst(index)
getInverseStarburst(index)
getConfusionHypercubeCell(cellCoordinates)
removeCellFromHypercube(cellCoordinates)
removeIndexFromHypercube(index)
findCellCoordinates(index)
createConfusionHypercubeFromLabeling(labelings, dimensionLabels)
createConfusionMatrixFromLabeling(labeling1, labeling2)
createConfusionMatrixFromFile(clusteringFile1, clusteringFile2)
getHypercubeCounts()

Remember that the scoring method (NMI, Linear Assignment) are undefined for the Hypercubes. If you need to produce scores, the projectConfusionHypercube() method allows you to project the N-dimensional cube down to a matrix.

The most fundatments aspect of the ConfusionHypercube is the relation between indices and coordinates. When you create a Hypercube from multiple files or labelings, it is assumed that each row of the inputs relates to the same item. That is, the label for row 5 in two different labelings both refer to the same logical piece of data. These row numbers are the indices.

If one uses labelings to create a hypercube, then each cell is references by an -dimensional coordinate. However, most of these cells will be empty. Therefore, the class maintains a internal mapping from indices (row numbers) to hypercube coordinates. By using the findCellCoordinates() method, you can convert from indices to coordinates.

Let's try an example using the FullEM, KMeans, and ground truth labelings from the previous example.

>>> cm.createConfusionHypercubeFromLabeling([truth, 
... kmeans_labeling, fullem_labeling], ["Truth",
... "KMeans","FullEM"])
>>> cm.getNumElements()
750
>>> cm.getDimensionalLabeling()
['Truth', 'KMeans', 'FullEM']
>>> tmp = cm.projectConfusionHypercube(['KMeans','FullEM'])  
>>> tmp.printCounts()
140 78  0   0   3 
12  0   0   0   140 
0   0   0   105 0 
0   85  100 0   0 
0   0   87  0   0 

>>> cm.findCellCoordinates(1)
(23, 0, 0)
>>> cm.findCellCoordinates(7)
(13, 1, 4)
>>> #
>>> # Find all the points strongly related to index 7
...
>>> cm.getConfusionHypercubeCell((13, 1, 4))
[7, 85, 240, 292, 392, 445, 469, 472, 486, 495, 500, 502, 
 504, 523, 555, 559, 582, 639, 684, 689, 706, 729]
>>> #
... # Find all the points weakly related to index 7
>>> cm.getStarburst(7)
[5, 36, 61, 66, 122, 123, 140, 163, 241, 243, 257, 277, 336, 
 377, 535, 572, 609, 624, 636, 678, 120, 138, 193, 222, 343, 
 371, 406, 410, 485, 594, 625, 633, 642, 710, 37, 73, 83, 101, 
 144, 154, 303, 419, 482, 516, 579, 601, 608, 695, 702, 716, 
 721, 35, 55, 169, 187, 234, 276, 283, 307, 354, 363, 450, 
 454, 503, 529, 567, 580, 635, 650, 709, 724, 27, 49, 116, 
 173, 198, 210, 304, 311, 396, 398, 428, 703, 4, 161, 228, 
 350, 366, 444, 475, 593, 658, 671, 714, 23, 59, 63, 64, 79, 
 93, 195, 221, 301, 367, 378, 386, 389, 403, 412, 426, 484, 
 538, 565, 626, 637, 675, 686, 694]
>>> #
... # Show the cells these points occupy
...
>>> unique(map(cm.findCellCoordinates, a))
[(30, 1, 4), (32, 1, 4), (31, 1, 4), (34, 1, 4), (33, 1, 4), 
 (14, 1, 4), (12, 1, 4)]

Notice that the confusion counts when projected onto the KMeans, FullEM plane are identical to the confusion matrix of just KMeans and FullEM. The starburst required a bit of explanation. It is composed of the indices of all the points who coordinates differ from the reference point by one index. This is the group of points ``loosely-related'' to the reference point.

Next: ROC Curves Up: Analysis....and stuff Previous: Adjacencies Contents

Lucas Scharenbroich 2003-08-27