Re^2: Making sense of data: Clustering OR A coding challenge

With 5 clusters and the dataset in Re^2: Making sense of data: Clustering OR A coding challenge I get some nasty results like:

0: Cluster 1: 18.27 ( 48 54 32 48 30 50 35 50 70 63 )

1: Cluster 2: 19.21 ( 20 22 18 14 30 24 60 35 40 36 )

2: Cluster 3: 19.76 ( 20 20 35 56 35 15 35 )

3: Cluster 4: 35.09 ( 80 100 58 100 105 150 86 100 99 60 86 )

4: Cluster 5: 509.63 ( 22 169 566 150 16 50 56 76 100 24 20 55 24 70 60 291 200 325 700 200 35 58 90 460 30 1226 950 67 35 68 200 67 24 300 50 54 770 90 750 450 24 60 46 20 12 280 154 20 128 20 600 42 1250 18 86 291 330 325 250 190 550 104 2000 64 )

With 5000 generations, 50 cullings and 20 spawn it yields slightly more reasonable, though still not too helpful results:

0: Cluster 1: 20.00 ( 770 750 )

1: Cluster 2: 2.00 ( 950 )

2: Cluster 3: 198.66 ( 700 64 63 48 20 42 18 20 58 30 200 55 90 70 22 60 200 54 104 18 60 22 291 20 14 154 291 35 20 76 100 35 54 50 86 67 36 325 24 48 35 20 35 50 100 280 56 30 58 60 60 24 20 460 99 32 24 30 24 86 56 600 40 16 20 150 105 15 566 169 300 86 450 330 50 80 128 325 190 550 35 35 150 50 68 100 35 250 67 100 24 200 90 70 12 46 )

3: Cluster 4: 24.00 ( 1226 1250 )

4: Cluster 5: 2.00 ( 2000 )

Given another order of magnitude or more runtime it might reach palatable results ;-)

-- In Bob We Trust, All Others Bring Data.

Comment on Re^2: Making sense of data: Clustering OR A coding challenge

Replies are listed 'Best First'.

Re^3: Making sense of data: Clustering OR A coding challenge
by jdporter (Paladin) on Apr 06, 2006 at 20:23 UTC

Indeed, there were a number of other parameters that could be tweaked, and by doing so, I was able to get better results than that.

However, in the end, it turns out there are some special properties of your problem that allow much simpler and more effective solutions. Namely, the fact that your data points are one-dimensional. (I'm assuming they are.)

It means, for example, that (1 3),(2 4) is never an optimal clustering. Neither is (1 2),(1 2).

From these, we can define the following constraints on clusters:

clusters are simply segments within the ordered data set.
all repetitions of a number must be kept together.

In the following solution, I'm using variance as a measure of the "coherence" or "binding strength" or whatever you want to call it within each cluster. You could use other measures; I wouldn't be surprised if variance doesn't necessarily give the best results. I've seen discussions of clustering that talk about maximizing variance between clusters, in addition to minimizing it within clusters. I have my doubts that that would help in a simple one-dimensional problem like this one.


XP is just a number
	PerlMonks