http://qs321.pair.com?node_id=541000

belg4mit has asked for the wisdom of the Perl Monks concerning the following question:

Often it is helpful or more instructive to examine aggregated or otherwise summarized data en lieu of the raw data set. However, determining the best means of doing so is not always evident, and can strongly influence the outcome. For instance, given the rated maximum occupancies for a bunch of rooms, what would be the best way to divide the range of values into classes? Quantiles (equal number of members in each class)? Nice round or culturally meaningful numbers (12, 25, 50, 75, 100)? There are in fact several algorithmic means of addressing this problem, known as clustering. One of the more common/robust is K-means, also known as Jenks natural breaks (especially amongst cartographers). Outside of select circles K-means seems to be rather unheard of, which is surprising since it is so powerful and general.

For the math monks, a formula and description of the algorithm are available over there. Alas, I'm not able to fully grok the description and have been unable to tackle implementing it in perl *. I've come across a couple Fortran and VB implementations; although neither language is very perl-like, and thusly would not be well suited for translation. Would anyone be interested in taking up the challenge of writing an N-D or 1-D implementation in perl with a simple interface in perl? i.e; accept a reference to/list of the values to classify and the number of desired classes** and spit back the classified values or class-divisions.

happy hacking!

P.S. For an implementation reference see Milligan's. I cannot attest to the quality of the Fortran but the README can provide some interesting insights as well.

P.P.S. I inquired about this in the cb and discussed it with theorbtwo and atcroft, mentioning it in passing today Limbic~Region urged me to post it as a potentially interesting diversion for some.

* There is in fact a wrapper for a C implementation however it lacks documentation, seems to require lots of unusual extras and is oriented towards clustering 2-D data.

** The number of classes can influence the interpretations of the resulting analysis however, at least in 1-D, there are relatively few meaningful values and so it is easy enough to test them by hand for bias. Typical values are 3-6, with many implementations defaulting to 5. There are many reasons for this:

  1. for 2 classes it'd be easier to use the mean
  2. larger numbers of classes are difficult to handle visually. If you insist on 8+ classes you are probably better off with an even gradient of divisions.

--
In Bob We Trust, All Others Bring Data.