comment on

Here's an artificial test data generator, for those of you who are interested in doing your own benchmarking before publishing your methods.

my %Items;

sub build_test_data
{
    # reproduceable case
    srand(12345);

    # Sorted by prevalence.  Keyword 'kaa' is way more common than 'kz
+z'.
    my @Keywords = 'kaa' ... 'kzz';

    # Each node is associated with an asciibetical list of unique keyw
+ords.
    # We groom out the top keywords which are basically noise.
    for my $xx ('iaa' .. 'izz')
    {
    my $count = int(rand(8)) + 4;
    $Items{$xx}{$Keywords[ int(rand()*rand()*@Keywords) ]}++
        while $count--;
    delete $Items{$xx}{$_} for 'kaa'..'kab';
    $Items{$xx} = [ sort keys %{$Items{$xx}} ];
    }

    return unless @_;
    print Dumper \%Items; # lots of raw data!
}

build_test_data();
[download]

Update: Here's a useful results format:

tuples of 3:
6 kaa kdf kea
6 kab kaf kka
4 kad kfa kfg
 ...
tuples of 2:
9 kad kfa
8 kaj kda
8 kaj kda
 ...
[download]

--
[ e d @ h a l l e y . c c ]

In reply to Re: algorithm for 'best subsets' by halley
in thread algorithm for 'best subsets' by halley

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


laziness, impatience, and hubris
	PerlMonks