Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Re: algorithm for 'best subsets'

by halley (Prior)
on Mar 03, 2005 at 15:37 UTC ( [id://436250]=note: print w/replies, xml ) Need Help??


in reply to algorithm for 'best subsets'

Here's an artificial test data generator, for those of you who are interested in doing your own benchmarking before publishing your methods.
my %Items; sub build_test_data { # reproduceable case srand(12345); # Sorted by prevalence. Keyword 'kaa' is way more common than 'kz +z'. my @Keywords = 'kaa' ... 'kzz'; # Each node is associated with an asciibetical list of unique keyw +ords. # We groom out the top keywords which are basically noise. for my $xx ('iaa' .. 'izz') { my $count = int(rand(8)) + 4; $Items{$xx}{$Keywords[ int(rand()*rand()*@Keywords) ]}++ while $count--; delete $Items{$xx}{$_} for 'kaa'..'kab'; $Items{$xx} = [ sort keys %{$Items{$xx}} ]; } return unless @_; print Dumper \%Items; # lots of raw data! } build_test_data();
Update: Here's a useful results format:
tuples of 3: 6 kaa kdf kea 6 kab kaf kka 4 kad kfa kfg ... tuples of 2: 9 kad kfa 8 kaj kda 8 kaj kda ...

--
[ e d @ h a l l e y . c c ]

Replies are listed 'Best First'.
Re^2: algorithm for 'best subsets'
by fizbin (Chaplain) on Mar 03, 2005 at 16:36 UTC
    Of course, to satisfy the "tens of thousands of keywords", you really should be using 'kaaa'..'kzzz' and 'iaaa'..'izzz'.

    Of course, when I did that my method blew through all available memory when trying to compute 5-at-a-time. It did manage 4-at-a-time, though, and in under five minutes.

    -- @/=map{[/./g]}qw/.h_nJ Xapou cets krht ele_ r_ra/; map{y/X_/\n /;print}map{pop@$_}@/for@/
      Exactly. The test case should show that the functionality worked, and that different methods found the same subsets. It should not try to break your machine.

      In practice, I can slice the %Items and the %Keywords up a bit, and do smaller overlapping datasets. I can also farm the work out to multiple machines on these slices.

      More on what that is, tonight; I've been stuck in meetings today so haven't had the chance to really explain what all this is about.

      --
      [ e d @ h a l l e y . c c ]

        halley,
        For the record, my solution has a very small memory footprint even for really large test cases but it can be really slow to finish. A half hour with this test case for a target size of 3.

        Cheers - L~R

        A clear indication of what output/format/datastructure you are expecting would be cool.


        Examine what is said, not who speaks.
        Silence betokens consent.
        Love the truth but pardon error.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://436250]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (6)
As of 2024-04-25 10:38 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found