Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Better algorithm or data structure?

by BrowserUk (Patriarch)
on Aug 29, 2010 at 21:57 UTC ( [id://857943]=perlquestion: print w/replies, xml ) Need Help??

BrowserUk has asked for the wisdom of the Perl Monks concerning the following question:

Can anyone see how to improve the performance--I'd like an order of magnitude improvement if possible--of the following code?

Essentially, the state of the booleans in @bools will switch from false to true in some undefinable order. Each time one becomes true, it will affect the composite state of one or more sets in @sets. And if all the booleans referenced by any particular set are true, then I need to remove that set from @sets.

This code does a simple brute force search of all the sets and checks all the booleans referenced. I keep looking at this thinking that there should be some combination of hashes or heaps and/or references that would allow me to remove completed sets more efficiently, but I'm not seeing it.

Update:Corrected typos.

#! perl -slw use strict; use Data::Dump qw[ pp ]; use List::Util qw[ shuffle ]; use Time::HiRes qw[ time ]; our $N //= 1e3; our $S //= 1e4; our $Z //= 5; my @bools = (0) x $N; my @sets = map [ map int( rand $N ), 1 .. 1 + int( rand $Z ) ],1 .. $S; my @someOrder = shuffle 0 .. $N-1; my $start = time; for my $next ( @someOrder ) { $bools[ $next ] = 1; for my $set ( reverse 0 .. $#sets ) { if( grep( $bools[ $_ ], @{ $sets[ $set ] } ) == @{ $sets[ $set + ] } ) { splice @sets, $set, 1; } } } printf "Took %.6f\n", time() - $start; __END__ C:\test>junk27 Took 11.623000

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re: Better algorithm or data structure?
by johngg (Canon) on Aug 29, 2010 at 23:37 UTC

    I'm not sure whether this would work as I haven't thought how to get the remaining "sets" back after culling but here goes. How about using the garbage collection built into Perl by turning your @bools into %bools keyed by the range 1 .. $N and the values being anonymous arrays containing references to any "sets" that contain the number in the key. By delete'ing any key in %bools that is in @someOrder you are left with only references to "sets" that have at least one number reamaining. The "sets" are array references that are now out of scope so once all references are gone they are garbage collected without the need to do a splice. You should then be able to recover the remaining "sets" by combing through %bools for unique array refs.

    This idea may be cock-eyed because I don't have time to fully test it now but your code took about 25 seconds on my elderly laptop whereas this code

    use strict; use warnings; use List::Util qw{ shuffle }; use Time::HiRes qw{ time }; our $N //= 1e3; our $S //= 1e4; our $Z //= 5; my %bools = map { $_ => [] } 1 .. $N; for my $set ( 1 .. $S ) { my $conts = []; push @{ $conts }, int rand $N for 0 .. int rand $Z; push @{ $bools{ $_ } }, $conts for @{ $conts }; } my @someOrder = shuffle 1 .. $N; my $start = time(); delete @bools{ @someOrder }; printf "Took %.6f\n", time() - $start;

    takes about 15ms, although more time would be taken recovering the sets.

    I hope I'm not just having a rush of blood and that this is actually helpful.

    Cheers,

    JohnGG

      I like your lateral thinking.

      Unfortunately, I need to find all (any) sets completed by the state change of each boolean, at the time I receive that boolean. @someOrder was intended to represent the program receiving the notification of the booleans changing state over time, in some unspecifiable order.

      That said, I think combining @sets into a hash keyed by the booleans is a very smart move. I think. I'm having trouble wrapping my head around how the sets, which will be referenced from multiple kash keys will get removed, but maybe tomorrow...


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

        How about making each "set" an object rather than just an anonymous array and store references to the objects in the %bools hash. You could then iterate over @someOrder in a loop rather than hash slicing and use the object destructor to take some action when the final reference to a "set" is deleteed from %bools.

        Just a thought that occurred to me while trying to get to sleep :-/

        Cheers,

        JohnGG

Re: Better algorithm or data structure?
by GrandFather (Saint) on Aug 30, 2010 at 05:44 UTC

    If it suits your application then turning @sets into %sets gives a useful performance gain:

    use strict; use warnings; use Time::HiRes qw[ time ]; use List::Util qw[ shuffle ]; our $N //= 1e3; our $S //= 1e4; our $Z //= 5; my %flags = map {$_ => []} 1 .. $N; my %sets = map {$_ => [map 1 + int(rand $N), 1 .. 1 + int(rand $Z)]} 1 + .. $S; my $start = time; for my $set (keys %sets) { next if ! @{$sets{$set}}; push @{$flags{$_}}, $set for @{$sets{$set}}; } printf "Setup: %.6fs\n", time() - $start; my @someOrder = shuffle 1 .. $N; $start = time(); for my $flag (@someOrder) { for my $set (@{$flags{$flag}}) { $sets{$set} = [grep {$_ != $flag} @{$sets{$set}}]; next if @{$sets{$set}}; delete $sets{$set}; } } printf "Took: %.6fs\n", time() - $start;

    Prints:

    Setup: 0.037229s Took: 0.077418s
    True laziness is hard work
Re: Better algorithm or data structure?
by bart (Canon) on Aug 29, 2010 at 22:39 UTC
    You could turn this:
            if( grep( $_, @{ $sets[ $set ] } ) == @{ $sets[ $set ] } ) {
    into a use of all from List::MoreUtils:
    if( all { $_ } @{ $sets[ $set ] } ) {
    (Note that the overhead of a function in this module is larger than the overhead of a plain grep.)

    But I think may get a better result by inverting the boolean, and thus setting the boolean for each item before you start, and clearing its value. Testing if an array contains no true value at all, might be faster than testing if enough of them are true. Anyway, I don't expect you can even double the speed this way...

    Also, you might reconsider using vec and storing an array of booleans into a single string (of bits). That way testing every item in a set is simply a matter of a single test on a string, instead of testing each item in a whole array.

Re: Better algorithm or data structure?
by salva (Canon) on Aug 30, 2010 at 07:43 UTC
    Reverse your data structures:
    #! perl -slw use strict; use Data::Dump qw[ pp ]; use List::Util qw[ shuffle ]; use Time::HiRes qw[ time ]; our $N //= 1e3; our $S //= 1e4; our $Z //= 5; my @bools = (0) x $N; my @sets = map [ map int( rand $N ), 1 .. 1 + int( rand $Z ) ],1 .. $S; my @sets1 = map [@$_], @sets; my @someOrder = (shuffle 0 .. $N-1); my $start = time; for my $next ( @someOrder ) { $bools[ $next ] = 1; for my $set ( reverse 0 .. $#sets ) { if( grep( $bools[ $_ ], @{ $sets[ $set ] } ) == @{ $sets[ $set + ] } ) { splice @sets, $set, 1; } } } printf "Took %.6f\n", time() - $start; $start = time; my @sizes = map scalar(@$_), @sets1; my @revsets = map [], @bools; my %sets; for my $set_ix (0..$#sets1) { if (@{$sets1[$set_ix]}) { $sets{$set_ix} = $sets1[$set_ix]; for my $bool (@{$sets1[$set_ix]}) { push @{$revsets[$bool]}, $set_ix; } } } for my $next ( @someOrder ) { for my $set_ix (@{$revsets[$next]}) { unless (--$sizes[$set_ix]) { delete $sets{$set_ix}; } } } printf "Took %.6f\n", time() - $start;
    and it runs as...
    Took 10.198676 Took 0.112004
Re: Better algorithm or data structure?
by repellent (Priest) on Aug 30, 2010 at 05:39 UTC
    Here's my stab at it, which goes along with what jethro and graff have already mentioned. %sets is kept as a hash for convenient deletes. By keeping track of which sets to update/delete once a resource ($next) becomes available, we only need to traverse a minimal amount of sets:
    use Data::Dump qw[ pp ]; use List::Util qw[ shuffle ]; use Time::HiRes qw[ time ]; our $N //= 1e3; our $S //= 1e4; our $Z //= 5; my @bools; my %sets = map { my $s = $_; my $n = 1 + int( rand $Z ); ++$bools[ $_ ]{ $s } for map int( rand $N ), 1 .. $n; $s => $n; } 0 .. $S - 1; my @someOrder = shuffle 0 .. $N-1; my $start = time; for my $next ( @someOrder ) { for my $s (keys %{ $bools[ $next ] }) { $sets{ $s } -= delete $bools[ $next ]{ $s }; delete $sets{ $s } unless $sets{ $s }; } } printf "Took %.6f\n", time() - $start; pp \@bools; pp \%sets; __END__ Took 0.089563 [ {}, {}, ... ] {}

    Update: FWIW, the OP code took around 20 seconds on my virtual machine running FreeBSD.
Re: Better algorithm or data structure?
by graff (Chancellor) on Aug 30, 2010 at 00:10 UTC
    I'm more than a little baffled... First, I'm not sure how to interpret this description:
    Essentially, the state of the booleans in @bools will switch from false to true in some undefinable order. Each time one becomes true, it will affect the composite state of one or more sets in @sets. And if all the booleans referenced by any particular set are true, then I need to remove that set from @sets.

    So, is it important that a set be removed as soon as the related booleans all turn true? Or could it instead be a two-step process: stuff happens to the booleans, and when that's all done, then you check the sets to see which ones should be removed?

    I'm puzzled because the OP code seems to do nothing at all in terms of checking the status of the booleans that relate to a given set, as implied by the description. Instead, the code appears to be deleting elements from the @sets AoA when the nested (inner) array happens to contain no zero values. Now that I've seen the updated version of the OP, it all makes more sense -- thanks!

    Another question about the description: is it essential that elements of @sets be deleted, or would it suffice to keep the array a constant size, and "undef" elements when appropriate?

    If you can avoid using splice (and if I understand the goal, which is still in doubt), you could build an AoH map of where the various @bools indexes occur in @sets, and only scan the relevant sets as you alter the values of the bools. Something like this:

    my @sets = ... # after values are loaded into the @sets AoA, build a map: my @boolmap; for my $set ( 0 .. $#sets ) { $boolmap[ $_ ]{ $set } = undef for ( @{ $sets[ $set ] } ); } # now, as changes are made to bools, check just the relevant sets: for my $next ( @someOrder ) { $bools[ $next ]++; for my $set ( keys %{ $boolmap[ $next ] } ) { if( defined( $sets[ $set ] ) and grep( { $bools[ $_ ] } @{ $sets[ $set ] } ) == @{ $sets[ $ +set ] } ) { $sets[ $set ] = undef; } } }
    (But I feel like I'm not really understanding your goal, because I'm pretty sure that you of all people would have figured this out.)
      is it essential that elements of @sets be deleted, or would it suffice to keep the array a constant size, and "undef" elements when appropriate?

      Yes. In use, new sets would be added to @sets on the fly. The mechanism is essentially a queue. New sets ("work items") are being continuously added asynchronously with the resources (represented by the booleans) becoming available. If I don't remove completed work items (sets for which all required resources have become available), then the queue would grow endlessly.

      But your notion about using hashes is a good one. You've used a hash to index those sets that use a boolean against each boolean. You're undefing rather than splicing to avoid screwing up the indexing. But, if I also use a hash for holding the sets, I can delete a set from %sets, without screwing up the index. And that greatly reduces the number of sets that need to be examined for each boolean.

      Thanks. I'll have a go at implementing that tomorrow.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        Given this added detail:
        The mechanism is essentially a queue. New sets ("work items") are being continuously added asynchronously with the resources (represented by the booleans) becoming available.

        This poses another problem, because the "boolmap" structure will grow continuously. If the "sets" queue can grow to the point that the process blows up after a while, then the accumulation of hash keys in the "boolmap" structure is going to have the same effect. You'll need a little more code (and more execution time -- but hopefully not too much) to delete "boolmap" hash keys as the corresponding sets are deleted.

        And one more question, which might be unimportant, but the "boolmap" makes it relevant: When adding a new "set" to the queue, what if all the booleans in this set are already true? (Not that I'm curious about that, but it's something the implementation should take into account.)

Re: Better algorithm or data structure?
by JavaFan (Canon) on Aug 30, 2010 at 09:08 UTC
    What I did was to preprocess the data. I created a @bools2 that maps each boolean to the sets it is in. @sets2 contains the number of booleans in each set. And instead of splicing a set from @sets, I delete it:
    use strict; use warnings; use List::Util qw[shuffle]; our $N //= 1e3; our $S //= 1e4; our $Z //= 5; my @bools = (0) x $N; my @sets = map [map int(rand $N), 1 .. 1 + int(rand $Z)], 1 .. $S; my @someOrder = shuffle 0 .. $N - 1; # # Preprocessing # my @bool2; my @sets2; for (my $i = 0; $i < @sets; $i ++) { push @{$bool2[$_]}, $i for @{$sets[$i]}; $sets2[$i] = @{$sets[$i]}; } # # Main loop # foreach my $next (@someOrder) { foreach my $set_nr (@{$bool2[$next]}) { delete $sets[$set_nr] unless --$sets2[$set_nr]; } }
    I think this algorithm is O(k), where k is the sum of the number of elements in each set.
Re: Better algorithm or data structure?
by jethro (Monsignor) on Aug 30, 2010 at 01:11 UTC

    Using johngg's HashofArrays wouldn't it suffice to have a number for each set representing the number of booleans referenced by it. Just decrement this number for every set that contains the just-switched boolean (i.e. every set in the array corresponding to the boolean). Any set for which the count turns to 0 has to be removed.

    Essentially this does the same job as the garbage collection, but since you do the "garbae collection" yourself you know which sets are done.

Re: Better algorithm or data structure?
by aquarium (Curate) on Aug 29, 2010 at 23:45 UTC
    If your boolean sets are of pre-determined size then you could improve performance by converting to binary number, with each bit representing a boolean. Then testing for all truth or all false is a simple matter of an and conditional.
    Another way to expedite the process is to do with the order of the boolean set, if particular booleans are more likely to be false than others. Therefore check these at earliest in the loop conditional and exit the loop as soon as false encountered. A related solution may also be possible by somehow using AND between all parts of the boolean set, with the short-circuiting behavior of leftmost False leaving the compound conditional early.
    the hardest line to type correctly is: stty erase ^H
Re: Better algorithm or data structure?
by Anonymous Monk on Aug 30, 2010 at 15:26 UTC

    Discard the notion of calling them “booleans.”   They are Work Items, which periodically change state.   And, for each one of these, there is a “cloud” of Interested Parties out there which are waiting for them to become what they are now not.   Those same Interested Parties also need to know when any Work Item ceases to be as it is now.

    So, when we introduce a new Interested Party, we associate with it a hash containing, for each Work Item, its last known state.   The Interested Party also contains a count (>= 0) of the number of Work Items it is waiting-upon to change state.   Each Work Item has a list of all of the Interested Parties who are now watching it.

    When a Work Item changes state, it notifies all of the Interested Parties, providing its unique-ID and new state.   The Interested Party retrieves the last known state and determines if it has changed.   The appropriate action is determined by an easy four-row state table:
    New State Last State Action
    TrueTrueTake no action.
    TrueFalseSignaled: update last-known-state, decrement wait count. If zero, remove yourself from all interested-parties lists and destroy yourself. (If negative, die!)
    FalseFalseTake no action.
    FalseTrueUnsignaled: update last-known-state, increment wait count.

    You have now completely eliminated the problem of “searching” for anything.

    Please notice that I stipulated that Work Items contain a list of references to Interested Parties, but that they provide the parties a unique-id.   This somewhat arbitrary arrangement avoids circular references which can cause headaches for a garbage collector.

      Oh, fudge.   If you think the above posting is brilliant, then, “it came from me.”   If not, “I have no idea who just said that.”

      :-D

      As you can see, the Interested Parties simply want to know when the state of a Work Item has changed, and they decide what (if anything...) to do in response to that occurrence.

      The “unique IDs,” which can simply be consecutive arbitrary integers, simply serve as hash-keys.

      There are endless variations on this idea, but I think you get the idea.   And it will produce “orders of magnitude” speed improvements provided that virtual-memory thrashing is not an issue.

        You have now completely eliminated the problem of “searching” for anything.... And it will produce “orders of magnitude” speed improvements

        It's an interesting story, but the devil is in the detail. I've been toying with johngg's idea which is essentially similar in that is uses objects that decide for themselves when something needs to be done. The reality is proving somewhat harder to implement.

        And so far, is well shy of the 2 orders of magnitude improvement achieved by graff, repellent, salva GrandFather.

        At this point, I can say no better than; "Show me the code":)


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://857943]
Approved by Paladin
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (4)
As of 2024-04-19 23:12 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found