Better algorithm or data structure?


laziness, impatience, and hubris
	PerlMonks

Better algorithm or data structure?

by BrowserUk (Patriarch)

on Aug 29, 2010 at 21:57 UTC ( [id://857943]=perlquestion: print w/replies, xml )

Need Help??

BrowserUk has asked for the wisdom of the Perl Monks concerning the following question:

Can anyone see how to improve the performance--I'd like an order of magnitude improvement if possible--of the following code?

Essentially, the state of the booleans in @bools will switch from false to true in some undefinable order. Each time one becomes true, it will affect the composite state of one or more sets in @sets. And if all the booleans referenced by any particular set are true, then I need to remove that set from @sets.

This code does a simple brute force search of all the sets and checks all the booleans referenced. I keep looking at this thinking that there should be some combination of hashes or heaps and/or references that would allow me to remove completed sets more efficiently, but I'm not seeing it.

Update:Corrected typos.

#! perl -slw
use strict;
use Data::Dump qw[ pp ];
use List::Util qw[ shuffle ];
use Time::HiRes qw[ time ];

our $N //= 1e3;
our $S //= 1e4;
our $Z //= 5;

my @bools = (0) x $N;

my @sets = map [
    map int( rand $N ), 1 .. 1 + int( rand $Z )
],1 .. $S;

my @someOrder = shuffle 0 .. $N-1;

my $start = time;
for my $next ( @someOrder ) {
    $bools[ $next ] = 1;
    for my $set ( reverse 0 .. $#sets ) {
        if( grep( $bools[ $_ ], @{ $sets[ $set ] } ) == @{ $sets[ $set
+ ] } ) {
            splice @sets, $set, 1;
        }
    }
}
printf "Took %.6f\n", time() - $start;

__END__
C:\test>junk27
Took 11.623000
[download]

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

RIP an inspiration; A true Folk's Guy

Comment on Better algorithm or data structure? Select or Download Code

Replies are listed 'Best First'.

Re: Better algorithm or data structure?
by johngg (Canon) on Aug 29, 2010 at 23:37 UTC

I'm not sure whether this would work as I haven't thought how to get the remaining "sets" back after culling but here goes. How about using the garbage collection built into Perl by turning your @bools into %bools keyed by the range 1 .. $N and the values being anonymous arrays containing references to any "sets" that contain the number in the key. By delete'ing any key in %bools that is in @someOrder you are left with only references to "sets" that have at least one number reamaining. The "sets" are array references that are now out of scope so once all references are gone they are garbage collected without the need to do a splice. You should then be able to recover the remaining "sets" by combing through %bools for unique array refs.

This idea may be cock-eyed because I don't have time to fully test it now but your code took about 25 seconds on my elderly laptop whereas this code

use strict;
use warnings;

use List::Util   qw{ shuffle };
use Time::HiRes  qw{ time };

our $N //= 1e3;
our $S //= 1e4;
our $Z //= 5;

my %bools = map { $_ => [] } 1 .. $N;

for my $set ( 1 .. $S )
{
    my $conts = [];
    push @{ $conts }, int rand $N for 0 .. int rand $Z;
    push @{ $bools{ $_ } }, $conts for @{ $conts };
}

my @someOrder = shuffle 1 .. $N;
my $start     = time();

delete @bools{ @someOrder };

printf "Took %.6f\n", time() - $start;
[download]

takes about 15ms, although more time would be taken recovering the sets.

I hope I'm not just having a rush of blood and that this is actually helpful.

Cheers,

JohnGG

[reply]
[d/l]
[select]

Re^2: Better algorithm or data structure?

by BrowserUk (Patriarch) on Aug 30, 2010 at 00:37 UTC

I like your lateral thinking.

Unfortunately, I need to find all (any) sets completed by the state change of each boolean, at the time I receive that boolean. @someOrder was intended to represent the program receiving the notification of the booleans changing state over time, in some unspecifiable order.

That said, I think combining @sets into a hash keyed by the booleans is a very smart move. I think. I'm having trouble wrapping my head around how the sets, which will be referenced from multiple kash keys will get removed, but maybe tomorrow...

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

RIP an inspiration; A true Folk's Guy

Re^3: Better algorithm or data structure?

by johngg (Canon) on Aug 30, 2010 at 01:30 UTC

How about making each "set" an object rather than just an anonymous array and store references to the objects in the %bools hash. You could then iterate over @someOrder in a loop rather than hash slicing and use the object destructor to take some action when the final reference to a "set" is deleteed from %bools.

Just a thought that occurred to me while trying to get to sleep :-/

Cheers,

JohnGG

[reply]
[d/l]
[select]

Re: Better algorithm or data structure?
by GrandFather (Saint) on Aug 30, 2010 at 05:44 UTC

If it suits your application then turning @sets into %sets gives a useful performance gain:

use strict;
use warnings;
use Time::HiRes qw[ time ];
use List::Util qw[ shuffle ];

our $N //= 1e3;
our $S //= 1e4;
our $Z //= 5;

my %flags = map {$_ => []} 1 .. $N;
my %sets = map {$_ => [map 1 + int(rand $N), 1 .. 1 + int(rand $Z)]} 1
+ .. $S;
my $start = time;
for my $set (keys %sets) {
    next if ! @{$sets{$set}};
    push @{$flags{$_}}, $set for @{$sets{$set}};
}
printf "Setup: %.6fs\n", time() - $start;

my @someOrder = shuffle 1 .. $N;

$start = time();
for my $flag (@someOrder) {
    for my $set (@{$flags{$flag}}) {
        $sets{$set} = [grep {$_ != $flag} @{$sets{$set}}];
        next if @{$sets{$set}};
        delete $sets{$set};
    }
}
printf "Took: %.6fs\n", time() - $start;
[download]

Prints:

Setup: 0.037229s
Took: 0.077418s
[download]

True laziness is hard work

[reply]
[d/l]
[select]

Re: Better algorithm or data structure?
by bart (Canon) on Aug 29, 2010 at 22:39 UTC

if( grep( $_, @{ $sets[ $set ] } ) == @{ $sets[ $set ] } ) {

all

List::MoreUtils

        if( all { $_ } @{ $sets[ $set ] } ) {
[download]

But I think may get a better result by inverting the boolean, and thus setting the boolean for each item before you start, and clearing its value. Testing if an array contains no true value at all, might be faster than testing if enough of them are true. Anyway, I don't expect you can even double the speed this way...

Also, you might reconsider using vec and storing an array of booleans into a single string (of bits). That way testing every item in a set is simply a matter of a single test on a string, instead of testing each item in a whole array.

[reply]
[d/l]
[select]

Re: Better algorithm or data structure?
by salva (Canon) on Aug 30, 2010 at 07:43 UTC

#! perl -slw
use strict;
use Data::Dump qw[ pp ];
use List::Util qw[ shuffle ];
use Time::HiRes qw[ time ];

our $N //= 1e3;
our $S //= 1e4;
our $Z //= 5;

my @bools = (0) x $N;

my @sets = map [
    map int( rand $N ), 1 .. 1 + int( rand $Z )
],1 .. $S;

my @sets1 = map [@$_], @sets;

my @someOrder = (shuffle 0 .. $N-1);

my $start = time;
for my $next ( @someOrder ) {
    $bools[ $next ] = 1;
    for my $set ( reverse 0 .. $#sets ) {
        if( grep( $bools[ $_ ], @{ $sets[ $set ] } ) == @{ $sets[ $set
+ ] } ) {
            splice @sets, $set, 1;
        }
    }
}
printf "Took %.6f\n", time() - $start;


$start = time;

my @sizes = map scalar(@$_), @sets1;
my @revsets = map [], @bools;
my %sets;

for my $set_ix (0..$#sets1) {
    if (@{$sets1[$set_ix]}) {
    $sets{$set_ix} = $sets1[$set_ix];
        for my $bool (@{$sets1[$set_ix]}) {
            push @{$revsets[$bool]}, $set_ix;
        }
    }
}

for my $next ( @someOrder ) {
    for my $set_ix (@{$revsets[$next]}) {
        unless (--$sizes[$set_ix]) {
            delete $sets{$set_ix};
        }
    }
}
printf "Took %.6f\n", time() - $start;
[download]

Took 10.198676
Took 0.112004
[download]

[reply]
[d/l]
[select]

Re: Better algorithm or data structure?
by repellent (Priest) on Aug 30, 2010 at 05:39 UTC

%sets

$next

use Data::Dump qw[ pp ];
use List::Util qw[ shuffle ];
use Time::HiRes qw[ time ];

our $N //= 1e3;
our $S //= 1e4;
our $Z //= 5;

my @bools;

my %sets = map {
    my $s = $_;
    my $n = 1 + int( rand $Z );
    ++$bools[ $_ ]{ $s } for map int( rand $N ), 1 .. $n;
    $s => $n;
} 0 .. $S - 1;

my @someOrder = shuffle 0 .. $N-1;

my $start = time;
for my $next ( @someOrder ) {
    for my $s (keys %{ $bools[ $next ] }) {
        $sets{ $s } -= delete $bools[ $next ]{ $s };
        delete $sets{ $s } unless $sets{ $s };
    }
}

printf "Took %.6f\n", time() - $start;

pp \@bools;
pp \%sets;

__END__
Took 0.089563
[
  {},
  {},
  ...
]
{}
[download]

Update:

[reply]
[d/l]
[select]

Re: Better algorithm or data structure?
by graff (Chancellor) on Aug 30, 2010 at 00:10 UTC

~~more than~~

Essentially, the state of the booleans in @bools will switch from false to true in some undefinable order. Each time one becomes true, it will affect the composite state of one or more sets in @sets. And if all the booleans referenced by any particular set are true, then I need to remove that set from @sets.

So, is it important that a set be removed as soon as the related booleans all turn true? Or could it instead be a two-step process: stuff happens to the booleans, and when that's all done, then you check the sets to see which ones should be removed?

I'm puzzled because the OP code seems to do nothing at all in terms of checking the status of the booleans that relate to a given set, as implied by the description. Instead, the code appears to be deleting elements from the @sets AoA when the nested (inner) array happens to contain no zero values. Now that I've seen the updated version of the OP, it all makes more sense -- thanks!

Another question about the description: is it essential that elements of @sets be deleted, or would it suffice to keep the array a constant size, and "undef" elements when appropriate?

If you can avoid using splice (and if I understand the goal, which is still in doubt), you could build an AoH map of where the various @bools indexes occur in @sets, and only scan the relevant sets as you alter the values of the bools. Something like this:

my @sets = ...
# after values are loaded into the @sets AoA, build a map:

my @boolmap;
for my $set ( 0 .. $#sets ) {
    $boolmap[ $_ ]{ $set } = undef for ( @{ $sets[ $set ] } );
}

# now, as changes are made to bools, check just the relevant sets:

for my $next ( @someOrder ) {
    $bools[ $next ]++;
    for my $set ( keys %{ $boolmap[ $next ] } ) {
        if( defined( $sets[ $set ] ) and
            grep( { $bools[ $_ ] } @{ $sets[ $set ] } ) == @{ $sets[ $
+set ] } )
        {
            $sets[ $set ] = undef;
        }
    }
}
[download]

[reply]
[d/l]
[select]

Re^2: Better algorithm or data structure?

by BrowserUk (Patriarch) on Aug 30, 2010 at 00:26 UTC

is it essential that elements of @sets be deleted, or would it suffice to keep the array a constant size, and "undef" elements when appropriate?

Yes. In use, new sets would be added to @sets on the fly. The mechanism is essentially a queue. New sets ("work items") are being continuously added asynchronously with the resources (represented by the booleans) becoming available. If I don't remove completed work items (sets for which all required resources have become available), then the queue would grow endlessly.

But your notion about using hashes is a good one. You've used a hash to index those sets that use a boolean against each boolean. You're undefing rather than splicing to avoid screwing up the indexing. But, if I also use a hash for holding the sets, I can delete a set from %sets, without screwing up the index. And that greatly reduces the number of sets that need to be examined for each boolean.

Thanks. I'll have a go at implementing that tomorrow.

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

RIP an inspiration; A true Folk's Guy

Re^3: Better algorithm or data structure?

by graff (Chancellor) on Aug 30, 2010 at 00:53 UTC

The mechanism is essentially a queue. New sets ("work items") are being continuously added asynchronously with the resources (represented by the booleans) becoming available.

This poses another problem, because the "boolmap" structure will grow continuously. If the "sets" queue can grow to the point that the process blows up after a while, then the accumulation of hash keys in the "boolmap" structure is going to have the same effect. You'll need a little more code (and more execution time -- but hopefully not too much) to delete "boolmap" hash keys as the corresponding sets are deleted.

And one more question, which might be unimportant, but the "boolmap" makes it relevant: When adding a new "set" to the queue, what if all the booleans in this set are already true? (Not that I'm curious about that, but it's something the implementation should take into account.)

Re^4: Better algorithm or data structure?

by BrowserUk (Patriarch) on Aug 30, 2010 at 01:17 UTC

Re: Better algorithm or data structure?
by JavaFan (Canon) on Aug 30, 2010 at 09:08 UTC

@bools2

@sets2

@sets

use strict;
use warnings;

use List::Util qw[shuffle];

our $N //= 1e3;
our $S //= 1e4;
our $Z //= 5;

my @bools = (0) x $N;
    
my @sets = map [map int(rand $N), 1 .. 1 + int(rand $Z)], 1 .. $S;
my @someOrder = shuffle 0 .. $N - 1; 

#
# Preprocessing
#
my @bool2;
my @sets2;
for (my $i = 0; $i < @sets; $i ++) {
    push @{$bool2[$_]}, $i for @{$sets[$i]};
    $sets2[$i] = @{$sets[$i]};
}

#
# Main loop
#
foreach my $next (@someOrder) {
    foreach my $set_nr (@{$bool2[$next]}) {
        delete $sets[$set_nr] unless --$sets2[$set_nr];
    }
}
[download]

O(k)

k

[reply]
[d/l]
[select]

Re: Better algorithm or data structure?
by jethro (Monsignor) on Aug 30, 2010 at 01:11 UTC

Using johngg's HashofArrays wouldn't it suffice to have a number for each set representing the number of booleans referenced by it. Just decrement this number for every set that contains the just-switched boolean (i.e. every set in the array corresponding to the boolean). Any set for which the count turns to 0 has to be removed.

Essentially this does the same job as the garbage collection, but since you do the "garbae collection" yourself you know which sets are done.

Re: Better algorithm or data structure?
by aquarium (Curate) on Aug 29, 2010 at 23:45 UTC

the hardest line to type correctly is: stty erase ^H

Re: Better algorithm or data structure?
by Anonymous Monk on Aug 30, 2010 at 15:26 UTC

Discard the notion of calling them “booleans.” They are Work Items, which periodically change state. And, for each one of these, there is a “cloud” of Interested Parties out there which are waiting for them to become what they are now not. Those same Interested Parties also need to know when any Work Item ceases to be as it is now.

So, when we introduce a new Interested Party, we associate with it a hash containing, for each Work Item, its last known state. The Interested Party also contains a count (>= 0) of the number of Work Items it is waiting-upon to change state. Each Work Item has a list of all of the Interested Parties who are now watching it.

When a Work Item changes state, it notifies all of the Interested Parties, providing its unique-ID and new state. The Interested Party retrieves the last known state and determines if it has changed. The appropriate action is determined by an easy four-row state table:

New State Last State Action

True True Take no action.
True False Signaled: update last-known-state, decrement wait count. If zero, remove yourself from all interested-parties lists and destroy yourself. (If negative, die!)
False False Take no action.
False True Unsignaled: update last-known-state, increment wait count.

You have now completely eliminated the problem of “searching” for anything.

Please notice that I stipulated that Work Items contain a list of references to Interested Parties, but that they provide the parties a unique-id. This somewhat arbitrary arrangement avoids circular references which can cause headaches for a garbage collector.

Re^2: Better algorithm or data structure?

by sundialsvc4 (Abbot) on Aug 30, 2010 at 15:27 UTC

Oh, fudge. If you think the above posting is brilliant, then, “it came from me.” If not, “I have no idea who just said that.”

:-D

As you can see, the Interested Parties simply want to know when the state of a Work Item has changed, and they decide what (if anything...) to do in response to that occurrence.

The “unique IDs,” which can simply be consecutive arbitrary integers, simply serve as hash-keys.

There are endless variations on this idea, but I think you get the idea. And it will produce “orders of magnitude” speed improvements provided that virtual-memory thrashing is not an issue.

Re^3: Better algorithm or data structure?

by BrowserUk (Patriarch) on Aug 30, 2010 at 16:16 UTC

You have now completely eliminated the problem of �searching� for anything.... And it will produce �orders of magnitude� speed improvements

It's an interesting story, but the devil is in the detail. I've been toying with johngg's idea which is essentially similar in that is uses objects that decide for themselves when something needs to be done. The reality is proving somewhat harder to implement.

And so far, is well shy of the 2 orders of magnitude improvement achieved by graff, repellent, salva GrandFather.

At this point, I can say no better than; "Show me the code":)

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

RIP an inspiration; A true Folk's Guy

Re^4: Better algorithm or data structure?

by sundialsvc4 (Abbot) on Aug 30, 2010 at 21:57 UTC

Back to Seekers of Perl Wisdom

Log In^?

Domain Nodelet^?

www.com | www.net | www.org

Node Status^?

node history
Node Type: perlquestion [id://857943]
Approved by Paladin
help

Chatterbox^?

How do I use this? • Last hour • Other CB clients

Other Users^?

Others romping around the Monastery: (4)

As of 2024-04-19 23:12 GMT

Sections^?

Information^?

Find Nodes^?

Leftovers^?

Today I Learned

Voting Booth^?

No recent polls found