http://qs321.pair.com?node_id=906065

LanX has asked for the wisdom of the Perl Monks concerning the following question:

Hi

I'm trying to design a faster and more robust alternative to perlfaq: intersection,union, difference of two arrays.

One of the things I don't like is, that many such examples in the FAQ use the keys of a %count hash as results.

This is fundamentally wrong because keys are stringifications and any reference type will not be reproduced (just think about arrays of objects or a AoH or ...)

So I thought about using $set{STRINGIFICATION}=REALVALUE as as fundamental datastructure for set operations.

And to significantly speed up things I also wanted to avoid any loop constructs, restricting myself to hashslices and list flattening.

use Data::Dump qw(pp); my @array1=(1..5); my @array2=(3..7); my (%union,%intersection,%difference,@tmp); my (%set1,%set2); @set1{@array1}=@array1; @set2{@array2}=@array2; @tmp= @set1{ keys %set2 }; @intersection{@tmp}=@tmp; # warning: use of uninitialized values delete $intersection{""} ; # dirty hack %union=(%set1,%set2); %difference=%union; delete @difference{keys %intersection}; pp \(%set1,%set2,%union,%intersection,%difference);

prints

Use of uninitialized value $tmp[0] in hash slice at /home/lanx/B/PL/PM +/set.pl line 17. Use of uninitialized value $tmp[3] in hash slice at /home/lanx/B/PL/PM +/set.pl line 17. ( { 1 => 1, 2 => 2, 3 => 3, 4 => 4, 5 => 5 }, { 3 => 3, 4 => 4, 5 => 5, 6 => 6, 7 => 7 }, { 1 => 1, 2 => 2, 3 => 3, 4 => 4, 5 => 5, 6 => 6, 7 => 7 }, { 3 => 3, 4 => 4, 5 => 5 }, { 1 => 1, 2 => 2, 6 => 6, 7 => 7 }, )

my problem is the "dirty hack" because hashslice return undef for non-existant keys and undef is later stringified to an empty string. (muting the warnings isn't the problem)

Any elegant idea how to solve this?

Or do I have to fall back to something like

%intersection = map { exists $set1{$_} ? ($_=>$set1{$_}) : () }  keys %set2;

or

 @tmp= grep {defined} @set1{ keys %set2 };

???

Cheers Rolf

Replies are listed 'Best First'.
Re: Using hashes for set operations...
by ikegami (Patriarch) on May 21, 2011 at 17:00 UTC
    my %set1 = map { $_ => 1 } @array1; my %set2 = map { $_ => 1 } @array2; my %intersection = grep $set1{$_}, keys %set2;

    That assumes true values. You're using weird values, so you might have to use exists.

      well this my %intersection = grep $set1{$_}, keys %set2; is wrong, you're assigning a list of keys to a hash.

      and you missed so many points from my post that I assume that you only read the headings... or how do you want to intersect sets of objects like this?

      Cheers Rolf

        and [he] missed so many points from my post that I assume that you only read the headings
        I've noticed a trend for quantity at the expense of quality, in general.

        Sorry, missing map { $_ => 1 }

        my %set1 = map { $_ => 1 } @array1; my %set2 = map { $_ => 1 } @array2; my %intersection = map { $_ => 1 } grep $set1{$_}, keys %set2;

        and you missed so many points from my post that I assume that you only read the headings..

        Are you trying to imply that I should have told you you need to use a hash to get the original from the stringification even you already know that?

        Or are you referring to request to avoid loops? Yeah, I missed that bit of nonsense. It's impossible to find the intersection of two sets without looping over the two sets.

Re: Using hashes for set operations...
by John M. Dlugosz (Monsignor) on May 21, 2011 at 18:26 UTC
    I like the idea of starting with a master hash that gives the stringification of each complex data type. Making that the basic unit for manipulation, rather than a separate internment step, is interesting.

    But your simple elegant code is relying on implicit stringification, and repeated stringification of the same values. This is inefficient if it's a complex data structure, and needs an explicit call in the general code to handle ad-hoc data structures that are not blessed into a class with a stringify operator that's suitable for this purpose.

    As for getting a fake "" key you need to toss out, I don't think that's a problem if you make sure that the stringification never returns "" for a valid item in the set.

    Ultimately, I think the answer to "how do I find an intersection" isn't to repeat some incantation invented by someone else, but to simply call intersection. The algorithm should be canned in a CPAN module.

      > But your simple elegant code is relying on implicit stringification, and repeated stringification of the same values. This is inefficient if it's a complex data structure, and needs an explicit call in the general code to handle ad-hoc data structures that are not blessed into a class with a stringify operator that's suitable for this purpose.

      In Perl stringification of references has no performance penalty. It's just a string with the reference addressą and the type (this includes package name if blessed).

      perl -e '$p=[];print $p' ARRAY(0x928c880)

      (I think you are confusing with other language like JS˛, where the whole data is dumped)

      OTOH, there is a doubled memory consumption when using primitive scalars like long strings.

      >As for getting a fake "" key you need to toss out, I don't think that's a problem if you make sure that the stringification never returns "" for a valid item in the set.

      well I think a special case handling "" wouldn't cost too much performance...it's just not elegant anymore.

      > Ultimately, I think the answer to "how do I find an intersection" isn't to repeat some incantation invented by someone else, but to simply call intersection. The algorithm should be canned in a CPAN module.

      There is already Set::Object which seems to follow the same ideas, but it looks heavy and uses inline C...

      Anyway my point was first to improve the FAQ and then thinking about a module.

      Cheers Rolf

      1) Of course one should be careful when mixing strings and references in one set... :)

      2) e.g. javascript:a=[1,[2,3]];alert(a) shows 1,2,3

        In Perl stringification of references has no performance penalty. It's just a string with the reference addressą and the type (this includes package name if blessed).
        perl -e '$p=[];print $p'
        ARRAY(0x928c880)
        (I think you are confusing with other language like JS˛, where the whole data is dumped)
        No, I'm thinking that it is pointless to compare references since any two copies will test as unequal. Instead, you must manually write something that stringifies (or hashes, in the other sense of the word) to a canonical form in order to then test for equivalence.

        I guess that depends on what the user intends, so the FAQ should point out that using a reference (or object) as a hash key will stringify as you show, so do the same test as == against the reference itself (the address).

        I think intersection and friends should be like sort, in that they can take a piece of code that is used to determine what is meant by equivalence in this particular case. That's easy to call but can be inefficient; and just like you use the whatever maneuver with sort to cache the keys, you could do the same with intersection. But the eventual module can have that built-in, as your ideas directly incorporate that kind of keying. Then the user needs to provide code to produce a canonical key of one item, as opposed to comparing the equivalence of two parameters.

        But back to the underlying code: If I want two ad-hoc uses of [qw/1 2 3/] to be considered the same, stringifying the reference won't do it. It needs to call a function to generate the string key from the contents. And we suppose that this is expensive, so only call it once per value in each input list.

        The user wants to find the intersection of two lists, so he would be told to pass @set1 and @set2, and optionally a &func, which defaults to built-in stringification. Prepare your internal %set1 from @set1 and func(each element), and arrange the code (at least in the case where a func is passed -- it could have different implementations) to not need to call func again on some value but to always keep it with the key.

Re: Using hashes for set operations...
by jaredor (Priest) on May 22, 2011 at 04:52 UTC

    I don't know what underlying perl mechanics I'm potentially violating (update: I meant in generating union, everything else should be jake), but the following seems to work. There's no idea here that wasn't in your original post; in fact I referred to it frequently when writing this up.

    #!/usr/bin/env perl use strict; use warnings; my @array1=(1..5); my @array2=(3..7); my (%union, %sdiff, %sd1, %sd2, %inter); @union{(@array1,@array2)} = (@array1,@array2); %sd1 = %sd2 = %inter = %union; delete @sd1{@array2}; delete @sd2{@array1}; %sdiff = (%sd1,%sd2); delete @inter{keys %sdiff}; print "array1: ", join (" ", @array1), "\n"; print "array2: ", join (" ", @array2), "\n"; print "union: ", join (" ", keys %union), "\n"; print "sdiff: ", join (" ", keys %sdiff), "\n"; print "inter: ", join (" ", keys %inter), "\n";

    It felt like I was writing either haiku or APL, but I can't quite say which....

      Hmm interesting!

      I was sure that using De Morgan's laws would just relocate the problem and not solve it. (you're now only deleting hash-slices)

      But this relocation makes sense ... now it works for all cases where "undef" isn't allowed as part of an initial set. (undef deletes empty strings).

      But you shouldn't use print to inspect data structures:

      #!/usr/bin/env perl use strict; use warnings; use Data::Dump qw(pp); my @array1=(1..2,""); my @array2=(2..3,undef); my (%union, %sdiff, %sd1, %sd2, %inter); @union{(@array1,@array2)} = (@array1,@array2); %sd1 = %sd2 = %inter = %union; delete @sd1{@array2}; delete @sd2{@array1}; %sdiff = (%sd1,%sd2); delete @inter{keys %sdiff}; print "array1: ", pp(\@array1), "\n"; print "array2: ", pp(\@array2), "\n"; print "union: ", pp(\%union), "\n"; print "sdiff: ", pp(\%sdiff), "\n"; print "inter: ", pp(\%inter), "\n"; print "sd1: ", pp(\%sd1), "\n"; print "sd2: ", pp(\%sd2), "\n";

      prints

      Use of uninitialized value $array2[2] in hash slice at /home/lanx/B/PL +/PM/set2.pl line 12. Use of uninitialized value $array2[2] in delete at /home/lanx/B/PL/PM/ +set2.pl line 16. array1: [1, 2, ""] array2: [2, 3, undef] union: { "" => undef, 1 => 1, 2 => 2, 3 => 3 } sdiff: { 1 => 1, 3 => 3 } inter: { "" => undef, 2 => 2 } sd1: { 1 => 1 } sd2: { 3 => 3 }

      But while you found a mathematical solution, needing so many operations for an intersection is in practice a little disappointing ...

      Thanks anyway! :)

      Cheers Rolf

        Sir, to me "elegant" is equivalent to "mathematical"!

        (Even though I am a failed mathematician, it does sting a little that you felt the need to give me a link to De Morgan's laws.)

        True, I did treat this more as a little puzzle than a real world answer, but then what do you expect with only allowing the set operations, "union" and "subtraction"? As for that, you only get the union property as a side effect from hash key creation (which (I think) would not give even a squeak of protest if there were two key/value pairs with identical keys and different values, but by construction we avoid this issue).

        Although I've never used this behavior, apparently the delete function does return a list of keys values it's deleted, so delete(@A{keys B}) does return the intersection of A and B ... as well as an undef for every key value that was in B and not in A; so you're back to square 1 with ugly undefs in your answer. (Hence I still haven't used this behavior.) BTW, it just occurred to me that for the symmetric differences %sd1 and %sd2 don't have to be copies of the union, they only need to be the hash equivalents of @array1 and @array2, respectively.

        I love talking like a pedant, but in addition to this vice, I have to admit to a hypocrisy: I love using map and grep more, even to the point of using them for their side effects, which means a significant portion of my time in writing up my (few) posts on perlmonks is translating to while and for constructs. So I'm sincere when I say that your original post was elegant enough for me: Handling undefs in perl is just a fact of life, and if you can do it with just one swipe of the knife, then you are about as good as you're going to get without reformulating your approach.

        My last gratuitous squeak of protest is that print statements suffice if they work correctly for a code that is an illustration of a solution; in other words, it is not a data structure at that point so much as a formatted solution! (That being said, you've shamed me enough that I will try to start using Data::Dumper in my posts.)

        I've just gone through and removed all the smiley's from this post. So now I look smarter, but more crotchety. Anyone who reads this please note that I enjoy these kinds of "pure perl" puzzles and have enjoyed everyone's contribution to this query of LanX's. I spent years as a coder looking for good coders to critique my work with only modest success. Here at perlmonks you get high quality feedback quickly--and you still learn from the low quality feedback because you now have context in which to evaluate it. Good stuff, and thank you LanX.