http://qs321.pair.com?node_id=425139

nwkcmk has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I'm a amateur in Perl. I saw from web that following code is used remove duplicates in a array
undef %saw; @out = grep(!$saw{$_}++, @in);
Looks like !$saw{$_}++ is too much for me to digest. Hopefully someone can explain to me what this expression means.

Replies are listed 'Best First'.
Re: What does !$saw{$_}++ means
by Corion (Patriarch) on Jan 26, 2005 at 08:39 UTC

    Let's split it up a bit:

    You have a hash, %saw. Individual elements of a hash are accessed via $hash{key}. If you append the postfix ++ operator to it, it looks like $saw{key}++, which increments the hash element by one.

    Now, what does that do?

    The grep iterates over the whole array @in, sets $_ to each element and then executes the code, in our case the expression !$saw{$_}++. If the expression returns a true value, grep keeps the array element in its result, otherwise it's discarded.

    Now, when a key in %saw does not already exists, the code sets $saw{key} to 1 (incrementing by one from undef), and then returns the negation of the previous hash value (undef is false, so it returns true). So, if the hash key did not yet exist in the hash, the array element is put into @out.

    The other case is that the hash key already exists in the hash. Then $saw{$_} returns a number greater than zero, which is interpreted as true, and the negation of that is false, so the (duplicate) array element in @in is discarded.

    This method is a nice and easy way (once you understand it) to get a unique list of elements in an array while retaining the order. There are other methods, like using the keys of %saw:

    undef %saw; $saw{$_}++ for @in; @out = keys %saw;

    This code puts the same elements into @out, but you lose the order.

    I don't remember if it was quicker, but there also is the non-looped version:

    undef %saw; @saw{@in} = (1) x @in; @out = keys %saw;
      returns the negation of the previous hash value (undef is false, so it returns true)
      Nit: post-increment returns 0 if the value was undef, not undef. This is occasionally criticised, defended, pointed out as inconsistent with post-decrement (which does return undef), and then dropped. A comment in pp.c refers one to http://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/2003-03/msg00536.html for further info (where you can read the criticism, defense, etc.).
      I don't remember if it was quicker, but there also is the non-looped version:
      undef %saw; @saw{@in} = (1) x @in; @out = keys %saw;
      I haven't benchmarked it, but I've frequently heard it asserted that even faster is
      undef %saw; @saw{@in} = (); @out = keys %saw;
      That is, using undef as your values (so you don't need the increment or to create the list of ones).
      but there also is the non-looped version:

      Except that x is a loop operator, just like map. In fact, it not only loops for the number of duplications, it also implicitely loops over each element of the list on the LHS.

      @saw{@in} and keys(%saw) are also loops, but they are implicit unlike x.

      In fact, you can remove two of the four loops of your "non-looped version":

      undef %saw; undef(@saw{@in}); @out = keys(%saw);

      Update: undef %saw is also an implicit loop, on calls other than the first.

Re: What does !$saw{$_}++ means
by Zaxo (Archbishop) on Jan 26, 2005 at 08:49 UTC

    With no change to the idiom, I'd write that as,

    my %saw; my @out = grep { ! $saw{$_}++ } @in;
    %saw is a hash, initialized empty in the my declaration. I changed that to keep from eliminating and then automatically reproducing (autovivifying) a global %saw.

    The expression $saw{$_}++ adds one to the value associated with the key that is $_'s value. It returns the value $saw{$_} had before the addition. $_ is a variable grep sets in turn to each element of @in. "!" is logical "not", so the boolean value $saw{$_} had is inverted. That means grep only sees true when the hash hadn't seen the key yet. Thus, grep only passes along the first instance it sees of each element in @in.

    This is an idiom. It seems complicated at first, but will soon be second nature to you. Sometimes it is written to increment $saw{$_} in a loop over @in, and then set @out to keys(%saw), but that doesn't preserve the order of the elements. Yours does.

    After Compline,
    Zaxo

      With no change to the idiom, I'd write that as,
      my %saw; my @out = grep { ! $saw{$_}++ } @in;
      But that extends the lifetime of %saw. If later in the same (or an inner) block, you need to use the same construct, you have to use a different name for the hash, or do a %saw = () - which will then cause errors if you remove the first construct (until you my the newer construct).

      I like to write it as:

      my @out = do {my %saw; grep !$saw{$_}++, @in};
      which doesn't leak the name of the temporary array.
      With no change to the idiom, I'd write that as,
      my %saw; my @out = grep { ! $saw{$_}++ } @in;
      Personally I prefer to use the "EXPR-form" of grep() if possible. But this largely depends on the case under examination: in some cases while it could be possible to use that, still it is more terse to adopt the "BLOCK-form", as you did.

      Definitely good point about my %saw; instead. But after all the snippet posted by the OP is too small to really understand wether he's using non-strict code or if he's reusing a previously used %saw (I wouldn't do that, FWIW) or...

Re: What does !$saw{$_}++ means
by Hena (Friar) on Jan 26, 2005 at 08:39 UTC
    This is my conclusion of it. It can be wrong :).

    Well, it uses %saw hash to keep track on how many inputs have passed (NOTE the return before increment). Grep gets number when asking keep or lose. ! will reverse the ok/fail answer.

    So when first of duplicate inputs gets there, $saw{$_}++ returns 0 and increments. ! will reverse that to 1 and grep takes it in. second has already a value and $saw{$_}++ returns 1 (and increments) which ! reverses to 0 and grep drops.
Re: What does !$saw{$_}++ means
by ZlR (Chaplain) on Jan 26, 2005 at 08:46 UTC
    Hello,

    Here's my interpretation :

    %saw is a hash.
    $saw{$_} is the value in this hash associated with the key $_
    $saw{$_}++ increases this value by 1
    !$saw{$_} is a logical which means "there is no value associated with the key $_ in %saw" .

    Now, this is used inside a grep applied to @in : this means that $_ will take each of the value in @in . Let's take the first value of @in : obviously it's not yet in %saw so the conditional !$saw{$_} is TRUE (it's a double negation) . Therefore grep validates and this first value goes into @out.

    At this time a little magic happens : after evaluation of !$saw{$_} the ++ is applied . I'm only guessing this happens because of some precedence of ! over ++ .

    So what if the second element of @in is the same as the first ? Well, since ++ happened , $saw{$_} will have a value of 1 and therefore !$saw{$_} will be FALSE : you will not get this repetition in the final @out .

    Hope this helps,
    ZlR .

    Question : I just checked in the camel book:
    the ! opertor has an arity of 1 and is right associative
    the ++ operator also has an arity of 1 but is not associative.
    I'm not sure then why the !$saw{$_}++ is correctly evaluated since they have the same arity.

    Answer by ysth : nothing to do with arity , it's just that ! has higher precedence than ++ .

      !saw{$_} should have been !$saw{$_}

      (Maybe obvious, but since this is about deobfuscating an idiom, I thought I'd post anyway.)

        Yep ! Corrected :)

      I only have a Camel II, which doesn't list arity in the operator table, so I'm not sure exactly what you are seeing. However, arity has nothing to do with precedence; !$seen{$_}++ works because ++ has higher precedence than !.
        Yes, i checked again (camel 3) and this time i understood the table corectly : precedence is shown by the way the operators are ordered, while "arity" is the number of argument they can take. thx .
Re: What does !$saw{$_}++ means
by blazar (Canon) on Jan 26, 2005 at 09:00 UTC
    Note: parens around C<grep>'s args are not necessary, so it can be even more concise. ("Elegant", IMHO!)

    You should really read

    perldoc -f grep
    perldoc perlsyn
    perldoc perlop
    
    Once you know the elementary charachteristics of each subexpression of the above, combining them should not be too hard.

    However grep() will take all the elements from @in and evaluate the !$saw{$_}++ expression for them ($_ is an alias to the actual element). On the first encounter of an item $saw{$_} will be undef, so $saw{$_}++ will return 0 and store one in it. At this point !$saw{$_}++ will return 1 and the element under examination will be passed. A similar analysis for the case in which $saw{$_} already contains a (positive) number (i.e. on second, third, etc. encounter of an item) is left as an exercise to you.

Re: What does !$saw{$_}++ means
by ikegami (Patriarch) on Jan 26, 2005 at 16:54 UTC

    !$saw{$_}++
    is
    !$saw{$_}++
    is
    !($saw{$_}++)
    is
    !(do { my $previously_seen = $saw{$_}; $saw{$_} += 1; $previously_seen })
    is
    do { my $previously_seen = $saw{$_}; $saw{$_} += 1; !$previously_seen }
    is a simplification of:
    do { my $previously_seen = $saw{$_}; $saw{$_} = 1; !$previously_seen }
    means (when in context of the grep)
    If the string we're looking at is in %saw, return false. Otherwise, add that string to %saw and return true.
    means (when in context of the grep)
    Remove duplicates.

Re: What does !$saw{$_}++ means
by nwkcmk (Sexton) on Jan 27, 2005 at 02:34 UTC
    Hi all, Thanks for the explanation. They have been very helpful.