http://qs321.pair.com?node_id=1202179

dominick_t has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks— Below is a quite long post with a number of small points regarding which I would appreciate your insight, if anyone is interested in wading through everything. But in terms of getting a working script for the particular task I have in mind, I'll first just ask this question: Given a list whose elements are strings (containing only letters, as they are actually English-language words), how would one print that list only on the condition that at least one of the strings in the list contains no repeated letters? Many thanks— Dominick


LONG POST HERE:
Dear Monks— I hope to accomplish two goals with this post. One is to understand fully a piece of code that was posted at Perl's pearls, and the other is to modify this code in order to accomplish a similar but I believe more complicated task. The original poster of the node I just referenced uses a Perl script to find all anagrams within a word list. (Note this is not the same task as finding anagrams of a given word, but rather of finding all anagram sets within a given word list.)

I spent a good deal of time studying the script provided by the original poster. I understand now how most of it is working. When I first ran this script, there appeared to be a flaw in that it was not eliminating duplicate elements in the original word list. I think I may know why this is happening, and I would very much like (as a bonus) to get to the bottom of that issue as well, but the truth is that I was able to use a script provided by a different person in response to the OP. So understanding the duplicate issue is less important to me than understanding the script provided in response, which seems generally more accessible to me. Here is that script:

while (<DATA>) { chomp; my $key = lc $_; next if $seen{$key}++; my $signature = join "", sort split //, $key; push @{$words{$signature}}, $_; } for (sort keys %words) { my @list = sort @{$words{$_}}; next unless @list > 1; print "@list\n"; }


If I am understanding this correctly, in the while loop, when a new word is encountered in the input list, the newline at end is removed, and $key becomes the lowercase version of the word. Now I believe a hash called %seen is initialized just to keep track of whether or not a word has already been encountered, in which case we skip it. (This sounds right to me in principle, but the actual mechanics are still a little opaque to me. I’m expecting to see some sort of truth statement after “if” and I don’t know how to interpret what I see as a statement at all, though I guess it has to do with the number value of the value of the hash? Also not quite clear on how the ++ is working.)

Next, we introduce a new scalar $signature, which takes the current string in $key, turns its letters into a list of separate letters, sorts the letters, and then joins them back together into one string. Next I believe we initialize a new hash called words, but it’s a hash whose values are arrays. (How does Perl know this? Because “push” expects an array? Because the @ symbol lets Perl know?)

Because the original word in the input list is still stored in $_, then the push line adds the word to the array which is the value corresponding to the key of $signature in the %words hash. And then I believe it does this to the entire input list, so by the end of the while loop, our hash %words has values which are lists. In cases where an input word has no anagrams, that list is only one element long. In cases where an input word has at least one anagram, that list is two or more elements long. So the “for” iteration looks at each of these lists in turn, and prints it if it contains more than one element. (This part also confused me for a minute, only because I saw $_ in the “for” command, and I thought this would still be an input word from the list. But I suppose it takes on the various values of $signature as it moves through the hash?)

So, aside from the few things I’ve mentioned in parentheses, which I’d very much appreciate responses to, I understand how this script does its thing, and I can use it to find anagrams within word lists that I provide it. I also understand the script just well enough to have attempted a modification of it to accomplish a different task, but one which I think can be tackled with a similar approach. I am interested in finding groups of words not that are all anagrams of each other, but rather which all share a common “letter bank,” which is a word (in the list) with no duplicate letters. For example, given the input

ab aabb aaabbb xxyy xxxyyy
what I would like to print is

ab aabb aaabbb

“aabb” and “aaabbb” use only the letters of “ab” so we say “aabb” and “aaabbb” bank to “ab”. Now, “xxyy” and “xxxyyy” do bank to the same string “xy”, but “xy” is not in the original list, so we wouldn’t say that they share the same bank. (To use a more English-language example, and supposing we are using as input a standard dictionary . . . While “accomplice” and “polemical” use only the same set of letters (a, c, e, l, m, o, and p), they do not bank to an actual word, because those letters do not form a word without repeating at least one of the letters. The bank must have no duplicate letters and be in the list.)

Now, I’ve written a modification of the anagram script (copied below) which is basically doing the job, except that it is printing

ab aabb aaabbb xxyy xxxyyy


All I’ve done is taken the anagram script, and instead of making the signatures only the sorted letters of the input word, I’ve gone a step further and removed any duplicates from it. So now each word has a signature which is simply the letters which are in it, sorted. So I get why it’s printing the xxyy and xxxyyy even though I don’t want it to. Now, I guess one could check that the signature is an anagram of a word in the list before accepting it as a key in the hash, but this seems hard. I believe I can accomplish what I want at the end, by deciding to print an array value on two conditions: one is that it contains more than one element (just as with the anagrams script) and the other is that at least one element in that list contains no repeated letters. Seems like an easy enough condition to add. Thoughts on how to add it?

Here’s the modified script:

#!/usr/bin/perl while (<DATA>) { chomp; my $key = lc $_; next if $seen{$key}++; my @unique = (); my %seenletter = (); my @alphawithdupes = sort split //, $key; foreach my $elem ( @alphawithdupes ) { next if $seenletter { $elem }++; push @unique, $elem; } my $signature = join "", @unique; push @{$words{$signature}}, $_; } for (sort keys %words) { my @list = sort @{$words{$_}}; next unless @list > 1; print "@list\n"; } __END__ ab aabb aaabbb xxyy xxxyyy


Many thanks and all best— Dominick

Replies are listed 'Best First'.
Re: Anagrams & Letter Banks
by haukex (Archbishop) on Oct 28, 2017 at 12:59 UTC
    next if $seen{$key}++; push @{$words{$signature}}, $_;

    You've already received some excellent answers, I just wanted to touch on your questions about these two lines of code. Both of them involve the concept of autovivification, where elements of data structures are brought into existence simply by referring to them.

    The postincrement operator ++ first returns the old value of its argument and then increments it (the argument $seen{$key}, not the returned value). This means that if the key does not yet exist in the hash, it will return the undefined value, which in the boolean (scalar) context of if evaluates to false, then brings the hash entry into existence, with a value of 1 (Update 2: because undef in numeric context is 0, so 0+1=1, and in this special case there is no warning about using undef in a numeric operation). The next time this key is encountered, that means that the operator returns the previous value of 1, which is true, so the next is executed, thereby skipping duplicates. Since the increment is also executed, this also means that the values of the hash count how many times each key is seen.

    In the second case, the array dereference operation @{} tells Perl explicitly that you expect that hash element to contain an array reference, so if the hash element does not yet exist, it will be brought into existence as an empty anonymous array, so then it can be pushed onto.

    As for your questions about for and $_, I find that the first few paragraphs of Foreach Loops explains this fairly well.

    Update: I strongly recommend you Use strict and warnings, also have a look at the Basic debugging checklist.

    Update 2019-08-17: Updated the link to "Truth and Falsehood".

Re: Anagrams & Letter Banks
by Discipulus (Canon) on Oct 27, 2017 at 19:51 UTC
    Hi dominik_t

    > Given a list whose elements are strings (containing only letters, as they are actually English-language words), how would one print that list only on the condition that at least one of the strings in the list contains no repeated letters?

    Counting repetitions or not repetitions is a call for a hash. The following iterates over lists and, using the $has_unique switch, search if one word of the list is built with just different characters.

    This is done with the %chars hash: $chars{$_}++ for ($word =~ /./g) where the regex in list context returns all chars and the respective value in the hash is augmented by one.

    Then if the length of the $word is equal to the number of keys of the hash the word must be built with different characters, so the switch $has_unique is turned on and after all word in the list were processed the list is printed only if the switch is on.

    use strict; use warnings; my $good = [qw( allo mallo malo)]; my $bad = [qw( tillo sillo sallo)]; foreach my $list ($good,$bad){ my $has_unique; foreach my $word(@$list){ my %chars; $chars{$_}++ for ($word =~ /./g); if (scalar keys %chars == length $word){ # print "$word has no repeated letters"; $has_unique++; } } print +(join ' ', @$list),"\n" if $has_unique; } # output allo mallo malo

    L*

    There are no rules, there are no thumbs..
    Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.
      Thank you, Discipulus. I need clarification on at least a couple of things. If (allo mallo malo) is a list, why does it appear to be a scalar with my $good?

      Also, with regard to the %chars hash, are you saying that a new hash is made for each word in the list, and that the keys of that hash are the individual characters in the word? And after it's parsed a whole word, the value corresponding to a key is the number of times it occurs? (Does the ++ make that happen? I guess I'm not seeing how the values are added to the hash.) So then this works by looking at the number of keys, which would match the length of the word if and only if the word had no repeated letters.
        hello again dominick_t

        An explicit array is @arr = qw(a b x) and an anonymous one is $array_ref = [qw(c d e)]

        Use print and/or Data::Dumper or better Data::Dump to se in action:

         my $array_ref = [qw(c d e)]; print Dumper $array_ref;

        The doc covering this is perlref

        > %chars hash, are you saying that a new hash is made for each word in the list, and that the keys of that hash are the individual characters in the word?

        Yes! try it to see; print is your first debugging tool:

        use strict; use warnings; my $good = [qw( allo mallo malo)]; my $bad = [qw( tillo sillo sallo)]; foreach my $list ($good, $bad){ print "got list [@{$list}]\n"; my $has_unique; foreach my $word(@$list){ print "analizing word [$word]\n"; my %chars; foreach my $char ($word =~ /./g){ print "\tGot char [$char]\n"; $chars{$char}+= 1; } print "\tchars count is:\n"; print map{"\t$_ = $chars{$_} "}keys %chars; print "\n"; if (scalar keys %chars == length $word){ print "\t$word has no repeated letters (keys of \%chars + are equal to the length of \$word)\n"; $has_unique++; } print "\n"; } print "I print the whole list: ",(join ' ', @$list),"\n\n" if $has +_unique; } # output got list [allo mallo malo] analizing word [allo] Got char [a] Got char [l] Got char [l] Got char [o] chars count is: l = 2 a = 1 o = 1 analizing word [mallo] Got char [m] Got char [a] Got char [l] Got char [l] Got char [o] chars count is: l = 2 a = 1 m = 1 o = 1 analizing word [malo] Got char [m] Got char [a] Got char [l] Got char [o] chars count is: l = 1 a = 1 m = 1 o = 1 malo has no repeated letters (keys of %chars are equal to the +length of $word) I print the whole list: allo mallo malo ...
        L*

        There are no rules, there are no thumbs..
        Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.
Re: Anagrams & Letter Banks
by tybalt89 (Monsignor) on Oct 27, 2017 at 22:48 UTC

    I am interested in finding groups of words not that are all anagrams of each other, but rather which all share a common “letter bank,” which is a word (in the list) with no duplicate letters.

    #!/usr/bin/perl # http://perlmonks.org/?node_id=1202179 use strict; use warnings; chomp( my @words = <DATA> ); # find the "banks" in the word list my %banks = map { $_, [ ] } grep !/(.).*\1/, @words; # find what a word "banks" to and save if "bank" exists for ( @words ) { my $banksto = tr///csr; exists $banks{$banksto} and push @{ $banks{$banksto} }, $_; } print "@$_\n" for values %banks; __DATA__ ab aabb aaabbb xxyy xxxyyy

      Perhaps I'm misunderstanding something but it seems to me that my $banksto = tr///csr; is only removing adjacent duplicate letters. Adapting your code:-

      #!/usr/bin/perl # http://perlmonks.org/?node_id=1202179 use strict; use warnings; use Data::Dumper; chomp( my @words = <DATA> ); # find the "banks" in the word list my %banks = map { $_, [ ] } grep !/(.).*\1/, @words; print Data::Dumper->Dumpxs( [ \ @words, \ %banks ], [ qw{ *words *bank +s } ] ); # find what a word "banks" to and save if "bank" exists for ( @words ) { my $banksto = tr///csr; print qq{$_ -> $banksto\n}; exists $banks{$banksto} and push @{ $banks{$banksto} }, $_; } print "@$_\n" for values %banks; __DATA__ ab aabb aaabbb aaabbab xxyy xxxyyy

      produces the following:-

      @words = ( 'ab', 'aabb', 'aaabbb', 'aaabbab', 'xxyy', 'xxxyyy' ); %banks = ( 'ab' => [] ); ab -> ab aabb -> ab aaabbb -> ab aaabbab -> abab xxyy -> xy xxxyyy -> xy ab aabb aaabbb

      My reading of the OP's requirement would imply that the added 'aaabbab' should also bank to 'ab' and e.g., 'cabbage' banks to 'cabge' not 'cabage' which is what comes out of tr//csr. What am I misunderstanding?

      Update: Thinking about it further, I realised I was fixating on the tr//csr in tybalt89's post (which usage I had never encountered before) and ignoring the join "", sort split //, $key used by the OP. That is what I was misunderstanding :-/

      Cheers,

      JohnGG

Re: Anagrams & Letter Banks
by Laurent_R (Canon) on Oct 27, 2017 at 19:02 UTC
    The %seen hash is there to remove duplicate words from the input. If a word in $key has nos been seen, then $seen{$key} will not exist and return a false value, so that we will continue the process; at the same time, the hash entry will be created for the key and the value will be 1. If the word has been seen, then $seen{$key} will exist and return a true value (1, the first time you see a duplicate word, then 2, 3...), we will go back to the top of the loop starting with the next input word.

    The %word hash will have normalized words (i.e. with letters put in alphabetic orders) or signatures as keys, and for each such key, will contain an anonymous array of the original words that were translated into this signature. In other words, if an array in the hash or arrays has more than one entry, then it means that two or more words matched the normalized signature and that we have an anagram.

    I hope this makes it clearer.

      Thank you, Laurent_R. That definitely clears up how the %seen hash is doing its thing. The bit about the %word hash I already understood . . . It was just the use of $_ that threw me for a moment. If you have thoughts on the question asked at the beginning of the (edited) post, I'd appreciate that as well! Thanks very much for you've already written here.
        Basically, you have two loops in a row. In the first loop, each record is assigned to $_. When this loop is finished, $_ does not have any useful value. When you start the second loop, again, each value is assigned to $_.

        No problem here, but there are cases, such as nested loops, where is might become slightly dangerous (you might clobber the value of the outer $_ with the value of the inner $_); in the cases of nested loops, while loops and for don't do exactly the same thing, so it is somewhat unreliable and don't do it unless you really know what you're doing. The best, in the event of nested loops, is to use explicit variables rather than the default topical variable $_.

        On the questions related to your other project (not anagrams), you don't give enough details on what you need and how it is supposed to work. Please specify.

Re: Anagrams & Letter Banks
by tybalt89 (Monsignor) on Oct 27, 2017 at 20:17 UTC

    "Given a list whose elements are strings (containing only letters, as they are actually English-language words), how would one print that list only on the condition that at least one of the strings in the list contains no repeated letters?"

    #!/usr/bin/perl # http://perlmonks.org/?node_id=1202179 use strict; use warnings; my $good = [qw( allo mallo malo)]; my $bad = [qw( tillo sillo sallo)]; grep !/(.).*\1/, @$_ and print "@$_\n" for $good, $bad;
      Thank you, tybalt89. It's nice to see this regex solution as well as the hash solution offered earlier. I was just reading about backreferences, and again it makes sense to me in principle, but the details remain opaque. What do the parentheses in (.) accomplish? And how does the backreference determine if something has been seen again?
        Hi dominick_t

        The parentheses capture one letter and look for a repeat, \1. He greps for a string which doesn't cause the regular expression to succed, (! /.....

        Update

        I should have been more thorough in my reply. When there is a parentheses around an expression, in this case '.', \1 refers to that captured value. If an 'a' was captured, then \1 would be a. If a 'd' was captured, then for \1 to match, it would need to find a 'd' somewhere further in the string.

        Perl will try to make the match succeed, so it will progress matching each letter until it finds a match. If it does, then the the regexp will succeed. Else, it will fail.

        If it fails then the grep succeeds!

        There would be a \2 if there were 2 sets of capturing parentheses. \1 would be what was captured in the first parens and \2 would be what was captured in the second set of parens. In this problem there is only one set of parens so only \1 would be involved.

Re: Anagrams & Letter Banks
by roboticus (Chancellor) on Oct 28, 2017 at 14:32 UTC

    dominick_t:

    Let's answer your last question first: How to print the line "ab aabb aaabbb" and not "xxyy xxxyyy": Since your key is the unique list of letters found in each word, your signatures are "ab" and "xy". You want to print the word list only if it contains a word that contains all the letters, but not any repeats.

    As you guess, this is a simple condition. You already know that all words contain all the letters in the list, so that condition is met automatically. Since you want a word that contains only those letters in your signature with no repeated letters, all you need to do is check the length: If any word in the list is the same length as the key, both conditions are met, and you can print the list, by adding two lines to your print loop, like so:

    ... for (sort keys %words) { my $siglen = length; next unless grep { $siglen == length } @{$words{$_}}; ...
    (How does Perl know this? Because “push” expects an array? Because the @ symbol lets Perl know?)

    Perl has a built-in feature called "autovivification" where if you add something to a data structure that doesn't exist, it tries to build it for you. I don't know exactly where/how it happens, but my mental model of the push @{$words{$signature}}, $_; statement goes something like this:

    . . . # push @{$words{$signature}, $_; $rTemp = $words{$signature} call push_op($rTemp, $_); . . . push_op: $rArray = get_array_ref($arg[0]); $rArray[$#rArray+1] = $_; return; get_array_ref: if (!defined $arg[0]) { return [ ]; } if ("ARRAY" ne ref $arg[0]) { die "Not an ARRAY reference"; } return $arg[0]; }

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

Re: Anagrams & Letter Banks
by Laurent_R (Canon) on Oct 28, 2017 at 11:31 UTC
    Hi dominick_t,

    I looked again at the OP, especially at its original content ("long post") before you added the new introduction, and I think I finally understood what you were trying to do.

    In the code below, I first traverse the array of words looking for words without repeated letters and store them in the %signatures hash (storing them in anonymous arrays, because two different words may have the same signature if they are anagrams, such as "ab" and "ba" in my list of words).

    Then I read the array a second time, skip words without duplicate letters (already stored in the hash), build a signature from the de-duplicated letters; if this signature is found in the %signatures hash, then we've found a word using the same letters as one of the words with no repeated letters and can add it in the array corresponding to this hash entry.

    At the end, the program prints every hash entry for which there is more than one word in the anonymous array.

    #!/usr/bin/perl use strict; use warnings; my (%seen, %signatures); my @words = qw / ba ab aabb aaabbb xxyy xxxyyy baaab baba xxy xyz ABBA + dcdccd cd/; for my $word (@words) { chomp $word; # not needed here, but may be useful when reading f +rom a file or a DATA section my $word = lc $word; next if $seen{$word}++; next if $word =~ /(.).*\1/; # skip if the word has repeated lette +rs my $signature = join "", sort split //, $word; push @{$signatures{$signature}}, $word; } for my $word (@words) { chomp $word; my $word = lc $word; next unless $word =~ /(.).*\1/;; # skip words already stored (h +aving no duplicate letters) my %unique_letters = map { $_ => 1 } split //, $word; # could use + the uniqstr function of Lists::Util my $signature = join "", keys %unique_letters; push @{$signatures{$signature}}, $word if exists $signatures{$sign +ature}; } for my $key (keys %signatures) { if (scalar @{$signatures{$key}} > 1) { print "@{$signatures{$key}}\n" } }
    With the data in the @words array used in the code above, I obtain the following output:
    ba ab aabb aaabbb baaab baba abba cd dcdccd
    which seems to be what you're looking for.

    Note that is probably not how I would usually write this, my code would probably be quite a bit shorter, but I tried to avoid colloquial (too idiomatic) Perl and to keep it very simple, doing things step by step.

Re: Anagrams & Letter Banks
by BillKSmith (Monsignor) on Oct 28, 2017 at 11:20 UTC
    use of a utility module List::MoreUtils allows you to write code, for your first question, that looks much like your specification.
    use strict; use warnings; use List::MoreUtils qw(any); my @strings = qw(abcd abbd efgh); print @strings if any {/(\w)\1/} @strings;
    Bill