http://qs321.pair.com?node_id=1161597


in reply to Comparing Lines within a Word List

If you're looking for a way to catch any two words that differ by any single letter (e.g. report if input contains "fool/foot" and/or "foot/loot" and/or "loot/lout"), then it's a matter of building on Eily's idea using bitwise xor on pairs of strings that are the same length. Look at the elements of the array returned by the "split" statement, and if only one element of the array is non-zero, then the given pair of strings differ by only one character. (You'll want to organize your input into groups based on word length, and only do comparisons within each group.)

OTOH, if you're looking for words that contain a particular pair of characters, and differ only in terms of using one vs. the other of those two (e.g. you really just want "bare/base", etc., but not "foot/fool"), you would probably want to use a regex like this:

#!/usr/bin/perl use strict; use warnings; my @words = <DATA>; chomp @words; while ( @words >= 2 ) { my $model = my $regex = shift @words; if ( $regex =~ s/(.*?)[rs](.*?)/$1\[rs\]$2/ ) { my @hits = grep /^$regex$/, @words; if ( @hits ) { print join( " ", $model, "matches", @hits, "using", $regex +, "\n" ); } } } __DATA__ bare mare base case bust burt bent sat rat bat matter mattes pat
(update: fixed the "while" condition as per hippo's remark below -- also added anchors around $regex in the grep call, so that "bare" doesn't match "debased", etc.)

Replies are listed 'Best First'.
Re^2: Comparing Lines within a Word List
by hippo (Bishop) on Apr 27, 2016 at 08:53 UTC

    This is a good, clear solution (++). However it looks like there is an off-by-one error in the while condition which means that the last two entries of @words are not compared against each other. If we change it to

    while ( @words > 1 ) {

    then it seems to cover that case as well.

Re^2: Comparing Lines within a Word List
by dominick_t (Acolyte) on Apr 27, 2016 at 03:57 UTC
    Thanks very much for the reply! At the moment, the problem I'd first like to solve is the second one you described . . . at least if I'm reading you correctly. I want pairs of words that are exactly the same except where one has an R, the other has an S. But I'm very much a newbie and am having a tough time parsing your example code so I cannot tell how it is working . . . would it be possible to walk me through it a bit? In general, are we starting with the first word in the list and testing it against the words below it for the R/S difference and reporting it if one is found, then moving to the next item on the list and testing it against all the ones below it, and so on? Again, many thanks!
      Yes, you've correctly described the approach, which uses the shift function to extract the word that is currently at the beginning of the array, and then, if that first word contains "r" or "s", a regex is created and used with the grep function to search for matches in all the remaining words in the array.

      One thing you didn't specify yet is what to do with sets like "cases / carer / caser / cares": Should the first one match all of the other three? Should the second one match both of the last two? Should the last two match each other? If the answer is "yes" on all points, then you'll want to create a different regex, which can be done using the split and map functions, and (my favorite from C) the "ternary" conditional operator:

      my $model = shift @words; my $regex = join( "", map{ ( /[rs]/ ) ? "[rs]" : $_ } split( /([rs +])/, $model )); next if ( $regex eq $model ); # skip if model has no "r" or "s" my @hits = grep /^$regex$/, @words; ...
      (BTW, maybe you already know, but /$regex/i (adding the "i" modifier at the end) does case-insensitive matches.)

      (updated to add a missing paren at the end of the second line in the snippet -- also added the anchors around $regex in the grep call)

        Thank you. This is extremely helpful. To answer your question about the tricky case you mention: 'cases' should NOT match 'carer', as for my purpose these words differ in two positions. (even though both positions involve an R/S swap, I still the need matches to differ in exactly one position). 'cases' SHOULD match 'caser' and 'cares'. For the same reason, 'carer' SHOULD match both of the remaining two ('caser' and 'cares') And finally, 'caser' should NOT match 'cares', again because these words differ at more than one position. Does that clarify things?
Re^2: Comparing Lines within a Word List
by Anonymous Monk on Apr 27, 2016 at 15:22 UTC

    Let perl regex do the looping for you.

    #!/usr/bin/perl -l use strict; use warnings; $_ = do{local $/; <DATA>}; /^((\w*) [rs] (\w*))$ # find a first word .* # skip non match lines ^(\2 [rs] \3)$ # second word same except rs (??{print "$1 $4"})/xms; # print and fail __DATA__ bare mare base case bust burt bent sat rat bat matter mattes pat
Re^2: Comparing Lines within a Word List
by dominick_t (Acolyte) on Apr 27, 2016 at 15:33 UTC

    Thank you, hippo, for catching the one-line issue, and thank you graff for this enormously helpful code. I responded to your question about the specific case you raised below but I'll post it here as well.

    'cases' should NOT match 'carer', as for my purpose these words differ in two positions. (even though both positions involve an R/S swap, I still the need matches to differ in exactly one position). 'cases' SHOULD match 'caser' and 'cares'.

    For the same reason, 'carer' SHOULD match both of the remaining two ('caser' and 'cares').

    And finally, 'caser' should NOT match 'cares', again because these words differ at more than one position.

    With all that in mind, does this code still find the matches I need?

    Another question: am I right to say this program will accept as input a word list with words of different lengths and has a way of handling that? Earlier I had gotten the idea that I might have to split my word list into separate lists (words of length 2, words of length 3, and so on), which the program would then handle each in turn. But does this piece of code automatically handle that? I suppose this is solved in the regex line, but I'm still learning the ins and outs of regular expressions. Any further explanations as to how that regex line works would be greatly appreciated.

Re^2: Comparing Lines within a Word List
by dominick_t (Acolyte) on Apr 29, 2016 at 23:04 UTC
    I was able to run this code with my long word list. It appears to not be grabbing some matches that I need, but I do not know enough about regular expressions to be able to fix it. For example, when I run the code looking for a/b swap matches, 'lama' and 'lamb' should be a match but it is not showing up as such. I'm guessing it's to do with the fact that 'lama' has two a's? Is it possible to easily amend this code to handle this case?
      I'm guessing it's to do with the fact that 'lama' has two a's? Is it possible to easily amend this code to handle this case?

      Yes, and it's a trivial amendment. Just deploy the /g modifier:

      #!/usr/bin/perl use strict; use warnings; my @words = <DATA>; chomp @words; while ( @words >= 2 ) { my $model = my $regex = shift @words; if ( $regex =~ s/(.*?)[ab](.*?)/$1\[ab\]$2/g ) { my @hits = grep /^$regex$/, @words; if ( @hits ) { print join( " ", $model, "matches", @hits, "using", $regex +, "\n" ); } } } __DATA__ lama lamb

        Thanks hippo, I will give this a shot. I'm reading the documentation on the /g modifier but do not see yet how this will work.

        If the issue was that there were two a's in 'lama', what does it mean that when I ran the original code, 'aaron' successfully matched 'baron' as it should have? Guessing it's to do with the fact that in this latter case, the repeated instance of 'a' comes after rather than before the a/b swap position, unlike 'lama' and 'lamb'. How this plays into the regex though, I don't yet see.