Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Re: Comparing Lines within a Word List

by Eily (Monsignor)
on Apr 26, 2016 at 20:27 UTC ( [id://1161592]=note: print w/replies, xml ) Need Help??


in reply to Comparing Lines within a Word List

Regular expression are not the best tool to do what you want actually. Not because your problem is impossible or even difficult to solve with regular expressions, but because there is a much better option. The bitwise xor operator "^" will yield a 0 anywhere the two strings are equal, but 1 for every bit that is different between the two.

my $first = "Fool"; my $second = "Foot"; my $diff = ($first ^ $second); print unpack "B*", $diff; # Print the binary representation of the dif +ference my @diff_char = split //, $diff; # get a char by char difference.
With that, and maybe the use of ord (you don't actually need it but it may help make things clearer) you should be able to do what you want.

Replies are listed 'Best First'.
Re^2: Comparing Lines within a Word List
by AnomalousMonk (Archbishop) on Apr 26, 2016 at 23:02 UTC

    Actually, bitwise-xor on strings and  tr/// (update: see Quote-Like Operators in perlop) go together quite nicely for something like this:

    c:\@Work\Perl\monks>perl -wMstrict -le "use Data::Dump qw(pp); ;; for my $word (qw(Fool Foot Tool Toot Foal)) { my $diff = 'Fool' ^ $word; print qq{'$word': }, pp $diff; print qq{'Fool' and '$word' differ by 1 char} if 1 == $diff =~ tr/\x00//c; } " 'Fool': "\0\0\0\0" 'Foot': "\0\0\0\30" 'Fool' and 'Foot' differ by 1 char 'Tool': "\22\0\0\0" 'Fool' and 'Tool' differ by 1 char 'Toot': "\22\0\0\30" 'Foal': "\0\0\16\0" 'Fool' and 'Foal' differ by 1 char

    Update: Changed example code to use  tr/\x00//c (/c modifier: complement the search list).


    Give a man a fish:  <%-{-{-{-<

      Thank you both for the replies! I hope everyone in the thread can see this, and not just the author of the note on which I hit the reply button. Okay, so if I'm getting this right, it looks like in this example, you're taking the word 'fool' and comparing its characters to each of the five words in the array, and since 'fool' matches itself exactly, the return on that one is all zeros. Any place there is not a zero is a place where the words differ. (I'm not immediately sure why the "difference" between the character 'l' and 't' would be 30 but I'm sure it's easily explained.) So I see how this works in principle, to compare two given words and look for word pairings that yield a one-character difference. But then how might I use this to solve the problem that I have, which is to find -- from let's say a massive dictionary of English language words -- all pairs of words that are the same except for one letter, and in particular, for that character difference to be that one has an R while the other has an S? Again, many thanks.
        I hope everyone in the thread can see this, and not just the author of the note on which I hit the reply button.

        They can.

        I'm not immediately sure why the "difference" between the character 'l' and 't' would be 30 ...

        You're seeing the octal values resulting from the character-by-character bitwise-xor of two strings. So

        c:\@Work\Perl\monks>perl -wMstrict -le "printf qq{%#02o \n}, ord 'l'; printf qq{%#02o \n}, ord 't'; printf qq{%#02o \n}, 0154 ^ 0164; " 0154 0164 030

        ... the problem ... [find] from ... a massive dictionary of English language words -- all pairs of words that are the same except for one letter, and in particular, for that character difference to be that one has an R while the other has an S [in the same character position] ...
        [please note the emphasized addition]

        As to this much larger problem (as restated; please confirm this clarification — or may the differing characters be in any position? (Update: E.g., Is 'aSaa' a "match" for 'aaRa'?)): it's an interesting one, but I've no time right now to go into it in detail.

        Update: Actually, the  '02' in the  '%#02o' format specifier used in the printfs above is unnecessary, although it does no harm. The same result (and the result I wanted) can be had with  '%#o' instead.


        Give a man a fish:  <%-{-{-{-<

      I always forget about using tr/// for counting, thanks for the reminder :)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1161592]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having an uproarious good time at the Monastery: (3)
As of 2024-04-20 01:11 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found