Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Re^2: Comparing Lines within a Word List

by dominick_t (Acolyte)
on Apr 29, 2016 at 23:04 UTC ( #1161931=note: print w/replies, xml ) Need Help??


in reply to Re: Comparing Lines within a Word List
in thread Comparing Lines within a Word List

I was able to run this code with my long word list. It appears to not be grabbing some matches that I need, but I do not know enough about regular expressions to be able to fix it. For example, when I run the code looking for a/b swap matches, 'lama' and 'lamb' should be a match but it is not showing up as such. I'm guessing it's to do with the fact that 'lama' has two a's? Is it possible to easily amend this code to handle this case?
  • Comment on Re^2: Comparing Lines within a Word List

Replies are listed 'Best First'.
Re^3: Comparing Lines within a Word List
by hippo (Chancellor) on Apr 30, 2016 at 10:04 UTC
    I'm guessing it's to do with the fact that 'lama' has two a's? Is it possible to easily amend this code to handle this case?

    Yes, and it's a trivial amendment. Just deploy the /g modifier:

    #!/usr/bin/perl use strict; use warnings; my @words = <DATA>; chomp @words; while ( @words >= 2 ) { my $model = my $regex = shift @words; if ( $regex =~ s/(.*?)[ab](.*?)/$1\[ab\]$2/g ) { my @hits = grep /^$regex$/, @words; if ( @hits ) { print join( " ", $model, "matches", @hits, "using", $regex +, "\n" ); } } } __DATA__ lama lamb

      Thanks hippo, I will give this a shot. I'm reading the documentation on the /g modifier but do not see yet how this will work.

      If the issue was that there were two a's in 'lama', what does it mean that when I ran the original code, 'aaron' successfully matched 'baron' as it should have? Guessing it's to do with the fact that in this latter case, the repeated instance of 'a' comes after rather than before the a/b swap position, unlike 'lama' and 'lamb'. How this plays into the regex though, I don't yet see.

        Note that hippo's addition of the g modifier brings up the issue that I raised earlier about having more than one character difference between two words. Here's what happens when I add a couple more examples to hippo's verson:
        #!/usr/bin/perl use strict; use warnings; my @words = <DATA>; chomp @words; while ( @words >= 2 ) { my $model = my $regex = shift @words; if ( $regex =~ s/(.*?)[ab](.*?)/$1\[ab\]$2/g ) { my @hits = grep /^$regex$/, @words; if ( @hits ) { print join( " ", $model, "matches", @hits, "using", $regex +, "\n" ); } } } __DATA__ lama lamb able bale
        Output:
        lama matches lamb using l[ab]m[ab] able matches bale using [ab][ab]le
        The output shows how the g modifier affects the creation of the regex to be used for searching the array; without it, the first regex would be l[ab]ma (which would not match "lamb"), and the next would be l[ab]mb (which would not match "lama" if it were to show up later in the list).

        But when using the g modifier, the search pattern for "able" and "bale" come out the same, and they match each other, because the regex [ab][ab]le allows up to two characters to differ.

        To solve that, you could to compare the current "model" word against each of the matches from the array, using the tr/// operator as described in previous replies, to see how many characters are different in each paired set of words, and keep only those matches that differ by a single character.

        (UPDATE: It's also worth noting that using g this way is effectively equivalent to using "split", "map" and "join" to build the multi-match regex, like I showed in this previous reply - which just goes to show that "there's more than one way to do it."

        If you cannot work out in your head what the substitution actually does (and it's not an easy thing if you are new to all this) then give it a try in some code. The lack of boilerplate in perl really helps when coding up trivial scripts for testing. eg:

        #!/usr/bin/env perl use strict; use warnings; for my $word ('lama', 'aaron') { print "Word is $word\n"; print "without /g the regex becomes: "; my $r = $word; $r =~ s/(.*?)[ab](.*?)/$1\[ab\]$2/; print "$r\n"; print "with /g the regex becomes: "; $r = $word; $r =~ s/(.*?)[ab](.*?)/$1\[ab\]$2/g; print "$r\n"; }

        Hopefully running this code will illustrate to you how the substitutions differ because of the /g modifier.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1161931]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (5)
As of 2021-04-20 01:55 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?