Match a chunk

dhasaan has asked for the wisdom of the Perl Monks concerning the following question:

I have a problem with matching a member of an array with a member of another array, the regex match statements in the following code is providing answers i am not clear about. Any suggestions.

#!/usr/bin/perl -w
use strict;
#Matching a member of texlines with a member of update*
#Last example, Fourth, is what needs to be done
my @updateloc = ("LakeOsouth", "LakeOeast", "LakeO2", "LakeO");
my @updateloc1 = ("LakeOsouth/stuf", "LakeOeast/stuf", "LakeO2/stuf", 
+"LakeO/stuf");
my @texlines = ("LakeOsouth", "LakeOeast", "LakeO2", "LakeO");
# Works fine
print "Zero\n";
foreach my $update (@updateloc){
    for my $lin (@texlines){
        if ($lin =~ /$update\b/){
            if ($update =~ /$lin/) {
                print "match  1 reverse $lin $update\n";
            }else{
                print "bad match $update $lin\n";
}    }    }    }
# Getting 3 bad ones
print "First\n";
foreach my $update (@updateloc){
    for my $lin (@texlines){
        if ($lin =~ /$update/){
            if ($update =~ /$lin/) {
                print "match  1 reverse $lin $update\n";
            }else{
                print "bad match $update $lin\n";
}    }    }    }
# Getting 2 bad ones; the digit on end of $update iso a char is ok?
print "Second\n";
foreach my $update (@updateloc){
    for my $lin (@texlines){
        if ($lin =~ /$update/g){
            if ($update =~ /$lin/g) {
                print "match  1 reverse $lin $update\n";
            }else{
                print "bad match $update $lin\n";
}    }    }    }
# No good, anything to do with previous matches (the /g)?
print "Third\n";
foreach my $update (@updateloc){
    for my $lin (@texlines){
        if ($lin =~ /$update/g){
            if ($update =~ /$lin/g) {
                print "match  1 reverse $lin $update\n";
            }else{
                print "bad match $update $lin\n";
}    }    }    }
# Back to number Zero, so Third is influenced by Second?
print "Third and a half\n";
foreach my $update (@updateloc){
    for my $lin (@texlines){
        if ($lin =~ /$update/){
            if ($update =~ /$lin/) {
                print "match  1 reverse $lin $update\n";
            }else{
                print "bad match $update $lin\n";
}    }    }    }
#Finally the problem
# This is what i need: match 1 reverse, doesn't happen
print "Fourth\n";
foreach my $update (@updateloc1){
    for my $lin (@texlines){
        if ($lin =~ /$update/g){
            if ($update =~ /$lin/g) {
                print "match  1 reverse $lin $update\n";
            }else{
                print "bad match $update $lin\n";
            }
        }else{
            if ($update =~ /$lin/g) {
                print "\tmatch 2 reverse $lin $update\n";
            }else{
                print "\tno match $update $lin\n";
}    }    }    }
[download]

Comment on Match a chunk Download Code

Replies are listed 'Best First'.
Re: Match a chunk by pc88mxer (Vicar) on Jun 11, 2008 at 22:53 UTC
What are you trying to accomplish with the `/g` modifier? You are not using it in a very conventional way. When you use the `/g` modifier, you update a position within the string from which the next match attempt begins. This position is available by calling `pos()` on the string: `my $t = "cat hat bat mat"; print pos($t), "\n"; # -> (empty) $t =~ m/at/g; print pos($t), "\n"; # -> 3 $t =~ m/at/g; print pos($t), "\n"; # -> 7 $t =~ m/at/g; print pos($t), "\n"; # -> 11 $t =~ m/hat/g; print pos($t), "\n"; # -> (empty) $t =~ m/at/g; print pos($t), "\n"; # -> 3` [download] For the first three matches you see `pos()` advancing along the string. The match `m/hat/g` fails because `pos()` is already past `hat`. This failure also causes the next match to start at the beginning of the string.	[reply] [d/l] [select]
Re: Match a chunk by starbolin (Hermit) on Jun 12, 2008 at 02:41 UTC
`(1) foreach my $update (@updateloc1){ (2) for my $lin (@texlines){ (3) if ($lin =~ /$update/g){` [download] Line 3 doesn't make sense to me. Why the 'g' option? Do you expect there to be more that one instance of a member of @updateloc1 contained in any given member of @textlines? Also, it would seem, for your dataset, line 3 would always fail as $update would always be a longer string than $lin. I have to admit I don't understand what you are trying to do. What is your expected output? Is the data always in order like the example? Are there multiple matches in the array? s//----->\t/;$~="JAPH";s//\r<$~~/;{s\|~$~-\|-~$~\|\|\|s \|-$~~\|$~~-\|\|\|s,<$~~,<~$~,,s,~$~>,$~~>,, $\|=1,select$,,$,,$,,1e-1;print;redo}	[reply] [d/l]
Re^2: Match a chunk by dhasaan (Initiate) on Jun 12, 2008 at 10:06 UTC
Thanks for the responses, i realize the /g in updateloc1 should not have been added. I need to match each member of texlines exactly to the correct one in updateloc1, the problem occurs when its LakeO's turn and it should come back with the fourth member of updateloc1 (LakeO/stuf), it doesn't, (without /g or \b). What am i doing wrong?	[reply]
Re^3: Match a chunk by Anonymous Monk on Jun 12, 2008 at 11:25 UTC
One problem is that `'LakeO'` is a substring of every string in `@updateloc1`. The regex `/LakeO\b/` matches only with `'LakeO/stuf'` because the latter string has a forward-slash (`'/'`) in just the right position to match with the `\b` word-boundary metacharacter in the regex. Why it matches with the use of `/g` (if, indeed, it does) I will not be able to figure out without the ingestion of a lot more caffeine. What is your definition of exact match in the context of the example you have given?	[reply] [d/l] [select]
Re: Match a chunk by Anonymous Monk on Jun 12, 2008 at 10:46 UTC
I, too, cannot understand what you are trying to accomplish by using the `/g` regex modifier or, in a broader sense, what your ultimate goal may be. If you could give us a better idea of this goal, we might be better able to help. I suspect this may be an instance of an XY Problem. One thing strikes me about the 'reverse' regex approach you are using. If `$s1` and `$s2` are both 'pure' strings with no regex metacharacters (or with any metacharacters escaped), then the expression `$s1 =~ /$s2/` is true if `$s2` is a substring of `$s1`. If this is true and the match is reversed, `$s2 =~ /$s1/` is true if `$s2` equals `$s1` by string comparison (i.e., `$s2 eq $s1` is true). So why not just use index (to find a substring) and `eq` (see the string Equality Operators) instead of trying to deal with the arcana of the operation of `/g` across multiple regexes?	[reply] [d/l] [select]
Re^2: Match a chunk by dhasaan (Initiate) on Jun 13, 2008 at 13:36 UTC
Let me try to express the problem better, i did indeed confuse by including too much. Being less confused myself this morning, i have eliminated all except what i really need in the following and resolved it: `#!/usr/bin/perl -w use strict; my @updateloc1 = ("LakeOsouth/stuf", "LakeOeast/stuf", "LakeO2/stuf", +"LakeO/stuf"); my @texlines = ("LakeOsouth", "LakeOeast", "LakeO2", "LakeO"); for my $lin (@texlines){ for my $update (@updateloc1){ if ($update =~ /$lin/){ print "The table column $lin matches the directory name $u +pdate, so now do stuf in the subdir\n"; } } }` [download] The above code works, except that when it gets to table column $lin = LakeO it matches all the members in updateloc1, so my struggle has been to make $lin match $update exactly to the forward slash. Well, after re-arranging the loops, the solution is simple: $update =~ /$lin\//. Should have taken a break much sooner.	[reply] [d/l]
Re: Match a chunk by Anonymous Monk on Jun 13, 2008 at 06:54 UTC
In looking back over this question, and seeing that the forward-slash (`'/'`) nix path separator character figures in the examples and that there seems to be a concern with one string appearing at the beginning of another, it occurred to me that the real problem might be something like the following: Given a directory name and a path, is it true that the given directory is the first (i.e., top level) directory of the path? If so, the following code may be helpful. # /usr/bin/perl # for a given path and directory, determine if given directory # is the first (i.e., top-level) directory in the path. # accepts win32 or nix path separators. # accepts directory names with embedded spaces. use warnings; use strict; # '\' repeated 4 times in next string to compensate for # subsequent double-quote-like interpolation. my $seps = '\\\\/'; # possible path separator chars: win & *nix # regexes to recognize parts of a path. maybe a bit crude. my $rx_path_sep = qr{ [$seps] }xms; # path separator my $rx_dir_like = qr{ [^$seps]+ }xms; # directory name my @weird = ('', '\\', '/'); # extra weird strings for testing # first DATA line is possible top-level directory names for testing. my @dirs = (@weird, split ',', do { chomp(my $d = <DATA>); $d }); # all other DATA lines are paths for testing. my @paths = (@weird, map { chomp; split ',' } <DATA>); # pre-compile directory names to regexes recognizing them # as first (i.e., top-level) directory in a path. my %toplevel = map { $_ => # each possible dir paired with regex qr{ \A (?= $rx_dir_like) # dir-like string is first \Q$_\E # and is the given dir (?: $rx_path_sep \| \z) # followed by sep or end of string }xms } @dirs; # check all permutations. for my $path (@paths) { for my $dir (@dirs) { printf qq(%10s begins '%s' \n), qq('$dir'), $path if $path =~ $toplevel{ $dir }; # if $path =~ m{ \A (?= $rx_dir_like) \Q$dir\E # (?: $rx_path_sep \| \z) # }xms; } } __DATA__ abc,abcd,abcde,x,ab c,ab cd,a+b,foo abc,abc\,abc/x,abcd,abcd/,abcd\x,abcde,abcde/,abcde/x x,x/,x/abc,abc\abcd,abcd/abc,abc\abc,abc\abcd/abcde,abcde\abcd/abc ab c,ab c\,ab c/abc,ab cd,ab cd\,ab cd\x,ab c/abc ab cd\ab c,a+b,a+b/,a+b\abc,bar,bar\zot,\,/,\abc,/abc,\x,/x /\,\/,/\x,\/x [download] Output: 'abc' begins 'abc' 'abc' begins 'abc\' 'abc' begins 'abc/x' 'abcd' begins 'abcd' 'abcd' begins 'abcd/' 'abcd' begins 'abcd\x' 'abcde' begins 'abcde' 'abcde' begins 'abcde/' 'abcde' begins 'abcde/x' 'x' begins 'x' 'x' begins 'x/' 'x' begins 'x/abc' 'abc' begins 'abc\abcd' 'abcd' begins 'abcd/abc' 'abc' begins 'abc\abc' 'abc' begins 'abc\abcd/abcde' 'abcde' begins 'abcde\abcd/abc' 'ab c' begins 'ab c' 'ab c' begins 'ab c\' 'ab c' begins 'ab c/abc' 'ab cd' begins 'ab cd' 'ab cd' begins 'ab cd\' 'ab cd' begins 'ab cd\x' 'ab c' begins 'ab c/abc' 'ab cd' begins 'ab cd\ab c' 'a+b' begins 'a+b' 'a+b' begins 'a+b/' 'a+b' begins 'a+b\abc' [download]	[reply] [d/l] [select]


Think about Loose Coupling
	PerlMonks