^x* vs x*$

Carl-Joseph has asked for the wisdom of the Perl Monks concerning the following question:

My friends and I have been playing around with regular expressions, and we ran into something that we haven't been able to explain.

Here is some test code that shows the "phenomena".

#! G:\Perl\bin\perl.exe

$test_string="x"; 

#
# This pattern matches once, and it matches
# after the "x", not before.
# 
@test = ($test_string =~ /^x*/g ); 
$NumMatch=@test;

print "Test String\t>$test_string<\n";
print "Prematch\t>$`<\n";
print "Match\t\t>$&<\n";
print "Postmatch\t>$'<\n";
print "Num Matches\t>$NumMatch<\n";
print "Match Arrary:\n";
foreach $m (@test) {
   print ">$m<\n";
}


print "\n";
print  "-" x 10;
print "\n\n";


# 
# This pattern matches twice, and 
# the second match is after the "x"
# 
@test = ($test_string =~ /x*$/g ); 
$NumMatch=@test;

print "Test String\t>$test_string<\n";
print "Prematch\t>$`<\n";
print "Match\t\t>$&<\n";
print "Postmatch\t>$'<\n";
print "Num Matches\t>$NumMatch<\n";
print "Match Arrary:\n";
foreach $m (@test) {
   print ">$m<\n";
}

__END__
[download]

The above code produces the following output.


Test String     >x<
Prematch        >x<
Match           ><
Postmatch       ><
Num Matches     >1<
Match Arrary:
>x<

----------

Test String     >x<
Prematch        >x<
Match           ><
Postmatch       ><
Num Matches     >2<
Match Arrary:
>x<
><
[download]

Here is why we are confused:

Why does the first pattern match after the "x". I understand that "x*" is able to match nothing, but shouldn't it match the nothing before the "x", rather than the nothing after the "x".

The reason I think it should match before the "x" is that I use the "^" assertion. Also, Camel2, p61 says:

"... any regular expression that can match the null string is guaranteed to match at the leftmost position in the string."

Also, why does the second pattern "x*$" match twice when the first pattern matched only once. It seems as though they should either both match once or both match twice.

Thanks,

Carl-Joseph

Comment on ^x* vs x*$ Select or Download Code

Replies are listed 'Best First'.
Re: ^x* vs x*$ by tilly (Archbishop) on Aug 19, 2000 at 17:08 UTC
Congratulations! You are exactly correct in your analysis, and correct to be unhappy with what Perl is doing. :-( The first is a bug in 5.6.0. It doesn't happen in 5.005_03. It likely has been fixed by Hugo already. Anyone who wants to check that can follow my advice in Getting current versions of Perl and see if it is still there with more current patches. The second likewise looks to me like a bug. It has been around longer though. (It appears in 5.005_03 and 5.6.0.) You match the first time and pos() is set to the end of the string. The second time you go back, start from pos() - and find that you can match at the end of the string. The first time it needs to mark that it actually matched the end of the string and not do so the next time. At this point you should run "perlbug" with your code, and toss in my observation that the first behaved differently in Perl 5.005_03. But first I would clean it up as follows: &re_test("x", '^x'); &re_test("x", 'x$'); sub re_test { my $str = shift; my $re_desc = shift; my $re = qr/$re_desc/; my @matches = ($str =~ /$re/g); my $num_match = @matches; print "Test String\t>$str<\n", "Test Regexp\t>$re_desc<\n", "Prematch\t>$`<\n", "Match\t\t>$&<\n", "Postmatch\t>$'<\n", "Num Matches\t>$NumMatch<\n", "Match Arrary:\n", map {"\t\t>$_<\n"} @matches; print "\n\n"; } [download] Also Jeffrey Friedl (jfriedl@yahoo-inc.com) is in the process of rewriting his Mastering Regular Expressions book and has been tracking down all of the RE bugs he can. I would toss this at him. He likely will want to check whether equivalents of the second bug appear in other RE tools. I would do this for you, but I think it is good to encourage people to get involved in the process. :-)	[reply] [d/l]
RE: Re: ^x* vs x*$ by tye (Sage) on Aug 19, 2000 at 22:02 UTC
When this second item has come up before, it has been defended as being the correct behavior. The more general case is that when a regex can match a zero-width string, it is possible for multiple matches to end at the same point. Another example is: `$str= "ababa"; $str =~ s/a/x/g; print "$str\n"` [download] which produces `xxbxxbxx` [download] This is because we start at position 0 and match "a", leaving us a position 1. At position 1 we match "", leaving us at position 2 (we've already started at position 1 so we don't start there again, even though our match ended at position 1). At pos 2 we match "a", at pos 3 we match "", etc. But this is a bit counter intuative. In fact, sed doesn't have this "quirk". So it might* be a good idea to disallow zero-width matches that start (and therefore end) at the point where the previous match ended. But that raises the ugly spectre of backward compatability... My current feeling is that "we" should "fix" this but provide a way to get the old behavior to ease the burdon of backward compatability (though no suitable syntax/feature for doing that springs to mind). I suspect a lack of to-its will cause the current behavior to remain until someone feels strong enough about it to champion its cause. - tye (but my friends call me "Tye")	[reply] [d/l] [select]
RE (tilly) 3: ^x* vs x*$ by tilly (Archbishop) on Aug 20, 2000 at 06:16 UTC
Very interesting. I can believe that happened. Still looks to me like a bug. `perl -e '$str = "Hello World\n"; $str =~ s/\r?\n?$/\n/g; print $str;'` [download] Where did the second return come from? At the least after matching $ you should not match a zero-width assertion at that point again. IMHO and all that. I will send that bug report in shortly.	[reply] [d/l]
Re: ^x* vs x*$ by Abigail-II (Bishop) on Sep 18, 2003 at 15:52 UTC
The first is a bug in 5.6.0. It doesn't happen in 5.005_03. It likely has been fixed by Hugo already. Anyone who wants to check that can follow my advice in Getting current versions of Perl and see if it is still there with more current patches. We're three years later. 5.8.1-RC4 and bleadperl still have this bug. Abigail	[reply]
Re^2: ^x* vs x*$ by tilly (Archbishop) on Oct 10, 2004 at 07:00 UTC
And finally, four years later, I got around to filing it with p5p.	[reply]
Re^3: ^x* vs x*$ by Steve_p (Priest) on Nov 16, 2007 at 16:26 UTC
Re^4: ^x* vs x*$ by tilly (Archbishop) on Nov 16, 2007 at 17:28 UTC


The stupid question is the question not asked
	PerlMonks