Regex Bug?

mt2k has asked for the wisdom of the Perl Monks concerning the following question:

Okay, this has just been annoying me for the last half hour here. Say I have a URL with a query string attached:

http://www.domain.com/hi.html?a=b&c=d

Now say I want to remove everything from the '?' on, to get:

http://www.domain.com/hi.html

After enough frustration, I tested the following regex on the same URL with a HREF NAME ending the URL instead of a query string. Yep, I made it look SO obvious as to what I was trying to do. I didn't use anything fancy, and this could look a lot better. I even added the /gs for extra strength!

$url =~ s/(.*?)#(.*?)/$1/gs;

Now guess what? That works fine... when it is a '#' instead of a '?'. Now, if that above code works, shouldn't the following regex work for a URL with a query string attached?:

$url =~ s/(.*?)\?(.*?)/$1/gs;

See look, I even backslashed the '?' there... and guess what? This doesn't work now! I also tried removing the backslash, which (obviously) did not help the situation one bit.

So am I missing something here or is this some kind of issue with the regex engine? Many thanks ahead of time...

Comment on Regex Bug? Select or Download Code

Replies are listed 'Best First'.
Re: Regex Bug? by chromatic (Archbishop) on May 26, 2002 at 06:34 UTC
It's working perfectly, doing exactly what you ask. The second `.?` minimally matches exactly zero characters after the question mark. Remove the minimal operator and it'll do as you intend: `$url =~ s/(.?)\?.*/$1/;`	[reply] [d/l] [select]
(jeffa) Re: Regex Bug? by jeffa (Bishop) on May 26, 2002 at 07:27 UTC
Your question has been answered, but consider this instead: `$url =~ s/\?.*$//;` [download] Instead of trying to capture everything UP to the question mark, just get rid of the question mark and everything AFTER it. Also, consider the URI CPAN module: `use URI; my $uri = URI->new($url); print $uri->scheme, '://', $uri->host, $uri->path, "\n";` [download] TIMTOWTDI! ;) jeffa L-LL-L--L-LL-L--L-LL-L-- -R--R-RR-R--R-RR-R--R-RR B--B--B--B--B--B--B--B-- H---H---H---H---H---H--- (the triplet paradiddle with high-hat)	[reply] [d/l] [select]
Re: Regex Bug? by graff (Chancellor) on May 26, 2002 at 05:58 UTC
Well, this does seem curious: `$ perl -v This is perl, v5.6.1 built for i586-linux ... $ perl -e '$u="http://www.domain.com/hi.html?a=b&c=d"; print $u,$/; $u=~s/(.?)\?(.?)/$1/gs; print $u,$/;'` [download] produces: `http://www.domain.com/hi.html?a=b&c=d http://www.domain.com/hi.htmla=b&c=d` [download] But you said that you only wanted the first part (before the "?"), so why did you put parens around the second part? It does work as desired this way: `$ perl -e '$u="http://www.domain.com/hi.html?a=b&c=d"; print $u,$/; $u=~s/(.?)\?./$1/; print $u,$/;' http://www.domain.com/hi.html?a=b&c=d http://www.domain.com/hi.html` [download] For that matter, whether or not you use the following part as well, why not split: `$ perl -e '$u="http://www.domain.com/hi.html?a=b&c=d"; print $u,$/; ($ub,$ue)=split(/\?/,$u,2); print "$ub :: $ue",$/;' http://www.domain.com/hi.html?a=b&c=d http://www.domain.com/hi.html :: a=b&c=d` [download] Still, the initial example is baffling, and I hope someone can explain why it should behave the way it did, just for our communal peace of mind. The final "gs" is of course superfluous for this example -- the behavior is the same with or without those qualifiers on the regex. (And "extra strength" is not really an appropriate reason for using them, anyway; check their descriptions in perlre to see what their proper, intended functions are.)	[reply] [d/l] [select]
Re: Regex Bug? by I0 (Priest) on May 26, 2002 at 15:20 UTC
`.?` is a minimal match, try `$url =~ s/(.?)\?(.)/$1/gs;` or just `$url =~ s/\?.//gs;`	[reply] [d/l] [select]

Back to Seekers of Perl Wisdom