http://qs321.pair.com?node_id=169337

mt2k has asked for the wisdom of the Perl Monks concerning the following question:

Okay, this has just been annoying me for the last half hour here. Say I have a URL with a query string attached:

http://www.domain.com/hi.html?a=b&c=d

Now say I want to remove everything from the '?' on, to get:

http://www.domain.com/hi.html

After enough frustration, I tested the following regex on the same URL with a HREF NAME ending the URL instead of a query string. Yep, I made it look SO obvious as to what I was trying to do. I didn't use anything fancy, and this could look a lot better. I even added the /gs for extra strength!

$url =~ s/(.*?)#(.*?)/$1/gs;

Now guess what? That works fine... when it is a '#' instead of a '?'. Now, if that above code works, shouldn't the following regex work for a URL with a query string attached?:

$url =~ s/(.*?)\?(.*?)/$1/gs;

See look, I even backslashed the '?' there... and guess what? This doesn't work now! I also tried removing the backslash, which (obviously) did not help the situation one bit.

So am I missing something here or is this some kind of issue with the regex engine? Many thanks ahead of time...

Replies are listed 'Best First'.
Re: Regex Bug?
by chromatic (Archbishop) on May 26, 2002 at 06:34 UTC
    It's working perfectly, doing exactly what you ask. The second .*? minimally matches exactly zero characters after the question mark. Remove the minimal operator and it'll do as you intend:

    $url =~ s/(.*?)\?.*/$1/;

(jeffa) Re: Regex Bug?
by jeffa (Bishop) on May 26, 2002 at 07:27 UTC
    Your question has been answered, but consider this instead:
    $url =~ s/\?.*$//;
    Instead of trying to capture everything UP to the question mark, just get rid of the question mark and everything AFTER it.

    Also, consider the URI CPAN module:

    use URI; my $uri = URI->new($url); print $uri->scheme, '://', $uri->host, $uri->path, "\n";
    TIMTOWTDI! ;)

    jeffa

    L-LL-L--L-LL-L--L-LL-L--
    -R--R-RR-R--R-RR-R--R-RR
    B--B--B--B--B--B--B--B--
    H---H---H---H---H---H---
    (the triplet paradiddle with high-hat)
    
Re: Regex Bug?
by graff (Chancellor) on May 26, 2002 at 05:58 UTC
    Well, this does seem curious:
    $ perl -v This is perl, v5.6.1 built for i586-linux ... $ perl -e '$u="http://www.domain.com/hi.html?a=b&c=d"; print $u,$/; $u=~s/(.*?)\?(.*?)/$1/gs; print $u,$/;'
    produces:
    http://www.domain.com/hi.html?a=b&c=d http://www.domain.com/hi.htmla=b&c=d
    But you said that you only wanted the first part (before the "?"), so why did you put parens around the second part? It does work as desired this way:
    $ perl -e '$u="http://www.domain.com/hi.html?a=b&c=d"; print $u,$/; $u=~s/(.*?)\?.*/$1/; print $u,$/;' http://www.domain.com/hi.html?a=b&c=d http://www.domain.com/hi.html
    For that matter, whether or not you use the following part as well, why not split:
    $ perl -e '$u="http://www.domain.com/hi.html?a=b&c=d"; print $u,$/; ($ub,$ue)=split(/\?/,$u,2); print "$ub :: $ue",$/;' http://www.domain.com/hi.html?a=b&c=d http://www.domain.com/hi.html :: a=b&c=d
    Still, the initial example is baffling, and I hope someone can explain why it should behave the way it did, just for our communal peace of mind. The final "gs" is of course superfluous for this example -- the behavior is the same with or without those qualifiers on the regex. (And "extra strength" is not really an appropriate reason for using them, anyway; check their descriptions in perlre to see what their proper, intended functions are.)
Re: Regex Bug?
by I0 (Priest) on May 26, 2002 at 15:20 UTC
    .*? is a minimal match, try $url =~ s/(.*?)\?(.*)/$1/gs; or just  $url =~ s/\?.*//gs;