http://qs321.pair.com?node_id=406651

fourmi has asked for the wisdom of the Perl Monks concerning the following question:

s/<body.+?>/<body class="theBody">/is insists on changing
<body><!--msnavigation-->
to
<body class="theBody">
Any ideas why it is grabbing the comment aswell?

thanks
ant

2004-11-11 Edited by Arunbear: Changed title from 'Pattern Matching', as per Monastery guidelines

Replies are listed 'Best First'.
Re: Non-greedy regex behaves greedily
by busunsl (Vicar) on Nov 10, 2004 at 12:09 UTC
    Because you insist on a character after 'body' that isn't there when you use '.+?'.

    Change it to s/<body.*?>/<body class="theBody">/ and it will work.

      Oh ok, i always went with .+? as an antigreedifier, had assumed that the ? made the .+ optional..

      That works though, but is does that mean that my antigreedifier ideas are wrong?
        Possibly, I don't know what your exact ideas are :-) At least .+? is not an optional match for one or more things.

        But I've also noticed that a lot of people use non-greedy matching as if it's some sort of magical negative lookahead. It isn't. It will do whatever it takes to match, even if that means also eating one or more of the character(s) that directly follows the non-greedy match (and exiting the repeat at some later instance of that character(s) instead). If your regex has only one match of a non-fixed length you can usually get away with this, but especially if there are more than one them, the semantics can get unexpected.

        If you don't want to match something, it's almost always better to just say that. So in your case that would be:

        s/<body[^>]+>/.../
        (also notice that greedy or non-greedy becomes irrelevant if you write it like this) It would of course still not work as expected (the + should be a * since matching 0 times should be allowed too), but at least the problem now will be not doing anything at all instead of changing too much. And it will keep working as expected even if you make your regex more complicated.
        Yes, yes it does.

        ? doesn't make it any more or less optional, just non-greedy.

        It means the belief that "antigreedifier" make the aforementioned "at least one or more" optional is wrong.

        "at least one or more" /.+/ still proves wrong if there is nothing.

        Cheers, Sören

Re: Non-greedy regex behaves greedily
by pelagic (Priest) on Nov 10, 2004 at 12:06 UTC
    try:s/<body[^>]*?>/<body class="theBody">/ update
    ## I changed +? ## to *? ## now it does match

    pelagic
      Now it doesn't match <body> at all
        yup, core error on my part, thanks a lot!
Re: Non-greedy regex behaves greedily
by tphyahoo (Vicar) on Nov 10, 2004 at 19:20 UTC
    Yes, the left hand side (except for the last closing bracket)
    <body.+?
    is finding out to
    <body>
    and then continuing out till it hits the next closing bracket. Problem is + finds AT LEAST one character. You want the star * here:
    s/<body.*?>/<body class="theBody">/is
    * finds 0 or more characters. That should solve it! :)
      indeedy, thanks a lot
Re: Non-greedy regex behaves greedily
by tall_man (Parson) on Nov 10, 2004 at 20:28 UTC
    The regular expression solutions will work in the simple cases presented here, but for advanced fixing of HTML tags you may want to look at Text::Balanced and HTML::Parser. For example:
    use strict; use Text::Balanced qw(extract_bracketed); my $line = "<body label=\"<<HI>>\"><!--msnavigation-->"; # This will match the brackets properly. my ($match, $remainder) = extract_bracketed($line,"<"); if ($match) { $match =~ s/body/body class=\"theBody\"/; print $match, $remainder,"\n"; } # The regular expression will mess things up. $line =~ s/<body.*?>/<body class="theBody">/is; print $line,"\n";
Re: Non-greedy regex behaves greedily
by Anonymous Monk on Jul 27, 2008 at 16:49 UTC

    I would like to reopen this with a similar question.

    Target string:

    Back to STATES Menu</font></a></h3> <p align="center"><a href="index.htm"><img src="home2.gif" alt="Home" border="0" width="106" height="30"></a></p> </body> </html>

    regex:

    </a>.*?$

    or

    </a>.*?\$

    matches:

    </a></h3> <p align="center"><a href="index.htm"><img src="home2.gif" alt="Home" border="0" width="106" height="30"></a></p> </body> </html>

    I expect it to match:

    </a></p> </body> </html>

    Both PERL and Regex Coach seem to concur, so I must be missing something.

      The regex engine matches "leftmost longest" and not "longest leftmost". The non-greedy modifier changes "longest" to "shortest" but doesn't change "leftmost" nor make "leftmost" no longer trump "longest"/"shortest". ("leftmost" refers to how close to the start of the string is the beginning of the matched substring.)

      The long match is indeed the leftmost possible match. The ? would change the quantifier so that you got the shortest of the leftmost possible matches instead of the longest of the leftmost possible matches.

      You can read about sexeger (or search for more threads: sexeger sexeger) to see how sometimes it can be useful or at least fun to reverse your string and your regex so that you get the substring with the "rightmost" ending point and can choose between longest/shortest as the regex engine moves leftward (with respect to the original string).

      For your particular case, I'd just use rindex and then substr.

      - tye        

      Your regex doesn't force a non-greedy behaviour.

      I'll try to explain with a simplified text example:

      my $text = <<TEXT; 000ABCDEFABCGHI TEXT if ( $text =~ m{(ABC.*?)$} ) { print $1, $/; }

      The engine reads $text from left to right and will have a try with starting at the first "ABC", using the complete following string until end of line. As that's exactly what the regex requested, this result is returned. There's no condition which forces the engine to search for a shorter result. There will be no second run which checks, if the current result may contain a shorter result.

      The first valid match will be returned; this isn't always the best match.

        Ok, is there a nice detailed description of the engine that would fill in what causes a second run and what the "?" does exactly?

        This behavior isn't very intuitive.

        Also, is there another way to get the desired result other than the ugly hack I posted below?
Re: Non-greedy regex behaves greedily
by kovacsbv (Novice) on Jul 27, 2008 at 17:23 UTC
    Hi, it's the real me now that I got the logon to work! I got it to work with:

    </a>([^<]|<[^/]|</[^a]|</a[^>])*$

    But why the one above didn't work is still a mystery.