Non-greedy regex behaves greedily

fourmi has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Non-greedy regex behaves greedily by busunsl (Vicar) on Nov 10, 2004 at 12:09 UTC
Because you insist on a character after 'body' that isn't there when you use '.+?'. Change it to `s/<body.*?>/<body class="theBody">/` and it will work.	[reply] [d/l]
Re^2: Non-greedy regex behaves greedily by fourmi (Scribe) on Nov 10, 2004 at 12:19 UTC
Oh ok, i always went with .+? as an antigreedifier, had assumed that the ? made the .+ optional.. That works though, but is does that mean that my antigreedifier ideas are wrong?	[reply]
Re^3: Non-greedy regex behaves greedily by thospel (Hermit) on Nov 10, 2004 at 12:32 UTC
Possibly, I don't know what your exact ideas are :-) At least .+? is not an optional match for one or more things. But I've also noticed that a lot of people use non-greedy matching as if it's some sort of magical negative lookahead. It isn't. It will do whatever it takes to match, even if that means also eating one or more of the character(s) that directly follows the non-greedy match (and exiting the repeat at some later instance of that character(s) instead). If your regex has only one match of a non-fixed length you can usually get away with this, but especially if there are more than one them, the semantics can get unexpected. If you don't want to match something, it's almost always better to just say that. So in your case that would be: `s/<body[^>]+>/.../` [download] (also notice that greedy or non-greedy becomes irrelevant if you write it like this) It would of course still not work as expected (the + should be a * since matching 0 times should be allowed too), but at least the problem now will be not doing anything at all instead of changing too much. And it will keep working as expected even if you make your regex more complicated.	[reply] [d/l]
Re^4: Non-greedy regex behaves greedily by fourmi (Scribe) on Nov 10, 2004 at 12:44 UTC
Re^3: Non-greedy regex behaves greedily by Jasper (Chaplain) on Nov 10, 2004 at 12:24 UTC
Yes, yes it does. ? doesn't make it any more or less optional, just non-greedy.	[reply]
Re^3: Non-greedy regex behaves greedily by Happy-the-monk (Canon) on Nov 10, 2004 at 12:27 UTC
It means the belief that "antigreedifier" make the aforementioned "at least one or more" optional is wrong. "at least one or more" `/.+/` still proves wrong if there is nothing. Cheers, Sören	[reply]
Re: Non-greedy regex behaves greedily by pelagic (Priest) on Nov 10, 2004 at 12:06 UTC
try:`s/<body[^>]?>/<body class="theBody">/` update* `## I changed +? ## to *? ## now it does match` [download] pelagic	[reply] [d/l] [select]
Re^2: Non-greedy regex behaves greedily by fourmi (Scribe) on Nov 10, 2004 at 12:13 UTC
Now it doesn't match <body> at all	[reply]
Re^3: Non-greedy regex behaves greedily by fourmi (Scribe) on Nov 10, 2004 at 12:21 UTC
yup, core error on my part, thanks a lot!	[reply]
Re: Non-greedy regex behaves greedily by tphyahoo (Vicar) on Nov 10, 2004 at 19:20 UTC
Yes, the left hand side (except for the last closing bracket) `<body.+?` [download] is finding out to `<body>` [download] and then continuing out till it hits the next closing bracket. Problem is + finds AT LEAST one character. You want the star * here: `s/<body.?>/<body class="theBody">/is` [download] finds 0 or more characters. That should solve it! :)	[reply] [d/l] [select]
Re^2: Non-greedy regex behaves greedily by fourmi (Scribe) on Nov 11, 2004 at 13:56 UTC
indeedy, thanks a lot	[reply]
Re: Non-greedy regex behaves greedily by tall_man (Parson) on Nov 10, 2004 at 20:28 UTC
The regular expression solutions will work in the simple cases presented here, but for advanced fixing of HTML tags you may want to look at Text::Balanced and HTML::Parser. For example: `use strict; use Text::Balanced qw(extract_bracketed); my $line = "<body label=\"<<HI>>\"><!--msnavigation-->"; # This will match the brackets properly. my ($match, $remainder) = extract_bracketed($line,"<"); if ($match) { $match =~ s/body/body class=\"theBody\"/; print $match, $remainder,"\n"; } # The regular expression will mess things up. $line =~ s/<body.*?>/<body class="theBody">/is; print $line,"\n";` [download]	[reply] [d/l]
Re: Non-greedy regex behaves greedily by Anonymous Monk on Jul 27, 2008 at 16:49 UTC
I would like to reopen this with a similar question. Target string: `Back to STATES Menu</font></a></h3> <p align="center"><a href="index.htm"><img src="home2.gif" alt="Home" border="0" width="106" height="30"></a></p> </body> </html>` regex: `</a>.?$` or `</a>.?\$` matches: `</a></h3> <p align="center"><a href="index.htm"><img src="home2.gif" alt="Home" border="0" width="106" height="30"></a></p> </body> </html>` I expect it to match: `</a></p> </body> </html>` Both PERL and Regex Coach seem to concur, so I must be missing something.	[reply] [d/l] [select]
Re^2: Non-greedy regex behaves greedily (leftmost) by tye (Sage) on Jul 28, 2008 at 06:09 UTC
The regex engine matches "leftmost longest" and not "longest leftmost". The non-greedy modifier changes "longest" to "shortest" but doesn't change "leftmost" nor make "leftmost" no longer trump "longest"/"shortest". ("leftmost" refers to how close to the start of the string is the beginning of the matched substring.) The long match is indeed the leftmost possible match. The ? would change the quantifier so that you got the shortest of the leftmost possible matches instead of the longest of the leftmost possible matches. You can read about sexeger (or search for more threads: sexeger sexeger) to see how sometimes it can be useful or at least fun to reverse your string and your regex so that you get the substring with the "rightmost" ending point and can choose between longest/shortest as the regex engine moves leftward (with respect to the original string). For your particular case, I'd just use rindex and then substr. - tye	[reply]
Re^2: Non-greedy regex behaves greedily by linuxer (Curate) on Jul 27, 2008 at 17:43 UTC
Your regex doesn't force a non-greedy behaviour. I'll try to explain with a simplified text example: `my $text = <<TEXT; 000ABCDEFABCGHI TEXT if ( $text =~ m{(ABC.*?)$} ) { print $1, $/; }` [download] The engine reads $text from left to right and will have a try with starting at the first "ABC", using the complete following string until end of line. As that's exactly what the regex requested, this result is returned. There's no condition which forces the engine to search for a shorter result. There will be no second run which checks, if the current result may contain a shorter result. The first valid match will be returned; this isn't always the best match.	[reply] [d/l]
Re^3: Non-greedy regex behaves greedily by kovacsbv (Novice) on Jul 27, 2008 at 23:36 UTC
Ok, is there a nice detailed description of the engine that would fill in what causes a second run and what the "?" does exactly? This behavior isn't very intuitive. Also, is there another way to get the desired result other than the ugly hack I posted below?	[reply]
Re^4: Non-greedy regex behaves greedily by ysth (Canon) on Jul 28, 2008 at 10:03 UTC
Re: Non-greedy regex behaves greedily by kovacsbv (Novice) on Jul 27, 2008 at 17:23 UTC
Hi, it's the real me now that I got the logon to work! I got it to work with: `</a>([^<]\|<[^/]\|</[^a]\|</a[^>])*$` But why the one above didn't work is still a mystery.	[reply] [d/l]


Welcome to the Monastery
	PerlMonks