Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Re: Non-greedy regex behaves greedily

by Anonymous Monk
on Jul 27, 2008 at 16:49 UTC ( [id://700389]=note: print w/replies, xml ) Need Help??


in reply to Non-greedy regex behaves greedily

I would like to reopen this with a similar question.

Target string:

Back to STATES Menu</font></a></h3> <p align="center"><a href="index.htm"><img src="home2.gif" alt="Home" border="0" width="106" height="30"></a></p> </body> </html>

regex:

</a>.*?$

or

</a>.*?\$

matches:

</a></h3> <p align="center"><a href="index.htm"><img src="home2.gif" alt="Home" border="0" width="106" height="30"></a></p> </body> </html>

I expect it to match:

</a></p> </body> </html>

Both PERL and Regex Coach seem to concur, so I must be missing something.

Replies are listed 'Best First'.
Re^2: Non-greedy regex behaves greedily (leftmost)
by tye (Sage) on Jul 28, 2008 at 06:09 UTC

    The regex engine matches "leftmost longest" and not "longest leftmost". The non-greedy modifier changes "longest" to "shortest" but doesn't change "leftmost" nor make "leftmost" no longer trump "longest"/"shortest". ("leftmost" refers to how close to the start of the string is the beginning of the matched substring.)

    The long match is indeed the leftmost possible match. The ? would change the quantifier so that you got the shortest of the leftmost possible matches instead of the longest of the leftmost possible matches.

    You can read about sexeger (or search for more threads: sexeger sexeger) to see how sometimes it can be useful or at least fun to reverse your string and your regex so that you get the substring with the "rightmost" ending point and can choose between longest/shortest as the regex engine moves leftward (with respect to the original string).

    For your particular case, I'd just use rindex and then substr.

    - tye        

Re^2: Non-greedy regex behaves greedily
by linuxer (Curate) on Jul 27, 2008 at 17:43 UTC

    Your regex doesn't force a non-greedy behaviour.

    I'll try to explain with a simplified text example:

    my $text = <<TEXT; 000ABCDEFABCGHI TEXT if ( $text =~ m{(ABC.*?)$} ) { print $1, $/; }

    The engine reads $text from left to right and will have a try with starting at the first "ABC", using the complete following string until end of line. As that's exactly what the regex requested, this result is returned. There's no condition which forces the engine to search for a shorter result. There will be no second run which checks, if the current result may contain a shorter result.

    The first valid match will be returned; this isn't always the best match.

      Ok, is there a nice detailed description of the engine that would fill in what causes a second run and what the "?" does exactly?

      This behavior isn't very intuitive.

      Also, is there another way to get the desired result other than the ugly hack I posted below?
        You can force the regex engine to start looking for </a> at the end of the string and work forwards by consuming all the string to start with and backtracking character by character:
        my $string = q!Back to STATES Menu</font></a></h3> <p align="center">< +a href="index.htm"><img src="home2.gif" alt="Home" border="0" width=" +106" height="30"></a></p> </body> </html>!; if ( $string =~ m!^.*(</a>.*?)$! ) { print "got $1\n"; }
        but there's usually a better way to get what you want done.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://700389]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chilling in the Monastery: (7)
As of 2024-04-24 03:30 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found