Re: negative Lookahead or

First, this code produces that following output for me:

string: <a href="#1">HIT<b>MUST MATCH</b></a> <a href="#2">2</a>
match:  HIT<b>MUST MATCH</b></a>

string: <a href="#1">HIT<a name="MUST NOT MATCH">-</a> <a href="#2">2<
+/a>
NO match!
[download]

This seems like the result you desire.

However, I would advise you do not use regexes to parse HTML, because it is very, very hard to make the regex do it correctly (especially since there can be >s or <s in comments or quoted attribute values.) It would be better HTML::Parser or HTML::TokeParser. Searches of this site should give you many examples of their use.

Comment on Re: negative Lookahead or Select or Download Code

Replies are listed 'Best First'.
Re: Re: negative Lookahead by eddmund (Initiate) on Nov 12, 2001 at 08:05 UTC
thanks for your suggestion, wog. But I would like to know what I did wrong in my regex. When I tried to post my code, the 2 strings have been too long so I cut it down a little bit, including dummy characters after "HIT". Add a "-" or use the original definition `my @strings = ( '<a href="#1">--HIT-<b>MUST MATCH</b></a> <a href="#2">2</a>', '<a href="#1">--HIT-<a name="MUST NOT MATCH">--</a> <a href="#2">2</a +>' );` [download] and the second string does not match anymore. Any help is greatly appreciated.	[reply] [d/l]
Re: Re: Re: negative Lookahead by wog (Curate) on Nov 12, 2001 at 08:15 UTC
What appears to be happening with the second string is this: You match `HIT`. Then you go looking matching `[^<]?`. The `?` first tries to match the minimal number of characters, so it gets 0. And, the negative look-ahead does not try to do backtracking to make that `?` match more so it can make the negative look-ahead match. (update*: because, if it did that, then the look-ahead would be doing more then just looking ahead in the string.) Thus, it doesn't find a match and the negative lookahead succeeds, so the positive lookahead succeeds.	[reply] [d/l] [select]
Re: Re: Re: Re: negative Lookahead by eddmund (Initiate) on Nov 12, 2001 at 21:00 UTC
OK, thanks. Considering switching to greedy instead of non-greedy results in the regex `$string =~ / ( HIT (?= [^<]* (?! <a[^>]> ) .?<\/a> ) .*?<\/a> ) /x;` [download] but still gives the same results. If the $string `'<a href="#1">--HIT<a name="MUST NOT MATCH">--</a> <a href="#2">2</a>'` [download] does properly not match and `'<a href="#1">--HIT-<a name="MUST NOT MATCH">--</a> <a href="#2">2</a> +'` [download] does "falsely" match I don't understand why the problem lies in the fact that a negative lookahead does not backtrack.	[reply] [d/l] [select]


P is for Practical
	PerlMonks