Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Re: negative Lookahead or

by wog (Curate)
on Nov 12, 2001 at 07:39 UTC ( #124731=note: print w/replies, xml ) Need Help??


in reply to negative Lookahead or

First, this code produces that following output for me:

string: <a href="#1">HIT<b>MUST MATCH</b></a> <a href="#2">2</a> match: HIT<b>MUST MATCH</b></a> string: <a href="#1">HIT<a name="MUST NOT MATCH">-</a> <a href="#2">2< +/a> NO match!

This seems like the result you desire.

However, I would advise you do not use regexes to parse HTML, because it is very, very hard to make the regex do it correctly (especially since there can be >s or <s in comments or quoted attribute values.) It would be better HTML::Parser or HTML::TokeParser. Searches of this site should give you many examples of their use.

Replies are listed 'Best First'.
Re: Re: negative Lookahead
by eddmund (Initiate) on Nov 12, 2001 at 08:05 UTC
    thanks for your suggestion, wog. But I would like to know what I did wrong in my regex. When I tried to post my code, the 2 strings have been too long so I cut it down a little bit, including dummy characters after "HIT". Add a "-" or use the original definition
    my @strings = ( '<a href="#1">--HIT-<b>MUST MATCH</b></a> <a href="#2">2</a>', '<a href="#1">--HIT-<a name="MUST NOT MATCH">--</a> <a href="#2">2</a +>' );
    and the second string does not match anymore. Any help is greatly appreciated.
      What appears to be happening with the second string is this:

      You match HIT. Then you go looking matching [^<]*?. The *? first tries to match the minimal number of characters, so it gets 0. And, the negative look-ahead does not try to do backtracking to make that *? match more so it can make the negative look-ahead match. (update: because, if it did that, then the look-ahead would be doing more then just looking ahead in the string.) Thus, it doesn't find a match and the negative lookahead succeeds, so the positive lookahead succeeds.

        OK, thanks. Considering switching to greedy instead of non-greedy results in the regex
        $string =~ / ( HIT (?= [^<]* (?! <a[^>]*> ) .*?<\/a> ) .*?<\/a> ) /x;
        but still gives the same results. If the $string
        '<a href="#1">--HIT<a name="MUST NOT MATCH">--</a> <a href="#2">2</a>'
        does properly not match and
        '<a href="#1">--HIT-<a name="MUST NOT MATCH">--</a> <a href="#2">2</a> +'
        does "falsely" match I don't understand why the problem lies in the fact that a negative lookahead does not backtrack.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://124731]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (4)
As of 2022-12-09 22:20 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?