Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

negative Lookahead or

by eddmund (Initiate)
on Nov 12, 2001 at 07:11 UTC ( [id://124727]=perlquestion: print w/replies, xml ) Need Help??

eddmund has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I have a problem that sounds very simple: if you found a hit within a webpage's source, how can you find out if this hit is part of a link, or more generally, within a tag? I tried the following code with finds the first string correctly, but also the second one that should not match.
#!/usr/bin/perl -w use strict; my @strings = ( '<a href="#1">HIT<b>MUST MATCH</b></a> <a href="#2">2</a>', '<a href="#1">HIT<a name="MUST NOT MATCH">-</a> <a href="#2">2</a>' ); foreach my $string (@strings) { "" =~ /()/; # set $1 to false $string =~ / ( # the whole match should be catches as $1 HIT # start at the first found "HIT" (?= # begin of positive Lookahead [^<]*? # zero or more characters other than "<" (?! # start a negative Lookahead <a[^>]*?> # an opening a tag is forbidden ) .*?<\/a> # any characters until the closing a tag ) # end of positive Lookahead .*? # catch all chars after HIT, <\/a> # non-greedy until the first closing a tag ) /x; print "\nstring:\t$string\n"; if ($1) { print "match:\t$1\n" } else { print "NO match!\n" } }
I think the problem is related to my use of the negative Lookahead. Could anybody point me to the right direction? Thanks, rob.

Replies are listed 'Best First'.
Re: negative Lookahead or
by wog (Curate) on Nov 12, 2001 at 07:39 UTC
    First, this code produces that following output for me:

    string: <a href="#1">HIT<b>MUST MATCH</b></a> <a href="#2">2</a> match: HIT<b>MUST MATCH</b></a> string: <a href="#1">HIT<a name="MUST NOT MATCH">-</a> <a href="#2">2< +/a> NO match!

    This seems like the result you desire.

    However, I would advise you do not use regexes to parse HTML, because it is very, very hard to make the regex do it correctly (especially since there can be >s or <s in comments or quoted attribute values.) It would be better HTML::Parser or HTML::TokeParser. Searches of this site should give you many examples of their use.

      thanks for your suggestion, wog. But I would like to know what I did wrong in my regex. When I tried to post my code, the 2 strings have been too long so I cut it down a little bit, including dummy characters after "HIT". Add a "-" or use the original definition
      my @strings = ( '<a href="#1">--HIT-<b>MUST MATCH</b></a> <a href="#2">2</a>', '<a href="#1">--HIT-<a name="MUST NOT MATCH">--</a> <a href="#2">2</a +>' );
      and the second string does not match anymore. Any help is greatly appreciated.
        What appears to be happening with the second string is this:

        You match HIT. Then you go looking matching [^<]*?. The *? first tries to match the minimal number of characters, so it gets 0. And, the negative look-ahead does not try to do backtracking to make that *? match more so it can make the negative look-ahead match. (update: because, if it did that, then the look-ahead would be doing more then just looking ahead in the string.) Thus, it doesn't find a match and the negative lookahead succeeds, so the positive lookahead succeeds.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://124727]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others browsing the Monastery: (5)
As of 2024-04-24 08:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found