negative Lookahead or

eddmund has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I have a problem that sounds very simple: if you found a hit within a webpage's source, how can you find out if this hit is part of a link, or more generally, within a tag? I tried the following code with finds the first string correctly, but also the second one that should not match.

#!/usr/bin/perl -w
use strict;

my @strings = (
'<a href="#1">HIT<b>MUST MATCH</b></a> <a href="#2">2</a>',
'<a href="#1">HIT<a name="MUST NOT MATCH">-</a> <a href="#2">2</a>'
); 
 
foreach my $string (@strings) {
"" =~ /()/;        # set $1 to false
$string =~ /
(                        # the whole match should be catches as $1
    HIT                    # start at the first found "HIT"
    (?=                    # begin of positive Lookahead
        [^<]*?            # zero or more characters other than "<" 
        (?!                # start a negative Lookahead
            <a[^>]*?>    # an opening a tag is forbidden
        )                
        .*?<\/a>        # any characters until the closing a tag
    )                    # end of positive Lookahead
    .*?                    # catch all chars after HIT,
    <\/a>                # non-greedy until the first closing a tag
)
/x;

print "\nstring:\t$string\n";
if ($1) { print "match:\t$1\n" }
else { print "NO match!\n" }
}
[download]

I think the problem is related to my use of the negative Lookahead. Could anybody point me to the right direction? Thanks, rob.

Comment on negative Lookahead or Download Code

Replies are listed 'Best First'.
Re: negative Lookahead or by wog (Curate) on Nov 12, 2001 at 07:39 UTC
First, this code produces that following output for me: `string: <a href="#1">HIT<b>MUST MATCH</b></a> <a href="#2">2</a> match: HIT<b>MUST MATCH</b></a> string: <a href="#1">HIT<a name="MUST NOT MATCH">-</a> <a href="#2">2< +/a> NO match!` [download] This seems like the result you desire. However, I would advise you do not use regexes to parse HTML, because it is very, very hard to make the regex do it correctly (especially since there can be `>`s or `<`s in comments or quoted attribute values.) It would be better HTML::Parser or HTML::TokeParser. Searches of this site should give you many examples of their use.	[reply] [d/l] [select]
Re: Re: negative Lookahead by eddmund (Initiate) on Nov 12, 2001 at 08:05 UTC
thanks for your suggestion, wog. But I would like to know what I did wrong in my regex. When I tried to post my code, the 2 strings have been too long so I cut it down a little bit, including dummy characters after "HIT". Add a "-" or use the original definition `my @strings = ( '<a href="#1">--HIT-<b>MUST MATCH</b></a> <a href="#2">2</a>', '<a href="#1">--HIT-<a name="MUST NOT MATCH">--</a> <a href="#2">2</a +>' );` [download] and the second string does not match anymore. Any help is greatly appreciated.	[reply] [d/l]
Re: Re: Re: negative Lookahead by wog (Curate) on Nov 12, 2001 at 08:15 UTC
What appears to be happening with the second string is this: You match `HIT`. Then you go looking matching `[^<]?`. The `?` first tries to match the minimal number of characters, so it gets 0. And, the negative look-ahead does not try to do backtracking to make that `?` match more so it can make the negative look-ahead match. (update*: because, if it did that, then the look-ahead would be doing more then just looking ahead in the string.) Thus, it doesn't find a match and the negative lookahead succeeds, so the positive lookahead succeeds.	[reply] [d/l] [select]
Re: Re: Re: Re: negative Lookahead by eddmund (Initiate) on Nov 12, 2001 at 21:00 UTC


"be consistent"
	PerlMonks