http://qs321.pair.com?node_id=667062

the_0ne has asked for the wisdom of the Perl Monks concerning the following question:

I just can't seem to get these look-ahead/look-behind assertions down pat like some of you. I have this code...

$contents = "<p>no tabs here</p><p>column 1[tab]column 2</p><p>no tabs + here</p>"; if ($contents =~ /<p>(.*?\[tab\].*?)<\/p>/) { print "yes: ${1}\n"; } else { print "no\n"; } if ($contents =~ /<p>(?=\[tab\])<\/p/) { print "yes: ${1}\n"; } else { print "no\n"; }

What I want here is to say only match the set of paragraph markers with the actual string [tab] inside.

Notice the first one DOES match, however, since I am using .*? the first para marker will match and then the .* will take over and match all the way to the close para marker. I want the first para marker to fail because there is no [tab] inside the paragraph markers.

The second one is a look-ahead, but notice the [tab] would have to be immediately after the para marker. I can't figure out how to tell it the possibility of some text, then the [tab], then the possibility of more text.

Thanks for any assistance you may provide...

Replies are listed 'Best First'.
Re: Matching set of paragraph tags with string inside.
by Roy Johnson (Monsignor) on Feb 08, 2008 at 21:13 UTC
    my $re = qr/ <p> # paragraph-open ( # Capture (?: # group (?!<\/p>) # Make sure we aren't at a paragraph-close (?!\[tab\]) # and we're not at [tab] . # consume a char )* # any number of times \[tab\] # consume [tab] (hooray!) .*? # anything up to paragraph-close ) <\/p> /x;
    Update: to not capture paragraph tags

    Caution: Contents may have been coded under pressure.
      Thanks a lot for the code and explanation. I think this is exactly what I need. I'll mess with it some more and see if I can break it.
Re: Matching set of paragraph tags with string inside.
by Tanktalus (Canon) on Feb 08, 2008 at 21:10 UTC

    If your HTML is actually XHTML-compliant, you could use XML::Twig to parse it, and then do something like this:

    my @tagged_paragraphs = $twig->get_xpath('//p[string()=~/\[tag\]/'); my @texts = map { $_->text() } @tagged_paragraphs;
    Note that if you have p's in p's (e.g., "<p>some text<p>inner [tag] stuff</p>outter</p>", this may give you problems (you'll get both "some textinner [tag] stuffoutter" and "inner [tag] stuff", I believe).

Re: Matching set of paragraph tags with string inside.
by jepri (Parson) on Feb 08, 2008 at 21:11 UTC
    I try to avoid getting too tricky with regexes, because I am a bear of little brain. I'd break the string up then inspect it in pieces, perhaps like this:

    my @arr = split /<\/p>/, $string; @matches = grep { /\[tab\]/ } @arr;

    one extra line but much easier for me. If I was concerned about it looking neat, I'd put the code inside a subroutine called match_tabs().

    ___________________
    Jeremy
    I didn't believe in evil until I dated it.

      One minor difference is that splitting this way removes the close-paragraph, but not the open-paragraph, so you need to remove it.
      my @arr = grep {/\[tab\]/ and s/^<p>//} split /<\/p>/, $contents;
      If you wanted to retain both tags, you could do so by splitting on a lookbehind expression:
      my @arr = grep /\[tab\]/, split /(?<=<\/p>)/, $contents;
      You'd also retain the paragraph-close if you used it for $/ and read the string as a file.
      { local $/ = '</p>'; open (STR, '<', \$contents) or die "Opening string: $!\n"; @arr = grep /^<p>/ && /\[tab\]/, <STR>; print "read $_\n" for @arr; }
      But now I'm just getting silly.

      Caution: Contents may have been coded under pressure.
        You seem to be a bit focussed on using big regexes :P

        map { s/<P>// } grep { /\[tab\]/ } split /<\/p>/, $string;

        ___________________
        Jeremy
        I didn't believe in evil until I dated it.

      hmmm, actually makes sense. Luckily most of the time I will be looking exactly for what I originally showed in my example. This should work ok. I'll have to mess with it and see if there's anything I didn't think of that may pop up.

      Thanks.
Re: Matching set of paragraph tags with string inside.
by moritz (Cardinal) on Feb 08, 2008 at 20:54 UTC
    In the second example you're trying to match <p></p> directly, with an additional assertion. Try /<p>(?=.*?\[tab\]).*?<\/p>/ instead. But that doesn't check if the [tab] occurs before the </p>, so don'T try to mess with lookarounds but use your first pattern.

    Update: OK, that doesn't fix your real problem. I'll have to think a bit more about it. In this simple case you can just use [^<>] instead of a dot everywhere between <p> and </p>.

    The old truth that HTML shouldn't be parsed with regexes still holds.

      My problem with the first match is it matches too much...
      original: "<p>no tabs here</p><p>column 1[tab]column 2</p><p>no tabs here</p>" result of first regex: no tabs here</p><p>column 1[tab]column 2
      What I want it to match is...
      column 1[tab]column 2
      Since that is the set of para's with the string [tab].
        My first shot was too fast, here's a working solution:
        my $contents = "<p>no tabs here</p><p>column 1[tab]column 2</p><p>no t +abs here</p>"; if ($contents =~ m{(<p>[^<>]*\[tab\][^<>]*</p>)} ){ print "Matched '$1'\n;"; }

      I understand your concern of parsing html with regexes, but this string will never be a full set of html. It will have some html (of course with the para's) but using a parser would be overkill for this situation.