Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Regexp: Match anything except a certain word

by fraktalisman (Hermit)
on Apr 21, 2006 at 16:56 UTC ( [id://544944]=perlquestion: print w/replies, xml ) Need Help??

fraktalisman has asked for the wisdom of the Perl Monks concerning the following question:

I want to modify a string of HTML code, using a simple regex, so that all links (<a ...) with a href that starts with "http", would be given a target="_blank" attribute unless they already have a target.

The regex I've come up with so far produces the desired output but it is awfully slow ... and it looks far from being elegant. It looks for links, extracts link targets and remaining attributes, and checks that the remaining parts don't already contain the word target:

$test=~s/<a ([^t|>]*?[^a|>]*?[^r|>]*?[^g|>]*?[^e|>]*?[^t|>]*?)href=(" +??http:[^"]*?" ??)([^t|>]*?[^a|>]*?[^r|>]*?[^g|>]*?[^e|>]*?[^t|>]*?)> +/<a $1 href=$2$3 target="_blank">/gosi;

I want to improve my regex but I wonder how. What I didn't find anywhere in documentation and tutorials is how to match anything (like .*?) unless it contains a certain expression (target). Maybe I'd better use HTML::Parser in this case, but I still want to know how it's possible to optimize the expression.

Replies are listed 'Best First'.
Re: Regexp: Match anything except a certain word
by wfsp (Abbot) on Apr 21, 2006 at 18:14 UTC
    I agree with santonegro that a parser makes this a lot simpler. Let someone else do the heavy lifting. :-)
    #!/bin/perl5 use strict; use warnings; use HTML::TokeParser::Simple; my $html; { local $/; $html = <DATA>; } my $tp = HTML::TokeParser::Simple->new(\$html) or die "Couldn't parse string: $!"; while (my $t = $tp->get_token) { if ( $t->is_start_tag('a') and $t->get_attr('href') =~ /^http/ and not $t->get_attr('target') ) { $t->set_attr('target', '_blank'); } print $t->as_is; } __DATA__ <a href="http://here.com" target="_blank">here</a> <a href="http://there.com">there</a> <a href="http://everywhere.com" target="foo">everywhere</a> <a href="local.html">local</a>
    output:
    ---------- Capture Output ---------- > "C:\Perl\bin\perl.exe" parse_.pl <a href="http://here.com" target="_blank">here</a> <a href="http://there.com" target="_blank">there</a> <a href="http://everywhere.com" target="foo">everywhere</a> <a href="local.html">local</a> > Terminated with exit code 0.
    Update: Covered the "unless they already have a target" condition as pointed out by ikegami below. I would argue that fixing/changing this script is easier than fixing a regex (but I would say that wouldn't I :-))
      Close. You missed the "unless they already have a target" condition specified by the OP. For example,
      <a href="http://example.com" target="foo">here</a>
      shouldn't change, but it becomes
      <a href="http://example.com" target="_target">here</a>

      Update: Added example.

Re: Regexp: Match anything except a certain word
by Roy Johnson (Monsignor) on Apr 21, 2006 at 17:15 UTC
    See the "Matching a pattern that doesn't include another pattern" section of the Using Look-ahead and Look-behind tutorial.

    Caution: Contents may have been coded under pressure.
Re: Regexp: Match anything except a certain word
by santonegro (Scribe) on Apr 21, 2006 at 17:54 UTC
    I want to modify a string of HTML code
    use HTML::Tree
Re: Regexp: Match anything except a certain word
by TedPride (Priest) on Apr 21, 2006 at 18:25 UTC
    The following seems to work:
    $_ = join '', <DATA>; s/(<a href="https?:\/\/(?:(?!target=).)*?)>/$1 target="_blank">/igs; print; __DATA__ <a href="dsfsdf">sdfdsf</a> <a href="http://dsfsdf">sdfdsf</a> <a href="https://dsfsdf">sdfdsf</a> <a href="http://dsfsdf" target="_blank">sdfdsf</a> <a href="http://dsfsdf" >sdfdsf</a>

      You're assuming the target will come after the href, but the OP allowed for any order.

      See my earlier post for the fix.

Re: Regexp: Match anything except a certain word
by ikegami (Patriarch) on Apr 21, 2006 at 17:06 UTC
    $test =~ s/(<a\s(?:(?!target=)(?!>).)*)(?=>)/$1 target="_blank"/sgi;

    Tested.

      Thanks! This looks good, although it does not check for http and so it also put the target to local links, but I should be able to put the pieces together now.

        Oops, I read the problem too fast. Fix:
        $test =~ s{ ( <a\s (?:(?!target=|>).)* href="http:// (?:(?!target=|>).)* ) (?=>) }{$1 target="_blank"}xsgi;

        Alternatively, the following *might* be faster, especially considering href is usually the first attribute:

        $test =~ s{ ( <a\s (?:(?!href=|target=|>).)* href="http:// (?:(?!target=|>).)* ) (?=>) }{$1 target="_blank"}xsgi;

        Tested.

        Update: Collapsed (?!..re1..)(?!..re2..) into (?!..re1..|..re2..)

Re: Regexp: Match anything except a certain word
by SamCG (Hermit) on Apr 21, 2006 at 17:07 UTC
    Sounds like you want a negative look-ahead? Aren't those expressed something like $test=~/<a.+?(?!target).+?>/;?

    Sort of guessing at the form, but I think that's what you're after...

    update: this has a problem because of the .+.


    -----------------
    s''limp';@p=split '!','n!h!p!';s,m,s,;$s=y;$c=slice @p1;so brutally;d;$n=reverse;$c=$s**$#p;print(''.$c^chop($n))while($c/=$#p)>=1;

      That's not how you use the negative lookahead. It won't work. For example,

      '<a href="..." target="...">' =~ / <a # Matches '<a'. .+? # Matches ' '. (?!target) # Matches. (/\Gtarget/ doesn't match.) .+? # Matches 'href="..." target="..."'. > # Matches '>'. /isx;
      and
      '<a target="..." href="...">' =~ / <a # Matches '<a'. .+? # Matches ' t'. (?!target) # Matches. (/\Gtarget/ doesn't match.) .+? # Matches 'arget="..." href="..."'. > # Matches '>'. /isx;

      The common use of a negative lookahead is

      /(?:(?!$re).)*/

      See my earlier post for the fix.

Re: Regexp: Match anything except a certain word
by fraktalisman (Hermit) on May 09, 2006 at 13:44 UTC

    And finally, what did I do? Use HTML::Parser. And learn some more about regular expressions and how to understand HTML::Parser better.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://544944]
Approved by Tanalis
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (7)
As of 2024-04-16 11:23 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found