Regexp: Match anything except a certain word

fraktalisman has asked for the wisdom of the Perl Monks concerning the following question:

I want to modify a string of HTML code, using a simple regex, so that all links (<a ...) with a href that starts with "http", would be given a target="_blank" attribute unless they already have a target.

The regex I've come up with so far produces the desired output but it is awfully slow ... and it looks far from being elegant. It looks for links, extracts link targets and remaining attributes, and checks that the remaining parts don't already contain the word target:

 $test=~s/<a ([^t|>]*?[^a|>]*?[^r|>]*?[^g|>]*?[^e|>]*?[^t|>]*?)href=("
+??http:[^"]*?" ??)([^t|>]*?[^a|>]*?[^r|>]*?[^g|>]*?[^e|>]*?[^t|>]*?)>
+/<a $1 href=$2$3 target="_blank">/gosi;
[download]

I want to improve my regex but I wonder how. What I didn't find anywhere in documentation and tutorials is how to match anything (like .*?) unless it contains a certain expression (target). Maybe I'd better use HTML::Parser in this case, but I still want to know how it's possible to optimize the expression.

_{fraktalisman keeps rolling}

Comment on Regexp: Match anything except a certain word Select or Download Code

Replies are listed 'Best First'.
Re: Regexp: Match anything except a certain word by wfsp (Abbot) on Apr 21, 2006 at 18:14 UTC
I agree with santonegro that a parser makes this a lot simpler. Let someone else do the heavy lifting. :-) #!/bin/perl5 use strict; use warnings; use HTML::TokeParser::Simple; my $html; { local $/; $html = <DATA>; } my $tp = HTML::TokeParser::Simple->new(\$html) or die "Couldn't parse string: $!"; while (my $t = $tp->get_token) { if ( $t->is_start_tag('a') and $t->get_attr('href') =~ /^http/ and not $t->get_attr('target') ) { $t->set_attr('target', '_blank'); } print $t->as_is; } __DATA__ <a href="http://here.com" target="_blank">here</a> <a href="http://there.com">there</a> <a href="http://everywhere.com" target="foo">everywhere</a> <a href="local.html">local</a> [download] output: `---------- Capture Output ---------- > "C:\Perl\bin\perl.exe" parse_.pl <a href="http://here.com" target="_blank">here</a> <a href="http://there.com" target="_blank">there</a> <a href="http://everywhere.com" target="foo">everywhere</a> <a href="local.html">local</a> > Terminated with exit code 0.` [download] Update: Covered the "unless they already have a target" condition as pointed out by ikegami below. I would argue that fixing/changing this script is easier than fixing a regex (but I would say that wouldn't I :-))	[reply] [d/l] [select]
Re^2: Regexp: Match anything except a certain word by ikegami (Patriarch) on Apr 21, 2006 at 19:21 UTC
Close. You missed the "unless they already have a target" condition specified by the OP. For example, `<a href="http://example.com" target="foo">here</a>` shouldn't change, but it becomes `<a href="http://example.com" target="_target">here</a>` Update: Added example.	[reply] [d/l] [select]
Re: Regexp: Match anything except a certain word by Roy Johnson (Monsignor) on Apr 21, 2006 at 17:15 UTC
See the "Matching a pattern that doesn't include another pattern" section of the Using Look-ahead and Look-behind tutorial. Caution: Contents may have been coded under pressure.	[reply]
Re: Regexp: Match anything except a certain word by santonegro (Scribe) on Apr 21, 2006 at 17:54 UTC
I want to modify a string of HTML code use HTML::Tree	[reply]
Re: Regexp: Match anything except a certain word by TedPride (Priest) on Apr 21, 2006 at 18:25 UTC
The following seems to work: `$_ = join '', <DATA>; s/(<a href="https?:\/\/(?:(?!target=).)*?)>/$1 target="_blank">/igs; print; __DATA__ <a href="dsfsdf">sdfdsf</a> <a href="http://dsfsdf">sdfdsf</a> <a href="https://dsfsdf">sdfdsf</a> <a href="http://dsfsdf" target="_blank">sdfdsf</a> <a href="http://dsfsdf" >sdfdsf</a>` [download]	[reply] [d/l]
Re^2: Regexp: Match anything except a certain word by ikegami (Patriarch) on Apr 21, 2006 at 19:23 UTC
You're assuming the target will come after the href, but the OP allowed for any order. See my earlier post for the fix.	[reply]
Re: Regexp: Match anything except a certain word by ikegami (Patriarch) on Apr 21, 2006 at 17:06 UTC
`$test =~ s/(<a\s(?:(?!target=)(?!>).)*)(?=>)/$1 target="_blank"/sgi;` [download] Tested.	[reply] [d/l]
Re^2: Regexp: Match anything except a certain word by fraktalisman (Hermit) on Apr 21, 2006 at 17:19 UTC
Thanks! This looks good, although it does not check for `http` and so it also put the target to local links, but I should be able to put the pieces together now. _{fraktalisman keeps rolling}	[reply] [d/l]
Re^3: Regexp: Match anything except a certain word by ikegami (Patriarch) on Apr 21, 2006 at 17:26 UTC
Oops, I read the problem too fast. Fix: `$test =~ s{ ( <a\s (?:(?!target=\|>).)* href="http:// (?:(?!target=\|>).)* ) (?=>) }{$1 target="_blank"}xsgi;` [download] Alternatively, the following might be faster, especially considering `href` is usually the first attribute: `$test =~ s{ ( <a\s (?:(?!href=\|target=\|>).)* href="http:// (?:(?!target=\|>).)* ) (?=>) }{$1 target="_blank"}xsgi;` [download] Tested. Update: Collapsed `(?!..re1..)(?!..re2..)` into `(?!..re1..\|..re2..)`	[reply] [d/l] [select]
Re: Regexp: Match anything except a certain word by SamCG (Hermit) on Apr 21, 2006 at 17:07 UTC
Sounds like you want a negative look-ahead? Aren't those expressed something like `$test=~/<a.+?(?!target).+?>/;`? Sort of guessing at the form, but I think that's what you're after... update: this has a problem because of the `.+`. ----------------- _{s''limp';@p=split '!','n!h!p!';s,m,s,;$s=y;$c=slice @p1;so brutally;d;$n=reverse;$c=$s**$#p;print(''.$c^chop($n))while($c/=$#p)>=1;}	[reply] [d/l] [select]
Re^2: Regexp: Match anything except a certain word by ikegami (Patriarch) on Apr 21, 2006 at 17:15 UTC
That's not how you use the negative lookahead. It won't work. For example, `'<a href="..." target="...">' =~ / <a # Matches '<a'. .+? # Matches ' '. (?!target) # Matches. (/\Gtarget/ doesn't match.) .+? # Matches 'href="..." target="..."'. > # Matches '>'. /isx;` [download] and `'<a target="..." href="...">' =~ / <a # Matches '<a'. .+? # Matches ' t'. (?!target) # Matches. (/\Gtarget/ doesn't match.) .+? # Matches 'arget="..." href="..."'. > # Matches '>'. /isx;` [download] The common use of a negative lookahead is `/(?:(?!$re).)*/` [download] See my earlier post for the fix.	[reply] [d/l] [select]
Re: Regexp: Match anything except a certain word by fraktalisman (Hermit) on May 09, 2006 at 13:44 UTC
And finally, what did I do? `Use HTML::Parser`. And learn some more about regular expressions and how to understand HTML::Parser better. _{fraktalisman keeps rolling}	[reply] [d/l]


We don't bite newbies here... much
	PerlMonks