Find and replace with regex

JayBee has asked for the wisdom of the Perl Monks concerning the following question:

I've just started using a WYSIWYG tool and the images are broken links unless I include the "http://domain.com/" before them. I've been using the short-cut <img src="directory/image.jpg" /> method.

So to fix the bug I've tried doing a simple regex during the edit: $content=~s{(img src=")}{$1$httpURL}g; and then I planned to remove it after the form posting/submitting.
Later, realized that there is the chance of remote urls being used as well and I don't want to put two or more http-urls back to back (as they don't belong).

I basically need a good way to search and replace only image sources that don't contain "http://" already. Here's a couple of things I've tried:

my $http='http://domain.com/';
my $sample='<img src="pics/local.jpg" alt="" />
<img src="http://remote.com/pics/remote.jpg" />
more confusion, just in case: 3:00pm 12/12/12
other urls <a href="http://fake.com">
http://fake.com</a>';

#-- THIS -------------------
$sample=~s{(img src=")(\w?)[^:]}{$1$http$2}g;
# something from my book, POSIX ? not very clear
#-- OR ---------------------
$sample=~s/(img src=")(\w.+?)(:?)/"$1$http$2"if(!$3)/eg;
#-- END --------------------
[download]

I also failed on other discarded variations, so I gave up and need your help.
Any help appreciated in advance.

Comment on Find and replace with regex Select or Download Code

Replies are listed 'Best First'.
Re: Find and replace with regex by wfsp (Abbot) on Dec 31, 2007 at 10:14 UTC
I agree with FunkyMonk, parsing html with a reg ex is fraught with danger. Using a parser can take a lot of the pain out of it. However much "confusion" there is in the html (and there is a lot of it about!) this snippet may help ease the way. #!/bin/perl5 use strict; use warnings; use HTML::TokeParser::Simple; my $p = HTML::TokeParser::Simple->new(\*DATA); my $http = q{http://domain.com}; while (my $t = $p->get_token){ if ( $t->is_start_tag(q{img}) and $t->get_attr(q{src}) ) { my $src = $t->get_attr(q{src}); if ($src !~ m\|^http://\|){ $src = join '/', $http, $src; $t->set_attr(src => $src); } } print $t->as_is; } __DATA__ <img src="pics/local.jpg" alt="" /> <img src="http://remote.com/pics/remote.jpg" /> more confusion, just in case: 3:00pm 12/12/12 other urls <a href="http://fake.com"> http://fake.com</a> [download] output: `<img src="http://domain.com/pics/local.jpg" alt="" /> <img src="http://remote.com/pics/remote.jpg" /> more confusion, just in case: 3:00pm 12/12/12 other urls <a href="http://fake.com"> http://fake.com</a>` [download]	[reply] [d/l] [select]
Re: Find and replace with regex by FunkyMonk (Chancellor) on Dec 31, 2007 at 10:04 UTC
You could use a negative look-ahead assertion (see perlre) `$sample =~ s{(img src=")(?!http)}{$1$http}g;` which gives the following output: `<img src="http://domain.com/pics/local.jpg" alt="" /> <img src="http://remote.com/pics/remote.jpg" /> more confusion, just in case: 3:00pm 12/12/12 other urls <a href="http://fake.com"> http://fake.com</a>` [download] But it's probably fraught with danger.	[reply] [d/l] [select]
Re^2: Find and replace with regex by JayBee (Scribe) on Jan 09, 2008 at 12:05 UTC
Looks good to me. What dangers do you see, besides having a folder or file named "httplaces" for example? I think this solution works best. :)	[reply]
Re^3: Find and replace with regex by FunkyMonk (Chancellor) on Jan 09, 2008 at 15:04 UTC
It's just too fragile. Consider `<img src="..." />` [download] as just one simple example that breaks the previous regexp.	[reply] [d/l] [select]
Re: Find and replace with regex by ysth (Canon) on Dec 31, 2007 at 10:10 UTC
It's not clear to me if this is a one-time fix or something that will be applied to incoming HTML data on an ongoing basis. If it's one time, `s{<img src="(?!/)(?!http:)}{<img src="http://domain.com/}g` sounds like what you want. (Suppress replacement for both src="http:... and src="/...) Do you really want the full http://domain.com/ there? Or would an absolute path beginning / be good enough? The latter might save you headaches if you start using multiple domains or secure pages (where some browsers will complain if there are non-secure references on the page). -- CollegeGear.com - more than just college gear (though, yes, we have college-branded teddy bears)	[reply] [d/l]
Re: Find and replace with regex by snopal (Pilgrim) on Dec 31, 2007 at 14:53 UTC
Something simpler is that you don't have to use the full domain path if you just use the root tree basis. Your example shows <img src="directory/image.jpg" />, which is a relative path and is not transportable between page hierarchies. The simple addition of a leading slash, as in "/directory/image.jpg" may be all the change you need.	[reply]
Re^2: Find and replace with regex by JayBee (Scribe) on Jan 09, 2008 at 12:11 UTC
Yes, I'm familiar with that style, only if it was that simple, however, the WYSIWYG didn't like it either, so no go. ++ to you anyway :)	[reply]
Re: Find and replace with regex by Anonymous Monk on Jan 04, 2008 at 10:07 UTC
`$domain="http://www.google.com"; s\|(<img [^>]*?src=")(?!$domain)(.+?)"\|$1domain$2"\|g;` [download]	[reply] [d/l]


No such thing as a small change
	PerlMonks