Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Find and replace with regex

by JayBee (Scribe)
on Dec 31, 2007 at 09:52 UTC ( [id://659727]=perlquestion: print w/replies, xml ) Need Help??

JayBee has asked for the wisdom of the Perl Monks concerning the following question:

I've just started using a WYSIWYG tool and the images are broken links unless I include the "http://domain.com/" before them. I've been using the short-cut <img src="directory/image.jpg" /> method.

So to fix the bug I've tried doing a simple regex during the edit: $content=~s{(img src=")}{$1$httpURL}g; and then I planned to remove it after the form posting/submitting.
Later, realized that there is the chance of remote urls being used as well and I don't want to put two or more http-urls back to back (as they don't belong).

I basically need a good way to search and replace only image sources that don't contain "http://" already. Here's a couple of things I've tried:

my $http='http://domain.com/'; my $sample='<img src="pics/local.jpg" alt="" /> <img src="http://remote.com/pics/remote.jpg" /> more confusion, just in case: 3:00pm 12/12/12 other urls <a href="http://fake.com"> http://fake.com</a>'; #-- THIS ------------------- $sample=~s{(img src=")(\w?)[^:]}{$1$http$2}g; # something from my book, POSIX ? not very clear #-- OR --------------------- $sample=~s/(img src=")(\w.+?)(:?)/"$1$http$2"if(!$3)/eg; #-- END --------------------

I also failed on other discarded variations, so I gave up and need your help.
Any help appreciated in advance.

Replies are listed 'Best First'.
Re: Find and replace with regex
by wfsp (Abbot) on Dec 31, 2007 at 10:14 UTC
    I agree with FunkyMonk, parsing html with a reg ex is fraught with danger.

    Using a parser can take a lot of the pain out of it. However much "confusion" there is in the html (and there is a lot of it about!) this snippet may help ease the way.

    #!/bin/perl5 use strict; use warnings; use HTML::TokeParser::Simple; my $p = HTML::TokeParser::Simple->new(\*DATA); my $http = q{http://domain.com}; while (my $t = $p->get_token){ if ( $t->is_start_tag(q{img}) and $t->get_attr(q{src}) ) { my $src = $t->get_attr(q{src}); if ($src !~ m|^http://|){ $src = join '/', $http, $src; $t->set_attr(src => $src); } } print $t->as_is; } __DATA__ <img src="pics/local.jpg" alt="" /> <img src="http://remote.com/pics/remote.jpg" /> more confusion, just in case: 3:00pm 12/12/12 other urls <a href="http://fake.com"> http://fake.com</a>
    output:
    <img src="http://domain.com/pics/local.jpg" alt="" /> <img src="http://remote.com/pics/remote.jpg" /> more confusion, just in case: 3:00pm 12/12/12 other urls <a href="http://fake.com"> http://fake.com</a>
Re: Find and replace with regex
by FunkyMonk (Chancellor) on Dec 31, 2007 at 10:04 UTC
    You could use a negative look-ahead assertion (see perlre) $sample =~ s{(img src=")(?!http)}{$1$http}g; which gives the following output:
    <img src="http://domain.com/pics/local.jpg" alt="" /> <img src="http://remote.com/pics/remote.jpg" /> more confusion, just in case: 3:00pm 12/12/12 other urls <a href="http://fake.com"> http://fake.com</a>

    But it's probably fraught with danger.

      Looks good to me. What dangers do you see, besides having a folder or file named "httplaces" for example?
      I think this solution works best. :)
        It's just too fragile. Consider
        <img src="..." />

        as just one simple example that breaks the previous regexp.

Re: Find and replace with regex
by ysth (Canon) on Dec 31, 2007 at 10:10 UTC
    It's not clear to me if this is a one-time fix or something that will be applied to incoming HTML data on an ongoing basis.

    If it's one time, s{<img src="(?!/)(?!http:)}{<img src="http://domain.com/}g sounds like what you want. (Suppress replacement for both src="http:... and src="/...)

    Do you really want the full http://domain.com/ there? Or would an absolute path beginning / be good enough? The latter might save you headaches if you start using multiple domains or secure pages (where some browsers will complain if there are non-secure references on the page).

Re: Find and replace with regex
by snopal (Pilgrim) on Dec 31, 2007 at 14:53 UTC

    Something simpler is that you don't have to use the full domain path if you just use the root tree basis. Your example shows <img src="directory/image.jpg" />, which is a relative path and is not transportable between page hierarchies. The simple addition of a leading slash, as in "/directory/image.jpg" may be all the change you need.

      Yes, I'm familiar with that style, only if it was that simple, however, the WYSIWYG didn't like it either, so no go. ++ to you anyway :)
Re: Find and replace with regex
by Anonymous Monk on Jan 04, 2008 at 10:07 UTC
    $domain="http://www.google.com"; s|(<img [^>]*?src=")(?!$domain)(.+?)"|$1domain$2"|g;

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://659727]
Approved by Corion
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others contemplating the Monastery: (2)
As of 2024-04-24 17:09 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found