Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Re^2: Pulling a Page with LWP::UserAgent and fixing URLs?

by MrForsythExeter (Novice)
on Nov 09, 2004 at 12:03 UTC ( [id://406315]=note: print w/replies, xml ) Need Help??


in reply to Re: Pulling a Page with LWP::UserAgent and fixing URLs?
in thread Pulling a Page with LWP::UserAgent and fixing URLs?

Yeah sorry im trying to make them absolute so for example src="http://www.xxxxx.co.uk/images/uploads/blah.gif" src="../../blah.gif" src="/images/uploads/blah.gif" all become src="http://www.xxxx.com/images/uploads/blah.gif" Hope that helps you understand me
  • Comment on Re^2: Pulling a Page with LWP::UserAgent and fixing URLs?

Replies are listed 'Best First'.
Re^3: Pulling a Page with LWP::UserAgent and fixing URLs?
by Mutant (Priest) on Nov 09, 2004 at 12:09 UTC
    Here's some code to do some of that:
    my $parser = HTML::TokeParser::Simple->new(string => $html); my $new_html; while ( my $token = $parser->get_token ) { for ( 'src', 'href' ) { my $attr = $_; my $value; next unless $value = $token->get_attr($attr); next unless $value =~ /\.(gif|jpe?g|png|swf)$/; $value =~ s/\/([\.[:word:]\-]+?)$/$new_url$1/; $token->set_attr($attr,$value); } $new_html .= $token->as_is; }
    Then your result is in $new_html. Of course, this won't handle everything, since you could have references to images, etc in Javascript, for example.
Re^3: Pulling a Page with LWP::UserAgent and fixing URLs?
by teabag (Pilgrim) on Nov 09, 2004 at 12:11 UTC
    ok then

    use URI::URL;

    Teabag

    -- Siggy Played Guitar
    Sure there's more than one way, but one just needs one anyway - Teabag
      URI::URL is only used for old stuff.. backward compatibility and all that, Looks like URI is the one, however using this are you saying i should parse out all the URL's and then use this to fix them and put them in.. or could i do a regexp with /e on the end and do it all in one line?
        I mean fixing the url to convert relative to absolute pathnames. Just include tokeparser as suggested and you're there.

        something like this:
        (btw. not my code from http://perl.com):

        #!/usr/bin/perl use strict; use warnings; use LWP; use URI; my $browser = LWP::UserAgent->new; my $url = 'http://www.cpan.org/RECENT.html'; my $response = $browser->get($url); die "Can't get $url -- ", $response->status_line unless $response->is_success; my $html = $response->content; while( $html =~ m/<A HREF=\"(.*?)\"/g ) { print URI->new_abs( $1, $response->base ) ,"\ +n"; }

        teabag

        -- Siggy Played Guitar
        Sure there's more than one way, but one just needs one anyway - Teabag
Re: Pulling a Page with LWP::UserAgent and fixing URLs?
by b10m (Vicar) on Nov 09, 2004 at 19:07 UTC

    Like suggested before, take a look at URI and if you do, please please please don't overlook the nifty -yet annoying ;)- <base... /> module.

    --
    b10m

    All code is usually tested, but rarely trusted.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://406315]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others perusing the Monastery: (5)
As of 2024-04-19 07:34 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found