Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

Pulling a Page with LWP::UserAgent and fixing URLs?

by MrForsythExeter (Novice)
on Nov 09, 2004 at 11:22 UTC ( [id://406309]=perlquestion: print w/replies, xml ) Need Help??

MrForsythExeter has asked for the wisdom of the Perl Monks concerning the following question:

Hi Guys as you can see from my 2 points of experiance, I am a rookie in your world even though i must have been working with perl for abotu 3 years now. Any way down to business. Im currently writing a content managment system, not a problem there as Javascript and HTMLPad are easy. The problem is that i would like people to put in a URL and then for the software to pull that page (easy) but then sort out all the links for images, backgournd tags etc.. I am already handled the style sheet, by parsing for it.. the collect it with UA and inserting that as page contents with the correct tags. My problems start to happen when someone puts in a URL like http://www.xxxxxx.co.uk/newsletter.htm for a start the page contains links like src="http://www.xxxxxx.com" which is really the same site just differnet domain name, or the page doesn't already contain the http://www.xxxxx.co.uk just ../../blah.gif or /blah/blah/funny.gif here is what i have already, as i know your wisdom is better in regexps than mine im hoping you can help or point me in the direction of a Perl module. As im a rookie, im sure you will find loads wrong with the code from the word go.. but here is a snippet..
#Pull the page and sort it out.. then display edit window use LWP::UserAgent; my $ua = LWP::UserAgent->new(); $ua->agent(""); my $content = $ua->get($fields{'url'})->content(); $fields{'url'} =~ s/http:\/\/(.*?)\/.*/$1/ig; $content =~ s/src="/src="http:\/\/$fields{'url'}\//sig; #Handle Styles $content =~ m/<link href="(.*?)"/ig; my $styleurl = $1; my $styles; if ($styleurl ne ''){ $styles = $ua->get($fields{'url'}.'/'.$styleurl)->con +tent(); } $styles = '<style type="text/css"><!--'.$styles.'--></styl +e>'; $content =~ s/<\/head>/<\/head>$styles/sig; $tpl_inner = &gettpl($skindir,'pointblank_templateadd2.htm +'); $tpl_inner =~ s/<!-- Content -->/$content/ig;

Replies are listed 'Best First'.
Re: Pulling a Page with LWP::UserAgent and fixing URLs?
by Mutant (Priest) on Nov 09, 2004 at 11:53 UTC

    Firstly, don't try to parse HTML yourself. Use one of the many CPAN modules available. I prefer HTML::TokeParser::Simple.

    I'm not exactly sure what you're trying to do with the image tags? Are you trying to fix broken links, or make them absolute instead of relative?

      Yeah sorry im trying to make them absolute so for example src="http://www.xxxxx.co.uk/images/uploads/blah.gif" src="../../blah.gif" src="/images/uploads/blah.gif" all become src="http://www.xxxx.com/images/uploads/blah.gif" Hope that helps you understand me
        Here's some code to do some of that:
        my $parser = HTML::TokeParser::Simple->new(string => $html); my $new_html; while ( my $token = $parser->get_token ) { for ( 'src', 'href' ) { my $attr = $_; my $value; next unless $value = $token->get_attr($attr); next unless $value =~ /\.(gif|jpe?g|png|swf)$/; $value =~ s/\/([\.[:word:]\-]+?)$/$new_url$1/; $token->set_attr($attr,$value); } $new_html .= $token->as_is; }
        Then your result is in $new_html. Of course, this won't handle everything, since you could have references to images, etc in Javascript, for example.
        ok then

        use URI::URL;

        Teabag

        -- Siggy Played Guitar
        Sure there's more than one way, but one just needs one anyway - Teabag

        Like suggested before, take a look at URI and if you do, please please please don't overlook the nifty -yet annoying ;)- <base... /> module.

        --
        b10m

        All code is usually tested, but rarely trusted.
Re: Pulling a Page with LWP::UserAgent and fixing URLs?
by Popcorn Dave (Abbot) on Nov 09, 2004 at 21:16 UTC
    I would definitely use HTML::TokeParser. I used it for parsing news headlines and it made life so much simpler. This node is the parser that I wrote to dump a page into tokens. Hopefully that will get you going in the direction you're after.

    Hope that helps!

    Useless trivia: In the 2004 Las Vegas phone book there are approximately 28 pages of ads for massage, but almost 200 for lawyers.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://406309]
Approved by Arunbear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chilling in the Monastery: (5)
As of 2024-04-19 15:04 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found