Pulling a Page with LWP::UserAgent and fixing URLs?

MrForsythExeter has asked for the wisdom of the Perl Monks concerning the following question:

Hi Guys as you can see from my 2 points of experiance, I am a rookie in your world even though i must have been working with perl for abotu 3 years now. Any way down to business. Im currently writing a content managment system, not a problem there as Javascript and HTMLPad are easy. The problem is that i would like people to put in a URL and then for the software to pull that page (easy) but then sort out all the links for images, backgournd tags etc.. I am already handled the style sheet, by parsing for it.. the collect it with UA and inserting that as page contents with the correct tags. My problems start to happen when someone puts in a URL like http://www.xxxxxx.co.uk/newsletter.htm for a start the page contains links like src="http://www.xxxxxx.com" which is really the same site just differnet domain name, or the page doesn't already contain the http://www.xxxxx.co.uk just ../../blah.gif or /blah/blah/funny.gif here is what i have already, as i know your wisdom is better in regexps than mine im hoping you can help or point me in the direction of a Perl module. As im a rookie, im sure you will find loads wrong with the code from the word go.. but here is a snippet..

#Pull the page and sort it out.. then display edit window
            use LWP::UserAgent;
            
            my $ua = LWP::UserAgent->new();
            $ua->agent("");
                    
            my $content =  $ua->get($fields{'url'})->content();
            
            $fields{'url'} =~ s/http:\/\/(.*?)\/.*/$1/ig;
            
            $content =~ s/src="/src="http:\/\/$fields{'url'}\//sig;
                        
            #Handle Styles
            $content =~ m/<link href="(.*?)"/ig;
            my $styleurl = $1;
            my $styles;
            if ($styleurl ne ''){
                $styles =  $ua->get($fields{'url'}.'/'.$styleurl)->con
+tent();
            }
            $styles = '<style type="text/css"><!--'.$styles.'--></styl
+e>';
            
            $content =~ s/<\/head>/<\/head>$styles/sig;
            
            $tpl_inner = &gettpl($skindir,'pointblank_templateadd2.htm
+');
            $tpl_inner =~ s/<!-- Content -->/$content/ig;
[download]

Comment on Pulling a Page with LWP::UserAgent and fixing URLs? Download Code

Replies are listed 'Best First'.
Re: Pulling a Page with LWP::UserAgent and fixing URLs? by Mutant (Priest) on Nov 09, 2004 at 11:53 UTC
Firstly, don't try to parse HTML yourself. Use one of the many CPAN modules available. I prefer HTML::TokeParser::Simple. I'm not exactly sure what you're trying to do with the image tags? Are you trying to fix broken links, or make them absolute instead of relative?	[reply]
Re^2: Pulling a Page with LWP::UserAgent and fixing URLs? by MrForsythExeter (Novice) on Nov 09, 2004 at 12:03 UTC
Yeah sorry im trying to make them absolute so for example src="http://www.xxxxx.co.uk/images/uploads/blah.gif" src="../../blah.gif" src="/images/uploads/blah.gif" all become src="http://www.xxxx.com/images/uploads/blah.gif" Hope that helps you understand me	[reply]
Re^3: Pulling a Page with LWP::UserAgent and fixing URLs? by Mutant (Priest) on Nov 09, 2004 at 12:09 UTC
Here's some code to do some of that: `my $parser = HTML::TokeParser::Simple->new(string => $html); my $new_html; while ( my $token = $parser->get_token ) { for ( 'src', 'href' ) { my $attr = $_; my $value; next unless $value = $token->get_attr($attr); next unless $value =~ /\.(gif\|jpe?g\|png\|swf)$/; $value =~ s/\/([\.[:word:]\-]+?)$/$new_url$1/; $token->set_attr($attr,$value); } $new_html .= $token->as_is; }` [download] Then your result is in $new_html. Of course, this won't handle everything, since you could have references to images, etc in Javascript, for example.	[reply] [d/l]
Re^3: Pulling a Page with LWP::UserAgent and fixing URLs? by teabag (Pilgrim) on Nov 09, 2004 at 12:11 UTC
ok then use URI::URL; Teabag -- Siggy Played Guitar Sure there's more than one way, but one just needs one anyway - Teabag	[reply]
Re^4: Pulling a Page with LWP::UserAgent and fixing URLs? by MrForsythExeter (Novice) on Nov 09, 2004 at 15:08 UTC
Re^5: Pulling a Page with LWP::UserAgent and fixing URLs? by teabag (Pilgrim) on Nov 10, 2004 at 14:30 UTC
Re: Pulling a Page with LWP::UserAgent and fixing URLs? by b10m (Vicar) on Nov 09, 2004 at 19:07 UTC
Like suggested before, take a look at URI and if you do, please please please don't overlook the nifty -yet annoying ;)- <base... /> module. -- b10m All code is usually tested, but rarely trusted.	[reply]
Re: Pulling a Page with LWP::UserAgent and fixing URLs? by Popcorn Dave (Abbot) on Nov 09, 2004 at 21:16 UTC
I would definitely use HTML::TokeParser. I used it for parsing news headlines and it made life so much simpler. This node is the parser that I wrote to dump a page into tokens. Hopefully that will get you going in the direction you're after. Hope that helps! Useless trivia: In the 2004 Las Vegas phone book there are approximately 28 pages of ads for massage, but almost 200 for lawyers.	[reply]


Do you know where your variables are?
	PerlMonks