Re^2: Pulling a Page with LWP::UserAgent and fixing URLs?

Replies are listed 'Best First'.
Re^3: Pulling a Page with LWP::UserAgent and fixing URLs? by Mutant (Priest) on Nov 09, 2004 at 12:09 UTC
Here's some code to do some of that: `my $parser = HTML::TokeParser::Simple->new(string => $html); my $new_html; while ( my $token = $parser->get_token ) { for ( 'src', 'href' ) { my $attr = $_; my $value; next unless $value = $token->get_attr($attr); next unless $value =~ /\.(gif\|jpe?g\|png\|swf)$/; $value =~ s/\/([\.[:word:]\-]+?)$/$new_url$1/; $token->set_attr($attr,$value); } $new_html .= $token->as_is; }` [download] Then your result is in $new_html. Of course, this won't handle everything, since you could have references to images, etc in Javascript, for example.	[reply] [d/l]
Re^3: Pulling a Page with LWP::UserAgent and fixing URLs? by teabag (Pilgrim) on Nov 09, 2004 at 12:11 UTC
ok then use URI::URL; Teabag -- Siggy Played Guitar Sure there's more than one way, but one just needs one anyway - Teabag	[reply]
Re^4: Pulling a Page with LWP::UserAgent and fixing URLs? by MrForsythExeter (Novice) on Nov 09, 2004 at 15:08 UTC
URI::URL is only used for old stuff.. backward compatibility and all that, Looks like URI is the one, however using this are you saying i should parse out all the URL's and then use this to fix them and put them in.. or could i do a regexp with /e on the end and do it all in one line?	[reply]
Re^5: Pulling a Page with LWP::UserAgent and fixing URLs? by teabag (Pilgrim) on Nov 10, 2004 at 14:30 UTC
I mean fixing the url to convert relative to absolute pathnames. Just include tokeparser as suggested and you're there. something like this: (btw. not my code from http://perl.com): `#!/usr/bin/perl use strict; use warnings; use LWP; use URI; my $browser = LWP::UserAgent->new; my $url = 'http://www.cpan.org/RECENT.html'; my $response = $browser->get($url); die "Can't get $url -- ", $response->status_line unless $response->is_success; my $html = $response->content; while( $html =~ m/<A HREF=\"(.*?)\"/g ) { print URI->new_abs( $1, $response->base ) ,"\ +n"; }` [download] teabag -- Siggy Played Guitar Sure there's more than one way, but one just needs one anyway - Teabag	[reply] [d/l]
Re: Pulling a Page with LWP::UserAgent and fixing URLs? by b10m (Vicar) on Nov 09, 2004 at 19:07 UTC
Like suggested before, take a look at URI and if you do, please please please don't overlook the nifty -yet annoying ;)- <base... /> module. -- b10m All code is usually tested, but rarely trusted.	[reply]


Keep It Simple, Stupid
	PerlMonks