Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?

Re: Getting the contents of any 'url'

by thatguy (Parson)
on Oct 03, 2001 at 10:29 UTC ( #116369=note: print w/replies, xml ) Need Help??

in reply to Getting the contents of any 'url' URGENT!!!!!!!

I am not sure what you mean by 'bad links', whether you mean incorrect html or links with a 404 on the other side. if you are looking for 404's, you could try this.. it only checks the links directly off of the root page ($url)

#!/usr/bin/perl -w use LWP::Simple; use HTML::TokeParser; use strict; my $url=""; my $content = get("$url"); my $parse = HTML::TokeParser->new(\$content); my $testlink; while (my $token = $parse->get_tag("a")) { # put the link and the text into variables my $link = $token->[1]{href} || "-"; my $text=$parse->get_trimmed_text("/a"); # if $link is fully qualified url if ($link=~ m/http\:\/\//i){ # use LWP to get the link $testlink=get("$link"); if ($testlink){ # parse the title returned from the testlink for 404 or # not found errs my $testparse=HTML::TokeParser->new(\$testlink); if ($testparse->get_tag("title")) { my $title = $testparse->get_trimmed_text; if (($title=~ m/not found/i) || ($title=~ m/404/)) { print "* $link ($text) is a bad link\n"; } else { print "$link ($text) seems to be a good link\n"; } } } } else { # guess at qualifiny url by adding $url to the front.. $testlink=get("$url/$link"); if ($testlink) { my $testparse=HTML::TokeParser->new(\$testlink); if ($testparse->get_tag("title")) { my $title = $testparse->get_trimmed_text; if (($title=~ m/not found/i) || ($title=~ m/404/)) { print "* $url/$link ($text) is a bad link\n"; } else { print "$url/$link ($text) seems to be a good link\n"; } } } } } exit;

you may also want to check out How to ask questions the smart way..

Replies are listed 'Best First'.
Re: Re: Getting the contents of any 'url'
by tachyon (Chancellor) on Oct 04, 2001 at 02:29 UTC

    Hi this is not the best way to do this. Some significant problems with your script are that you load the contern of the entire link and then parse this when you only want the header which will contain the HTTP response code in the status line. You might find this a bit better, it is much faster.

    #!/usr/bin/perl -w use strict; use HTML::TokeParser; use LWP::UserAgent; my $ua = LWP::UserAgent->new; my $proxy = ''; my $netloc = ''; my $realm = ''; my $uname = ''; my $pass = ''; $ua->credentials($netloc, $realm, $uname, $pass); $ua->proxy( 'http', $proxy ); my $url = ''; $url .= '/' if $url =~ m|http://[^/]+$|i; # add trailing / if forgott +en my ($root) = $url =~ m|(http://.*/)|i; my $content = get( $url ); print "Checking links....\n"; my $parse = HTML::TokeParser->new( \$content ); while (my $token = $parse->get_tag('a')) { my $link = $token->[1]{href} || next; my $text=$parse->get_trimmed_text('/a'); if ($link =~ m|^\s*mailto|i) { next; } elsif ($link=~ m|^\s*http://|i){ print &test( $link, $text ); } else { $link =~ s|^\s*/||; print &test( $root.$link, $text ); } } sub get { my $url = shift; print "Getting $url...."; my $request = HTTP::Request->new( 'GET', $url ); my $content = $ua->request( $request ); print "$$content{_msg} $$content{_rc}\n"; return $$content{_content}; } sub test { my ( $url, $text ) = @_; my $request = HTTP::Request->new( 'HEAD', $url ); my $content = $ua->request( $request ); return "$$content{_msg} $$content{_rc} ($text) $url\n"; } __END__ Getting 200 Checking links.... OK 200 (Frank) +id=131 OK 200 (perlmonks) OK 200 (login) +d=131 [snip]




Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://116369]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (7)
As of 2020-05-26 18:24 GMT
Find Nodes?
    Voting Booth?
    If programming languages were movie genres, Perl would be:

    Results (150 votes). Check out past polls.