Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Re: Merlyns Web Link checker

by dvergin (Monsignor)
on Feb 27, 2002 at 16:55 UTC ( [id://147963]=note: print w/replies, xml ) Need Help??


in reply to Merlyns Web Link checker

It's not clear why you added the tilde and the angle brackets to the values assigned to @CHECK, but the script runs better if you change
@CHECK = # list of initial starting points qw(~http://www.sun.com/ <http://www.sun.com/>);
to:
@CHECK = # list of initial starting points qw(http://www.sun.com/);
Also, you made a mistake in typing the line near the bottom that you have as:    $request = HTTP::Request ($method,$thisurl); It should read:    $request = new HTTP::Request ($method,$thisurl); You will also need to adjust the values in sub PARSE to properly screen for the range of URLs you are interested in. The script you have at present checks for pages at www.sun.com that satisfy a regex in sub PARSE that is screening for pages in Merlyn's site. Since there are no pages that satisfy both conditions, the script finishes up rather quickly.

For those interested in pursing this further, here is the listing and the column that discusses it.

Update: Hi, Merlyn!
 

Replies are listed 'Best First'.
Re: Re: Merlyns Web Link checker
by Anonymous Monk on Feb 27, 2002 at 18:36 UTC
    I still got same error message:
    C:\Perl\bin>we3.pl verifying http://www.domain/mysite.com Cannot fetch http://www.domain/mysite.com (status 501 Protocol scheme +'' is not supported)
    I have tried several different URL's that I have access to and still get the same message when I run this script. Any other suggestions??? I made changes as suggested:
    #!/usr/bin/perl use lib "/perl/bin"; use LWP::UserAgent; use HTML::Parser; use URI::URL; ## begin configure @CHECK = # list of initial starting points qw(http://www.domain/mysite.com); sub PARSE { # verify existance, parse for further URLs ## $_[0] is the absolute URL $_[0] =~ m!^<http://www>\.(domain)\.com/! and not $_[0] =~ /refindex/; } sub PING { # verify existence, but don't parse ## $_[0] is the absolute URL $_[0] =~ m!^(http):!; } ## end configure { package ParseLink; @ISA = qw(HTML::Parser); sub start { # called by parse my $this = shift; my ($tag, $attr) = @_; if ($tag eq "a") { $this->{links}{$attr->{href}}++; } elsif ($tag eq "img") { $this->{links}{$attr->{src}}++; } } sub get_links { my $this = shift; sort keys %{$this->{links}}; } } # end of ParseLink $ua = new LWP::UserAgent; $ua->agent("hverify/1.0"); $ua->env_proxy; $| = 1; MAINLOOP: while ($thisurl = shift @CHECK) { $thisurl =~ s/%7e/~/ig; # ugh :-) next if $did{$thisurl}++; if (PARSE $thisurl) { warn "fetching $thisurl\n"; $request = HTTP::Request('GET',$thisurl); $response = $ua->request($request); # fetch! unless ($response->is_success) { warn "Cannot fetch $thisurl (status ", $response->code, " ", $response->message,")\n"; next MAINLOOP; } next MAINLOOP unless $response->content_type =~ /text\/html/i; $base = $response->base; my $p = ParseLink->new; $p->parse($response->content); $p->parse(undef); for $link ($p->get_links) { $abs = url($link, $base)->abs; warn "... $link => $abs\n"; push(@CHECK, $abs); } next MAINLOOP; } if (PING $thisurl) { warn "verifying $thisurl\n"; for $method (qw(HEAD GET)) { $request = new HTTP::Request($method,$thisurl); $response = $ua->request($request); # fetch! next MAINLOOP if $response->is_success; # ok } warn "Cannot fetch $thisurl (status ", $response->code, " ", $response->message,")\n"; next MAINLOOP; } warn "[skipping $thisurl]\n"; }
      Unrelated to your immediate question, but I really doubt that you need to specify use lib '/perl/bin'; unless you've done some funky stuff to your Perl config; as that is the base directory that Perl looks for by default.

      ..Guv

      Not that this really matters, but in line 1 of your code you have:
      #!/usr/bin/perl the shebang line is used to tell perl where to look for perl on your system. On a windows system, mine is: #!/d:/perl -w The -w is added to turn on warnings. It can be VERY helpful.

      "The social dynamics of the net are a direct consequence of the fact that nobody has yet developed a Remote Strangulation Protocol." -- Larry Wall

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://147963]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (3)
As of 2024-04-25 22:16 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found