Checking "incomplete" URLs

nop has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Checking "incomplete" URLs by rob_au (Abbot) on Feb 18, 2002 at 23:54 UTC
This is fairly straight-forward to fix - Try changing your `validURL` subroutine to read thus: `sub validURL { my ($self, $url) = @_; my $req = new HTTP::Request HEAD => $url; my $res = $self->request($req); my $content = $res->content; return 0 unless $res->is_success; return 0 if $content =~ /the page you have requested cannot be fou +nd/i; return 1; }` [download] Note that I have changed the request method from `POST` to `HEAD` - The `POST` method will not be allowed for most URLs (thereby generating your false-negative results) and while this could be changed to a `GET` request, the `HEAD` request method will be more successful for all "valid" URLs, irrelevant of the preferred request method. `perl -e 's&&rob@cowsnet.com.au&&&split/[@.]/&&s&.com.&_&&&print'`	[reply] [d/l] [select]
Re: Checking "incomplete" URLs by BlueLines (Hermit) on Feb 19, 2002 at 02:53 UTC
My question is how do I get LWP useragent to act like a browser and find the default page in a directory? It has nothing to do with your browser, and everything to do with your web server. I tested your example on a site I had control of (running apache). Here's what happened: [jon@valium jon]$ telnet divisionbyzero.com 80 Trying 168.103.109.84... Connected to divisionbyzero.com. Escape character is '^]'. GET /decss HTTP/1.0 HTTP/1.1 301 Moved Permanently Date: Tue, 19 Feb 2002 02:47:50 GMT Server: Apache/1.3.22 (Unix) (Red-Hat/Linux) mod_ssl/2.8.5 OpenSSL/0. +9.6b mod_perl/1.24_01 Location: http://www.divisionbyzero.com/decss/ Connection: close Content-Type: text/html; charset=iso-8859-1 <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"> <HTML><HEAD> <TITLE>301 Moved Permanently</TITLE> </HEAD><BODY> <H1>Moved Permanently</H1> The document has moved <A HREF="http://www.divisionbyzero.com/decss/"> +here</A>.<P> <HR> <ADDRESS>Apache/1.3.22 Server at www.divisionbyzero.com Port 80</ADDRE +SS> </BODY></HTML> Connection closed by foreign host. [download] The web server sent me a 301 since `/decss` wasn't an actual file, but rather, a directory. My web browser followed that redirect automatically, which is what browsers are supposed to do when the http method used is `GET` or `HEAD`. I suspect your troubles are caused because you are using the `POST` method, which is explicitly forbidden to redirect you without notifying the user. BlueLines Disclaimer: This post may contain inaccurate information, be habit forming, cause atomic warfare between peaceful countries, speed up male pattern baldness, interfere with your cable reception, exile you from certain third world countries, ruin your marriage, and generally spoil your day. No batteries included, no strings attached, your mileage may vary.	[reply] [d/l] [select]
Re: Re: Checking "incomplete" URLs by chipmunk (Parson) on Feb 19, 2002 at 04:29 UTC
By default, LWP::UserAgent automatically follows redirects for any request except a POST. The redirect_ok() method controls this behavior: `$ua->redirect_ok This method is called by request() before it tries to do any redirects. It should return a true value if a redirect is allowed to be performed. Subclasses might want to override this. The default implementation will return FALSE for POST request and TRUE for all others.` [download] Recently I had to write a script which posted a form on a remote site, and then checked the text of the resulting page to make sure the post succeeded. Unfortunately, there was a redirect to that page. First I tried a making a subclass with a new redirect_ok() that always returned 1. Unfortunately, LWP::UserAgent used a POST request for the redirect; the remote server returned a 405 error. I ended up writing a redirect_ok() which replaced the POST request object in @_ with a new one that did a GET instead. Ugly, but it worked!	[reply] [d/l]
Re: Re: Re: Checking "incomplete" URLs by IlyaM (Parson) on Feb 19, 2002 at 11:16 UTC
You could upgrade to latest libwww and just use method `requests_redirectable` from LWP::UserAgent `$ua->requests_redirectable( ); # to read $ua->requests_redirectable( \@requests ); # to set This reads or sets the object's list of request names that "$ua->redirect_ok(...)" will allow redirection for. By default, this is "['GET', 'HEAD']", as per RFC 2068. To change to include 'POST', consider: push @{ $ua->requests_redirectable }, 'POST';` [download] -- Ilya Martynov (http://martynov.org/)	[reply] [d/l] [select]
Re: Re: Checking "incomplete" URLs by nop (Hermit) on Feb 19, 2002 at 03:08 UTC
Hurrah! GET (vs. POST) solved it -- Many thanks, BlueLines! ++ `sub validURL { my ($self, $url) = @_; my $req = new HTTP::Request GET => $url; my $res = $self->request($req); my $content = $res->content; return 0 if $content =~ /the page you have requested cannot be fou +nd/i; return 0 unless $content =~ /\S/i; return 1; }` [download]	[reply] [d/l]
Re: Checking "incomplete" URLs by Anonymous Monk on Feb 19, 2002 at 04:30 UTC
Finding the default page is done by the server. not the browser the browser client sends whatever it wants, and the server decides what to send back.	[reply]
Re: Checking "incomplete" URLs by erikharrison (Deacon) on Feb 20, 2002 at 05:03 UTC
When you request a directory on a webserver the server gets to decide what to send you - usually index.html but not necesarily. This is straight ought of lwpcook.pod: "If you just want to check if a document is present (i.e. the URL is valid) try to run code that looks like this: `use LWP::Simple; if (head($url)) { # ok document exists }` [download] . . ." . . . which is the "canonical" way to make sure a url is valid. Cheers, Erik	[reply] [d/l]


Come for the quick hacks, stay for the epiphanies.
	PerlMonks