HTTP GET without LWP

bbfu has asked for the wisdom of the Perl Monks concerning the following question:

Okay, maybe not really a Perl question, per-se... But here goes:

I have a small program (gethttp) to make simple HTTP requests and print the response to stdout. Unfortunately, LWP is not available so it had to be done manually using IO::Socket. I copied the program from one of the Perl man pages and modified it only slightly.

For the most part, this program works just great. Every once in a while, however, I find a page (usually a CGI) that doesn't seem to work. What I get back is a 404 Error though I know the page is there because I can access using my web-browser.

I figure, there must be something going on that I'm just not getting. I can't find any documents anywhere explaining a different syntax for the HTTP GET, and I can't see anything wrong with my Perl code. I really just want to understand why it's not working and what's going on, though it might have a practical application in a project I'm working on if I can get it to work.

I'm including the code from my program below, as well as a URL that it doesn't work on (I don't know if everyone can get to the URL, since it might be set up private to UF. Let me know if you have problems.) and the response I get from them.

I would appreciate any help very much!

The Program...

#!/usr/bin/perl -w
use IO::Socket;
unless (@ARGV) { die "usage: $0 URL\n" }
$EOL = "\015\012";
$BLANK = $EOL x 2;
$sep = (@ARGV > 1) ? "-------------------\n" : "";
foreach $url ( @ARGV ) {
    unless($url =~ m{^http://(.*?)/}) { print "$0: invalid url: $url\n
+"; next }
    $host = $1;
    $remote = IO::Socket::INET->new( Proto     => "tcp",
                                     PeerAddr  => $host,
                                     PeerPort  => "http(80)",
                                    );
    unless ($remote) { die "Cannot connect to http daemon on $host\n" 
+}
    $remote->autoflush(1);
    print $remote "GET $url HTTP/1.0" . $BLANK;
    while ( <$remote> ) { print }
    print "\n$sep";
    close $remote;
}
[download]

The Response

$ ./gethttp 'http://login.gatorlink.ufl.edu/authenticate.cgi'
HTTP/1.0 404 Not Found
Date: Fri, 12 Jan 2001 22:57:21 GMT
Server: Apache/1.3.6 (Unix) mod_perl/1.19 mod_ssl/2.2.8 OpenSSL/0.9.2b
Connection: close
Content-Type: text/html

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<HTML><HEAD>
<TITLE>404 Not Found</TITLE>
</HEAD><BODY>
<H1>Not Found</H1>
The requested URL http://login.gatorlink.ufl.edu/authenticate.cgi was 
+not found on this server.<P>
</BODY></HTML>
[download]

Comment on HTTP GET without LWP Select or Download Code

Replies are listed 'Best First'.
Re: HTTP GET without LWP by sutch (Curate) on Jan 13, 2001 at 04:34 UTC
update: because of isotope's response to this posting, I've discovered that it is not HTTP/1.1 that is the solution, but providing the Host: header that allows the server to respond with a redirect. Without LWP, you're safer and have less work if you stick to HTTP/1.0 combinded with the Host: header The page is not available, for whatever reason (probably because of authentication). Request http://login.gatorlink.ufl.edu/authenticate.cgi in a browser and notice that you are redirected to http://login.gatorlink.ufl.edu/retry.cgi? . You're making an HTTP request using HTTP/1.0. So the server responds with the "404 Not Found" page. Change your request to HTTP/1.1 and you will receive a redirect as the response: telnet login.gatorlink.ufl.edu 80 Trying 128.227.128.87... Connected to dir2fe1.server.ufl.edu. Escape character is '^]'. GET /authenticate.cgi HTTP/1.1 Host: login.gatorlink.ufl.edu HTTP/1.1 302 Found Date: Fri, 12 Jan 2001 23:30:14 GMT Server: Apache/1.3.6 (Unix) mod_perl/1.19 mod_ssl/2.2.8 OpenSSL/0.9.2b URI: retry.cgi? Set-Cookie: UF_GatorLinkState=none; path=/; domain=.ufl.edu; Location: retry.cgi? Transfer-Encoding: chunked Content-Type: text/html be <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"> <HTML><HEAD> <TITLE>302 Found</TITLE> </HEAD><BODY> <H1>Found</H1> The document has moved <A HREF="retry.cgi?">here</A>.<P> </BODY></HTML> 0 [download]	[reply] [d/l]
Re: Re: HTTP GET without LWP by isotope (Deacon) on Jan 13, 2001 at 04:49 UTC
Don't send HTTP/1.1 unless you're prepared to implement it properly. If you do, the server will expect to keep the connection open. The `Host:` header is supported just fine with HTTP/1.0, which will drop the connection as soon as the transfer is complete. --isotope http://www.skylab.org/~isotope/	[reply]
Re: HTTP GET without LWP by isotope (Deacon) on Jan 13, 2001 at 04:25 UTC
Many servers won't like getting a request that includes the 'http://hostname' part of the URL. Typically only proxy servers actually accept that. You might have better luck if you change: `m{^http://(.?)/} #to m{^http://(.?)(/.)$}` [download] ...and then set `$url = $2` so you only request the URI (/authenticate.cgi). Before anyone else lays into me, this is a very rough solution and doesn't necessarily take everything into account. This is really what LWP is designed for, as it will use RFC-compliant methods to parse the URL instead of this quick and dirty stuff. Update:* It may be a virtual server, in which case you also need to send the Host: header in your request, like this: `print $remote "GET $url HTTP/1.0\nHost: $host" . $BLANK;` [download] I strongly suggest splitting the URI, too. --isotope http://www.skylab.org/~isotope/	[reply] [d/l] [select]
(unfortunately...) Re (2): HTTP GET without LWP by mwp (Hermit) on Jan 13, 2001 at 04:30 UTC
That was my first thought, but breaking the URL into host and URI didn't solve the issue. If I figure anything else out, I'll post it here. Update: sutch seems to have hit the nail on the head. If you send "GET /authenticate.cgi HTTP/1.0" alone it errors out. The key is attaching "Host: login.gatorlink.ufl.edu" to the end of the request, before your `$BLANK` variable. `while(my $url = shift @ARGV) { unless($url =~ m{^http://([A-Za-z0-9\.\-]+)/(.*)$}) { print "$0: invalid url: $url\n"; next; } my($host, $uri) = ($1, $2); my $remote = IO::Socket::INET->new(Proto => "tcp", PeerAddr => $host, PeerPort => "http(80)"); unless ($remote) { die "Cannot connect to http daemon on $host\n" } $remote->autoflush(1); print $remote "GET /$uri HTTP/1.0\nHost: $host" . $BLANK; print while(<$remote>); print "\n$sep"; close $remote; }` [download] Your end result might look something like that. You really should just use LWP. =) 'kaboo	[reply] [d/l] [select]
Servers not liking 'http://hostname' by bbfu (Curate) on Jan 13, 2001 at 04:32 UTC
Yes, I'd thought of that and tried it but it doesn't seem to work any better. =( All the pages I've tried it on accept the full URL but since you say many won't like it, I'll change it. It seems to work both ways for the ones that work. Unfortunately: `$ ./gethttp 'http://login.gatorlink.ufl.edu/authenticate.cgi' HTTP/1.1 404 Not Found Date: Fri, 12 Jan 2001 23:27:55 GMT Server: Apache/1.3.6 (Unix) mod_perl/1.19 mod_ssl/2.2.8 OpenSSL/0.9.2b Connection: close Content-Type: text/html <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"> <HTML><HEAD> <TITLE>404 Not Found</TITLE> </HEAD><BODY> <H1>Not Found</H1> The requested URL /authenticate.cgi was not found on this server.<P> </BODY></HTML>` [download] Thanks for your help, though!	[reply] [d/l]
Re: HTTP GET without LWP by dws (Chancellor) on Jan 13, 2001 at 04:41 UTC
It's possible that the web server is configured to internally redirect the request based on HTTP request headers that your web browser is sending, but your script isn't. There are several possibilities: Your browser is probably using HTTP/1.1, and is thus sending a `Host:` header. Your browser is sending a `User-Agent:` header. (A likely culprit). Your browser is sending an `Accept:` header. (A less likely culprit). Try adding these headers to your HTTP request. Update: The correct answer (HTTP/1.1 + Host:) snuck in while I was typing this. Update^2: An invaluable reference to have, whether you're using LWP or not, is <a href="http://www.oreilly.com/catalog/webmaster2/"Webmaster in a Nutshell (O'Reilly). It includes a complete overview of HTTP, including request and response headers.	[reply] [d/l] [select]
Thanks, everyone! by bbfu (Curate) on Jan 13, 2001 at 04:50 UTC
I appreciate all the help from everyone! The problem was pretty much exactly what sutch said. I've updated the request to use HTTP/1.1 and the Host: field. I might take the advice and use the User-Agent: field as well when I read more about it. What I've got now (below) works. Though I realize this is a pretty primative hack, I just don't have access to LWP (at the moment). Again, thanks for all the help! #!/usr/bin/perl -w use IO::Socket; unless (@ARGV) { die "usage: $0 URL\n" } $EOL = "\015\012"; $BLANK = $EOL x 2; $sep = (@ARGV > 1) ? "-------------------\n" : ""; foreach $url ( @ARGV ) { unless($url =~ m{^http://(.?)/(.)$}) { print "$0: invalid url: $ +url\n"; next } $host = $1; $rest = $2; $remote = IO::Socket::INET->new( Proto => "tcp", PeerAddr => $host, PeerPort => "http(80)", ); unless ($remote) { die "Cannot connect to http daemon on $host\n" +} $remote->autoflush(1); print $remote "GET /$rest HTTP/1.1". $EOL . "Host: $host" . $BLANK +; while ( <$remote> ) { print } print "\n$sep"; close $remote; } [download]	[reply] [d/l]
Re: HTTP GET without LWP by strredwolf (Chaplain) on Jan 13, 2001 at 10:30 UTC
You may find my WolfSkunk Proxy "wsproxy" program helpful in this regard. There's a bit of regexp code to pharse a URL to it's right components. Worth taking a look over. -- $Stalag99{"URL"}="http://stalag99.keenspace.com";	[reply]


Clear questions and runnable code get the best and fastest answer
	PerlMonks