Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Re: Grabbing a Web Page

by reyjrar (Hermit)
on Aug 28, 2000 at 03:27 UTC ( #29938=note: print w/replies, xml ) Need Help??


in reply to Grabbing a Web Page

or, if you want to learn how things work.. and communicate with raw sockets and "re-invent the wheel" which I find is FAR more educationally valuable than modules, you can:
#!/usr/bin/perl use Socket; use strict; my $line; my $URL = "http://www.yahoo.com"; $URL =~ s/http\:\/\///; my ($HOST,@temppage) = split('/', $URL); my $PAGE = join('/', @temppage); if(!$PAGE) { $PAGE = "/"; } $PAGE = "/$PAGE"; open(OUTFILE, ">html.out"); socket(HTML, PF_INET, SOCK_STREAM, getprotobyname('tcp')) || die $!; connect(HTML, sockaddr_in(80,inet_aton($HOST))); my $REQUEST = "GET $PAGE HTTP/1.0\n\n"; send(HTML, $REQUEST, ''); while(<HTML>) { print OUTFILE; } close HTML; close OUTFILE;
And like I said, its more time efficient to use the LWP module.. but this way you're actually using just perl, and not relying on some machine to have lynx, or LWP installed.. and its fun! :)

Replies are listed 'Best First'.
RE: Re: Grabbing a Web Page
by bobby (Sexton) on Aug 28, 2000 at 06:09 UTC
    definitely fun, especially after spending all afternoon installing various modules prerequisite to LWP (smiley)

    i modified this program so you can say:
    www.foo.com/ instead of www.foo.com/index.html
    and it reads the url from the command line and just prints the page to stdout
    just in case anyone cared
    #!/usr/bin/perl use Socket; use strict; #i don't know what $line is my $line; #but i left it in anyway my $trailingslash; my $URL = $ARGV[0]; #get URL from command line $URL =~ s/http\:\/\///; #get rid of "http://" if it's there if ($URL =~ m/\/$/) { #check for trailing slash $trailingslash = 'true'; #(i.e. get /index.foo) } else { $trailingslash = 0; } my ($HOST,@temppage) = split('/', $URL); my $PAGE = join('/', @temppage); if (($trailingslash) && ($PAGE)) { $PAGE = "/$PAGE/"; #reattach the trailing slash } else { $PAGE = "/$PAGE"; } socket(HTML, PF_INET, SOCK_STREAM, getprotobyname('tcp')) || die $!; connect(HTML, sockaddr_in(80,inet_aton($HOST))); my $REQUEST = "GET $PAGE HTTP/1.0\n\n"; send(HTML, $REQUEST, ''); while(<HTML>) { print; #to STDOUT } close HTML;

    of course, we could just make the program respond to 301 Moved Permanently. ha.
    -b
      definitely fun, especially after spending all afternoon installing various modules prerequisite to LWP (smiley)

      Isn't that what the CPAN module is for ;-)

      perl -MCPAN -eshell install LWP
      Then sit back and relax!

      Make sure you install the latest version of the CPAN module first though so it doesn't try to upgrade your perl to 5.6.0...

      I thought I tested everything, but I was wrong.. good call.. uhm.. I believe if we make this change it'll work too: my $REQUEST = "GET $PAGE HTTP/1.0 \n\n"; goes to: my $REQUEST = "GET $PAGE\n\n"; I tested it on my apache server and it seemed to work fine.. and I recall from past experience with Squid, that it will work. lemme know if you find differently..
RE: Re: Grabbing a Web Page
by turnstep (Parson) on Aug 28, 2000 at 14:38 UTC

    > ...if you want to learn how things work.. and
    >communicate with raw sockets and "re-invent the
    >wheel" which I find is FAR more educationally
    >valuable than modules, you can:

    Or, for even MORE of an education, take out that

    use Socket;
    line. Then try and get it to run on different platforms! :) At the very least you'll gain an appreciation for the Socket module.

    P.S. The open in the code above should have an "or die..." after it.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://29938]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (3)
As of 2023-09-28 00:45 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?