http://qs321.pair.com?node_id=36347

Someone on clpmisc recently asked about a layer over LWP that would allow for line-by-line reading from a GET or POST request. I replied to him (not to the newsgroup) with code I'd developed. I didn't send it to the NG because I didn't want to be flamed for reinventing the wheel, and I'm sure my code is less than desireable.

I'm a bit irritated LWP isn't built to allow line-by-line reading of the response -- and it is not easily sub-classed, due to the tremendous amount of code. That's why I had to come up with what I post below. It also has not been too rigorously tested.

The use of the module is as follows:
use LWP::FileHandle; lwpopen URL, $method, $url, $query;
where $method is either 'GET' or 'POST', $url is a FULL URL (like "http://www.server.com/path"), and $query is a string, array reference, or hash reference that holds the key-value pairs of the HTTP request. If you use an array or hash reference, the data MUST NOT be encoded yet -- if you use a string, the data MUST be encoded already. Sample usages are:
lwpopen URL, GET => $url, 'this=that'; lwpopen URL, GET => "$url?this=that"; # can append QS to URL lwpopen URL, POST => $url, [ this => 'that' ]; lwpopen URL, POST => $url, { this => 'that' };
Then you can read from URL as if it were a regular filehandle:
use LWP::FileHandle; lwpopen JAPHY, GET => 'http://www.crusoe.net/~jeffp/'; while (<JAPHY>) { print; } lwpclose JAPHY;
You can turn off the returning of the HTTP response headers by setting $LWP::FileHandle::HEADERS to 0. I think that about covers it for the module... oh, it doesn't handle redirects. It could be added (a bit more code, but it can be done).

Is this a bad thing for me to have done/written? I don't mean to incite a flame war or a cargo cult in my honor but I felt this functionality warranted creation.
package LWP::FileHandle; use IO::Socket; use URI::Escape; use Socket (); use Carp; use strict; use vars qw( @ISA @EXPORT $HEADERS ); require Exporter; @ISA = qw( Exporter ); @EXPORT = qw( lwpopen lwpclose ); $HEADERS = 1; my $CRLF = $Socket::CRLF; sub lwpopen (*@) { my ($fh,$mode,$url,$qs1) = @_; my ($host,$path,$qs2) = $url =~ m!(?:http://)?([^/]+)([^?]*)(.*)!; my ($query,$socket,$obj); $mode = uc $mode; croak "HTTP mode must be 'GET' or 'POST'" if $mode ne 'GET' and $mode ne 'POST'; if (UNIVERSAL::isa($qs1, 'ARRAY')) { for (my $i = 0; $i < @$qs1; $i += 2) { $query .= '&' if length $query; $query .= join '=', uri_escape($qs1->[$i]), uri_escape($qs1->[$i+1]); } } elsif (UNIVERSAL::isa($qs1, 'HASH')) { while (my ($k,$v) = each %$qs1) { $query .= '&' if length $query; $query .= join '=', uri_escape($k), uri_escape($v); } } elsif ($qs1 and not ref $qs1) { $query = $qs1 } elsif ($qs1) { croak "HTTP request must be array ref, hash ref, or string" } $query .= '&' if length $query and length $qs2; $query .= $qs2; $query = "?$query" if length $query; $path ||= '/'; $host .= ':80' if $host !~ /:\d+$/; $socket = IO::Socket::INET->new($host); { no strict 'refs'; tie *$fh, 'LWP::FileHandle::Tie', $socket, $host, $path, $query, $mode; } return $socket ? 1 : 0; } sub lwpclose (*) { no strict 'refs'; untie(*{ $_[0] }); } package LWP::FileHandle::Tie; sub TIEHANDLE { my ($class,$socket,$host,$path,$query,$mode) = @_; bless { SOCKET => $socket, READFROM => 0, PATH => $path, QUERY => $query, MODE => $mode, }, $class; } sub READLINE { my $socket = $_[0]{SOCKET}; if (!$_[0]{READFROM}++) { my ($path,$query) = @{$_[0]}{qw( PATH QUERY )}; if ($_[0]{MODE} eq 'GET') { $socket->print("GET $path$query HTTP/1.0$CRLF$CRLF"); } else { my $enctype = "application/x-www-form-urlencoded"; my $len = length $query; $socket->print("POST $path HTTP/1.0$CRLF"); $socket->print("Content-type: $enctype$CRLF"); $socket->print("Content-length: $len$CRLF$CRLF"); $socket->print($query); } if (!$LWP::FileHandle::HEADERS) { while (<$socket>) { last if $_ eq $CRLF } } } <$socket>; } sub DESTROY { $_[0]{SOCKET}->close; }


$_="goto+F.print+chop;\n=yhpaj";F1:eval

Replies are listed 'Best First'.
RE: to post, or not to post...
by merlyn (Sage) on Oct 12, 2000 at 06:22 UTC
    Why not just use the "callback" parameter of request or simple_request, and grab the data as it comes back?

    Seems like you've reinvented a pretty big wheel. {grin}

    From perldoc LWP::UserAgent...

    The subroutine variant requires a reference to callback routine as the second argument to request() and it can also take an optional chuck size as the third argument. This variant can be used to construct "pipe-lined" pro- cessing, where processing of received chuncks can begin before the complete data has arrived. The callback func- tion is called with 3 arguments: the data received this time, a reference to the response object and a reference to the protocol object. The response object returned from request() will have empty content. If the request fails, then the the callback routine is called, and the response->content might not be empty. The request can be aborted by calling die() in the call- back routine. The die message will be available as the "X-Died" special response header field. The library also allows you to use a subroutine reference as content in the request object. This subroutine should return the content (possibly in pieces) when called. It should return an empty string when there is no more con- tent.

    -- Randal L. Schwartz, Perl hacker

RE (tilly) 1: to post, or not to post...
by tilly (Archbishop) on Oct 12, 2000 at 06:37 UTC
    Well merlyn gave the best answer, but you can actually subclass off of LWP if you need to. What you can do is create a new protocol. The package-name needs to be of the form LWP::Protocol::foo, and it needs to inherit from LWP::Protocol. To get a sample, take a look at how LWP::Protocol::https is implemented, subclassing off of http.

    None of which you really need to know unless you are trying to figure out how LWP works, or are trying to write a new scheme.

    ObRandomThought: Having read the code, I am wondering why the heck they have the hack to pass news to nntp in the implementor method of LWP::Protocol. Why not just have a trivial LWP::Protocol::news module that just inherits from LWP::Protocol::nntp?

RE: to post, or not to post...
by Anonymous Monk on Oct 12, 2000 at 10:50 UTC
    Every comment here is so true, but one thing japhy did which I don't think anyone can disapprove is writing it all by his own, learning from it.

    True, it's re-inventing the wheel, not-so-smart to put this effort on CPAN, could be done using LWP's callback-thingies, but at least he now knows how to write such things.

    Jouke.
RE: to post, or not to post...
by AgentM (Curate) on Oct 12, 2000 at 06:39 UTC
    Are you kidding? Coming up with a new module is cool, not embarrassing. I have the feeeling that this may be very useful to many Perl people everywhere, not just the perlmonks- be sure to submit this to CPAN. If your code does have bugs, you'll might get alot of feedback by email. That's great! This promotes the evolution of your code and great improvements. If everyone kept their code under the roof just because he/she thought it had bugs, we wouldn't have CPAN, the wonderful MySQL dbengine, or Linux (not mentioning perlmonks)! In short, you shouldn't be apologizing- you should be bragging about your cool new module that you submitted to CPAN! Have fun dude! NIfty mod! Congrats! Anyone that does "flame" has obviously never contibuted anything to the world of GPL.
    AgentM Systems or Nasca Enterprises is not responsible for the comments made by AgentM- anywhere.
      But you shouldn't build on prior art until you understand why the prior art is there. I suspect (but am waiting for confirmation) that japhy was not aware of the callback parameter, which would have reduced his program to about half its size, and then it would have handled proxy servers and all protocols, not just HTTP.

      The problem with reinventing things is that you now have two or more incompatible mechanisms to accomplish a task, which dilutes the effort, and can confuse the marketplace (should I use X or Y interface, since both seem to be in the CPAN?).

      Witness News::NNTPClient, an interesting implementation that predated Net::NNTP, but is now abandoned. I wrote quite a few programs using the old interface, but have finally just recently succeeded in cutting them all over. And I did that because I know that Graham Barr will be around to update Net::*, but who knows about any efforts to pick up the other package.

      SImilarly, a solution based on top of LWP will very likely continue to work and get bug fixes, because I know that LWP will continue to be maintained for some time to come. But if japhy submits his independent code to the CPAN, will he notice when Gisle updates LWP for a security fix, and re-implement that? I think not.

      So, the code doesn't do as much as LWP, won't be maintained the same way as LWP, and doesn't leverage off the existing code in any big way. Do you think I can recommend this little code for any production work? Hardly not.

      Now, if japhy writes code that uses the LWP callback, the codesize will be reduced, everything will play well with proxies and new underlying protocols, and the chances that the code will be maintained will be greatly increased. Everyone wins. I encourage japhy to rewrite the code to use the callback, and submit it. I would not encourage the current code to be submitted.

      -- Randal L. Schwartz, Perl hacker

        I will be voting this up. I honestly believe that this post makes one of the most fundamental points about good programming technique that I have seen merlyn make. If you read books on programming you will hear it echoed again and again in many different ways.

        The point is that the key to maintainability is to have each thing done in one place. You want to avoid using two parallel efforts. Not only is it duplicated effort, both originally and ongoing, but it is a significant maintainance problem.

        The easiest route to code-sharing in the short-term is called cut-and-paste. However if you stay aware of where your energy is going and work to get to code-sharing through having any particular task done once in your code-base, you will find that decision paying you back time after time again.

        For more on this I recommend virtually any classic, starting with The Pragmatic Programmer because it is not only classic, but also mentions Perl quite a few times. :-)

        There are many issues, but I think that if a few extra lines or a new method will actually in some way promote server efficiency (processing time), then they should be added without question.

        Efficient coding doesn't always mean less code. Consider the case where a fundamental method is for some reason missing from the object.

        I added a new method to Lady /TM specifically for updating a page counter. To make everything simpler, if the counter wasn't found during a pass through the index, it was added automatically. The previous method of updating a counter in a hash of counters was too slow, requiring an extra recursion through the index. Once to read it. Another to read it again and write the change.

        The new method specially developed for this task reads the counter table and holds the entire table, closes the file. Then quickly reopens it for a write access and prints the updated file.

        The method is much more efficient than would be an MS-Access ODBC implementation that would require that the large MS-Access driver is loaded. I "reinvented" Lady /TM because I had uses in mind that seemed simple. It seemed obvious that ODBC and MS-Access was over-kill for my immediate application which needed speed.

        When I sit next to my server and listen to the MS-Access driver load every now and then I am truly gratified for having "re-invented" Lady /TM.

        -Steeeeeve

      A reply falls below the community's threshold of quality. You may see it by logging in.
RE: to post, or not to post...
by wardk (Deacon) on Oct 12, 2000 at 17:40 UTC

    japhy,

    I have experienced similar situations, needing to process/strip the contents from another web site. In my case it was getting info from an "internal" site to be displayed on an "external" site.

    In our case, I also wrote the site being called and reformatted, so I was able to put markers in the source output that I could easily strip with a straightforward regex. I didn't need to get it a line at a time, so I didn't need to re-write any LWP type functionality, so I got off easy.

    In another similar case, I used a behind-the-firewall server to produce delimited data from an internal financials database (thus no absolutely access from the 'outside' allowed). Again I used markers to isolate the customer data, then created a page from that. Luckily, it was a single row of data, so I didn't need to process by line ( although I could have split on "yet another set of markers". So while my situation differed a bit from yours, I think you did the right thing...which is you solved a problem in a way that you were able to using your own skills and insight.

    I don't think that you should be worried about either doing this or posting that you did this. I suspect your customer (even if it's just your own project, thus you) is looking for end results. And while I don't subscribe to the "end justified the means", I do subscribe to TIMTOWTDI.

    thanks for sharing your experience, and posting useful code.