LWP::UserAgent and HTML::Parser and the joys of Open Source

I've been looking at improving PMSI - Perl Monks Snippets Index, namely by fetching each individual snippet to acquire its date of creation, in order to create an index page of snippets per month or per year, as per stefan k's suggestion. So I whipped up a quick script using a LWP::UserAgent object with a callback to send the received content to be parsed on the fly by an HTML::Parser object.

It became apparent pretty quickly that the information I needed was in the first returned chunk. For the remaining chunks there was nothing left to do. (Of course, a future enhancement could be to count the number of follow-ups, but that's another story). It seemed to me that this was pretty inefficient, and an unnecessary drag on the sorely overloaded Monastery server.

So I started pondering how I could interrupt the download once I had received the information I needed. I wasn't sure that it was possible, but at least I had the source to hack in a solution if need be. I had visions of plumbing the depths of socket wizardry with a kluge of a global variable to take down the connection, and... um...

After spelunking around for a few minutes (by tracing where my callback was being passed), I came across LWP::Protocol::http which contains a sub named collect which does the deed of fetching the bytes (at least to as low a level as I cared about). There I found the following code, (which I've roughly paraphrased):

   if($cb) {
    while ($content = &$collector, length $$content) {
      eval {
        &$cb($$content, $response, $self);
      };
      if ($@) {
        chomp($@);
        $response->header('X-Died' => $@);
        last;
      }
    }
  }
[download]

There it was, all I had to was to die in my callback, and the connection would be cancelled. I hacked up the following code in about 10 minutes just to prove to myself that this was the case:

#! /usr/bin/perl -w
# bloat.cgi
use strict;
print <<HEAD;
Content-Type: text/html

<html><head>bloat.cgi -- a humungous web page</head><body bgcolor="#ff
+ffff">
HEAD
print qq{<p class="foobar" align="right" name="$_">$_</p>\n} for( 1 ..
+ 10000 );
print '</body></html>';
__END__
[download]

(Note: I made up that really long <p> tag to see whether it was broken across chunk boundaries. If it is, HTML::Parser appears to hide that ugliness -- more power to it if it does). And then I read that back with the following (note how I die in a callback)

#! /usr/bin/perl -w

use strict;
use LWP::UserAgent;
use HTTP::Request;
use HTML::Parser;

my $chunk = 0;

my $p = HTML::Parser->new(
    start_h   => [ \&begin,   'tagname,attr' ],
    default_h => [ \&content, 'text'    ],
    end_h     => [ \&end,     'tagname' ],
);

my $ua = LWP::UserAgent->new;

my $req = HTTP::Request->new(GET => 'http://localhost/cgi-bin/bloat.cg
+i' );
my $res = $ua->request($req, \&cb);
$p->eof;

sub cb {
    my $received = shift;
    ++$chunk;
    $p->parse( $received );
}

sub begin {
    my $element = shift;
    my $r       = shift;
    print "received <$element";
    print qq{ $_="$r->{$_}"} foreach keys %$r;
    print "> at chunk $chunk\n";
}

sub content {
    my $content = shift;
    print "received [$content] at chunk $chunk\n";
    ###########################
    die if $content eq '123'; #
    ###########################
}

sub end {
    my $element = shift;
    print "received </$element> at chunk $chunk\n";
}
__END__
[download]

That seems to work pretty well. Checking the web server logs, I see the following lines appear:

127.0.0.1 - - [15/Oct/2001:10:40:14 +0200] "GET /cgi-bin/bloat.cgi HTT
+P/1.0" 200 527879 "-" "lwp-request/1.39"
127.0.0.1 - - [15/Oct/2001:10:40:18 +0200] "GET /cgi-bin/bloat.cgi HTT
+P/1.0" 200 116807 "-" "libwww-perl/5.53"
[download]

Quod erat demonstrandum. Don't let them take my Open Source away.

`g r i n d e r`

Comment on LWP::UserAgent and HTML::Parser and the joys of Open Source Select or Download Code

Replies are listed 'Best First'.
Re: LWP::UserAgent and HTML::Parser and the joys of Open Source by merlyn (Sage) on Oct 15, 2001 at 18:18 UTC
For a more robust solution, you only want to push a HTML parser onto a stream that has announced itself as MIME-type of `text/html`. I did that for a client once, and have talked about the code in a recent Usenet article, and hope to have it published soon. In there, I said: I have (unpublished) a dynamic-pre-forking Apache-style web streaming proxy server in about 300 lines of pure Perl (using HTTP::Daemon and the other LWP items, of course). It takes the same parameters as Apache child management: `### configuration my $HOST = 'www.stonehenge.com'; my $PORT = 42001; # 0 = pick next available user-port my $START_SERVERS = 4; # start this many, and don't g +o below my $MAX_CLIENTS = 12; # don't go above my $MAX_REQUESTS_PER_CHILD = 250; # just in case there's a leak my $MIN_SPARE_SERVERS = 1; # minimum idle (if 0, never start new) my $MAX_SPARE_SERVERS = 12; # maximum idle (should be "single brow +ser max")` [download] And acts accordingly, using a simple scoreboarding mechanism similar to the Apache method. Using this code, the apache-benchmark program shows that I'm only half as fast as Apache, and has one quarter the footprint! The best part is that in those 300 lines, I handle full SSL streaming (the CONNECT call), full content streaming (I was watching live-feed quicktime movies through the proxy), and if the content-type is text/html, an HTML parser in token mode is inserted, allowing real-time rewriting. For example, I could insert `<font color=blue>` tags around all `<a href=>` links, while not impeding the stream of the rest of the HTML... there'd just be a hiccup while the `<a href=>` was being noticed. The code was originally written as a work for-hire for a client who had intended my work to become open source. But the client dot-bombed, so I'm still trying to get clarification of whether I can release the code under my own copyright. As soon as that clears up, expect a WebTechniques column or two on it. :) -- Randal L. Schwartz, Perl hacker	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re: LWP::UserAgent and HTML::Parser and the joys of Open Source
by merlyn (Sage) on Oct 15, 2001 at 18:18 UTC

text/html

a recent Usenet article

I have (unpublished) a dynamic-pre-forking Apache-style web streaming proxy server in about 300 lines of pure Perl (using HTTP::Daemon and the other LWP items, of course). It takes the same parameters as Apache child management:
    ### configuration
    my $HOST = 'www.stonehenge.com';
    my $PORT = 42001;           # 0 = pick next available user-port
    my $START_SERVERS = 4;              # start this many, and don't g
+o below
    my $MAX_CLIENTS = 12;               # don't go above
    my $MAX_REQUESTS_PER_CHILD = 250; # just in case there's a leak
    my $MIN_SPARE_SERVERS = 1;  # minimum idle (if 0, never start new)
    my $MAX_SPARE_SERVERS = 12; # maximum idle (should be "single brow
+ser max")
[download]
And acts accordingly, using a simple scoreboarding mechanism similar to the Apache method.
Using this code, the apache-benchmark program shows that I'm only half as fast as Apache, and has one quarter the footprint!
The best part is that in those 300 lines, I handle full SSL streaming (the CONNECT call), full content streaming (I was watching live-feed quicktime movies through the proxy), and if the content-type is text/html, an HTML parser in token mode is inserted, allowing real-time rewriting. For example, I could insert <font color=blue> tags around all <a href=> links, while not impeding the stream of the rest of the HTML... there'd just be a hiccup while the <a href=> was being noticed.
The code was originally written as a work for-hire for a client who had intended my work to become open source. But the client dot-bombed, so I'm still trying to get clarification of whether I can release the code under my own copyright. As soon as that clears up, expect a WebTechniques column or two on it. :)

-- Randal L. Schwartz, Perl hacker

[reply]
[d/l]
[select]

Back to Meditations