Re:x2 Scraping HTML: orthodoxy and reality

in reply to Re: Scraping HTML: orthodoxy and reality
in thread Scraping HTML: orthodoxy and reality

Well, having never used it, I'd be very interested in seeing how you'd do this with HTML::TableExtract. Here's an example page: http://grinder.perlmonk.org/hp4600/.

There are 6 printers today, and we'll probably be adding another 4 or so in the future.

As a general rule I really don't care about performance, but this is a rare case where I have to do something about it. The reason being is that I want to be able to call this from mod_perl, so every tenth of a second is vital (in terms of human perception noticing lag in loading/rendering a page). It's for a small population of users (5 or so), and mod_perl is reverse proxied through lightweight Apache processes, so I'm not worried about machine resources.

I can't do anything about the time the printer takes to respond, but I do need the extraction to be as fast as possible to make up lost ground. There is always Plan B, which would be to cache the results via cron once or twice an hour; it's not as if the users drain one cartridge per day. I already do this for other status pages where the information is very expensive to calculate. People know the data aren't always fresh up to the minute but they can deal with that (especially since I always label the age of the information being presented).

I'll be very interested in seeing what you come up with. And if someone wants to show what a sub-classed HTML::Parser solution looks like, I think we'd have a really good real-life tutorial.

update: here's the proof-of-concept code as it stands today, as a yardstick to go by. The end result is a hash of hashes, C M Y and K are the colour cartridges and X and F are the transfer and fuser kits, respectively. These will mutate into something like HP::4600::Kit and HP::4600::Cartridge.

This code implements jeffa's observation of grepping the array for definedness, which indeed simplifies the problem considerably. Thanks jeffa!

#! /usr/bin/perl -w

use strict;
use LWP::UserAgent;

my @cartridge = qw/ name part percent remain coverage low serial print
+ed /;
my @kit       = qw/ name part percent remain /;

for my $host( @ARGV ) {
    my $url = qq{http://$host/hp/device/this.LCDispatcher?dispatch=htm
+l&cat=0&pos=2};
    my $response = LWP::UserAgent->new->request( HTTP::Request->new( G
+ET => $url ));
    if( !$response->is_success ) {
        warn "$host: couldn't get $url: ", $response->status_line, "\n
+";
        next;
    }
    $_ = $response->content;

    my (@s) = grep { defined $_ } m{
        (?:
            >         # closing tag
            ([^<]+)   # text (name of part, e.g. q/BLACK CARTRIDGE/)
            <br>
            ([^<]+)   # part number (e.g. q/HP Part Number:     HP C97
+24A/)
            </font>\s+</td>\s*<td[^>]+><font[^>]+>
            (\d+)     # percent remaining
        )
        |
        (?:
            (?:
                (?:
                    Pages\sRemaining # different text values
                    | Low\sReached
                    | Serial\sNumber
                    | Pages\sprinted\swith\sthis\ssupply
                )
                    :
                    \s*</font></p>\s*</td>\s*<td[^>]*>\s*<p[^>]*><font
+[^>]*>\s* # separated by this
                |
                Based\son\shistorical\s\S+\spage\scoverage\sof\s # or 
+just this, within a <td>
            )
            (\w+) # and the value we want
        )
    }gx;

    my %res;
    @{$res{K}}{@cartridge} = @s[ 0.. 7];
    @{$res{X}}{@kit}       = @s[ 8..11];
    @{$res{C}}{@cartridge} = @s[12..19];
    @{$res{F}}{@kit}       = @s[20..23];
    @{$res{M}}{@cartridge} = @s[24..31];
    @{$res{Y}}{@cartridge} = @s[32..39];

    print <<END_STATS;
$host
    Xfer $res{X}{percent}% $res{X}{remain}
    Fuse $res{F}{percent}% $res{F}{remain}
    C $res{C}{percent}% cover=$res{C}{coverage}% left=$res{C}{remain} 
+printed=$res{C}{printed}
    M $res{M}{percent}% cover=$res{M}{coverage}% left=$res{M}{remain} 
+printed=$res{M}{printed}
    Y $res{Y}{percent}% cover=$res{Y}{coverage}% left=$res{Y}{remain} 
+printed=$res{Y}{printed}
    K $res{K}{percent}% cover=$res{K}{coverage}% left=$res{K}{remain} 
+printed=$res{K}{printed}
END_STATS
}
[download]

_____________________________________________
Come to YAPC::Europe 2003 in Paris, 23-25 July 2003.

In Section Meditations