Re: Scraping HTML: orthodoxy and reality

I am of course biased, but...

So I brought HTML::TreeBuilder to bear on the task. It wasn't quite as easy. It was no simple matter to find a reliable part in the tree from whence to direct my search. The HTML contains deeply nested tables, with a high degree of repetition for each kit and cartridge. The various pieces of information in scattered in different elements and collecting and collating it made for some pretty ugly code.

From a logical standpoint, HTML::TableExtract seems to be a perfect choice for this. It might not be a perfect choice for efficiency, for which you seem to have a requirement (though I'm not sure what total run time you were shooting for...how often is this tool going to be run?)

Could you give an example HTML page and some numbers such as how many of them you are expected to handle and how often? For purposes of discussion, let's say your parallel fetch more or less delivers all of the pages simultaneously.

(Despite my bias, I am not automatically anti-regexp parsing and I see both sides of that particular scuffle)

Matt

Comment on Re: Scraping HTML: orthodoxy and reality

Replies are listed 'Best First'.
Re:x2 Scraping HTML: orthodoxy and reality by grinder (Bishop) on Jul 08, 2003 at 16:50 UTC
Well, having never used it, I'd be very interested in seeing how you'd do this with HTML::TableExtract. Here's an example page: http://grinder.perlmonk.org/hp4600/. There are 6 printers today, and we'll probably be adding another 4 or so in the future. As a general rule I really don't care about performance, but this is a rare case where I have to do something about it. The reason being is that I want to be able to call this from mod_perl, so every tenth of a second is vital (in terms of human perception noticing lag in loading/rendering a page). It's for a small population of users (5 or so), and mod_perl is reverse proxied through lightweight Apache processes, so I'm not worried about machine resources. I can't do anything about the time the printer takes to respond, but I do need the extraction to be as fast as possible to make up lost ground. There is always Plan B, which would be to cache the results via cron once or twice an hour; it's not as if the users drain one cartridge per day. I already do this for other status pages where the information is very expensive to calculate. People know the data aren't always fresh up to the minute but they can deal with that (especially since I always label the age of the information being presented). I'll be very interested in seeing what you come up with. And if someone wants to show what a sub-classed HTML::Parser solution looks like, I think we'd have a really good real-life tutorial. update: here's the proof-of-concept code as it stands today, as a yardstick to go by. The end result is a hash of hashes, C M Y and K are the colour cartridges and X and F are the transfer and fuser kits, respectively. These will mutate into something like HP::4600::Kit and HP::4600::Cartridge. This code implements jeffa's observation of grepping the array for definedness, which indeed simplifies the problem considerably. Thanks jeffa! #! /usr/bin/perl -w use strict; use LWP::UserAgent; my @cartridge = qw/ name part percent remain coverage low serial print +ed /; my @kit = qw/ name part percent remain /; for my $host( @ARGV ) { my $url = qq{http://$host/hp/device/this.LCDispatcher?dispatch=htm +l&cat=0&pos=2}; my $response = LWP::UserAgent->new->request( HTTP::Request->new( G +ET => $url )); if( !$response->is_success ) { warn "$host: couldn't get $url: ", $response->status_line, "\n +"; next; } $_ = $response->content; my (@s) = grep { defined $_ } m{ (?: > # closing tag ([^<]+) # text (name of part, e.g. q/BLACK CARTRIDGE/) <br> ([^<]+) # part number (e.g. q/HP Part Number: HP C97 +24A/) </font>\s+</td>\s<td[^>]+><font[^>]+> (\d+) # percent remaining ) \| (?: (?: (?: Pages\sRemaining # different text values \| Low\sReached \| Serial\sNumber \| Pages\sprinted\swith\sthis\ssupply ) : \s</font></p>\s</td>\s<td[^>]>\s<p[^>]><font +[^>]>\s* # separated by this \| Based\son\shistorical\s\S+\spage\scoverage\sof\s # or +just this, within a <td> ) (\w+) # and the value we want ) }gx; my %res; @{$res{K}}{@cartridge} = @s[ 0.. 7]; @{$res{X}}{@kit} = @s[ 8..11]; @{$res{C}}{@cartridge} = @s[12..19]; @{$res{F}}{@kit} = @s[20..23]; @{$res{M}}{@cartridge} = @s[24..31]; @{$res{Y}}{@cartridge} = @s[32..39]; print <<END_STATS; $host Xfer $res{X}{percent}% $res{X}{remain} Fuse $res{F}{percent}% $res{F}{remain} C $res{C}{percent}% cover=$res{C}{coverage}% left=$res{C}{remain} +printed=$res{C}{printed} M $res{M}{percent}% cover=$res{M}{coverage}% left=$res{M}{remain} +printed=$res{M}{printed} Y $res{Y}{percent}% cover=$res{Y}{coverage}% left=$res{Y}{remain} +printed=$res{Y}{printed} K $res{K}{percent}% cover=$res{K}{coverage}% left=$res{K}{remain} +printed=$res{K}{printed} END_STATS } [download] _____________________________________________ Come to YAPC::Europe 2003 in Paris, 23-25 July 2003.	[reply] [d/l]
Re: Re:x2 Scraping HTML: orthodoxy and reality by mojotoad (Monsignor) on Jul 08, 2003 at 19:54 UTC
Here's a quick example, just to give you an idea. I apologize for the crufty code. This solution is still vulnerable to layout changes from the printer manufacturer. I really don't like having to use depth and count with HTML::TableExtract because of this reason -- if the HTML tables had some nice, labeled columns it would be another story entirely. With that in mind you may well be better off with your solution in the long run, though I daresay the regexp solution might be more difficult to maintain. HTML::TableExtract is a subclass of HTML::Parser, in case you were unaware. I'm pretty sure HTML::Parser slows things down compared to your solution, but I'm curious to what degree. Enjoy, Matt #!/usr/bin/perl -w use strict; my $depth = 0; my $count = 0; my $ddepth = 3; use LWP::Simple; my $html = get('http://grinder.perlmonk.org/hp4600/'); my %Device; use Data::Dumper; use HTML::TableExtract; my $te = HTML::TableExtract->new; $te->parse($html); foreach my $ts ($te->table_states) { &process_detail($ts) if ($ts->depth == $ddepth); &process_main($ts) if ($ts->depth == $depth && $ts->count == $coun +t); } # Clean up the empty spots @{$Device{stats}} = grep(defined, @{$Device{stats}}); print Dumper(\%Device); exit; sub process_main { my $ts = shift; my($host, $model) = _scrub(($ts->rows)[1]); $Device{host} = $host; $Device{model} = $model; } sub process_detail { $_[0]->count % 2 ? _proc_detail_stats(@_) : _proc_detail_name(@_); } sub _proc_detail_name { my $ts = shift; my($name, $part, $pct) = _scrub(($ts->rows)[0]); $part =~ s/.:\s+//; $Device{stats}[$ts->count] = { name => $name, part => $part, pct => $pct }; } sub _proc_detail_stats { my $ts = shift; my @stats = map(_scrub($_), $ts->rows); my $i = $ts->count - 1; @{$Device{stats}[$i]}{qw(pages_left hist low serial_num pages_printe +d)} = (map(_scrub($_), $ts->rows))[1,2,4,6,8]; } sub _scrub { grep(!/^\s$/s, map(split(/(^M\|\n)+/,$_), @{shift()})); } [download]	[reply] [d/l]


Keep It Simple, Stupid
	PerlMonks