Re: Scraping HTML: orthodoxy and reality
by BrowserUk (Patriarch) on Jul 08, 2003 at 11:13 UTC
|
Amen grinder++. I whole heartedly agree.
I came to much the same conclusion in Being a heretic and going against the party line. after having only been using perl for a relatively short time. My experiences since have done little to change my mind.
Back in that old post I tried to make a distinction between the need to parse HTML and the need to extract something that just happens to be embedded within stuff that happens to be HTML. This distinction was roundly set upon as being wrong. I still hold with this distinction.
The dictionary definition of parse is
- To break (a sentence) down into its component parts of speech with an explanation of the form, function, and syntactical relationship of each part.
- To describe (a word) by stating its part of speech, form, and syntactical relationships in a sentence.
- To examine closely or subject to detailed analysis, especially by breaking up into components: “What are we missing by parsing the behavior of chimpanzees into the conventional categories recognized largely from our own behavior?” (Stephen Jay Gould).
- To make sense of; comprehend: I simply couldn't parse what you just said.
Whilst the dictionary definition of extract is:
- To draw or pull out, often with great force or effort: <cite>extract a wisdom tooth; used
tweezers to extract the splinter.</cite>
- To obtain despite resistance: <cite>extract a promise.</cite>
- To obtain from a substance by chemical or mechanical action, as by pressure, distillation, or evaporation.
- To remove for separate consideration or publication; excerpt.
- To derive or obtain (information, for example) from a source.
- To deduce (a principle or doctrine); construe (a meaning).
- To derive (pleasure or comfort) from an experience.
- Mathematics. To determine or calculate (the root of a number).
From my perspective, when the need is to locate and capture one or more pieces of information from within any amount or structure of other stuff, without regard to the structural or semantic positioning of those pieces within the overall structure, the term extraction more applicable than parsing. If I need to understand the structure, derive semantic meaning from the structure or verify its correctness, then I need to parse, otherwise I just need to extract. After all, Practical Extraction and Reporting is what that Language was first designed to do.
My final, and strongest argument lies in a simple premise. If the information I was after was embedded amongst a lot of Arabic, Greek or Chinese, then noone would expect me to find and use a module that understood those languages, just to extract the bits I needed.
Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
| [reply] |
|
Ironically the distinction that you draw is the same one that I use to argue against using regular expressions for parsing problems.
Regular expressions are designed as a tool for locating specific patterns in a sea of stuff. (Well until Perl 6 that is...) Parsing is the task of taking structured information and analyzing the structure. This is a very different task, and regular expressions (as they currently are) are simply not designed to do it. Parsing is a lot more work, but for structured text is going to give much more robust solutions. For instance you avoid different kinds of data being mistaken for each other.
The problem is that people are used to using regular expressions for text manipulation, and then set out to solve what is really a parsing probem with regular expressions. Then fail (and may or may not realize this). This happens so routinely that the knee-jerk response is that virtually anything which can be done with parsing should be, rather than regular expressions. And indeed this is good advice to give to someone who doesn't understand the parsing wheels - if only to avoid the problem of all problems looking like nails for the one hammer (regexps) that you have.
However the two kinds of problems are different and do overlap. Where they do overlap, it isn't necessarily obvious which is more practical. It isn't even necessarily obvious from the problem specification - sometimes you need to make a guess about how the code will evolve to know that...
| [reply] |
|
Parsing is the task of taking structured information and analyzing the structure. This is a very different task, and regular expressions (as they currently are) are simply not designed to do it.
Parsing typically has two phases though, the first is Tokenization and the second Parse Tree Generation (Im sure there is a better term but I forget what it is). These phases more often then not occur in synch but they need not. Either way regexes are perfectly suited to tokenization.
I learned the most about regexes from writing a regex tokenizer and parser. I learned a lot more from the tokenizer than from the parser tho. :-) Writing regexes to tokenize regexes is a fun head trip. (Incidentally the whole idea was to be able to use regexes to specify and generate random test data.)
---
demerphq
<Elian> And I do take a kind of perverse pleasure in having an OO assembly language...
| [reply] [d/l] |
Re: Scraping HTML: orthodoxy and reality
by PodMaster (Abbot) on Jul 08, 2003 at 08:08 UTC
|
After seeing
HP200LX:: on cpan,
I suggest you stick it in HP::4600::Status(Scrape)? (or something like HP::Printer::4600 thatwhatever somewhat corresponds to the HP naming convention ;) and
suggest to the author of HP200LX:: to rename his HP::200:: yada yada.
As for your notes on html scraping reality, checkout YAPE::HTML, it's regex based.
MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!" | I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README). | ** The third rule of perl club is a statement of fact: pod is sexy. |
| [reply] |
Re: Scraping HTML: orthodoxy and reality
by gjb (Vicar) on Jul 08, 2003 at 13:37 UTC
|
Although I agree that choosing for a regexp approach or a context free grammar approach depends on the problem at hand, I'd like to stress that halley made a very important point:
Rules are meant to be broken, but you've to understand them before you can break them... safely.
Although a lot of Monks will know the distiction between a regular language and a context free language (and I'm sure grinder and BrowserUK do), I'm rather sure that some don't. In the latter case, unfortunately those Monks simply don't know the rules and have lots of opportunity to mess up.
I'd like to paraphrase: "a little thinking is a dangerous thing" if the process is not supported by a proper amount background knowledge.
It is possible to approximate a context free grammar with a regular expression, a nice survey article about that has been written by Mark-Jan Nederhof. There are several good books about formal languages, but I'd particularly recommend Sipser's since it is well written and is nice to read.
Conclusion: even if you know the rules, but don't understand them, don't try and break them. More importantly: try and understand the rules you're following.
Just my 2 cents, -gjb-
| [reply] |
Re: Scraping HTML: orthodoxy and reality
by halley (Prior) on Jul 08, 2003 at 13:03 UTC
|
| [reply] |
Re: Scraping HTML: orthodoxy and reality
by mojotoad (Monsignor) on Jul 08, 2003 at 15:22 UTC
|
I am of course biased, but...
So I brought HTML::TreeBuilder to bear on the task. It wasn't quite as easy. It was no simple matter to find a reliable part in the tree from whence to direct my search. The HTML contains deeply nested tables, with a high degree of repetition for each kit and cartridge. The various pieces of information in scattered in different elements and collecting and collating it made for some pretty ugly code.
From a logical standpoint, HTML::TableExtract seems to be a perfect choice for this. It might not be a perfect choice for efficiency, for which you seem to have a requirement (though I'm not sure what total run time you were shooting for...how often is this tool going to be run?)
Could you give an example HTML page and some numbers such as how many of them you are expected to handle and how often? For purposes of discussion, let's say your parallel fetch more or less delivers all of the pages simultaneously.
(Despite my bias, I am not automatically anti-regexp parsing and I see both sides of that particular scuffle)
Matt | [reply] |
|
Well, having never used it, I'd be very interested in seeing how you'd do this with HTML::TableExtract. Here's an example page: http://grinder.perlmonk.org/hp4600/.
There are 6 printers today, and we'll probably be adding another 4 or so in the future.
As a general rule I really don't care about performance, but this is a rare case where I have to do something about it. The reason being is that I want to be able to call this from mod_perl, so every tenth of a second is vital (in terms of human perception noticing lag in loading/rendering a page). It's for a small population of users (5 or so), and mod_perl is reverse proxied through lightweight Apache processes, so I'm not worried about machine resources.
I can't do anything about the time the printer takes to respond, but I do need the extraction to be as fast as possible to make up lost ground. There is always Plan B, which would be to cache the results via cron once or twice an hour; it's not as if the users drain one cartridge per day. I already do this for other status pages where the information is very expensive to calculate. People know the data aren't always fresh up to the minute but they can deal with that (especially since I always label the age of the information being presented).
I'll be very interested in seeing what you come up with. And if someone wants to show what a sub-classed HTML::Parser solution looks like, I think we'd have a really good real-life tutorial.
update: here's the proof-of-concept code as it stands today, as a yardstick to go by. The end result is a hash of hashes, C M Y and K are the colour cartridges and X and F are the transfer and fuser kits, respectively. These will mutate into something like HP::4600::Kit and HP::4600::Cartridge.
This code implements jeffa's observation of grepping the array for definedness, which indeed simplifies the problem considerably. Thanks jeffa!
#! /usr/bin/perl -w
use strict;
use LWP::UserAgent;
my @cartridge = qw/ name part percent remain coverage low serial print
+ed /;
my @kit = qw/ name part percent remain /;
for my $host( @ARGV ) {
my $url = qq{http://$host/hp/device/this.LCDispatcher?dispatch=htm
+l&cat=0&pos=2};
my $response = LWP::UserAgent->new->request( HTTP::Request->new( G
+ET => $url ));
if( !$response->is_success ) {
warn "$host: couldn't get $url: ", $response->status_line, "\n
+";
next;
}
$_ = $response->content;
my (@s) = grep { defined $_ } m{
(?:
> # closing tag
([^<]+) # text (name of part, e.g. q/BLACK CARTRIDGE/)
<br>
([^<]+) # part number (e.g. q/HP Part Number: HP C97
+24A/)
</font>\s+</td>\s*<td[^>]+><font[^>]+>
(\d+) # percent remaining
)
|
(?:
(?:
(?:
Pages\sRemaining # different text values
| Low\sReached
| Serial\sNumber
| Pages\sprinted\swith\sthis\ssupply
)
:
\s*</font></p>\s*</td>\s*<td[^>]*>\s*<p[^>]*><font
+[^>]*>\s* # separated by this
|
Based\son\shistorical\s\S+\spage\scoverage\sof\s # or
+just this, within a <td>
)
(\w+) # and the value we want
)
}gx;
my %res;
@{$res{K}}{@cartridge} = @s[ 0.. 7];
@{$res{X}}{@kit} = @s[ 8..11];
@{$res{C}}{@cartridge} = @s[12..19];
@{$res{F}}{@kit} = @s[20..23];
@{$res{M}}{@cartridge} = @s[24..31];
@{$res{Y}}{@cartridge} = @s[32..39];
print <<END_STATS;
$host
Xfer $res{X}{percent}% $res{X}{remain}
Fuse $res{F}{percent}% $res{F}{remain}
C $res{C}{percent}% cover=$res{C}{coverage}% left=$res{C}{remain}
+printed=$res{C}{printed}
M $res{M}{percent}% cover=$res{M}{coverage}% left=$res{M}{remain}
+printed=$res{M}{printed}
Y $res{Y}{percent}% cover=$res{Y}{coverage}% left=$res{Y}{remain}
+printed=$res{Y}{printed}
K $res{K}{percent}% cover=$res{K}{coverage}% left=$res{K}{remain}
+printed=$res{K}{printed}
END_STATS
}
_____________________________________________ Come to YAPC::Europe 2003 in Paris, 23-25 July 2003.
| [reply] [d/l] |
|
Here's a quick example, just to give you an idea. I apologize for the crufty code.
This solution is still vulnerable to layout changes from the printer manufacturer. I really don't like having to use depth and count with HTML::TableExtract because of this reason -- if the HTML tables had some nice, labeled columns it would be another story entirely. With that in mind you may well be better off with your solution in the long run, though I daresay the regexp solution might be more difficult to maintain.
HTML::TableExtract is a subclass of HTML::Parser, in case you were unaware.
I'm pretty sure HTML::Parser slows things down compared to your solution, but I'm curious to what degree.
Enjoy,
Matt
#!/usr/bin/perl -w
use strict;
my $depth = 0;
my $count = 0;
my $ddepth = 3;
use LWP::Simple;
my $html = get('http://grinder.perlmonk.org/hp4600/');
my %Device;
use Data::Dumper;
use HTML::TableExtract;
my $te = HTML::TableExtract->new;
$te->parse($html);
foreach my $ts ($te->table_states) {
&process_detail($ts) if ($ts->depth == $ddepth);
&process_main($ts) if ($ts->depth == $depth && $ts->count == $coun
+t);
}
# Clean up the empty spots
@{$Device{stats}} = grep(defined, @{$Device{stats}});
print Dumper(\%Device);
exit;
sub process_main {
my $ts = shift;
my($host, $model) = _scrub(($ts->rows)[1]);
$Device{host} = $host;
$Device{model} = $model;
}
sub process_detail {
$_[0]->count % 2 ? _proc_detail_stats(@_) : _proc_detail_name(@_);
}
sub _proc_detail_name {
my $ts = shift;
my($name, $part, $pct) = _scrub(($ts->rows)[0]);
$part =~ s/.*:\s+//;
$Device{stats}[$ts->count] =
{ name => $name, part => $part, pct => $pct };
}
sub _proc_detail_stats {
my $ts = shift;
my @stats = map(_scrub($_), $ts->rows);
my $i = $ts->count - 1;
@{$Device{stats}[$i]}{qw(pages_left hist low serial_num pages_printe
+d)}
= (map(_scrub($_), $ts->rows))[1,2,4,6,8];
}
sub _scrub {
grep(!/^\s*$/s, map(split(/(^M|\n)+/,$_), @{shift()}));
}
| [reply] [d/l] |
Re: Scraping HTML: orthodoxy and reality
by hsmyers (Canon) on Jul 08, 2003 at 13:12 UTC
|
In almost any situation, there are those who prefer maxims to thinking---ignore them. With specific regard to the odd web scrape, there are a couple of things that can be a problem. - The need to handle nesting.
- The need to survive arbitrary changes in the source.
The first can be handled by tight expression bounding or by using something like Text::DelimMatch or Text::Balanced. You manage the second by vigilance---patch it when needed.
As for those who knee jerk instead of thinking, since many of them have no experience witting parsers (formal or ad hoc) they fail to see that the regex approach is a form of parsing, just without the overhead of dealing with things that don't matter.
--hsm
"Never try to teach a pig to sing...it wastes your time and it annoys the pig." | [reply] |
(jeffa) Re: Scraping HTML: orthodoxy and reality
by jeffa (Bishop) on Jul 08, 2003 at 14:27 UTC
|
my (@s) = grep $_, m{
big regex here
}gx;
Sure it would be better to not have undefs in there in
the first place, and you said you know where they will be,
but why worry when grep is at disposal. Just a
thought, and thanks for the post. :)
jeffa
L-LL-L--L-LL-L--L-LL-L--
-R--R-RR-R--R-RR-R--R-RR
B--B--B--B--B--B--B--B--
H---H---H---H---H---H---
(the triplet paradiddle with high-hat)
| [reply] [d/l] |
Re: Scraping HTML: orthodoxy and reality
by chunlou (Curate) on Jul 08, 2003 at 19:17 UTC
|
"Parse" vs "extract" or "regular language" vs "context free," etc. are indeed important distinctions to be made, as pointed out by some monks. Parsing data is a (more or less) mechanical process; extracting info is a human (A.I.) process.
Suppose you want to extract info by paragraph. Consider the following text fragment:
________________________________________
Look at the table below...
Could you behold the secret this unfolds?
A bit more, a bit more, irrelevant thought, a new paragraph...
________________________________________
You might see either two or three paragraphs (if you consider "Look... unfolds?" as one paragraph). Now, let's look at the html of the above text fragment:
<p>Look at the table below...
<table border="1"><tr><td><p>ho ho ho...</p></td></tr></table><br><br>
Could you behold the secret this unfold?<br><br>
A bit more, a bit more, irrelevant thought, a new paragraph...</p>
A parser might only see one paragraph between the <p> and </p> tags. There is a <p></p> pair in the table. Is it a paragraph? A parser might ask.
Suppose the parser takes into consideration that some people use <br><br> to denote the end of a paragraph. "Look..." and "Could..." might be considered two paragraphs. What about "A bit..."? Or are "Look..." and the table two paragraphs?
Human can read semantically; machine mostly syntactically. That's why extracting info is not the same problem as parsing data. | [reply] [d/l] [select] |
|
I applaud your attempt to clarify, though I don't necessarially agree with your example. For me, the difference between parsing something and and extracting something is that when I parse something, I am interested in the whole something. When I extract something, I am only interested in a small part of the whole.
That's basically why using an HTML parser when all I want is an extraction, is so wasteful. I go to all the trouble of analysing (sometimes validating) and capturing the structure of the HTML only to throw all that effort away as soon as I have captured the bit I want.
It's a bit like carefully dismantling, labelling and boxing an entire stately home, brick by brick, when the only part of value is the Adam's fireplace in the drawing room. If the rest of the place is simply going to be discarded, there really isn't any point in doing the extra work unless some conservation or restoration is intended. In code terms, that means I am going modify, reconstuct or otherwise utilise the structure I've spent the effort capturing. In a large majority of screen-scraping tasks, the structure is simply discarded.
The argument for using a parser is that semantic knowledge gained from the structure is useful when used to locate the bits of data required, and that using the structural clues is more reliable (less brittle) when the page is updated than a dumb match by regex. The problem I have found is that when the page changes, the structure is just as likely to change in ways that require the script to be re-worked as is a match-by-context regex.
For every example one way, there is a counter example the other, and I would never critisise anyone who chooses to use a parser for this purpose. I just wish that they would give my (and others) informed and cognisant choice to do otherwise, the same respect.
Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
| [reply] |
|
What you said was logically sound. Apparently, "parse" and "extract" mean something very specific to you, whereas I merely used them in a loosey-goosey way.
But I do make distinction between "data" and "info." This is data: "1I2N! 2r2U 1E! 1R 1f"; this is info: "FIRE! RUN!".
| [reply] |
Re: Scraping HTML: orthodoxy and reality
by ff (Hermit) on Jul 09, 2003 at 03:18 UTC
|
Gosh, I wish I even knew the difference between those HTML:: modules and how to put them to work! Given the examples in this thread, soon I'll have more than a hammer to do my scraping. :-)
In the meantime: when I go to the web page from the link via my IE browser and do a Ctl-A and Ctl-C and then paste the text into a Notepad screen, this particular output is quite comprehensible to my HTML-untrained eye (vs the HTML stuff), e.g.
impse400 (I3C) / 172.17.8.182
hp color LaserJet 4600
Information
<snip much miscellaneous info>
For highest print quality always use genuine Hewlett-Packard supplies.
+
BLACK CARTRIDGE
HP Part Number: HP C9720A 73%
Estimated Pages Remaining:
11025
(Based on historical black page coverage of 2%)
Low Reached:
NO
Serial Number:
35860
Pages printed with this supply:
4078
TRANSFER KIT
HP Part Number: HP C9724A 87%
Estimated Pages Remaining:
103856
Etc.
With my regex sledgehammer it would be straightforward to process this data. Oftentimes, when I look at the "pure text" version of a web page there aren't nearly as many nice hooks for sorting things out. But this is THIS case, and my question is: might there be a tool which emulates this action of select/copy/paste of a web page to automate the production of such text for follow-on regex processing? | [reply] [d/l] |
|
You could probably automate the C&P from your favorite browser (under Win32 at least) or use one of the console browsers (Lynx etc.) under *nix.
The question is, what would you have achieved. Not only would you have used a parser (the one builtin to the browser), but you would have also used its rendering engine, spawned a new process and gone through some form of IPC whether its a pipe or clipboard. And you would still need to apply a regex to the result.
If your going to use a parser, then you mught as well use one of the many available to you via CPAN and avoid all that additional overhead:)
Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
| [reply] |
|
For sure, this approach is expensive cpu-wise, etc., but if I need a solution that works right away then "module fetches/renders HTML into text", combined with regex processing that at least I know how to do, IS a solution. Sure, per RBFuller, "... if the solution is not beautiful, I know it is wrong" but if those cycles won't be used for anything else, who cares? This bear of little brain would have his program done.
So, assuming that efficiency doesn't matter, I'm still fishing for something like building the $html object via a LWP 'get' as above and then turning it into text that I can examine with regexen. (However, since this is turning a golden object into lead, I'll do some more digging as you suggest, like re-reading this thread's Data::Dumper/HTML::TableExtract example! :-)
| [reply] |
|
|
Well, /s?he/ had the right idea. One wonders why HP can't just produce simpler HTML, or even provide a port for text output... ok, may be that's a little silly, but all of this to get a few numbers.
<grumble>Isn't the real problem here the
obsession that some designer at HP has with
producing beautiful output, good HTML practice
be damned? Why do Dreamweaver jockies have to
make my life hard!!! AHHHHH!</grumble>
10POKE53280,A:POKE53281,A
20?"C64 RULES ";
30A=A+1:IFA=16THENA=0:GOTO10
|
| [reply] |
Re: Scraping HTML: orthodoxy and reality
by John M. Dlugosz (Monsignor) on Jul 10, 2003 at 18:21 UTC
|
OK, you convinced me to use regex instead of a parser for my program. This avoids the problem of re-formatting the parse tree to resemble the original input (I can modify the found lines in-place easily), and I can live with "parser" limitations and simply not write goofy stuff in my HTML.
—John | [reply] |