I'd like to implement a generic module for scraping HTML pages. I had
an idea that when extracting the important bits from a web page - the
result of a search engine query, for example, or just about any list
of links - one often extracts the same bits from multiple items. In
these cases, it should be possible to extract just the
repeating bits to get all one is looking for, in a much more
regular and simpler structure - and the extracting algorithm is
clearly the same for all input pages.
I've implemented
HTML::ListScraper (the name by analogy with
Text::Scraper
as well as
HTML::Parser,
which my module extends), and it works as designed, but not as well as
I'd like. HTML::ListScraper looks for repeating tag sequences - I
don't want to search just for trees, the module should handle tag
soup, too. The implementation is reasonably obvious:
- Construct the tag sequence for the whole document.
- Scan it to find all tag pairs and where they occur.
- Throw out those that occur only once.
- Extend the remaining sequences by their adjacent tags.
- Repeat the previous 2 steps until there are no sequences to extend.
But, to be recognized as repeats, all these sequences have to be
exactly the same. In practice, that often doesn't happen. Text content
can have different tags - a bolded word here, a paragraph there, of
course one can ignore such "inline" tags, but are they the same for
all HTML::ListScraper users? Worse, some parts of the tag sequence can
be optional. Say I'm scraping Google results: most have the "Cached"
and "Similar pages" links, some don't. For a specific site, obviously
one can construct specific queries - but that's exactly what I wanted
to avoid... Could my module tell the calling application that "there's
a sequence there but these parts are optional"? How would it find such
an amorphous structure - and even if it did, wouldn't it be just too
complicated to use?
So, I've decided to release HTML::ListScraper early and often and
solicit some feedback: Do you think it's practically usable as it
stands? Does it fail for you in interesting ways? Where would you take
it, if you had an urge to take it somewhere?