Extracting arbitrary data from HTML

vbfg has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

This isn’t really a Perl question as such I suppose, more a question on how best to proceed.

My problem is this. For the last few months I’ve been using Google News to search for news stories about Rugby League and Rugby League clubs. The intention has always been to collect stories and publish them on a Rugby League site in RDF format but before I go live with it I want to take Google News out of the loop and harvest news stories from the sites directly.

Google News has been incredibly useful. I have a large body of documents that I can use to train AI::Categorizer into recognising valid Rugby League stories. I have a huge collection of sites that actively publish stories about Rugby League. Most important of all so far, at least as far as any future users of this are concerned, is that Google gives me consistent HTML to parse and understand irrespective of which site the story came from. The headline of the news story is the link to the story so I simply find the headline that will ultimately be the link in the RDF file that way.

Here’s the problem. How do I extract such information from the raw HTML of the source site? I cannot use the tag that contains the information since it may not be unique within the document. Use regexes you say? Well, I can’t do that really either. The reason is the HTML of one of the source sites will change at the least opportune moment which will require someone to sit and work out a new regex to extract the information. I don’t want that person to me. It could be absolutely anyone involved in running our Rugby League site. It’s a community effort and experience could range from IT project manager to interested school kid looking for experience or who just wants to help.

My plan thus far is as follows:

Using Perl/TK I’ve built a small app which downloads a page and displays the HTML in its tree structure, a bit like the DOM Inspector in Mozilla would. I can look at the page directly in a browser and copy the text of the headline into my application. The application then searches down through the tree from the root to find the relevant node.

Thus far that’s all this helper application does. What I want to do next is, using that node as a starting point, discover what it is about the parent, siblings and children in the local part of the tree that makes it unique within the document. When that is done I can analyse that particular chunk of tree and devise a rule to describe it, possibly just dumping it out in Newick format or something like. I then can apply this rule to extract data from all pages of the same type that come from that site. Since these news sites are typically generated from a database, or according to some other rule set, the HTML is at least consistent on a given site even if it is liable to change without warning.

That’s the key – rule creation and subsequent application has to be automatic.

I think I know what to do to analyse the tree and find unique portions of it. I can find the node with the headline we searched for and then look for other instances of it. If there are no other instances then that’s my rule right there. If there are then I can look at the parent and siblings and see if other instances of the same node have the same parents and siblings, and so on until I have a unique description.

The question is, is it worth the effort? How would you tackle this? It seems incredible to me that Google would do something similar for their news service; their sources must number in the many tens of thousands and to manage them all would probably require some significant effort. Whatever they use does make mistakes even though it mostly seems to get it right. One site in particular (Manchester Evening News) usually comes up with ‘Site created and maintained by…’ as the headline of the story. Looking at the text of the HTML it’s not falling back on the <TITLE> or anything like that, it is extracting that from some point deep within the HTML tree as though some rules are guiding it to that point.

Suggestions and hints on how to proceed are very welcome. Thanks for your time.

Comment on Extracting arbitrary data from HTML

Replies are listed 'Best First'.
Re: Extracting arbitrary data from HTML by kvale (Monsignor) on Apr 21, 2004 at 13:03 UTC
Although HTML is a structured language, your task is essentially to parse nautral language. As you say, this is a very hard task. Most natural language parsers I have seen use heurisitcs plus some manual touch-up, or they use a statistical approach. For instance, in classifying email as spam or ham, Mail::SpamAssassin uses a combination of regexps and parsing to catch common spam phrases or email structure and uses a linear weighting (i.e., a perceptron) for classification. More recently, it has incorporated a a Bayesian classifier to create a customized, adaptive component to the classification. I think you could use a similar approach here. The first thing to do is look at a bunch of these sites, and identify likely patterns to locate and extract titles. Program these in, most common to least common. As a backup, train a Naive Bayes calssifier on the context surrounding a title string with classes of tile/no_title. After training it up, run your HTML through it and pick the title based on the most probable title context. Whether all this work is less than that of maintaining a custom parsing of all your sites is something you will have to decide. For sorting spam, it is a clear win. -Mark	[reply]
Re: Extracting arbitrary data from HTML by TilRMan (Friar) on Apr 22, 2004 at 05:27 UTC
I don't know Newick, but you could have the user app spit out a Perl source file with a subroutine in it. That subroutine, given the story page content, would return the title (or undef if it can't find it.) Then your scraper would pull in all of the files (with `require`), call each subroutine, and use the first (or best) result. For actually finding the title, I suggest going a step farther on the assumption that "sites are typically generated from a database." Instead of looking at the HTML structure for a pattern, use the raw text. Once the user selects the node that contains the title, capture some number of characters of context before and after the title. As with the HTML pattern you described, you might refine how much context you keep based upon it uniqueness. Once you have the context, you can spit out the new subroutine as little more than a regexp match. Or go one step further and use String::Approx to do approximate matching. To reduce false positives, you could look for a signature on a web site that identifies it as coming from a particular sourt. For instance, a copyright notice won't often change. When you create the rule, you also include a check for the copyright, and return undef immediate if it's not there. Another way to improve accuracy is a feedback loop with the users. Give each subroutine a weight. If more than one subroutine gives you back an answer, use the one with the highest weight. However, also include links on the jump page (I assume you have a web version of the RDF feed) like "Should this have been titled 'Such-and-such'?" When clicked, it increases the weight of the subroutine that gave the right answer and decreases the wrong one. (But beware malicious users.) One more: Have each subroutine return a confidence value (perhaps the Levenshtein edit distance of the context, inverted). Then use the one with the highest confidence.	[reply] [d/l]