comment on

Hi,

This isn’t really a Perl question as such I suppose, more a question on how best to proceed.

My problem is this. For the last few months I’ve been using Google News to search for news stories about Rugby League and Rugby League clubs. The intention has always been to collect stories and publish them on a Rugby League site in RDF format but before I go live with it I want to take Google News out of the loop and harvest news stories from the sites directly.

Google News has been incredibly useful. I have a large body of documents that I can use to train AI::Categorizer into recognising valid Rugby League stories. I have a huge collection of sites that actively publish stories about Rugby League. Most important of all so far, at least as far as any future users of this are concerned, is that Google gives me consistent HTML to parse and understand irrespective of which site the story came from. The headline of the news story is the link to the story so I simply find the headline that will ultimately be the link in the RDF file that way.

Here’s the problem. How do I extract such information from the raw HTML of the source site? I cannot use the tag that contains the information since it may not be unique within the document. Use regexes you say? Well, I can’t do that really either. The reason is the HTML of one of the source sites will change at the least opportune moment which will require someone to sit and work out a new regex to extract the information. I don’t want that person to me. It could be absolutely anyone involved in running our Rugby League site. It’s a community effort and experience could range from IT project manager to interested school kid looking for experience or who just wants to help.

My plan thus far is as follows:

Using Perl/TK I’ve built a small app which downloads a page and displays the HTML in its tree structure, a bit like the DOM Inspector in Mozilla would. I can look at the page directly in a browser and copy the text of the headline into my application. The application then searches down through the tree from the root to find the relevant node.

Thus far that’s all this helper application does. What I want to do next is, using that node as a starting point, discover what it is about the parent, siblings and children in the local part of the tree that makes it unique within the document. When that is done I can analyse that particular chunk of tree and devise a rule to describe it, possibly just dumping it out in Newick format or something like. I then can apply this rule to extract data from all pages of the same type that come from that site. Since these news sites are typically generated from a database, or according to some other rule set, the HTML is at least consistent on a given site even if it is liable to change without warning.

That’s the key – rule creation and subsequent application has to be automatic.

I think I know what to do to analyse the tree and find unique portions of it. I can find the node with the headline we searched for and then look for other instances of it. If there are no other instances then that’s my rule right there. If there are then I can look at the parent and siblings and see if other instances of the same node have the same parents and siblings, and so on until I have a unique description.

The question is, is it worth the effort? How would you tackle this? It seems incredible to me that Google would do something similar for their news service; their sources must number in the many tens of thousands and to manage them all would probably require some significant effort. Whatever they use does make mistakes even though it mostly seems to get it right. One site in particular (Manchester Evening News) usually comes up with ‘Site created and maintained by…’ as the headline of the story. Looking at the text of the HTML it’s not falling back on the <TITLE> or anything like that, it is extracting that from some point deep within the HTML tree as though some rules are guiding it to that point.

Suggestions and hints on how to proceed are very welcome. Thanks for your time.

In reply to Extracting arbitrary data from HTML by vbfg

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Do you know where your variables are?
	PerlMonks