comment on

Although HTML is a structured language, your task is essentially to parse nautral language. As you say, this is a very hard task.

Most natural language parsers I have seen use heurisitcs plus some manual touch-up, or they use a statistical approach. For instance, in classifying email as spam or ham, Mail::SpamAssassin uses a combination of regexps and parsing to catch common spam phrases or email structure and uses a linear weighting (i.e., a perceptron) for classification. More recently, it has incorporated a a Bayesian classifier to create a customized, adaptive component to the classification.

I think you could use a similar approach here. The first thing to do is look at a bunch of these sites, and identify likely patterns to locate and extract titles. Program these in, most common to least common.

As a backup, train a Naive Bayes calssifier on the context surrounding a title string with classes of tile/no_title. After training it up, run your HTML through it and pick the title based on the most probable title context.

Whether all this work is less than that of maintaining a custom parsing of all your sites is something you will have to decide. For sorting spam, it is a clear win.

-Mark

In reply to Re: Extracting arbitrary data from HTML by kvale
in thread Extracting arbitrary data from HTML by vbfg

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Come for the quick hacks, stay for the epiphanies.
	PerlMonks