Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

I don't know Newick, but you could have the user app spit out a Perl source file with a subroutine in it. That subroutine, given the story page content, would return the title (or undef if it can't find it.) Then your scraper would pull in all of the files (with require), call each subroutine, and use the first (or best) result.

For actually finding the title, I suggest going a step farther on the assumption that "sites are typically generated from a database." Instead of looking at the HTML structure for a pattern, use the raw text. Once the user selects the node that contains the title, capture some number of characters of context before and after the title. As with the HTML pattern you described, you might refine how much context you keep based upon it uniqueness.

Once you have the context, you can spit out the new subroutine as little more than a regexp match. Or go one step further and use String::Approx to do approximate matching.

To reduce false positives, you could look for a signature on a web site that identifies it as coming from a particular sourt. For instance, a copyright notice won't often change. When you create the rule, you also include a check for the copyright, and return undef immediate if it's not there.

Another way to improve accuracy is a feedback loop with the users. Give each subroutine a weight. If more than one subroutine gives you back an answer, use the one with the highest weight. However, also include links on the jump page (I assume you have a web version of the RDF feed) like "Should this have been titled 'Such-and-such'?" When clicked, it increases the weight of the subroutine that gave the right answer and decreases the wrong one. (But beware malicious users.)

One more: Have each subroutine return a confidence value (perhaps the Levenshtein edit distance of the context, inverted). Then use the one with the highest confidence.


In reply to Re: Extracting arbitrary data from HTML by TilRMan
in thread Extracting arbitrary data from HTML by vbfg

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others learning in the Monastery: (2)
As of 2024-04-18 23:26 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found