Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Re: Re: HTML content extractor

by Nooks (Monk)
on Feb 11, 2001 at 08:21 UTC ( [id://57706]=note: print w/replies, xml ) Need Help??


in reply to Re: HTML content extractor
in thread HTML content extractor

I actually ran this. I'm not sure what it's supposed to do that makes it superior to the HTML::Parser quickies being kicked around, but it doesn't.

Thanks for running the software. For an idea of what it is supposed to do, download the source of, say, a Wired or CNN news article, and run that past the program. Those are two types of input documents that I know work well.

Yes, unfortunately it is far from perfect. The intent is to use it on busy weblog and news portal sites to automatically download and trim out things like sidebars, boxes interrupting the flow of text, headers and footers. So yes, I'm not surprised it didn't do too well on a POD page---it assumes there's something to be found, but this assumption doesn't work well on a document that is pretty much all content and no distraction.

What's supposed to make it superior to HTML::Parser quickies (and I've written a few of them in my time) is that it doesn't have to be told how to interpret a given page. This may have to change in the future (the range of HTML out there is pretty big!) but I'm confident the approach is robust enough that with work it'll be a killer. If anyone has a HTML::Parser quickie that works in the general case, I'd be very pleased to see it.

The error you got is very unfortunate and wholly my fault for posting something so premature.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://57706]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others meditating upon the Monastery: (5)
As of 2024-04-16 19:01 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found