Don't ask to ask, just ask | |
PerlMonks |
Re: Re: HTML content extractorby Nooks (Monk) |
on Feb 11, 2001 at 18:19 UTC ( [id://57738]=note: print w/replies, xml ) | Need Help?? |
If all you want to do is extract the text content from an HTML document, you can use YAPE::HTML like so: Yes, and if extracting text was all I wanted to do, that's how I'd do it. The point of this CUFP is to extract content---important text that would appear in a rendered HTML page---as opposed to non-content, such as the comments, the javascript, the unnecessary tags and other fluff, which can't be reliably removed without some idea of the document structure, which is readily available with a parse tree or similar but not with a simple variation on HTML::Parser which can't easily provide some context or easy document manipulation. Usually, a parse tree would be readily available through a DOM or XSLT, or a DTD or something, but most HTML is not written well enough to manipulate this way, so I'm using HTML::TreeBuilder to create the parse tree for me, since it provides excellent support for parsing ambiguous elements like a browser would. Obviously I am not communicating my idea well, or this code is not as good as I think it is, or something. To try to alleviate this problem, I'll include the POD for the program here:
For anyone still interested in looking at the output of the program, I recommend either the lynx or w3m text browsers, which will render as text to a terminal or tty if passed the -dump argument.
In Section
Cool Uses for Perl
|
|