Parsing HTML

Gorby has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Parsing HTML by davido (Cardinal) on Jun 13, 2004 at 16:12 UTC
HTML is the markup language used to format the text. Taking it out, you can approximate some things (paragraph breaks, for example), but you can't have bold, italics, tables, etc. ...not in an OS independant way, at least. If formatting isn't an absolute requirement, read on... As far as removing all HTML, I happen to like HTML::Strip. It's easy to use, and results in pretty readable output. It has a habbit of indenting a lot, but that's easy to strip out too if you want. Here's the synopsis from its POD: `use HTML::Strip; my $hs = HTML::Strip->new(); my $clean_text = $hs->parse( $raw_html ); $hs->eof;` [download] $clean_text now will contain the HTML-free version of $raw_html. It's as easy to use as LWP::Simple. Dave	[reply] [d/l]
Re: Parsing HTML by atcroft (Abbot) on Jun 13, 2004 at 16:06 UTC
One possible way, shown in Perl CookBook (1st ed.), problem 20.6, is: `use HTML::Parse; use HTML::FormatText; $plain_text = HTML::FormatText->new->format(parse_html($html_text));` [download] (I thought there was a function in LWP::Simple that would just return the text, but I can't recall for sure.) Hope this helps.	[reply] [d/l]
Re: Parsing HTML by thunders (Priest) on Jun 14, 2004 at 06:05 UTC
Many of the modules that inherit from HTML::Parser are quite capable for this kind of task. I like HTML::TokeParser::Simple. `#!/usr/bin/perl -w use strict; use LWP::Simple; use HTML::TokeParser::Simple; my $site = $ARGV[0]; my $content = get($site); my $parser = HTML::TokeParser::Simple->new(\$content); while ( my $token = $parser->get_token ) { next unless $token->is_text; print $token->as_is; }` [download]	[reply] [d/l]
Re: Parsing HTML by injunjoel (Priest) on Jun 14, 2004 at 00:09 UTC
Greetings all, I know there are modules out there to do this but this regexp has yet to let me down for stripping HTML style tags. `$strContainingHTML =~ s\|<[^>]>\|\|sg;` [download] ...and you should check some of them out :} -injunjoel Update:* Thanks dragonchild for pointing out the error of my assumptive ways. I guess I've usually (not always) deferred javascript logic to functions and let the event handlers simply trigger the call to them. A case where personal style has saved me many potential headaches. Good catch.	[reply] [d/l]
Re^2: Parsing HTML by dragonchild (Archbishop) on Jun 14, 2004 at 01:42 UTC
Sorry, but I couldn't resist. `use strict; use warnings; my $x = "<html onLoad='if (x > 5) { exit; }' />"; print "'$x'\n"; $x =~ s\|</?[^>]>\|\|sg; print "'$x'\n"; ----- '<html onFocus='if (x > 5) { exit; };' />' ' 5) { exit; };' />'` [download] ------ We are the carpenters and bricklayers of the Information Age.* Then there are Damian modules.... sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon.* - flyingmoose I shouldn't have to say this, but any code, unless otherwise stated, is untested	[reply] [d/l]


Come for the quick hacks, stay for the epiphanies.
	PerlMonks