HTML is the markup language used to format the text. Taking it out, you can approximate some things (paragraph breaks, for example), but you can't have bold, italics, tables, etc. ...not in an OS independant way, at least. If formatting isn't an absolute requirement, read on...
As far as removing all HTML, I happen to like HTML::Strip. It's easy to use, and results in pretty readable output. It has a habbit of indenting a lot, but that's easy to strip out too if you want. Here's the synopsis from its POD:
use HTML::Strip;
my $hs = HTML::Strip->new();
my $clean_text = $hs->parse( $raw_html );
$hs->eof;
$clean_text now will contain the HTML-free version of $raw_html. It's as easy to use as LWP::Simple.
| [reply] [d/l] |
use HTML::Parse;
use HTML::FormatText;
$plain_text = HTML::FormatText->new->format(parse_html($html_text));
(I thought there was a function in LWP::Simple that would just return the text, but I can't recall for sure.)
Hope this helps. | [reply] [d/l] |
Many of the modules that inherit from HTML::Parser are quite capable for this kind of task. I like HTML::TokeParser::Simple.
#!/usr/bin/perl -w
use strict;
use LWP::Simple;
use HTML::TokeParser::Simple;
my $site = $ARGV[0];
my $content = get($site);
my $parser = HTML::TokeParser::Simple->new(\$content);
while ( my $token = $parser->get_token ) {
next unless $token->is_text;
print $token->as_is;
}
| [reply] [d/l] |
| [reply] [d/l] |
Sorry, but I couldn't resist.
use strict;
use warnings;
my $x = "<html onLoad='if (x > 5) { exit; }' />";
print "'$x'\n";
$x =~ s|</?[^>]*>||sg;
print "'$x'\n";
-----
'<html onFocus='if (x > 5) { exit; };' />'
' 5) { exit; };' />'
------
We are the carpenters and bricklayers of the Information Age.
Then there are Damian modules.... *sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon. - flyingmoose
I shouldn't have to say this, but any code, unless otherwise stated, is untested
| [reply] [d/l] |